Question about principal component analysis

Andrew Garvin writes,

The CFNAI is the first principal component of 85 macroeconomic series – it is supposed to function as a measure of economic activity. My contention is that the first principal component of a diverse set of series will systematically overweight a subset of the original series, since the series with the highest weightings are the ones that explain the most variance. As an extreme example, say we do PCA on 100 series, where 99 of them are identical – then the first PC will just be the series that is replicated 99 times.

This is not necessarily a bad thing, but consider the CFNAI – most of the highly weighted series are from a) industrial production numbers, or b) non-farm payroll numbers. On the other hand, the series with relatively small weightings are very diverse. As I see it then, using the first principal component is not so much a measure of ‘economic activity’, but rather, ‘economic activity as primarily measured by industrial production and NFP’. Now, if I thought a priori that industrial production and NFP explained most of what was happening in economic activity, then this would not be such a bad outcome. However, it seems to me that the whole point of using PCA instead of an equal-weighting is that we are naive about the true weightings of the various series composing our indicator – and so PCA conveniently gives us the most appropriate weightings. So, to me, PCA only works as a weighting strategy if we already have some idea of what the weights should be, which defeats the purpose of using PCA in the first place.

My question then is: Do you see this as a problem? a) If so, would you mind suggesting ways to deal with this problem, or perhaps pointing me to some reading material that might discuss this issue? b) If not, I would be curious to know what the flaw is in my argument above.

My reply: Hey–you’ve hit on something I know nothing about! My usual inclination is not to use principal components but rather to just take simple averages or total scores (see pages 69 and 295 of my book with Hill), on the theory that this will work about as well as anything else. But in that case you certainly have to be careful about keeping too many duplicate measures of the same thing. My impression was that principal component analysis gets around that problem, actually.

My more general advice is to check your reasoning by simulating some fake data under various hypothesized models (including examples where you have several near-duplicate series) and then see what the statistical procedure does.

5 thoughts on “Question about principal component analysis

  1. This is a general problem in principal components analysis (and the related, but different, factor analysis). If you stick a lot of the same thing in, you'll get that to emerge. Generally, for things like PCA, you assume that there is a population of possible indicators, and you randomly sample from those indicators.

    However, the ideal of the population of possible indicators isn't achievable, at least without feeding in some knowledge of what you want. I think the questioner is correct – they should at least think about what the measures mean. Too many of one type of measure, and that will emerge as a (or the) principal component. Cattell referred to a 'bloated specific' – in terms of personality measures, if you rephrase the same item, it will emerge as a factor / component, but it won't mean anything. "I like going to parties". "I don't like parties." "I'd rather stay at home than go to a party" is a measure of how much you like parties, not anything else.

    Psychologists have been using the first (unrotated) principal component approach for a long time to define g – a measure of general intelligence; you give a bunch of tests, which purport to measure intelligence, and you find the first unrotated PC – that's g. You hear people talk about the g-loading of a particular test; that's it's component loading. However, that's a definition that's been refined and changed over time – as the nature of g has been refined, the kind of tests that people are given has changed, which changes the nature of g. And so on.

    I'd say you've got to think about what you're measuring. PCA has it's uses, but you can't use it to sort out a mess that you've got into because you (or someone else) hasn't thought about the measures. And if you have thought about the measures, totals or averages are likely to good. (My copy of G&H is at work, so I haven't checked to see what they say about this – but I agree with their actions.)

    Just my rambling thoughts.

    Jeremy

  2. I found an interesting blog article on factor analysis in the context of intelligence testing yesterday. Part of the article describes how exploratory factor analysis like principal components analysis, needs to be complemented by confirmatory factor analysis. Depending on what you intend to use CFNAI for, the arguements on the limited usefulness of principal components for discovering causal structural might also be interesting.

  3. Jeremy is right — variable selection can have a big impact on what a PCA will tell you. A colorful summary of the point is, "Garbage in, garbage out."

    With PCA (or factor analysis), it is useful to think of variable selection like a sampling issue. If you imagine the universe of economic indicators that comprise "economic activity," the question is, are the 85 CFNAI variables in the PCA a representative sample of that universe? Or have you accidentally sampled a nonrepresentative subset, or even sampled the same "unit" multiple times (as in your hypothetical example of 99 redundant predictors)?

    The upshot is, I'd probably focus my critical eye on design considerations that come prior to the statistical analysis. The important question is which variables are included (is there garbage going in?). The more you trust that selection, the more you can be confident in the weights that the PCA produces (is there garbage coming out?).

  4. In geophysics, we use localized PCA quite often to filter 'random noise' from multi-channel data. Here we assume one channel is similar to it spatial neighbor, and that useful signal resides the in first few principal components and noise resides in the higher principal components. Basically discard the higher PC's. How localized the PCA needs to be is adjusted for each data set.

  5. I think of PCA similarly to Sanjay above. The first component may combine nearly similar variables into a clearner signal. The rest may include other factors or noise or both. I'm a newbie to PCA but that's how it strikes me.

Comments are closed.