Also available as an Acrobat File

Editorial

Abstract

Introduction

The Projection Pursuit
  Context
  Method
  Interpretation
  Choice of I
  Geographically weighted regression

Appendix A - Datasets

Appendix B

Case Studies Index

An Investigation of Methods for Visualising Highly Multivariate Datasets

2. The Projection Pursuit Approach to Visualisation

2.1 Context

Suppose, for a set of cases, m variables are recorded. Then each case can be thought of as a point in m-dimensional space. Unfortunately, unless m is less than or equal to 3 it is not possible to view these points directly. However, it is possible to project an m-dimensional set of points onto a two-dimensional plane, or a three-dimensional volume. Here we will restrict the problem to projections onto two-dimensional planes. To visualise the concept of projection, figures 1 and 2 should be considered. In both figures a rectangular ´screen' is shown, either above or to the right of a set of three dimensional data. Imagine a very bright light on the other side of the data points. The shadows thrown on the screen from the data points are the projection. The dotted lines in the diagram link the data points to their projected images.

Figure 1: Example of point projection (1)

In figure 1 the data points are projected onto a plane to the right. Here the projected image shows two distinct clusters of points. In figure 2 the same data points are projected onto a plane above. Here the projected image shows only a single cluster of points. Obviously in this case the projection is from R³ to R², but similar (and sometimes more complex) phenomena occur when the projection is from R^m dimensions and m > 3.

Figure 2: Example of point projection (2)

The above example demonstrates that different projections of the same data set can reveal different aspects of the data structure - indeed some projections can fail to reveal any structure. There are in fact an infinite number of possible projections to choose from, so which one should be used? Projection pursuit (Jones and Sibson, 1987} is concerned with resolving this kind of problem.

2.2 The Projection Pursuit Method

To see how this technique operates, it is first worth noting that projections from R^m to R² are linear mappings. Thus, if X = {x_ij} is a matrix whose (i,j)th element is the jth variable for observation i, we can write the general projection from R^m to R² as (z₁, z₂) = (Xa', Xb'). Here a and b are m-dimensional row vectors defining the linear transform, and z₁and z₂are n-dimensional column vectors representing the points on the projection screen. The prime denotes transposition. Choosing a projection is now a matter of choosing a and b.

The next problem is to decide what kind of feature one wishes to detect. When this decision is made, one attempts to measure the degree to which this feature is exhibited in (z₁, z₂). Call this measure I(z₁, z₂) It is sometimes called the index function. For example, suppose one wishes to detect clusters. A common test statistic for clustering in two dimensional data is the mean nearest neighbour distance (MNND). Lower values of this statistic indicate greater clustering. Thus, here I(z₁, z₂) is the MNND for the data set (z₁, z₂). Noting that the expression I(..,..) can be written in the form I(Xa', Xb'), the projection choice problem can be thought of as an optimisation problem in which a and b must be chosen to minimise I. Essentially, this is the projection pursuit process.

At this stage, however, some careful thought should take place. Clearly the nearest neighbour distance index is scale dependent. Multiplying a and b by a constant will also multiply the MNND by this factor - so that one can make I as small as one likes with an appropriate choice of constant. This problem can be avoided by adding the constraint that z₁and z₂are standardized - that is that they both have a mean of zero and a variance of one. Also, it is helpful to add the constraint that z₁and z₂are not correlated. This ensures that maximum information is given in the two dimensional plot, in the sense that if two variables are correlated, they are ´sharing' some underlying one-dimensional feature pattern rather than each representing different patterns.

The projection pursuit algorithm is thus equivalent to a constrained optimisation problem. The difficulty with the specification given is that the constraints are given in terms of z₁and z₂rather than a and b. However, suppose each variable in X is mean-centred and then transformed to its principal components - and the principal components are standardised so that each has a variance of one, giving a transformed matrix Q. Q is a linear transform of X, say XP, where P is an m by m matrix, so that a linear mapping of Q is also a linear mapping of X. Thus, we can re-write the index in the form I(z₁, z₂) = I(Qc', Qd'). In this case, the column vectors c and d take the same form as a and b. In fact Pc' = a' and Pd' = b'. The advantage of the re-stated problem is the fact that if c'c = 1, d'd = 1 and c'd = 0 then z₁and z₂will be uncorrelated, have zero mean and variance of one. Thus, if constraints are imposed on c and d then z₁and z₂automatically satisfy the constraints proposed above. Thus, the projection pursuit problem may be stated:

Minimise

I(Qc', Qd')

Subject to

c'c = 1 and

d'd = 1 and

c'd = 0

This is a standard form for a constrained optimisation problem. Computationally, the difficulty of this problem depends on the index function I. Applying the method with I as the MNND is fairly intensive, mainly because it involves finding the nearest neighbour of every point in the projected data set.

Figure 3: Minimised MNND projection of census data

In figure 3 the result of applying this technique to the census data is shown. Due to the nature of the index function, the rotation of the point pattern obtained is arbitrary, so there is no clear interpretation of the individual axes in the plot.

No obvious clusters exist in the plot, suggesting perhaps that the data is not bimodal in any way detectable by projecting onto a two dimensional plane. However, some features are very clear, most notably a ´spur' in the lower part of the plot.

Graphics Multimedia Virtual Environments Visualisation Contents