|Also available as an Acrobat File|
An Investigation of Methods for Visualising Highly Multivariate Datasets
Suppose we have a set of $m$ continuous observed variables for each of $n$ cases, and denote the jth observation on the ith case by xij. Such a situation frequently arises when examining social data. For example, the cases might be census wards, and the observations might be rates computed from census variables, such as the percentage of households without cars, the percentage of households without central heating and so on. Before calibrating models based on these variables, it is generally useful to apply exploratory techniques to the data. Many of these techniques are graphical in nature - for example histograms, box plots or scatter plots may be drawn. However, these approaches are limited by the fact that they can only represent the relationship between at most two variables at any one time. In fact, apart from the scatter plot, the methods above only provide graphical representations of a single variable.
In order to decide how useful a representation is, one needs to consider the kind of feature in data that one wishes to detect. Three common possibilities in social science data are clusters, outliers and geographical trends. Clusters are distinct groupings in the data points, usually corresponding to multimodality in the underlying probability distribution for the data. Outliers are one-off cases that have unusual combinations of observed values, when compared to the remainder of the sample. Geographical trends are fairly self explanatory, but it is worth noting that as well as univariate trends, such as house prices increasing in certain areas, there could be trends in the relationships between some variables. It is also worth noting that these trends are rarely linear.
For most types of feature, there is variability in subtlety. For example an extremely high or low value of one particular variable would be a fairly crude type of outlier. This could be detected using a well-established univariate graphical tool such as a box-and-whisker plot (Velleman and Hoaglin, 1981). On the other hand, a more subtle outlier might be a point in the centre of a sphere, when all of the other points are close to its surface. The problem here is that none of the three coordinate values (x1, x2, x3) defining the central point are unusual in their own right, and even worse, none of the possible coordinate value pairs such as (x1, x2) are unusual. Thus, no simple univariate or bivariate representation could detect this outlier. The problem would become even worse if instead of a sphere, a ten dimensional hypersphere were substituted in the previous example! Generally, more subtlety tends to imply a greater degree of sophistication required in the graphical representation. This leads to the statement of a fundamental problem: "How can the interactions between large numbers of variables be represented in a managable number of dimensions?".
In this Case Study, two data sets will be used to demonstrate a number of ways of addressing the above problem. The two data sets are described in detail in Appendix A, but, briefly, the first is a set of six socio-economic variables for northern England measured at census ward level, taken from the 1991 census, and the second is a simulated data set designed to have a ´pathological' outlier, as discussed above. The following sections each describe a particular approach to visualisation, giving examples using the census data set. After these sections, a number of specific issues are considered, including a comparison of the way each method responds to the synthesised data.
Graphics Multimedia Virtual Environments Visualisation Contents