Also available as an Acrobat File

Editorial

Abstract

Introduction

The Projection Pursuit

Appendix A - Datasets

Appendix B

Case Studies Index

An Investigation of Methods for Visualising Highly Multivariate Datasets

9. Appendix A - Data Sets Used in the Study

In this study, two datasets are used. The first of these is derived from the 1991 UK census, at ward level for the Northern region of England, using variables as follows:

LLTI The percentage of persons in households in each ward where a member of the household has some limiting long-term illness. This is the response variable. Note that to control for different age profiles in areas, this is only computed for 45-65 year olds - an age category that is perhaps most at likely to suffer LLTI s a result of working in the extractive industries.

CROWDING This is the proportion of households in each census ward having an average of more than one person per room. This is an attempt to measure the level of cramped housing conditions in each ward.

DENSITY This is the housing density of each ward, measured in millions per square kilometer. This is intended to measure ´Rurality' of areas. Note the differences between this and the previous variable - a remote village with poor housing conditions may well score low in this variable, but high in the previous.

UNEMP The proportion of male unemployment in an area. This is generally regarded as a measure of economic well-being for an area.

SC1 The proportion of heads of households whose jobs are classed in social class I in the Census. These are professional and managerial occupations. Whilst the previous variable measures general well-being, this measures affluence.

SP-FAM The proportion of single parent families in an area. This is an attempt to measure the nature of household composition in areas.

The second dataset is a synthesised, six-variable data set. The variables are named V1 to V6. Each data point lies on the surface of a six-dimensional hypersphere of radius one, with the exception of one outlier, which lies at the centroid of the hypersphere. This outlier is particularly ´pathological' in that in any five-dimensional subset of the six variables, the value of this outlier is not particularly unusual. While it is uncertain how often this situation will happen with ´real life' social science data, it does provide a yardstick for assessing each visualisation method in a worst case scenario.

The data may be generated as follows:

Note that a point on the circumference of a unit circle may be parametrised in terms of a single variable by the expression

If theta is a uniform random variable, then random points on the circumference may be generated from this expression. Call this expression C₂(). Now let be a point on the surface of an n-dimensional unit hypersphere. For example, if n=3, then is a point on the surface of a sphere. Recursively, we can parametrise C_n+1 by

where * is a vector concatenation operator such that (x,y)*z = (x,y,z). One can check inductively that if the squared elements of C_n sum to one, then the squared elements of C_n+1 also sum to one. Since this can be checked directly for C₂, it is true for all n > 2 also.

Thus, the surface on an n-dimensional sphere can be parametrised by a vector . By generating uniform random numbers for the elements of this vector and applying this transform, it is possible to generate points on the surface of the hypersphere. The simulation is then finalised by adding the origin point as the outlier in the data set.

Graphics Multimedia Virtual Environments Visualisation Contents