Also available as an Acrobat File |
Finding Outliers |
An Investigation of Methods for Visualising Highly Multivariate Datasets6. Finding Outliers - the Synthesised Data SetEach of the methods of visualisation outlined above was applied to the synthesised data set described in the appendix. As suggested in the appendix, this data set was deliberately chosen to exhibit ´pathological' behaviour, in terms of an outlier. This outlier can be thought of as a sixth order outlier, in that no five-way combinations of the variables V_{1} to V_{6} appear to have any unusual observations. The outlier lies on the six-dimensional point (0,0,0,0,0,0), whilst all other observations satisfy V_{1}^{2}_{+}V_{2}^{2}_{+}V_{3}^{2}_{+}V_{4}^{2}+V_{5}^{2}_{+}V_{6}^{2}=1. In figure 14, the parallel coordinates plot of the data is shown. As is typical of the technique, it is subset selection and highlighting that brings out patterns in the data. Here, the darker lines correspond to cases where |V_{2}|<0.1. From this, it is clear that an unusual value of V_{1} occurs. In fact, selecting this line only would then reveal the straight line through the zero point of all six parallel axes. Thus, the parallel coordinates plot has shown reasonable success in detecting the outlier.
Note that the parallel axes are calibrated in terms of standardised variables, so that the ranges for V_{1} to V_{6} extend beyond the range from -1 to 1. Next, consider the projection pursuit approach to the data set. The results are shown in figures 15 and 16. Clearly, these do not show any obvious outliers. It is possible that the problem here is that, due to the nature of this particular data set, there are no linear projections that are able to identify the point at the origin. If one considers the case in three dimensions, it is possible to envision the difficulty. Similarly disapointing results are experienced with RADVIZ (see figure 17). It is likely that similar arguments occur here (although the mapping from six to two dimensions is no longer linear).
It seems that unmodified projection-based approaches do not work well with the data set generated here. However, it is possible that some more flexible non-linear approaches might be helpful. For example, if one were to analyse the squares of the variables, the outlier would be much more easily detected. To see this note that V_{1}^{2}_{+}V_{2}^{2}_{+}V_{3}^{2}_{+}V_{4}^{2}+V_{5}^{2}_{+}V_{6}^{2} is equal to one for all points except the outlier, when the expression is equal to zero. However, one would have to have very strong prior knowledge to consider using this particular transform! It should also be noted that the slicing technique discussed in the previous section might have helped to highlight the outlier in the projective methods. It is significant that the parallel coordinates plot showed few patterns until a subset was selected and highlighted. |
Graphics Multimedia Virtual Environments Visualisation Contents