The data, which is under our primary consideration, contains a series of observations and measurements, made various subjects, patients, objects or other entities of interest. They might comprise the results of applying a battery of cognitive tests to a sample of patients with Alzheimer's disease, the taxonomic characteristics of bacteria or the relative proportions of several constituents of
different types of rock (or food), for example. One particular type of multivariate
data set involves the collection of repeated measures of the same characteristics over time. And in a situation that might be termed doubly multivariate, we might
indeed have a multidimensional set of features that are assessed at each of
several time points.
A typical multivariate data matrix, X, will have the form
X = ( ::: ::: : : : =~ l '
Xnl Xn2 Xnp
where the typical element, xii, is the value of the jth variable for the ith individual.
If there are several distinct groups of individuals, one of the xiis might be a
categorical variable with values of I, 2, etc. to distinguish these groups. The
number Of individuals under investigation is n, and the number of observations
taken on each of these n individuals is p. Table l.l gives a hypothetical example
of such a multivariate data matrix. Here n = 10, p = 7 and, for example,
X34 = 135.
In many cases, as in Table 1.1, the variables measured on each of them
individuals will be of different types depending on whether they are conveying
Types of data 3
Table 1.1 Data matrix for a hypothetical example of 10 individuals
Individual Gender Age (yrs) 10 Depression Health Weight (lbs)
1 Male 21 120 Yes Very good 150
2 Male 43 NK No Very good 160
3 Male 22 135 No Average 135
4 Male 86 150 No Very poor 140
5 Male 60 92 Yes Good 110
6 Female 16 130 Yes Good 110
7 Female NK 150 Yes Very good 120
8 Female 43 NK Yes Average 120
9 Female 22 84 No Average 105
10 Female 80 70 No Good 100
Note: NK =not known
Quantitative or merely qualitative information. The most common way of
distinguishing these types is the following:
• Nominal - unordered categorical variables. Examples include treatment
allocation, the gender of the respondent, hair colour, presence or absence of
Depression, and so on.
• Ordinal - where there is ordering but no implication of distance between
The different points of the scale. Examples include social class and self-perception
of health (each coded from I to V, say), and educational level
(no schooling, primary, secondary or tertiary education).
• Interval - where there are equal differences between successive points on the
Scale, but the position of zero is arbitrary. The classic example is the measurement
Of temperature using the Celsius or Fahrenheit scales. In some cases a
the variable such as a measure of depression, anxiety or intelligence, for example,
might be treated as if it were interval-scaled when this, in fact, might be
Difficult to justify. We take a practical approach to such problems
and frequently treat these variables as interval-scaled measures- but the readers
should always question whether this might be a sensible thing to do and
What implications a wrong decision might have.
• Ratio - the highest level of measurement, where one can investigate the
The relative magnitude of scores and their differences, where zero is in the fixed position. The perfect example is the absolute measure of
temperature (in Kelvin, for example) but other common ones include age (or any other time from a fixed event), weight and length.
The qualitative information in Table l. L could have been presented in terms
of numerical codes (as often would be the case in a multivariate data set) such
that Gender= l for males and gender= 2 for females, for example, or Health= 5 when
perfect and Health= l for very poor, and so on. But it is vital that both the
user and consumer of these data appreciate that the same numerical codes (l,
say) will convey utterly different information, depending on the scale of
measurement.
4 Multivariate data and multivariate statistics
A further feature of Table 1.1 is that it contains missing values (NK). Age
has not been recorded for individual number 7, and no IQ value is available
for individuals 2 and 8. Missing observations arise from a variety of reasons,
and it is essential to put some effort into discovering why the view is
missing. One explanation is that such an observation might not apply to that individual. In a taxonomic study, for example, in which the investigator
might wish to classify dinosaur fossils, 'wing length' might be an essential
variable. Dinosaurs without wings will have missing values for this
variable! In other cases the measurement might be missing by accident or
because the respondent either forgot or refused to provide the information.
Occasionally, one might be able to obtain the information from elsewhere or
to repeat the measurement and then replace the missing value with useful
information.
Missing values can cause problems for many of the methods of analysis
described in this text, mainly if there are a lot of them. Although there
are many ways of dealing with missing-data problems (both valid and invalid!),
these are, in general, beyond the scope of this text. One method with universal applicability, however, is to attribute ('estimate') the missing values
from a knowledge of the data that are not missing. Such imputation methods
range from the very simple (replace the missing value with the mean of the
values from subjects with non-missing data, for example) to the technically
complex (multiple imputations acknowledging the stochastic nature of the
data) and are briefly described in Appendix B. However, one should always
keep in mind that the imputed values are virtual measurements. We do not get something for anything! And if there is a substantial proportion of the
individuals with large amounts of missing data one should undoubtedly question
whether any form of statistical analysis is worth the bother.