Statistics with Matlab

Contents

Matlab provides commands and operations for the computation of basic statistics.

randn('seed', 1);
x = randn(1000,1);
y = randn(1000,1);

The mean of a series can be computed by

mean(x)
ans =

    0.0020

If there are some NaNs in the vector x, this command will also be NaN. For this case we can use the command nanmean.

Further commands are

median

median(x)
ans =

    0.0197

standard deviation

std(x)
ans =

    1.0096

variance

var(x)
ans =

    1.0193

largest value

max(x)
ans =

    3.0206

smalles value

min(x)
ans =

   -3.4333

value range

range(x)
ans =

    6.4539

interquartile range (robust estimate for the spread of the data)

iqr(x)
ans =

    1.3076

covariance matrix

cov(x,y)
ans =

    1.0193   -0.0011
   -0.0011    0.9783

correlation matrix

corrcoef(x,y)
ans =

    1.0000   -0.0011
   -0.0011    1.0000

Sometimes it is desired to know the position of the kargest or smalles value in the vector. The commands max and min can also provide this information:

[a i] = max(x)
a =

    3.0206


i =

   317

Now the position of the largest value a in the vector x is in variable i.

To sort the values in the vector x we use the sort command:

xs = sort(x);

or with the index vector of the sorted values

[xs i] = sort(x);

A histogram of the data can be computed with

hist(x)

The default settings for a histogram are ten intervalls. To change this to another number of intervalls, e. g. 50, we call

hist(x, 50)

The number of counts h per intervall i can be accessed by

[h i] = hist(x,50);

Now we would like to know the 5% and 95% percentiles of the distribution of the series x.

prctile(x,[5 95])
ans =

   -1.6467    1.6592

The minimum, maximum, lower and upper quartile and median values can be represented by a box and whisker plot

boxplot([x y])

Where mean and standard deviation are the first two moments of the distribution, skewness and kurtosis are related to the third and fourth moments of the distributions. The skewness reflects the shape or asymmetry of the distribution: if it is negative, the data spread out more to the left of the mean, if it is positive, the data spread out more to the right.

skewness(x)
ans =

   -0.0786

The kurtosis represents how outlier-prone the distribution is. The kurtosis of the normal distribution is 3. Distributions that are more outlier-prone than the normal distribution have kurtosis greater than 3; distributions that are less outlier-prone have kurtosis less than 3.

kurtosis(x)
ans =

    3.0258

All moments of a distribution can be derived with the command moment.

To assess statistical significance, the bootstrap statistics is sometimes helpful. With bootstrapping we can get new realisations of the original data series by a random resampling. The Matlab command bootstrp can be applied for a bootstrap statistic. To compute a bootstrap statistic of the mean of our vector x by using 500 new realisations, we call

m = bootstrp(500, 'mean', x);

To show the variation of the mean across all bootstrap realisations, we plot the histogram

hist(m)

As expected, the mean spreads around the origin.

Linear regression models can be useful for the study of relations between two data series. Matlab provides different commands to estimate linear regression coefficients and corresponding statistics.

Least squares fit can be performed by the command regress. Assume a linear system

randn('seed',1);
x = [1:10:500]';
y = 2.5 * x + 80 + 50 * randn(50,1);

plot(x,y,'.')

and estimate the linear fit and the confidence intervals within 95% with

[b, bint] = regress(y, [ones(length(x),1), x], 0.05)
b =

   78.1034
    2.4832


bint =

   50.3461  105.8608
    2.3859    2.5805

In order to get not only an estimate for the regression coefficient, but also the offset, we added a vector consisting of ones in the second argument of regress. The result b yiels the offset 85 and the regression coefficient 2.5, the corresponding confidence intervals are stored in bint.

hold on
plot(1:500, b(2) * [1:500] + b(1))

Residuals can be automatically computed with

[b, bint, residuals] = regress(y, [ones(length(x),1), x], 0.05);
clf, plot(residuals), ylabel('Residuals')

For nonlinear systems, the command polyfit is more appropriate. We consider the following nonlinear system

randn('seed', 1);
y = sin(x/50) ./ x + .002 * randn(50,1);
clf
plot(x,y,'.')

and fit a polynom of order 5

p = polyfit(x, y, 5);

hold on
plot(x,polyval(p,x))
Warning: Polynomial is badly conditioned. Add points with distinct X
         values, reduce the degree of the polynomial, or try centering
         and scaling as described in HELP POLYFIT.

Error bounds can be derived by

[p s] = polyfit(x, y, 5);
[y_fit delta] = polyval(p, x, s);
Warning: Polynomial is badly conditioned. Add points with distinct X
         values, reduce the degree of the polynomial, or try centering
         and scaling as described in HELP POLYFIT.

The variable delta contains the estimate of the standard deviation of the error in predicting a future observation at x by P(x). As error bounds we use 2 delta corresponding roughly to the 95% confidence interval.

plot(x, y_fit + 2 * delta, ':', x, y_fit - 2 * delta, ':')

Exercise

1. Consider the file ecg.dat and compute and interprete the mean, standard deviation, skewness and kurtosis. Plot and interprete the histogram (chose an appropriate number of intervals).

2. The file eeg.dat contains EEG measurements on 59 channels (arranged as columns). Compute a matrix of correlation coefficients between all channels.

3. Consider the first 120 data points in the file ecg.dat. Fit polynoms of different orders to this sequence of data. Compute bootstrap estimates of the last coefficient of the polynom and present the result in a histogram.