Surrogate Data

Marco Thiel

Stochastic process and realization

If you throw a dice ten times one outcome could be the following series:{2,5,3,6,5,1,4,4,3,1}. But if you do the experiment another time the outcome in general will be different {3,1,6,4,5,3,4,2,1,2} or {3,3,2,6,5,6,4,2,3,1} or else {6,4,5,3,4,2,1,2,6,5}. The series {1,1,1,1,1,1,1,1,1,1} is as probable as the other mentioned times series. Each outcome is called a realization of a stochastic process. The word process means (lottery)-''game''. Another game or process is for example throwing a coin. Here there is - if you throw ten times - {head, head, tail, head, tail, head, tail, tail, tail, head} a possible realization.

A stochasic process is a game or experiment, with realizations that are not identical. An example of such a process is the above mentioned game with the dice. A deterministic process is a game or experiment with only identical realizations. Examples for such processes are abundant in classical physics. One can bear in mind the mathematical pendulum for example.

In many physical systems of interest today we only have observational data in form of one realization of a stochastic process. In the game of the dice it is easy to do the experiments many times and so gain a huge number of realizations. But if we think of meteorological data it is impossible to perform experiments. Astrophysical data is a further example.

If someone gives you a measurement of a time series {3,3,2,6,5,6,4,2,3,1} that is supposed to belong to the throwing of a dice, you may ask if it is probable that it belongs to the dice game. Probability theory tell us that if you have a perfect dice the probability to obtain each of the 6 numbers is 1/6 ~0.1666. The mean or expectated value is 3,5. The percentage of each number in our time series is : p(1)=1/10 =0.1, p(2)=1/5=0.2, p(3)=3/10=0.3, p(4)=1/10=0.1,p(5)=1/10=0.1, p(6)=2/10=0.2. The sample time series has a mean of 3,5 as the theoretical value. Now we consider the last time series {1,1,1,1,1,1,1,1,1,1}. The percentages of each number is : p(1)=1, p(2)=0, p(3)=0, p(4)=0, p(5)=0, p(6)=0. The mean is 1. We see that in the latter case the percentages of the numbers and the mean differ much from the theoretical values whereas the first sample differes less. These results may indicate that the series {1,1,1,1,1,1,1,1,1,1} in principle can be a realization of the dice game but one can suspect that there was something wrong: Perhaps the dice was not perfect (perhaps a fool has made a dice with only 1's!), perhaps the data record is wrong...

Hypothesis test and surrogate data

In the upper example it was easy to test the hypothesis that the data sample was generated by our dice game as we had a model. We knew theoretically many properties of the game: The probability of the outcome of each number, the mean ... So we were able to perform a hypothesis test based on theoretical knowlege.

A second possiblity would have been to take a dice and make a lot of new realizations (surrogates) and compare them to our samples. We could calculate the mean and the percentage of occurence for each number. If we have enough surrogates we can test our hypothesis. It is not necessary to use a real dice. One can simulate a dice with the computer. So if we have a model we can easily generate surrogate data.

In some cases we cannot perform the experiment and do not have any physical model to generate the surrogate data. In this case we can try to generate surrogate data from a more mathematical point of view. We can make surrogate data that have in some respect the same qualities as the measured data, for example the same mean, the same variance and the same periodogram (often called spectrum. Be careful there is a difference!) . After that we compare further qualities of the sample and the surrogates and try to decide if we captured all important qualities my the mean the variance and the periodogram. To give you an idea of this method we present you three time series that have the same mean, variance and periodogram.

The first graphic may be the sample.

Now we represent the first surrogate:

A second surrogate is:

A third:

You will remark that all surrogates have the same periodicity as the sample. But you can imagine that the original sample is somehow different to the other data sets. It is possible to quantify the difference and decide if it is probable that the sample is given only by the mean, variance and periodoram.

In this group we try to develop new methods to generate surrogates and test for various null hypothesis.