Last modified on 19. March 2026 at 20:16:59

“The limits of my language mean the limits of my world.” — Ludwig Wittgenstein

Emma sat on the wooden stage. She need time to recover from what she had seen behind the curtains. What she had seen beneath the wood had now become reality. Dust danced in the light. Single light beams hit some dust particles from time to time and exploded into golden rain. Sparkles dropped on the wood.

“How is gold made?” Emma asked.

“This was a long break. I’m a bit surprised by your question. I thought we should discuss your findings from behind the stage. Not gold.” Jeff answers from his seat on the left.

“Gold cannot be formed through fusion in a star. This is only possible up to iron. You need a supernova or a collision of two neutron stars.” Emma spoke absently into the darkness.

“Interesting topic. Was this observed? The collision of two neutron stars and the resulting gold spewing out,” Jeff laughed.

“No, it was a computer simulation. Artificial data, if you will.” Emma looked around. She decided that it was time to begin the play.

“Then it’s not real, and it’s not an explanation for me at all.” Jeff stated.

“But the simulation was based on real observations.” Emma stood up and started building the scene for the play.

2.1 What is observed?

Figure 2.1: My son and I are observing the people riding bicycles on an early spring day. Given their clothing and headwear, my son tried to guess what season it really was. Some cyclists seem to be dressed for spring, others for winter, and one young woman for summer. He was very confused because it was really hard to guess. Why was there such a wide variation?

Can we observe anything? What does it mean to observe? We see things, but our vision is already influenced by our understanding of what they are. Additionally, by naming things and processes, we are already following a theory. “The sun is rising” is a popular example of a saying connected with the geocentric theory, which states that the Earth is at the center of the solar system. Therefore, our language is already influenced by our observations. We cannot speak about something we observe without giving the observation some meaning. It is possible, but it demands a high level of concentration.

Figure 2.2: An observation is a measurement or sensory input within a theoretical and linguistic framework. An observation cannot be made without a context. There is no such thing as an objective observation.

This book requires a highly specific type of observation. We need data. Very rarely is any observation we make based on first-hand data. For example, if you look outside your window and observe a bird, you will learn some information about it. But is this data? No, it is just information. Data is information, but structured information. Data is a type of observation that is stored in a specific way. We translate observations and information about them into a set of numbers. Not every observation can be represented as a number in a data sheet or table. As their names suggest, data sheets and tables are two-dimensional. A sheet is a type of paper on which organised observations can be written down. The same is true for a table. It is a structured type of observation.

2.2 What is data?

We will now think about data. This will happen after we have discussed the science in the chapter. It will also happen if you have skipped the chapter entirely. What is data? If you had asked me when I was a teenager, I would have referred to Star Trek: The Next Generation and Commander Data. He was an android without any emotions. Pure objectivity. He was an artificial being. However, this is not the type of data we want to discuss. We want to discuss data that has been observed and measured. Furthermore, our data is numeric or has a numeric counterpart. We use descriptive words, but rarely. Later on, we want to run algorithms and calculations with our data. Therefore, the data must be open to calculations.

Data are observations represented by numbers and letters. Can we use data in any format? No, you cannot. There is a general format for how data should be stored. In our case, we have observations in the rows and variables of measurement in the columns. There can only be one observation per row. This format is called the long format. Another possibility is the wide format, where the observations are scattered around and do not share the same rows. The wide format may be useful for time series, but I do not recommend it.

The following Figure 2.3 shows a typical data set used in data science and statistics. It is also typical for those working in the life sciences. The matrix has \(n\) rows and \(p\) columns. To select a cell, we can write the row and column information in brackets. The rows are indicated by \(i\) and the columns by \(j\). This is a simple way of extracting information from a data table. However, these manipulations are not normally used anymore because they are very error-prone. This is a simple yet powerful way of storing information. We translate our observations into numbers and letters, which we then put into a two-dimensional matrix.

Figure 2.3: Theoretical data set. It has \(n\) rows and \(p\) columns. The row and column positions can be used to access each cell of this two-dimensional matrix, which is indicated by the grey shaded area.

Figure 2.4 might be interesting, but it is not very helpful in the daily working process. Therefore, I will show you an annotated data table with example observations labelled with numbers and letters. The first columns contain an identifier for each observation, or ID. I always include one in my data sets because I always want to know which individual is which. This is even more important when working with time series or repeated measurements. After the ID column, we find the experimental variables. We can also refer to these as ‘X’ variables, as we will plot them on the x-axis in our visualizations later on. We will often set the experimental variables before we start the experiment. Then, we find the measurements on the individuals. These variables are called outcomes or ‘Y’ variables, as they will be the variables on our y-axis later on. The terminology used varies from scientific field to scientific field. I will therefore talk about terminology in a separate chapter later on.

Figure 2.4: Annotated dataset. The individuals’ observations are shown in the rows. The columns show the single identifier of the observations, the experimental factors set before the experiment, and the measured variables after the experiment has concluded.

Can we live with this two-dimensional representation of data? Not really. If you look at different fields, you will find a variety of other data tables. In genetics, we do not have data sets of this kind when we look at thousands of genes. Therefore, we have at least two data files: One contains all the information on the patients or individuals, and the other contains all the genetic information. Most of the time, the genetic file is saved in a transposed form. This means that the variables or genes are in the rows and the individuals are in the columns. This is because it is easier to change rows in a data set than columns. This gives us an advantage when it comes to saving and manipulating the data. However, this makes it slightly more complicated to work with such data. Further, geology has its own data format, while other fields use different ways of storing data. Storage methods sometimes depend on the analysis software used or the storage and calculation capacities required.

Here, we will focus on the presented data set in Figure 2.4, making some modifications to the columns if we are working with time series or repeated measurements. However, we will stick to one data set and will not consider complicated representations. We will talk about the experimental variables and the different types of measurements, both of which are represented by different numbers and letters. There are still many topics to consider if we want to work with data to obtain information and knowledge. In the following sections, we will cover different attributes and limitations of data that are important for successful work.

2.3 Meta information of data

The data seems easy enough. We have a two-dimensional table with numbers and letters. That could be it. However, there are some things to consider. Information is often hidden in data. Where does the data come from? What are the internal generation processes? Are there unknown or known structures beneath the numbers and letters that cannot be seen by the row and column names?

First, we need to consider the source of the data. Depending on where you obtain your data, there may be more regularities to consider than in other fields.

Second, we must discuss how the data is generated. Are you conducting a controlled experiment where you only change one influential variable and observe what happens? Or are you doing nothing and only observing the data? Does your role in the generation process play an important role?

Third, we would like to discuss the replication. We need replications in our data in order to perform statistical analyses later on. However, there are two types of replication. They can appear similar in a dataset, but they have a significant impact on your analysis and the answers to your questions.

Next, we will examine artificial or simulated data. This special type of data is used mainly in statistical research and teaching, and we will use it in this book. Non-statisticians should not use artificial data in their daily research to cheat their data sheets.

Is all this information lost or hidden? In a sense, yes, because if you don’t have any information on the generation process, or if you can only see some cryptic column names, it might be impossible for you to gain any deeper knowledge of the data. However, it is always possible to find a pattern in anything.

Garbage in garbage out

2.3.1 Source of data

Where does your data come from? Does it matter? Yes, because it has a huge influence on your work and how you share it. First, let’s define the different sources of data. One type is human data, such as blood samples or questionnaires. Humans can share information about social, economic, and other factors. Human data is often used in medicine and the social sciences. Then, we can examine data from animals. Mice are a special case because much basic research is conducted on mice before moving on to humans. The rest of the animal studies include all animals relevant to agriculture and leisure activities, such as dogs and horses. Naturally, there is a great deal of interest in chickens, pigs, and cows because of their role in food production. Finally, we have plants, including fungi and other plant-based sources. Our focus is mostly on crops that are interesting to agriculture, but the field is broad. On the sidelines, we have cell cultures, which have their own problems and limitations.

Figure 2.5 shows a visualization of the different sources of data and their connection to regulation and openness. We often refer to data from humans, animals and plants as in vivo data because it originates from living beings. This includes mice, which are the general animal model for humans. If artificial or simulated data is used in statistics, these data sources are referred to as in silico data. In my view and in the context of this book, artificial data is common, but this is a biased perspective on data. Most of the time, data is generated by experiments or observations rather than by algorithms. Nevertheless, human data is largely regulated. Human data cannot be shared openly. The same is partly true for animal models. Plant data and some types of cell cultures are easier to share. If you simulate your data, you can share it as freely as you like.

Figure 2.5: Figure showing examples of sources of data. On the left are the three main in vivo sources of data: human, animal and plant, in order of decreasing regulatory requirements. Cell cultures can be conducted from each of these sources. On the left is the in silico source of data, which uses artificial or simulated data. The openness and shareability of data moves in the opposite direction to the level of regulation.

Ethical considerations. Can I do want I want?

2.3.2 Experimental vs. observed data

Experimental data comes from a controlled experiment.

Observed data is real world data which can be come from everywhere. We live in a data driven world.

Bias

Real world data vs randomized clinical trails (RCT)

Figure 2.6: Visualisation of the saying ‘garbage in, garbage out’. In other words, if the input data is bad, the statistical algorithm cannot work miracles. The output will also be garbage, regardless of the statistical algorithm. In such a setting, the algorithm is also seen as a black box because, if the data is not understood, neither will the algorithm be.

2.3.3 Biological vs. technical replication

Do we need randomization here?

We need replication, because we need a probabilistic process to do statistics in science. By the probabilistic process we can then calculate probabilities. Without a probabilistic process and replication we cannot have any probabilities.

The following Figure 2.7 illustrates the various types of replication. In general, we divide replication into biological and technical replication. We are also able to conduct biological replication at many points in time. This may sound confusing, but the difference lies in the time between the measurements. Let us start with biological replication. We measure an outcome Y on different biological individuals. Depending on our research question, we might measure only a few individuals or hundreds to thousands. These individuals can be humans, plants or mice. Technical replication involves taking measurements many times on the same individual at the same time. This allows us to ensure that the measurement was correct. For example, you could take your body weight three times and then combine the three measurements to give one overall result. If a large amount of time passes between the individual measurements, we speak of a repeated measurement. In this case, we are still measuring the same individual, but at different points in time. We want to know how body weight changes in humans over weeks. This is a biological replication at different points in time.

Figure 2.7: Different types of replication can be observed in data tables and experiments. (A) First, there is biological replication, where a measured \(Y\) is repeatedly observed in different biological individuals. (B) The second type is repeated measurement, where a measured Y is observed on each individual at different points in time. The time between each measurement is long. (C) Third, technical replication involves measuring individuals in a row at one point in time. The time between each observation is negligible. The values measured for each individual are mostly combined into a single measure.

2.3.4 Simulation and artificial data

You might be surprised to learn that, in statistical research, we statisticians work with simulated data. This data is also known as artificial data. This data has no experimental background and is entirely simulated. We simulate our measured values based on the experimental values. Therefore, we need to make some assumptions about the simulation process. Consequently, we can conclude that our artificial data is model-laden. We need a model to connect the experimental and measured variables. We can call this connection a model. If we change the model, the generated data will also change.

In this book, we will simulate our data. This may seem counterintuitive, as we will use data with predefined effects and differences. What is the point of such data? What can we learn from it? The advantage of simulated data is that we already know what we want to find with our statistical algorithms. The advantage of using simulated data is that we already know what we are looking for with our statistical algorithms. Otherwise, we would not know whether there were no patterns in the data or whether our algorithm was unable to detect them.

1

In vivo versus in silico.

2.4 Ways to work with data

2.4.1 Long format vs. wide format

2.4.2 The elephant in the room is Excel

2

3

What problems are there with Excel?

  • Reach and shareability. If we do something in Excel by point and click, nobody else can do it if they are not observing us.
  • Scalability and efficiency. If we have one task, it will take some time. If we have the same task with different numbers, it will take twice as long. We cannot really automate any processes.
  • Excel is error-prone. If we make a mistake, we will not really understand why things did not work. We can click again, but we have no protocol to follow. Others cannot understand what we have done.
  • Excel is neither open nor free to use. We need a licence to use Excel and are unable to share any files with people who do not have access to Microsoft products.

2.4.3 Tidy data

2.5 R packages used

Do we want here to simulate data?

Figure 2.8: foo. (A) foo. (B)
Figure 2.9: foo.

2.6 Data

Is your data the fuel of discovery or the test of discovery?

R Code [show / hide]
jump_weight_tbl <- tibble(x = c(0.6, 1, 2.3, 3.5, 5.2, 7.1, 8.4, 9.2, 10),
                          y = 0.15*x^3 - 2.2*x^2 + 8.8*x + 3.2 + rnorm(9, 0, 0.5)) |>
  mutate_all(round, 1) |> 
  rename(weight_mg = x, jump_length_cm = y)
Table 2.1: foo.
weight_mg jump_length_cm
0.6 8.5
1.0 9.6
2.3 13.7
3.5 13.2
5.2 11.0
7.1 8.3
8.4 10.9
9.2 14.3
10.0 21.1

Equation 2.1

\[ y = 0.15\cdot x^3 - 2.2\cdot x^2 + 8.8 \cdot x + 3.2 \tag{2.1}\]

Figure 2.10: foo.
Figure 2.11: foo.

2.7 Outro

What do we want to do in this book? We want to perform inductive reasoning based on data. Statistical modeling is inductive reasoning, so it is also based on data. Therefore, we will need data to perform statistical analyses in any possible combination. In other words, we live in a data-driven world.

In this book, we will follow an inductive approach. First, we make observations, then we try to find patterns in them. Therefore, our data is the fuel of discovery. Conversely, we can also use data to test a hypothesis. Both are valid and are used. Since this book is about statistics and data science, we cannot generate new theories. We are limited by the tools we have and the data available to us. This book may be read by those with a background in natural science. Then, you can use the statistical tools differently.

2.8 Alternatives

Further tutorials and R packages on XXX

2.9 Glossary

term

what does it mean.

2.10 The meaning of “Models of Reality” in this chapter.

  • itemize with max. 5-6 words

2.11 Summary

References

[1]
Dormann CF, Ellison AM. Statistics by Simulation: A Synthetic Data Approach. Princeton University Press; 2025.
[2]
Hassenstein MJ, Jung K. Ten simple rules for effective research data management. PLOS Computational Biology. 2025;21(12):e1013779.
[3]
Wilkinson MD, Dumontier M, Aalbersberg IjJ, Appleton G, Axton M, Baak A, et al. The FAIR guiding principles for scientific data management and stewardship. Scientific data. 2016;3(1):1-9.