2 What is data?

Last modified on 24. February 2026 at 20:32:46

“The limits of my language mean the limits of my world.” — Ludwig Wittgenstein

Emma sat on the wooden stage. She need time to recover from what she had seen behind the curtains. What she had seen beneath the wood had now become reality. Dust danced in the light. Single light beams hit some dust particles from time to time and exploded into golden rain. Sparkles dropped on the wood.

“How is gold made?” Emma asked.

“This was a long break. I’m a bit surprised by your question. I thought we should discuss your findings from behind the stage. Not gold.” Jeff answers from his seat on the left.

“Gold cannot be formed through fusion in a star. This is only possible up to iron. You need a supernova or a collision of two neutron stars.” Emma spoke absently into the darkness.

“Interesting topic. Was this observed? The collision of two neutron stars and the resulting gold spewing out,” Jeff laughed.

“No, it was a computer simulation. Artificial data, if you will.” Emma looked around. She decided that it was time to begin the play.

“Then it’s not real, and it’s not an explanation for me at all.” Jeff stated.

“But the simulation was based on real observations.” Emma stood up and started building the scene for the play.

2.1 General background

We will now think about data. This will happen after we have talked about science. It will also happen if you have skipped the chapter entirely. What is data? If you had asked me when I was a teenager, I would have referred to Star Trek: The Next Generation and Commander Data. He was an android without any emotions. Pure objectivity. He was an artificial being. However, this is not the type of data we want to discuss. We want to discuss data that has been observed and measured. Furthermore, our data is numeric or has a numeric counterpart. We use descriptive words, but rarely. Later on, we want to run algorithms and calculations with our data. Therefore, the data must be open to calculations.

Can we use data in any format? No, you cannot. There is a general format for how data should be stored. In our case, we have observations in the rows and variables of measurement in the columns. There can only be one observation per row. This format is called the long format. Another possibility is the wide format, where the observations are scattered around and do not share the same rows. The wide format may be useful for time series, but I do not recommend it.

In this book, we will follow an inductive approach. First, we make observations, then we try to find patterns in them. Therefore, our data is the fuel of discovery. Conversely, we can also use data to test a hypothesis. Both are valid and are used. Since this book is about statistics and data science, we cannot generate new theories. We are limited by the tools we have and the data available to us. This book may be read by those with a background in natural science. Then, you can use the statistical tools differently.

observations are theory-laden

Observation = Sensory Input + Theoretical Framework

2.2 Simulation and artificial data

Artificial data is model-laden

2.3 Technical vs. biological replication

2.4 Experimental vs. observed data

2.5 Clinical studies

2.6 Long format vs. wide format

2.7 The elephant in the room is Excel

2.8 Tidy data

2.9 Theoretical background

2.10 R packages used

Do we want here to simulate data?

2.11 Data

Is your data the fuel of discovery or the test of discovery?

R Code [show / hide]

jump_weight_tbl <- tibble(x = c(0.6, 1, 2.3, 3.5, 5.2, 7.1, 8.4, 9.2, 10),
                          y = 0.15*x^3 - 2.2*x^2 + 8.8*x + 3.2 + rnorm(9, 0, 0.5)) |>
  mutate_all(round, 1) |> 
  rename(weight_mg = x, jump_length_cm = y)

Table 2.1: foo.

weight_mg	jump_length_cm
0.6	8.5
1.0	9.6
2.3	13.7
3.5	13.2
5.2	11.0
7.1	8.3
8.4	10.9
9.2	14.3
10.0	21.1

Equation 2.1

\[ y = 0.15\cdot x^3 - 2.2\cdot x^2 + 8.8 \cdot x + 3.2 \tag{2.1}\]

2.12 Alternatives

Further tutorials and R packages on XXX

2.13 Glossary

term: what does it mean.

2.14 The meaning of “Models of Reality” in this chapter.

itemize with max. 5-6 words

2.15 Summary

References

```{r echo = FALSE, warning = FALSE, message = FALSE} source("init.R") source("images/part_0/part_0_data.R") ``` # What is data? {#sec-what-is-data} *Last modified on `r format(fs::file_info("chapter-02-what-is-data.qmd")$modification_time, '%d. %B %Y at %H:%M:%S')`* > *"The limits of my language mean the limits of my world." --- Ludwig Wittgenstein* Emma sat on the wooden stage. She need time to recover from what she had seen behind the curtains. What she had seen beneath the wood had now become reality. Dust danced in the light. Single light beams hit some dust particles from time to time and exploded into golden rain. Sparkles dropped on the wood. "How is gold made?" Emma asked. "This was a long break. I'm a bit surprised by your question. I thought we should discuss your findings from behind the stage. Not gold." Jeff answers from his seat on the left. "Gold cannot be formed through fusion in a star. This is only possible up to iron. You need a supernova or a collision of two neutron stars." Emma spoke absently into the darkness. "Interesting topic. Was this observed? The collision of two neutron stars and the resulting gold spewing out," Jeff laughed. "No, it was a computer simulation. Artificial data, if you will." Emma looked around. She decided that it was time to begin the play. "Then it's not real, and it's not an explanation for me at all." Jeff stated. "But the simulation was based on real observations." Emma stood up and started building the scene for the play. ## General background We will now think about data. This will happen after we have talked about science. It will also happen if you have skipped the chapter entirely. What is data? If you had asked me when I was a teenager, I would have referred to Star Trek: The Next Generation and Commander Data. He was an android without any emotions. Pure objectivity. He was an artificial being. However, this is not the type of data we want to discuss. We want to discuss data that has been observed and measured. Furthermore, our data is numeric or has a numeric counterpart. We use descriptive words, but rarely. Later on, we want to run algorithms and calculations with our data. Therefore, the data must be open to calculations. Can we use data in any format? No, you cannot. There is a general format for how data should be stored. In our case, we have observations in the rows and variables of measurement in the columns. There can only be one observation per row. This format is called the long format. Another possibility is the wide format, where the observations are scattered around and do not share the same rows. The wide format may be useful for time series, but I do not recommend it. In this book, we will follow an inductive approach. First, we make observations, then we try to find patterns in them. Therefore, our data is the fuel of discovery. Conversely, we can also use data to test a hypothesis. Both are valid and are used. Since this book is about statistics and data science, we cannot generate new theories. We are limited by the tools we have and the data available to us. This book may be read by those with a background in natural science. Then, you can use the statistical tools differently. observations are theory-laden Observation = Sensory Input + Theoretical Framework ```{r} #| message: false #| echo: false #| warning: false #| fig-align: center #| fig-height: 0.75 #| fig-width: 7.5 #| fig-cap: "foo" #| label: fig-observation p_observation_data ``` ## Simulation and artificial data Artificial data is model-laden ## Technical vs. biological replication ## Experimental vs. observed data ## Clinical studies ## Long format vs. wide format ## The elephant in the room is Excel ## Tidy data ## Theoretical background ## R packages used Do we want here to simulate data? ```{r} #| message: false #| echo: false #| warning: false #| fig-align: center #| fig-height: 3.25 #| fig-width: 7 #| fig-cap: "foo. **(A)** foo. **(B)**" #| label: fig-02-cabin-tibble p_tibble_empty <- ggplot() + theme_void() + # theme_minimal() + coord_cartesian(xlim = c(0, 10), ylim = c(0.5, 10.5)) + scale_x_continuous(breaks = seq(-10,10,1), expand = expansion(mult = c(0, 0))) + scale_y_continuous(breaks = seq(-10,10,1), expand = expansion(mult = c(0, 0))) + geom_cabin(x = 0.5, y = 0.5) + labs(title = "A tibble()") + theme(plot.title = element_text(size = 16, face = "bold"), plot.subtitle = element_text(size = 12, face = "italic")) p_tibble_filled <- ggplot() + theme_void() + # theme_minimal() + coord_cartesian(xlim = c(0, 10), ylim = c(0.5, 10.5)) + scale_x_continuous(breaks = seq(-10,10,1), expand = expansion(mult = c(0, 0))) + scale_y_continuous(breaks = seq(-10,10,1), expand = expansion(mult = c(0, 0))) + geom_cabin(x = 0.5, y = 0.5) + geom_hook(x = 1.7, y = 8.25, width = 1.3, angle = 0, hook_text = str_c(c("cat", "cat", "cat", "...", "dog", "dog", "dog"), collapse = "\n"), label = FALSE) + geom_hook(x = 3.7, y = 8.25, width = 1.3, angle = 0, hook_text = str_c(c("33.2", "31.7", "36.8", "...", "22.1", "19.7", "24.3"), collapse = "\n"), label = FALSE) + geom_hook(x = 5.7, y = 8.25, width = 1.3, angle = 0, hook_text = str_c(c("10.1", "12.4", "15.6", "...", "22.4", "18.1", "25.6"), collapse = "\n"), label = FALSE) + geom_hook(x = 7.7, y = 8.25, width = 1.3, angle = 0, hook_text = str_c(c("yes", "yes", "no", "...", "no", "yes", "no"), collapse = "\n"), label = FALSE) + labs(title = "A tibble()") + theme(plot.title = element_text(size = 16, face = "bold"), plot.subtitle = element_text(size = 12, face = "italic")) p_tibble_empty + p_tibble_filled + plot_layout(ncol = 2) + plot_annotation(tag_levels = 'A', tag_prefix = '(', tag_suffix = ')') & theme(plot.tag = element_text(size = 16, face = "bold")) ``` ```{r} #| message: false #| echo: false #| warning: false #| fig-align: center #| fig-height: 3 #| fig-width: 4 #| fig-cap: "foo." #| label: fig-02-cabbinet ggplot() + theme_void() + # theme_minimal() + coord_cartesian(xlim = c(0, 10), ylim = c(0.5, 10)) + scale_x_continuous(breaks = seq(-10,10,1), expand = expansion(mult = c(0, 0))) + scale_y_continuous(breaks = seq(-10,10,1), expand = expansion(mult = c(0, 0))) + geom_cabin(x = 0.5, y = 0.5) + geom_hook(x = 1.5, y = 8.25, width = 0.8, angle = 0, hook_text = str_c(c(2.1, 3.2, 2.1, 3.2, 2.1, 3.2, 2.1, 3.2), collapse = "\n"), label_text = "hust") + geom_hook(x = 2.75, y = 8.25, width = 0.8, angle = 0, hook_text = str_c(c(2.1, 3.2, 2.1, 3.2, 2.1, 3.2, 2.1, 3.2), collapse = "\n"), label_text = "hust") + geom_hook(x = 4, y = 8.25, width = 0.8, angle = 0, hook_text = str_c(c(2.1, 3.2, 2.1, 3.2, 2.1, 3.2, 2.1, 3.2), collapse = "\n"), label_text = "hust") + geom_hook(x = 5.25, y = 8.25, width = 0.8, angle = 0, hook_text = str_c(c(2.1, 3.2, 2.1, 3.2, 2.1, 3.2, 2.1, 3.2), collapse = "\n"), label_text = "hust") + geom_hook(x = 5.25, y = 8.25, width = 0.8, angle = 0, hook_text = str_c(c(2.1, 3.2, 2.1, 3.2, 2.1, 3.2, 2.1, 3.2), collapse = "\n"), label_text = "hust") + geom_hook(x = 6.5, y = 8.25, width = 0.8, angle = 0, hook_text = str_c(c(2.1, 3.2, 2.1, 3.2, 2.1, 3.2, 2.1, 3.2), collapse = "\n"), label_text = "hust") + geom_hook(x = 7.75, y = 8.25, width = 0.8, angle = 0, hook_text = str_c(c(2.1, 3.2, 2.1, 3.2, 2.1, 3.2, 2.1, 3.2), collapse = "\n"), label_text = "hust") ``` ## Data Is your data the fuel of discovery or the test of discovery? ```{r} jump_weight_tbl <- tibble(x = c(0.6, 1, 2.3, 3.5, 5.2, 7.1, 8.4, 9.2, 10), y = 0.15*x^3 - 2.2*x^2 + 8.8*x + 3.2 + rnorm(9, 0, 0.5)) |> mutate_all(round, 1) |> rename(weight_mg = x, jump_length_cm = y) ``` ```{r} #| echo: false #| message: false #| warning: false #| label: tbl-02-jump-weight #| tbl-cap: "foo." jump_weight_tbl|> kable(align = "c", "pipe") ``` @eq-jump-weight $$ y = 0.15\cdot x^3 - 2.2\cdot x^2 + 8.8 \cdot x + 3.2 $$ {#eq-jump-weight} ```{r} #| message: false #| echo: false #| warning: false #| fig-align: center #| fig-height: 4 #| fig-width: 10 #| fig-cap: "foo." #| label: fig-02-science-to-model p1 <- ggplot(jump_weight_tbl, aes(x = weight_mg, y = jump_length_cm)) + theme_book() + scale_x_continuous(breaks = 0:10, limits = c(0, 10)) + scale_y_continuous(breaks = seq(7, 21, 2), limits = c(7, 22)) + labs(x = "Flea weight in [mg]", y = "Jump length in [cm]", title = "Science") + theme(axis.text = element_blank()) p2 <- ggplot(jump_weight_tbl, aes(x = weight_mg, y = jump_length_cm)) + theme_book() + geom_point() + scale_x_continuous(breaks = 0:10, limits = c(0, 10)) + scale_y_continuous(breaks = seq(7, 21, 2), limits = c(7, 22)) + labs(x = "weight_mg", y = "jump_length_cm", title = "Data") p3 <- ggplot(jump_weight_tbl, aes(x = weight_mg, y = jump_length_cm)) + theme_book() + geom_function(fun = \(x) 0.15*x^3 - 2.2*x^2 + 8.8*x + 3.2, size = 1) + scale_x_continuous(breaks = 0:10, limits = c(0, 10)) + scale_y_continuous(breaks = seq(7, 21, 2), limits = c(7, 22)) + labs(x = "Influencer (X)", y = "Outcome (Y)", title = "Model") + theme(axis.text = element_blank()) p1 + p2 + p3 + plot_layout(ncol = 3) + plot_annotation(tag_levels = 'A', tag_prefix = '(', tag_suffix = ')') & theme(plot.tag = element_text(size = 16, face = "bold")) ``` ```{r} #| message: false #| echo: false #| warning: false #| fig-align: center #| fig-height: 4.5 #| fig-width: 5.5 #| fig-cap: "foo." #| label: fig-02-science-with-model ggplot(jump_weight_tbl, aes(x = weight_mg, y = jump_length_cm)) + theme_book() + geom_point(size = 3, shape = 21, fill = col_pal(1, 0.5)) + geom_function(fun = \(x) 0.15*x^3 - 2.2*x^2 + 8.8*x + 3.2) + scale_x_continuous(breaks = 0:10, limits = c(0, 10)) + scale_y_continuous(breaks = seq(7, 21, 2), limits = c(7, 22)) + labs(x = "Flea weight in [mg]", y = "Jump length in [cm]", title = "Science with data and model") ``` ## Alternatives Further tutorials and R packages on XXX ## Glossary term : what does it mean. ## The meaning of "Models of Reality" in this chapter. - itemize with max. 5-6 words ## Summary ## References {.unnumbered}