Skip to content

Commit c0e57d2

Browse files
authored
feat: ✨ slides for talk on Parquet and use in DST (#173)
# Description Slides and post for our in-person discussion session about Parquet and the Danish registers. No review needed. ## Checklist - [x] Formatted Markdown - [x] Ran `just run-all`
1 parent 534ed75 commit c0e57d2

File tree

3 files changed

+280
-0
lines changed

3 files changed

+280
-0
lines changed

posts/parquet-dst-2025/index.qmd

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
---
2+
title: "A talk on using Parquet data format in Denmark Statistics"
3+
description: |
4+
A short presentation on what Parquet is, why it's useful, and
5+
why it should be used in Denmark Statistics for faster
6+
analysis and lower expenses.
7+
author:
8+
- "Luke W. Johnston"
9+
date: "2025-10-24"
10+
categories:
11+
- presentation
12+
- parquet
13+
- data format
14+
---
15+
16+
For our in-person sessions, we're going to discuss the Denmark
17+
Statistics project database at Steno Diabetes Center Aarhus, how we'll
18+
use it in our collaborating project
19+
[DP-Next](https://dp-next.github.io), and how we can align efforts at
20+
converting the registers into Parquet format rather than in SAS format.
21+
22+
This post contains the [slides](slides.qmd) used in the session for
23+
talking about Parquet.

posts/parquet-dst-2025/slides.qmd

Lines changed: 227 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,227 @@
1+
---
2+
title: "Parquet data format and using it in Denmark Statistics"
3+
exclude-from-listing: true
4+
author: "Luke W. Johnston"
5+
date: "2025-10-24"
6+
format:
7+
revealjs:
8+
theme:
9+
- brand
10+
- theme.scss
11+
logo: /_extensions/seedcase-project/seedcase-theme/logos/seedcase-logo.svg
12+
slide-number: true
13+
---
14+
15+
# Outline for this presentation {.center}
16+
17+
1. Refresher: How Denmark Statistics currently stores data.
18+
19+
2. Intro to Parquet file format.
20+
21+
3. Problems that Parquet solves.
22+
23+
# Denmark Statistics data storage
24+
25+
## Data format: Proprietary SAS format {.center}
26+
27+
For example, BEF register:
28+
29+
``` text
30+
bef2018.sas7bdat
31+
bef2019.sas7bdat
32+
bef2020.sas7bdat
33+
bef2021.sas7bdat
34+
bef2022.sas7bdat
35+
```
36+
37+
. . .
38+
39+
Challenge: Takes many minutes to load one year of data (in R).
40+
41+
::: notes
42+
This means, if you use R, or Python, or Stata, you have to load these,
43+
which can take many minutes per file, just to load it.
44+
:::
45+
46+
## Data updates make more work for us {.center}
47+
48+
``` text
49+
bef2021.sas7bdat
50+
bef2022.sas7bdat
51+
December_2023/bef2022.sas7bdat
52+
December_2023/bef2023.sas7bdat
53+
```
54+
55+
> Can you see the issue?
56+
57+
::: notes
58+
One problem, sometimes there's a new version of a year you already had.
59+
But you don't know what's been changed. You have to spend time checking
60+
what changed and if it messes things up for you. The second problem is,
61+
the updates are in a new folder. So trying to build an automated
62+
pipeline to load the data in is a bit of a pain because the structure
63+
changes for each update.
64+
:::
65+
66+
## Metadata is confusing and poorly documented {.center}
67+
68+
- Variables are not consistent across years.
69+
70+
- Finding the metadata is difficult.
71+
72+
- Some variables are numeric but actually categorical.
73+
74+
::: notes
75+
Metadata is a big problem. Documentation is relatively poor for most of
76+
the variables, it's in another location that requires you to dig into
77+
it. Values in some variables that are numbers but actually are
78+
categories... but the documentation for what those numbers mean isn't in
79+
the same place. So requires searching.
80+
:::
81+
82+
## Use something other than SAS? Data gets duplicated {.center}
83+
84+
E.g. Stata will create `.dta` files, doubling storage needs.
85+
86+
# Parquet file format {.center}
87+
88+
::: aside
89+
<https://parquet.apache.org/>
90+
:::
91+
92+
## Parquet is a column-based data storage format {.center}
93+
94+
Most data formats are row-based, like CSV. Newer formats tend to be
95+
column-based.
96+
97+
## Row vs column-based storage {.center}
98+
99+
::::: columns
100+
::: column
101+
### Row-based
102+
103+
``` text
104+
name,sex,age
105+
Tim,M,30
106+
Jenny,F,25
107+
```
108+
:::
109+
110+
::: {.column .fragment}
111+
### Column-based
112+
113+
``` text
114+
name,Tim,Jenny
115+
sex,M,F
116+
age,30,25
117+
```
118+
:::
119+
:::::
120+
121+
## Column-based storage has better compression {.center}
122+
123+
``` text
124+
sex,M,F,F,M,M,F,F,F
125+
age,30,30,25,32,31,40,39,50
126+
diabetes,0,1,0,0,1,0,0,0
127+
```
128+
129+
...becomes...
130+
131+
``` text
132+
sex,M,F{2},M{2},F{3}
133+
age,30{2},25,32,31,40,39,50
134+
diabetes,0,1,0{2},1,0{3}
135+
```
136+
137+
### Loading
138+
139+
- Computers read by lines.
140+
- Per line = same data type.
141+
- Only read needed columns.
142+
143+
Only need age? Only read that line:
144+
145+
::::: columns
146+
::: column
147+
``` text
148+
sex,M,F
149+
age,30,25
150+
diabetes,0,1
151+
```
152+
:::
153+
154+
::: column
155+
``` text
156+
age,30,25
157+
```
158+
:::
159+
:::::
160+
161+
## Parquet is 50-75% smaller than other formats {.center}
162+
163+
| File type | Size (MB) |
164+
|----------------------|--------------|
165+
| SAS (`.sas7bdat`) | 1.45 Gb |
166+
| CSV (`.csv`) | \~90% of SAS |
167+
| Stata (`.dta`) | 745 Mb |
168+
| Parquet (`.parquet`) | 398 Mb |
169+
170+
: File size between CSV, Parquet, Stata, and SAS for `bef` register for
171+
2017.
172+
173+
## Personal experience: 500 GB SAS = 80 GB Parquet {.center}
174+
175+
## Can partition data by a value (e.g. year) {.center}
176+
177+
``` text
178+
bef/
179+
├── year=2018/
180+
│ └── part-0.parquet
181+
├── year=2019/
182+
│ └── part-0.parquet
183+
├── year=2020/
184+
│ └── part-0.parquet
185+
└── year=2021/
186+
└── part-0.parquet
187+
```
188+
189+
## Partitioned Parquet dataset can be loaded all at once {.center}
190+
191+
Load in R with `arrow` package:
192+
193+
``` r
194+
bef <- arrow::open_dataset("bef")
195+
```
196+
197+
> Loads all years in fraction of a second, compared to \~5 min for one
198+
> year without using Parquet.
199+
200+
## Easy connection to DuckDB engine {.center}
201+
202+
DuckDB <https://duckdb.org/> is a recent powerful SQL engine designed
203+
for analytical queries.
204+
205+
``` r
206+
bef <- arrow::open_dataset("bef") |>
207+
arrow::to_duckdb()
208+
```
209+
210+
## SAS and Python can load Parquet but not Stata {.center}
211+
212+
(But we should be pushing for R or Python use anyway.)
213+
214+
# Problems Parquet solves {.center}
215+
216+
## Less space used = less money spent {.center}
217+
218+
DST charges for storage used.
219+
220+
## Faster loading and analysis times {.center}
221+
222+
Parquet loads multiple files in seconds, compared to minutes for other
223+
formats.
224+
225+
## Sooner that researcher is done = less money spent {.center}
226+
227+
DST charges per user on a project.

posts/parquet-dst-2025/theme.scss

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
/*-- scss:defaults --*/
2+
3+
$presentation-font-size-root: 48px !default;
4+
$presentation-h1-font-size: 2em !default;
5+
$presentation-h2-font-size: 1.5em !default;
6+
$code-block-font-size: 0.85em !default;
7+
8+
/*-- scss:rules --*/
9+
10+
.reveal .progress {
11+
height: 8px;
12+
color: $primary;
13+
top: 0;
14+
}
15+
16+
.reveal .slide-logo {
17+
top: 10px;
18+
left: 10px;
19+
max-height: 4rem !important;
20+
}
21+
22+
// Fix center align for Mermaid diagrams in RevealJS slides.
23+
svg {
24+
display: inline-block;
25+
max-width: 60% !important;
26+
}
27+
28+
.tb-mermaid svg {
29+
max-width: 30% !important;
30+
}

0 commit comments

Comments
 (0)