Managing data projects

Andrés Cruz

2022-09-30

Roadmap (full)

1 Follow the one-project-one-folder rule

Structure your projects purposely.
Use RStudio projects.

Structure your projects purposely

Always keep raw datasets! Save temporary “clean” datasets if needed.
Common project structure:

my_project/
├─ raw/    # raw datasets
├─ temp/   # "clean" datasets
├─ output/ # figures, tables, model results
├─ 01_script_first.R
├─ 02_script_second.R # ...
├─ README.txt

Our example today:

project_example/
├─ raw/
│  ├─ qog_bas_cs_jan22.csv
├─ temp/
│  ├─ 01_qog_clean.csv
├─ output/
│  ├─ 02_model_coefplot.pdf
│  ├─ 02_model_results.rds
├─ 01_clean_data.R
├─ 02_run_model.R
├─ 99_paper_example.qmd
├─ 99_paper_example.pdf
├─ project_example.Rproj
├─ README_project_example.txt

Use RStudio projects

A project can be anything: a paper, a homework assignment, your CV…
RStudio Projects are folders. Just like the ones we just saw. There’s an .Rproj file indicating that the folder is an RStudio Project.
Never worry about working directories again. Simply reference your files within the folder. For example: read_csv("raw/qog_bas_cs_jan22.csv").
Start afresh every time you open RStudio with the following options (Tools > Global Options… > General):
- Uncheck “Restore .RData into workspace at startup”
- Set “Save Workspace to .RData on exit” to “Never”

2 Automate your output creation

Use Quarto (or R Markdown)
Make publication-ready tables with modelsummary
Save plots with ggplot2::ggsave()

Use Quarto (or R Markdown)

Keep code and output in the same document! No more copy-pasting tables.
You can write short reports, presentations (with revealjs or Beamer), or papers! Let’s see an example on 99_paper_example.qmd.
It compiles automatically to .html or .pdf (via \(\LaTeX\): everything works!).
- Installing \(\LaTeX\) locally is fairly easy. Just run the following code one time and you’ll be able to compile to PDF:

install.packages("tinytex")
tinytex::install_tinytex()

You can insert figures, tables, citations, \(\LaTeX\) equations…

Make publication-ready tables with `modelsummary`

m1 <- lm(mpg ~ hp, data = mtcars)
m2 <- fixest::feols(mpg ~ hp | cyl, data = mtcars, vcov = "cluster")

modelsummary::modelsummary(list(m1, m2))

	Model 1	Model 2
(Intercept)	30.099
	(1.634)
hp	−0.068	−0.024
	(0.010)	(0.015)
Num.Obs.	32	32
R2	0.602
R2 Adj.	0.589
AIC	181.2	163.9
BIC	185.6	166.8
Log.Lik.	−87.619
RMSE	3.74	2.94
Std.Errors		by: cyl
FE: cyl		X

Make publication-ready tables with `modelsummary` (2)

modelsummary::modelsummary(
  list("(1) OLS" = m1, "(2) FE" = m2), 
  gof_omit = "Log|RMSE|[AB]IC", 
  coef_map = c("(Intercept)" = "Intercept", "hp" = "Horspower")
)

	(1) OLS	(2) FE
Intercept	30.099
	(1.634)
Horspower	−0.068	−0.024
	(0.010)	(0.015)
Num.Obs.	32	32
R2	0.602
R2 Adj.	0.589
Std.Errors		by: cyl
FE: cyl		X

Make publication-ready tables with `modelsummary` (3)

If compiling to PDF/\(\LaTeX\):

Save plots with `ggplot2::ggsave()`

p_mpg <- ggplot(mtcars, aes(x = mpg)) +
  geom_histogram()
p_mpg

Save plots with `ggplot2::ggsave()` (2)

Don’t use “Export” from RStudio. It’s not reproducible!

Use ggplot2::ggsave() instead:

ggsave(plot = p_mpg, filename = "output/plot_mpg.png",
       width = 8,  # other values to try: 10, 12
       height = 6, 
       scale = .8) # lower values will make text and other elements larger

3 Write more solid code

Write abstract code when possible.
Write code in a consistent style
Don’t trust copy-pasting; write iterations!
Add sanity checks with assertthat::assert_that()

Write abstract code when possible

We want our code to be flexible enough to withstand minor changes upstream. In general, use objects/names instead of numbers!

For columns:

# bad
mtcars[, 1]

# good
mtcars$mpg
mtcars[["mpg"]]

For rows:

# bad
sum(mtcars$mpg) / 32

# good
sum(mtcars$mpg) / nrow(mtcars)

Write code in a consistent style

I highly recommend the tidyverse style guide (even if you don’t use the tidyverse). But here’s the most important part, in my opinion.
Use descriptive object names. df_anes is better than data.
Follow the snake_case convention for object names (including columns). Don’t use capital letters or spaces. df_anes is better than dfAnes.
Use janitor::clean_names() to clean messy variable names:

readxl::read_excel("raw/example_excel.xlsx") %>% 
  janitor::clean_names()

# A tibble: 2 × 3
  id_variable gdp_agr_fish x2010_production_log
        <dbl>        <dbl>                <dbl>
1           1         3000                   30
2           2         2000                   20

Don’t trust copy-pasting; write iterations!

Let’s say that we want to multiple all of our variables by 100.
This gets old fast… and it’s prone to error.

mtcars$mpg  <- mtcars$mpg * 100
mtcars$cyl  <- mtcars$cyl * 100
mtcars$disp <- mtcars$disp * 100
# ...

Use iterations instead! Here’s a for-loop:

mtcars2 <- mtcars
for (var in colnames(mtcars)){
  mtcars[[var]] <- mtcars[[var]] * 100
}

dplyr functions can be iterated over columns with across():

mtcars <- mtcars %>% mutate(across(everything(), ~ . * 100))

mtcars <- mtcars %>% mutate(across(where(is.numeric), ~ . * 100))

( mtcars <- mtcars %>% mutate(across(mpg:disp, ~ . * 100)) )

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           2100 600 16000 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       2100 600 16000 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          2280 400 10800  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      2140 600 25800 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   1870 800 36000 175 3.15 3.440 17.02  0  0    3    2
Valiant             1810 600 22500 105 2.76 3.460 20.22  1  0    3    1
Duster 360          1430 800 36000 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           2440 400 14670  62 3.69 3.190 20.00  1  0    4    2
Merc 230            2280 400 14080  95 3.92 3.150 22.90  1  0    4    2
Merc 280            1920 600 16760 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           1780 600 16760 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          1640 800 27580 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          1730 800 27580 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         1520 800 27580 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  1040 800 47200 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 1040 800 46000 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   1470 800 44000 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            3240 400  7870  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         3040 400  7570  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      3390 400  7110  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       2150 400 12010  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    1550 800 31800 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         1520 800 30400 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          1330 800 35000 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    1920 800 40000 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           2730 400  7900  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       2600 400 12030  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        3040 400  9510 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      1580 800 35100 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        1970 600 14500 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       1500 800 30100 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          2140 400 12100 109 4.11 2.780 18.60  1  1    4    2

Add sanity checks with `assertthat::assert_that()`

Sometimes you perform an operation and then manually check whether it worked well. Formalize this so you don’t have to do it every time something else changes!
This is very common in joins. For instance, you might want your output to conserve the number of rows of the X data frame. If not, you know something went wrong…

( df_x <- data.frame(id = c("A", "B", "C"), x_val = 1:3) )
( df_y <- data.frame(id = c("B", "C", "C"), y_val = 11:13) )

  id x_val
1  A     1
2  B     2
3  C     3

  id y_val
1  B    11
2  C    12
3  C    13

df_merged <- left_join(df_x, df_y)

assertthat::assert_that(nrow(df_x) == nrow(df_merged))

Error: nrow(df_x) not equal to nrow(df_merged)

assertthat::assert_that(sum(is.na(df_merged)) == 0)

Error: sum(is.na(df_merged)) not equal to 0

4 Make your workflow reproducible

Set seeds when randomizing
Use renv to make a package vault
Automate your entire workflow with targets

Set seeds when randomizing

Every time you run this, you’ll get a different result:

rnorm(n = 5, mean = 0, sd = 1)

[1]  0.53327428 -1.92674383 -0.09157776 -1.27632757  1.04407709

This is the case in many procedures that use randomness (MCMC, cross-validation, etc).
Whenever you will run code with a random component, use a seed to ensure reproducibility:

set.seed(123)
rnorm(n = 5, mean = 0, sd = 1)

[1] -0.56047565 -0.23017749  1.55870831  0.07050839  0.12928774

set.seed(123)
rnorm(n = 5, mean = 0, sd = 1)

[1] -0.56047565 -0.23017749  1.55870831  0.07050839  0.12928774

Use `renv` to make a package vault

Keeping track of package versions is important for future reproducibility.
Using renv makes the most sense when making a replication package, or sharing an advance project with colleagues.
Run renv::init() in your project to create a record of all packages used and their versions. This will create a couple of folders in your project (most notably, renv/).
If you ever need to update this, run renv::snapshot().
When someone else wants to run your code, they start with renv::restore(). This will scan renv/ and the other files and install all needed packages in their appropriate versions.

Automate your entire workflow with `targets`

When projects grow large, it’s hard to keep track of how everything’s related.
Say that you just caught a bug in one of your data cleaning scripts. What analyses do you need to rerun downstream?
targets builds a dependency graph for your datasets/scripts. If you edit a node anywhere in the flow, targets knows what other nodes become out of sync, and can rerun them automatically.

Managing data projects

Roadmap (full)

1 Follow the one-project-one-folder rule

Structure your projects purposely

Use RStudio projects

2 Automate your output creation

Use Quarto (or R Markdown)

Make publication-ready tables with modelsummary

Make publication-ready tables with modelsummary (2)

Make publication-ready tables with modelsummary (3)

Save plots with ggplot2::ggsave()

Save plots with ggplot2::ggsave() (2)

3 Write more solid code

Write abstract code when possible

Write code in a consistent style

Don’t trust copy-pasting; write iterations!

Add sanity checks with assertthat::assert_that()

4 Make your workflow reproducible

Set seeds when randomizing

Use renv to make a package vault

Automate your entire workflow with targets

Other resources

Make publication-ready tables with `modelsummary`

Make publication-ready tables with `modelsummary` (2)

Make publication-ready tables with `modelsummary` (3)

Save plots with `ggplot2::ggsave()`

Save plots with `ggplot2::ggsave()` (2)

Add sanity checks with `assertthat::assert_that()`

Use `renv` to make a package vault

Automate your entire workflow with `targets`