01/05/2018

Dynamic documents in R Markdown

Massive disclaimer: I seriously do not know what I'm doing. I try stuff and it works at times. Please consider trying stuff too. It's fun.

I'm serious

Dynamic reporting in R Markdown

Massive disclaimer: I seriously do not know what I'm doing. I try stuff and it works at times. Please consider trying stuff too. It's fun.

Bing

A short history

  • Literate programming (Knuth, 1984)
  • Dynamic documents in R (Xie, 2015)

  • Sweave: Integration of R into LaTeX documents
  • knitr: Sweave + RMarkdown format, caching, + other language support (Python, C++, etc.)

  • knitr is backbone of R Markdown –> Focus of today

Why should everybody document dynamically - A tripartite

  1. Statistical inconsistencies are everywhere.. (Nuijten et al., 2016)
  2. Reproducibility = life
  3. Doing manual work in 2018 seems rather silly

Statistical inconsistencies are everywhere

  • Inconsistencies in psych: it's bad (Nuijten et al., 2016). Replication in management: same/worse (van Zelst et al., work in progress)
  • Reasons are well-known: typos, copy-paste errors, incorrect rounding, etc.

Solution 1: Statcheck

  • But statcheck can only read APA
  • Statcheck shows where you were wrong, but doesn't fix it

Statistical inconsistencies are everywhere (2)

Solution 2: tidystats

  • R-only, although I'm not sure why this is a problem (:
  • Does not yet support most analyses
  • Specific output (e.g. CIs) has to be added manually

Also: tidystats solves other problems than R Markdown does

Also: use tidystats and R Markdown together (Sleegers, 2018)

Statistical inconsistencies are everywhere (3)

R Markdown offers some encompassing solutions:

  • It's not after the fact: mistakes are prevented while writing
  • Supports all analyses (e.g., I managed to produce CIs in a table from meta-analytic structural equation modeling)
  • If R can analyze it, R Markdown can print it

Downside:

  • The good old steep learning curve

Reproducibility = problem then, problem now

"God blessed them and said to them, "Be fruitful and reproduce" (Genesis 1:28; 9:1)

Reproducibility

Reproducibility = life

  1. Will you capable of reproducing your own results in half a year from now?
  2. Will your co-authors be able to do this?
  3. Will a colleague that is not a co-author?
  4. An independent researcher who's in your field of expertise?

Data package requirement (mandatory!): be able to reproduce all your results in reasonable amounts of time

Reproducibility = life (2)

Remember those methods sections where you seem to be unable to understand what the authors did?

Even if you have the syntax/code, would you be able to figure out which analysis produces which result in the paper?

Reproducibility = life (3)

What did I just read

Reproducibility = life (4)

Complete R Markdown file can be self-contained (including main text, data manipulation, analyses, formatting)

You always understand which number was produced by which analysis at the exact place in your manuscript

  • Don't forget to keep annotating your code: you will forget what you did

Doing manual work in 2018 seems rather silly

"we have to manually copy over statistical results into or out of a manuscript" (Sleegers, 2018)

Manual work situations that can be prevented:

  • Redoing analyses when some new data came in (e.g., new papers for your meta-analysis)
  • "Can you try this other statistical specification and send me the table with output?"
  • "Let's submit to this other journal which has completely different formatting requirements"

That seems like a serious waste of time..

Conclusion

Several research-process related problems:

  1. Not all statistics are correctly reported
  2. Non-reproducible research is not research –> we cannot take it seriously
  3. Wasting time on non-useful actions

Dynamic documents/dynamic reporting solves this

Some general statements about R Markdown

Able to produce:

  • Word / ODT
  • PDF (LaTeX)
  • html / Github compatible markdown
  • Slides for presentation
  • Shiny documents

Can import:

  • data / other R-files / 'internet' files / bib-file / csl-file

R Markdown functionality

R Markdown integrates text (Markdown / LaTeX), data manipulation (R), analyses (R), and formatting (Markdown / LaTeX)

Data analysis workflow becomes:

  1. Draft intro, theory, methods within Markdown

  2. Data preparation

  3. Data analysis

  4. Write results using code in Markdown file

  5. Format document within Markdown file

  6. Print and submit!

R Markdown functionality (2)

How does it work?

You write your normal text (similar to e.g. Word) but add 'code chunks' in your text.

Code chunks act like normal R analyses. I actually often write my code in R to try it until it works and then copy paste it to R Markdown.

When you save results to objects (like in R), you can extract the results from those objects in your main text or put them into graphs/tables.

Example

So let's put this into practice #hoedan

Today, I will show some examples of how to build an R Markdown document, how text drafting looks and how you can insert results and other statistical output in your document.

Fasten your seatbelts.

Creating a document

yaml header

Write your draft

main text

How about analyses?

We'll start with the Student's sleep data (again)

res <- t.test(extra ~ group, data = sleep, var.equal = TRUE)

Code:

We found a significant positive effect of drugs on hours of sleep, r*t*(res$parameter) = round(res$statistic,3), p = round(res$p.value,3).

Produces:

We found a significant positive effect of drugs on hours of sleep, t(18) = -1.861, p = 0.079.

Let's move it up a notch

We conducted a meta-analysis on organizational performance feedback which tests to what extent firms change their strategies in response to negative performance feedback.

Obviously, we wanted to test some moderators and conducted a meta-regression.

spfb.wls <- rma(spfb , mods=~ public + finance*mean_age + service + 
                hightech  + change + search, var, method="DL", 
                data=dat, test="knha")
res.spfb <- list(spfb.wls)

Writing the results section

Code:

# For social performance feedback, we find that firms are significantly 
# more responsive to non-financial performance feedback below the 
# aspiration level ($\beta$ = ` round(spfb.wls$b[3],3)`, 
# *p*-value = ` round(spfb.wls$pval[3],3)`).

Produces:

For social performance feedback, we find that firms are significantly more responsive to non-financial performance feedback below the aspiration level (\(\beta\) = -0.081, p-value = 0.019).

Producing table

Let's say you also want to produce a table with the results.

knitr allows for easy table production, if you can configure your code right (this will take (many) trial and error runs)

      knitr::kable(data.frame(name=c("Constant","Public firm","Finance",
      "Firm age","Service","High-tech","Change","Search","Interaction MetricAge"),
      
      HPFBA = sapply(res.spfb, function(x) x`$`b), 
      CIlb = sapply(res.spfb, function(x) x`$`ci.lb),
      CIub = sapply(res.spfb, function(x) x`$`ci.ub),
      Pvalue=sapply(res.spfb, function(x) x`$`pval),
      digits=3,  col.names=c("Variable","PF < HA","CI~lb~","CI~ub~",
                              "*p-value*"),
      caption = "Table 3: Meta-analytic weighted 
       least squares for social performance feedback")

Table time

Table 3: Meta-analytic weighted least squares for social performance feedback
Variable PF < HA CIlb CIub p-value
Constant 0.057 -0.070 0.185 0.367
Public firm -0.143 -0.191 -0.096 0.000
Finance -0.081 -0.148 -0.014 0.019
Firm age 0.002 0.000 0.003 0.015
Service -0.006 -0.066 0.055 0.845
High-tech 0.033 -0.017 0.083 0.192
Change 0.015 -0.057 0.087 0.682
Search -0.040 -0.122 0.043 0.337

So we wanted to plot some meta-analytic interaction. We interact our 'performance metric' variable with 'organizational age'.

require(metafor)
dat$spfb <-0.5*log((1+dat$spfb_dv)/(1-dat$spfb_dv))
spfb.wls <- rma(spfb,mods=~ public + finance*mean_age + service + 
                  hightech + search + change,var,data=dat,
                method="DL",test="knha")
wi    <- 1/sqrt(dat$var)
size  <- 0.5 + 1.5*(wi - min(wi))/(max(wi) - min(wi))
plot(dat$mean_age, dat$spfb_dv, pch=19, xlim=c(0,120), ylim=c(-0.5,0.5), cex=size,
     xlab="Firm age (years)", ylab="Correlation", las = 1, bty="l")
preds.nonfinance <- predict(spfb.wls, newmods=cbind(1,0,0:120,0,0,0,0,0))
lines(0:120, preds.nonfinance$pred,col="red",lwd=2)
lines(0:120, preds.nonfinance$ci.lb,lty="dotted",col="red",lwd=2)
lines(0:120, preds.nonfinance$ci.ub,lty="dotted",col="red",lwd=2)
preds.finance <- predict(spfb.wls, newmods=cbind(1,1,0:120,0,0,0,0,0:120))
lines(0:120, preds.finance$pred,col="blue",lwd=2)
lines(0:120, preds.finance$ci.lb,lty="dotted",col="blue",lwd=2)
lines(0:120, preds.finance$ci.ub,lty="dotted",col="blue",lwd=2)
legend("topright", legend=c("Non-financial","Financial"),lty=c("dotted","solid"),col=c("red","blue"),lwd=2)
title("P < SA for public")

Plot

Reporting statistics

Because all statistics are structured in a manner you devised, it is very easy to write functions that return whatever-style output.

Using code inside your text allows you to specify where you want statistical output, which can be individual tests, tables, graphs, etc.

Self-contained files

Being smart with packages allows you to make a completely self-contained paper within your Rmd file

if(!require(devtools)){install.packages("devtools")}
require(devtools)
if(!require(osfr)){install_github("chartgerink/osfr",force=TRUE)}
require(osfr)
download_files('bj786', private=TRUE) #masterdata.csv
download_files('wqyz2', private=TRUE) #reference file for formatting
download_files('pz3kx', private=TRUE) #20170711studies.csv
download_files('3v9jf', private=TRUE) #csl file for AMR references
download_files('g6xur', private=TRUE) #references

Live example from a conference paper

Limitations of R Markdown

The pesky learning curve:

  • Producing a text file with numbers: rather easy
  • Producing a formatted docx/PDF with tables: doable'ish
  • Producing a presentation with analyses, tables, that is nicely formatted: ????

For PhD students:

  • "My supervisor doesn't understand this"

Concluding thoughts

I think reproduciblity is a daily process. Reproducible research starts today; not when your project is done.

When the TSB Science Committee requests to check your paper, you should be able to send them what they request within five minutes.

R Markdown is a great tool to help you do that.