The main benefit of using Notebooks (R Notebooks or Jupyter Notebooks) is that the document is reproducible: the reader knows exactly how the results of the analysis were obtained. I wrote about the use of Notebooks in an earlier post.
Most organizations have a certain report format: a certain cover sheet layout, a certain font, a log of revisions, etcetera. For the most part, organizations have an MS Word template for this report format. If you want to use a Notebook for you analysis and to write your report, you have a few options:
- You could write front matter in MS Word using your company’s report template and then attach the Notebook as an appendix.
- You could also use
Pandoc
(more about what this is later) to convert the Notebook into a .docx file and then merge it into the report template. - You could create your own
Pandoc
template to convert a Notebook directly into a PDF with the correct formatting.
The first option of attaching a Notebook as an appendix to a report otherwise created in MS Word is effective but is means that you need to maintain two different files: the MS Word report and the Notebook itself. The second option of exporting the Notebook to MS Word and merging it into the template is problematic when it comes to document revisions. If the part of the analysis is revised, there is a temptation to change the affected part by either only re-exporting that section from the Notebook into docx, or worse, making the change directly in MS Word. In both cases, there is the possibility of breaking the reproducibility. For example, let’s say that in your report you define some constants at the beginning and do some math using these constants:
P = 1000
A1 = 2
A2 = 4
sigma1 = P / A1
print(sigma1)
# 500
sigma2 = P / A2
print(sigma2)
# 250
Now let’s say that you ask your new intern to revise the document so that $P = 1200$. They just edit the MS Word version of the report thinking that they will save some time. They don’t notice that $P$ is used twice in the calculation and only update the result from the first time it’s used. Now the report reads:
P = 1200
A1 = 2
A2 = 4
sigma1 = P / A1
print(sigma1)
# 600
sigma2 = P / A2
print(sigma2)
# 250
The report is now wrong. In a simple case like this, you’ll probably notice the error when you review your intern’s work, but if the math was significantly more complex, there is probably a fairly good chance that you wouldn’t pick up on the newly introduced error.
For this reason, I think that the best option is to create a Pandoc
template
for your company’s report template. This means that you’ll be creating a PDF
directly from the Notebook. In order to revise the report, you have to re-run
the Notebook — the whole Notebook.
For those unfamiliar with Pandoc
, it is a program for
converting between
various file formats. It’s also free and open-source software. Commonly, it’s
used for converting from Markdown into HTML or PDF (actually, Pandoc
converts
to a LaTeX format and LaTeX converts to PDF,
but this happens transparently).
Pandoc
can also convert into MS Word (.docx) and several other formats.
When I decided to create a corporate format for use with notebooks, I
looked at the types of notebooks that we use. Generally, statistics are
done in an R-Notebook
and other analysis is done in a Jupyter notebook.
Unfortunately, R-Notebooks and Jupyter Notebooks use different templates.
R-Notebooks use pandoc
templates, while Jupyter uses its own template.
Fortunately, there is a workaround. Jupyter is able to export to markdown,
which can be read by pandoc
and translated to PDF using a pandoc template.
Thus, I made the decision to write a pandoc
template.
When pandoc
converts a markdown file to PDF, it actually uses LaTeX.
The pandoc
template is actually a template for converting markdown
into LaTeX. Pandoc
then calls pdflatex
to turn this .tex
file into
a PDF.
When I first started figuring out how to write a template for converting
markdown to PDF, I thought I was going to have to write a LaTeX class or style.
I got scared. LaTeX classes are not for the faint of heart. But, I soon
realized that I didn’t actually have to do that. The pandoc
template
that I needed to write was just a regular LaTeX document that has some
parameters that pandoc
can fill in. I’m not sure that I could figure out
how to write a LaTeX class in a reasonable amount of time, but I sure can
write a document using LaTeX.
This is something that I learned to do when I wrote my undergraduate
thesis, and while I don’t write LaTeX often anymore, it’s really not
that hard.
A very basic LaTeX file would look something like this:
\documentclass{article}
\begin{document}
\title{My Report Title}
\author{A. Student}
\maketitle
\section{Introduction}
Some text
\end{document}
A pandoc
template is just a LaTeX file, but with placeholder for the content
that pandoc
will insert. These placeholders are just variables surrounded
with dollar signs. For example, pandoc
has a variable called body
. This
variable will contain the body of the report. We would simply put $body$
in the part of the template where we want pandoc
to insert the body of the report.
Pandoc
also supports for
and if
statements. A common pattern is to check
for the existence of a variable and use it if it does exist and use a default
value if it does not. The syntax for this would look something like:
$if(myvar)$
$myvar$
$else$
Default text
$endif$
I’ve written the above code on multiple lines for readability, but it could be written on a single line too.
Similarly, if a variable is a list, you’d use a for
statement to iterate over
the list. We’ll cover this later when we talk about adding logs of revisions.
Defining New Template Variables
Pandoc
defines a number of variables by default. However, you’ll likely need
to define some variables of your own. First of all, you’ll likely need to
define a variable for the report number and the revision.
To create the variable, it’s just a matter of defining it in the
YAML
header of the markdown file. Variables can either
have a single value or they can be lists. Elements of a list start with
dash at the beginning of the line.
Once we add the report number (which we’ll call report-no
) and the revision
(which we’ll call rev
) to the YAML
header, the YAML header will look like
the following:
title: "Report Title"
author: "A. Student"
report-no: "RPT-001"
rev: B
(Bonus points if you immediately though of William Sealy Gosset when you read that).
We’ll probably want to add a log of revisions to the report. The contents of
this log of revisions will have to come from somewhere, and the YAML
header
is the most logical place. The log of revisions will be a list with one
element of the list corresponding to each revision in the log. Lists can
have nested members. In our case, an entry within the log of revisions
will have a revision letter, a date and a description. Including the
log of revisions, the YAML
header will look like this:
title: "Report Title"
author: "A. Student"
report-no: "RPT-001"
rev: B
rev-log:
- rev: A
date: 1-Jun-2019
desc: Initial release
- rev: B
date: 18-Jun-2019
desc: Updated loads based on fligt test data
We can now use these variables in our pandoc
template. Using the variables
report-no
and rev
are straight forward and will be just the same as
using the default variables (like title
and author
).
Using the list variables will require the use of a for
statement. In the
case of a log of revisions, each revision will get a row in a LaTeX table.
Using the variable rev-log
, this table will look like this:
\begin{tabular}{| m{0.25in} | m{0.95in} | m{4.0in} |}
\hline
Rev Ltr & Date & Description \\
$for(rev-log)$
\hline
$rev-log.rev$ & $rev-log.date$ & $rev-log.desc$ \\
$endfor$
\hline
\end{tabular}
In the above LaTeX code, everything between $for(...)$
and $endfor$
gets
repeated for each item in the list rev-log
. We can access the nested members
using dot notation.
Using the Pandoc Template from an R-Notebook
RStudio handles a lot of the interface with pandoc
. Adding the following to
the YAML
header of the R-Notebook should cause RStudio to use your new
template when it compiles the R-Notebook to PDF. This should be all
you need to do.
output:
pdf_document:
template: my_template_file.tex
toc_depth: 3
fig_caption: true
keep_tex: false
df_print: kable
Using the Pandoc Template from a Jupyter Notebook
Using your new pandoc
template from a Jupyter Notebook is a bit more
complicated because Jupyter doesn’t work directly with pandoc
. First of all,
we need to tell nbconvert
to convert to markdown. I think that it’s best to
re-run the notebook at the same time (to make sure that it is, in fact,
fully reproducible. You can do this using nbconvert
as follows:
jupyter nbconvert --execute --to markdown my-notebook.ipynb
But, Jupyter notebooks don’t have YAML
headers like R-Notebooks do, so
we need a place to put all the variables that the template needs. The easiest
way to do this is to create a cell at the beginning of the notebook with the
cell type set as raw
, then enter the YAML
header into this cell, including
the starting end ending fences (---
). This cell would, then, have a content
similar to the following. Cells of type raw
simply get copied to the output,
so this becomes the YAML
header in the resulting markdown file.
---
title: "Report Title"
author: "A. Student"
report-no: "RPT-001"
rev: B
rev-log:
- rev: A
date: 1-Jun-2019
desc: Initial release
- rev: B
date: 18-Jun-2019
desc: Updated loads based on flight test data
---
Once you’ve used nbconvert
to create the markdown file, you can call
pandoc
. You’ll have to provide the template as a command-line argument
and also specify the output filename (so that pandoc
knows you want a
pdf) and also give the code highlighting style. The call to pandoc
will
look something like this.
`pandoc` my-notebook.md -N --template=my_template_file.tex -o my-notebook.pdf --highlight-style=tango
Documentation of Your Template
A “trick” that I’ve used is to add some documentation about how to use the
template inside the template itself. It’s pretty unlikely that the user
will actually open up the template, but it’s relatively likely that the user
will forget one of the variables that the template expects. Since pandoc
allows if/else
statements, I’ve added the following to my template:
$if(abstract)$
\abstract{$abstract$}
$else$
\abstract{
The documentation for using the template goes here
}
$endif$
This means that if the user forgets to define the abstract
variable,
the cover page of the report (where the abstract normally goes in my
case) will contain the documentation for the template.
Change Bars: Future Work
One of the things that I haven’t yet figured out are change bars. In my
organization, we put vertical bars in the margin of reports to indicate
what part of a report has been revised. There are LaTeX packages for
(manually) inserting
change bars into documents. However,
I haven’t yet figured out how to automatically insert these into a report
generated using pandoc
. I’m sure there’s a way, though.
Conclusion
I hope that this demystifies the process of writing a pandoc
template
to allow you to create reports directly from Jupyter Notebooks or R-Notebooks
in your company’s report format.
(Edited to fix a few typos)