Purpose
This post illustrates how R can be used to extract numerical information from PDF files. The technique is illustrated by extracting the Credit Risk Premium and Valuation Uncertainty parameters from the PDF file containing the instructions for the PRA’s 2021 Solvency II Quantitative Impact Study (QIS).
(Note that since this post was written, the PRA have updated the QIS instructions. This has changed the pagination of the PDF file slightly. The code below has not been updated for this.)
This post is partly to illustrate how to extract data from PDF files, and partly preparation for a later one discussing the QIS in more detail, in particular the draft redesign of the Fundamental Spread. See this post for a general introduction to the existing Fundamental Spread.
Introduction
PDF stands for Portable Document Format. This format is designed for presentation, principally to look the same on-screen as when printed out, and so some work is needed to extract data from it.
Copying and pasting tables from PDF files into Word or Excel often does not work very well and needs manual adjustment, for example using Excel’s Convert Text to Columns feature.
The R library pdftools reads PDF files and does a great deal of work behind the scenes to represent their content as text, which provides an alternative to copying from a PDF viewer.
Using R is more auditable than manually copying/pasting, because all the code used is stored for later examination (it can be thought of as a ‘recording’ of the manipulation). Being code-based, it can in principle be automated for use-cases where PDFs in the same overall format are regularly updated (this does not apply to the QIS case study, but the point holds in general).
Download and open the PDF file
The calibrations tested in the PRA QIS are included in the ‘instructions’ PDF file at: https://www.bankofengland.co.uk/-/media/boe/files/prudential-regulation/solvency-ii/solvency-ii-reform-quantitative-impact-survey/qis-instructions.pdf.
The instructions file is downloaded to the working directory using R as follows. Note that the mode must be set to "wb"
on Windows to download as a binary file, to avoid file corruption issues from treating it as text.
url <- "https://www.bankofengland.co.uk/-/media/boe/files/prudential-regulation/solvency-ii/solvency-ii-reform-quantitative-impact-survey/qis-instructions.pdf"
pdf_file <- "qis-instructions.pdf"
download.file(url, pdf_file, mode = "wb")
First we use the pdftools::pdf_info
function to look at some basic information:
pdf_i <- pdftools::pdf_info(pdf_file)
# Page count
pdf_i$pages
#> [1] 35
# Author, Producer, Title
pdf_i$keys
#> $Author
#> [1] "Prudential Regulation Authority, Bank of England"
#>
#> $Producer
#> [1] "Microsoft: Print To PDF"
#>
#> $Title
#> [1] "Review of Solvency II - QIS Instructions"
All is as expected, so we can now load the PDF file in full using pdftools::pdf_text
. This returns an R character vector with one entry per page (so 35 entries since there are 35 pages as noted above). Each page is a continuous stream of text, with embedded newlines ("\n"
in R).
pdf_pages <- pdftools::pdf_text(pdf_file)
str(pdf_pages)
#> chr [1:35] "Review of Solvency II 㤼㸶 Quantitative Impact Study (QIS)\nInstructions for completing the QIS template(s)\n\nVer"| __truncated__ ....
Extract the Credit Risk Premium parameters
From looking at the file in a PDF viewer, the credit risk premium (CRP) parameters are on page 28. We use cat
to view this page with newlines shown as such, rather than "\n"
. It can be seen that the ‘text’ version of the PDF provided by pdftools
is reasonably faithful to the original, with the embedded tables laid out logically, and so it will be possible to extract the numerical values.
cat(pdf_pages[[28]])
#> (ii) A percentage applied to the 5-year average spread on an index of the same sector
#> and CQS as the asset.
#>
#> 10. The percentages to be used in 9(i) and 9(ii) are as follows:
#>
#> Scenario A Scenario B
#> Percentage applied to current z-spread 25% 25%
#> Percentage applied to average spread 25% 0%
#>
#> 11. The 5-year average spreads to be used for the purposes of 9(ii) above are as follows:
#>
#> 5-year average spreads (bps)
#> Financials Non-Financials
#> CQS 0 90 79
#> CQS 1 121 112
#> CQS 2 180 159
#> CQS 3 276 197
#> CQS 4 462 379
#> CQS 5 693 656
#> CQS 6 693 656
#>
#> 12. For the purposes of this QIS, we will be testing both a floor and a cap applied to the total
#> CRP. The floor will be applied in Scenarios A and B and all the sensitivities tested under
#> them. The cap will be tested both on and off in Scenario A only19. The following table
#> sets out the floor and cap to be applied to the total CRP.
#>
#> CRP floor for Scenario A and B
#> CRP cap for Scenario A (bps)
#> (bps)
#> Financials Non-Financials Financials Non-Financials
#> CQS 0 11 4 56 42
#> CQS 1 27 19 101 86
#> CQS 2 43 28 153 124
#> CQS 3 72 40 229 166
#> CQS 4 168 117 507 403
#> CQS 5 391 184 1,037 623
#> CQS 6 391 184 1,207 793
#>
#> 13. The following table summarises the scenarios in which the floor and cap should be
#> applied.
#>
#> 19 As stated above, there are a number of cells in Scenario A of the template with the explicit direction that figures
#> are to be provided on the basis of a cap being applied. Unless such explicit direction is give, please provide
#> figures on the basis that the cap is not applied.
#>
#> 28
The text representation isn’t perfect – footnote 19 is not shown in superscript, for example – but it’s definitely usable.
To ease parsing page 28, we use strsplit
to break the text up by newlines, so we can read each line individually. For example lines 6 to 8 hold the CRP percentages to apply to z-spreads in each scenario, lines 13 to 20 hold the 5-year average spreads, and lines 30-37 hold the floors and caps:
p28_lines <- strsplit(pdf_pages[[28]], "\n")[[1]] # strsplit returns a 1-item list here and we want the first item itself
# Percentages
p28_lines[6:8]
#> [1] " Scenario A Scenario B"
#> [2] " Percentage applied to current z-spread 25% 25%"
#> [3] " Percentage applied to average spread 25% 0%"
# 5-year average spreads
p28_lines[13:20]
#> [1] " Financials Non-Financials"
#> [2] " CQS 0 90 79"
#> [3] " CQS 1 121 112"
#> [4] " CQS 2 180 159"
#> [5] " CQS 3 276 197"
#> [6] " CQS 4 462 379"
#> [7] " CQS 5 693 656"
#> [8] " CQS 6 693 656"
# Floors and caps
p28_lines[27:37]
#> [1] " CRP floor for Scenario A and B"
#> [2] " CRP cap for Scenario A (bps)"
#> [3] " (bps)"
#> [4] " Financials Non-Financials Financials Non-Financials"
#> [5] " CQS 0 11 4 56 42"
#> [6] " CQS 1 27 19 101 86"
#> [7] " CQS 2 43 28 153 124"
#> [8] " CQS 3 72 40 229 166"
#> [9] " CQS 4 168 117 507 403"
#> [10] " CQS 5 391 184 1,037 623"
#> [11] " CQS 6 391 184 1,207 793"
Now we need to pull out the numbers. This is hardly worth doing for the percentages, which could just be typed in, but it establishes the general principle we’ll use for the other tables.
The idea is to use regular expressions to search for numerical values systematically. The regular expression [0-9]
means ‘a digit character’ and the "+"
means ‘one or more of’. Thus [0-9]+
will match a sequence of digits of any length. This is sufficient to parse the percentages, but to deal with embedded commas and decimal points in later tables, we expand it to [0-9,\\.]+
, which means one or more of (1) a digit character (2) a comma and (3) a decimal point.
The decimal point is a ‘special’ character for regular expressions that matches anything, so we have to precede it with a \
character, and in turn this has to be doubled for R to recognise it as a \
. The power of regular expressions makes it worthwhile to tolerate these fiddly subtleties!
This isn’t a fully general way of parsing numbers, but it suffices for the PRA QIS tables here.
We wrap this into a function parse_numbers
that matches the numerical text, deals with embedded commas, converts the results to numbers while retaining the embedded decimal points, and arranges the results as a matrix.
Enclosing this regular expression in parentheses means ‘capture anything matching this expression’, and the combination of gregexpr
and regmatches
captures the matching expressions and discards the non-matching text.
For simplicity we add the row and column names manually rather than by parsing.
parse_numbers <- function(text_lines) {
matches <- regmatches(text_lines, gregexpr("([0-9,\\.]+)", text_lines))
matches <- t(simplify2array(matches)) # t to transpose the result of simplify2array
nr <- nrow(matches) # Keep the number of rows so it can be restored later
matches <- gsub(",", "", matches) # Remove commas
matches <- matrix(as.numeric(matches), nrow = nr) # Convert to actual numbers rather than text representing numbers
return(matches)
}
crp_fac <- parse_numbers(p28_lines[7:8]) / 100 # Convert to numerically correct rather than percentages
rownames(crp_fac) <- c("Current", "5yrAverage")
colnames(crp_fac) <- c("ScenarioA", "ScenarioB")
crp_fac
#> ScenarioA ScenarioB
#> Current 0.25 0.25
#> 5yrAverage 0.25 0.00
We use parse_numbers
again to pull out the 5-year average spreads, and put these in a data frame labelled with the CQSs in the first column:
crp_5yr_avg_bp <- as.data.frame(parse_numbers(p28_lines[14:20]))
crp_5yr_avg_bp[,1] <- paste0("CQS", crp_5yr_avg_bp[,1])
colnames(crp_5yr_avg_bp) <- c("CQS", "Financial", "NonFinancial")
crp_5yr_avg_bp
#> CQS Financial NonFinancial
#> 1 CQS0 90 79
#> 2 CQS1 121 112
#> 3 CQS2 180 159
#> 4 CQS3 276 197
#> 5 CQS4 462 379
#> 6 CQS5 693 656
#> 7 CQS6 693 656
Finally we pull out the CRP floors and caps, which are laid out as a 5-column table, and split this into two separate data frames:
crp_floor_cap_bp <- as.data.frame(parse_numbers(p28_lines[31:37]))
crp_floor_cap_bp[,1] <- paste0("CQS", crp_floor_cap_bp[,1])
crp_floor_bp <- crp_floor_cap_bp[,1:3]
crp_cap_bp <- crp_floor_cap_bp[,c(1, 4:5)]
colnames(crp_floor_bp) <- colnames(crp_cap_bp) <- c("CQS", "Financial", "NonFinancial")
crp_floor_bp
#> CQS Financial NonFinancial
#> 1 CQS0 11 4
#> 2 CQS1 27 19
#> 3 CQS2 43 28
#> 4 CQS3 72 40
#> 5 CQS4 168 117
#> 6 CQS5 391 184
#> 7 CQS6 391 184
crp_cap_bp
#> CQS Financial NonFinancial
#> 1 CQS0 56 42
#> 2 CQS1 101 86
#> 3 CQS2 153 124
#> 4 CQS3 229 166
#> 5 CQS4 507 403
#> 6 CQS5 1037 623
#> 7 CQS6 1207 793
Extract Valuation Uncertainty parameters
The Valuation Uncertainty (VU) parameters are on page 29.
p29_lines <- strsplit(pdf_pages[[29]], "\n")[[1]]
# VU parameters
p29_lines[19:24]
#> [1] " CQS IFRS Level VU for Scenario A VU for Scenario B"
#> [2] " (bps) (bps)"
#> [3] " All Level 1 NIL Nil"
#> [4] " All Level 2 7.5 3.75"
#> [5] " 0-3 Level 3 7.5 3.75"
#> [6] " 4-6 Level 3 25.0 12.50"
We pull these out, omitting the row for IFRS Level 1 since this complicates the parsing, and starting the parse at character 30 to omit the CQS column which will otherwise confuse matters.
We then add in the CQS column, manually expand the rows to refer to CQS explicitly, and include zero values for IFRS level 1 at the top.
vu_lines <- p29_lines[22:24]
vu_lines <- unname(vapply(vu_lines, function(x) substr(x, 30, nchar(x)), character(1)))
ex_cqs_vu_bp <- as.data.frame(parse_numbers(vu_lines))
ex_cqs_vu_bp[,1] <- paste0("IFRS", ex_cqs_vu_bp[,1])
colnames(ex_cqs_vu_bp) <- c("IFRSLevel", "ScenarioA", "ScenarioB")
vu_bp <- rbind(data.frame(CQS = paste0("CQS", 0:6), data.frame(IFRSLevel = "IFRS1", ScenarioA = 0, ScenarioB = 0)),
data.frame(CQS = paste0("CQS", 0:6), ex_cqs_vu_bp[1,]), # First row is IFRS2 all CQS
data.frame(CQS = paste0("CQS", 0:3), ex_cqs_vu_bp[2,]), # Second row is IFRS3 CQS0-3
data.frame(CQS = paste0("CQS", 4:6), ex_cqs_vu_bp[3,])) # Third row is IFRS3 CQS4-6
vu_bp
#> CQS IFRSLevel ScenarioA ScenarioB
#> 1 CQS0 IFRS1 0.0 0.00
#> 2 CQS1 IFRS1 0.0 0.00
#> 3 CQS2 IFRS1 0.0 0.00
#> 4 CQS3 IFRS1 0.0 0.00
#> 5 CQS4 IFRS1 0.0 0.00
#> 6 CQS5 IFRS1 0.0 0.00
#> 7 CQS6 IFRS1 0.0 0.00
#> 8 CQS0 IFRS2 7.5 3.75
#> 9 CQS1 IFRS2 7.5 3.75
#> 10 CQS2 IFRS2 7.5 3.75
#> 11 CQS3 IFRS2 7.5 3.75
#> 12 CQS4 IFRS2 7.5 3.75
#> 13 CQS5 IFRS2 7.5 3.75
#> 14 CQS6 IFRS2 7.5 3.75
#> 15 CQS0 IFRS3 7.5 3.75
#> 16 CQS1 IFRS3 7.5 3.75
#> 17 CQS2 IFRS3 7.5 3.75
#> 18 CQS3 IFRS3 7.5 3.75
#> 19 CQS4 IFRS3 25.0 12.50
#> 20 CQS5 IFRS3 25.0 12.50
#> 21 CQS6 IFRS3 25.0 12.50
Save the parameters
Finally we save the CRP and VU parameters to RDS (R format) and CSV (for Excel) files for later analysis.
saveRDS(crp_fac, "qis21_crp_fac.rds")
saveRDS(crp_5yr_avg_bp, "qis21_crp_5yr_avg_bp.rds")
saveRDS(crp_floor_bp, "qis21_crp_floor_bp.rds")
saveRDS(crp_cap_bp, "qis21_crp_cap_bp.rds")
saveRDS(vu_bp, "qis21_vu_bp.rds")
write.csv(crp_fac, "qis21_crp_fac.csv", row.names = FALSE)
write.csv(crp_5yr_avg_bp, "qis21_crp_5yr_avg_bp.csv", row.names = FALSE)
write.csv(crp_floor_bp, "qis21_crp_floor_bp.csv", row.names = FALSE)
write.csv(crp_cap_bp, "qis21_crp_cap_bp.csv", row.names = FALSE)
write.csv(vu_bp, "qis21_vu_bp.csv", row.names = FALSE)
Summary
In this post, we showed how the R package pdftools
and R’s built-in regular expressions capability can be used to pull numerical information out of PDF files.