Download this application exercise by pasting the code below into your console
download.file("https://sta101.github.io/static/appex/ae18.rmd",
destfile = "ae18.rmd")
library(tidyverse)
library(tidymodels)
library(reshape2)
By the end of today, you will…
Our data includes
This application exercise builds on the R
code found here
and The
Markov Chain Monte Carlo Revolution by Persi Diaconis.
Let’s load the data
warandpeace = readLines("https://sta101.github.io/static/appex/data/warandpeace.txt")
frequency = read.table("https://sta101.github.io/static/appex/data/frequencies.txt")
colnames(frequency) = c(toupper(letters), "") # edit column names
secret_message = readLines("https://sta101.github.io/static/appex/data/secret-message.txt")
Take a look at secret-message
. Each letter is a stand in
for exactly one other letter. This sort of cipher is known as a
“substitution cipher”.
‘A’ could be encoded as one of 26 characters (A, B, C, …). Once the encoding for ‘A’ is chosen, ‘B’ has 25 possibilities and so on so there are, in total \(26 \times 25 \times 24 \times \ldots \times 3 \times 2 \times 1\) possibilities.
n = 26
input = n + 1
keys = gamma(input)
keys
## [1] 4.032915e+26
That’s over \(4 \times 10^{26}\) possible keys! If you could check 10M keys per second, it would take approximately \(1 \times 10^{12}\) (trillion) years to check every possible key. Trying every possible key is known as a “brute force” approach.
Here we determine how often one character follows another using the text from the very long book, War and Peace. We include a whitespace character as a 27th character in our alphabet.
To reduce computational demand, we will load the object created by this analysis but leave the code below for reference.
The result of the below analysis is in the object
frequency
.
The letter
column denotes the first character in a 2
character sequence while subsequent columns determine the second
character in the character sequence.
# code here
Create a heatmap of character frequency, where the y-axis is the first letter and the x-axis is the second letter in a two letter chain.
Hint: We’ll use melt
from the reshape2
package. Click here
for an example.
melt_freq = melt(as.matrix(frequency))
# melt_freq %>%
# ggplot(aes(x = __, y = __, fill = __))
We will use a famous statistical algorithm, Markov chain Monte Carlo (MCMC) to break the substitution cipher.
MCMC is composed of three essential components: