library(tidyverse)
library(parallel)
library(MASS)
set.seed(42)
Martin Petr, PopGen 2022
Many problems in population genetics cannot be solved by a mathematician, no matter how gifted. [It] is already clear that computer methods are very powerful. This is good. It […] permits people with limited mathematical knowledge to work on important problems […]
What minimum components are needed for the program to be useful?
We need populations.
We need genetics.
…a linear sequence of nucleotides…
… which accumulates mutations every generation at a given mutation rate.
A collection of individuals at a given point in time.
Each individual carrying two homologous chromosomes inherited from two parents in the previous generation.
Chromosomes recombine at a certain recombination rate.
Let’s make the algorithm even more minimal by focusing on the evolution of a single mutation across time (i.e. no mutation process, no recombination).
If you want to try the following couple of examples yourself, you will need to paste the following lines into your R session (either R in the terminal, or better an RStudio session):
N <- 500 # number of (haploid) individuals in the population
generations <- 500 # number of generations to run the simulation for
p_start <- 0.5 # initial allele frequency
# initialize an (empty) trajectory vector of allele frequencies
p_traj<- rep(NA, generations)
p_traj[1] <- p_start
# in each generation...
for (gen_i in 2:generations) {
p <- p_traj[gen_i - 1] # get the current frequency
# ... calculate the allele frequency in the next generation ...
p_next <- rbinom(1, N, p) / N
# ... and save it to the trajectory vector
p_traj[gen_i] <- p_next
}
p_traj
Input: \(N\), \(p_0\) and the number of generations
Output: allele frequency trajectory vector
simulate <- function(N, p_start, generations) {
# initialize an (empty) trajectory vector of allele frequencies
p_traj<- rep(NA, generations)
p_traj[1] <- p_start
# in each generation...
for (gen_i in 2:generations) {
p <- p_traj[gen_i - 1] # get the current frequency
# ... calculate the allele frequency in the next generation ...
p_next <- rbinom(1, N, p) / N
# ... and save it to the trajectory vector
p_traj[gen_i] <- p_next
}
p_traj
}
p_start <- 0.5
N <- 10000
factors <- fractions(c(3, 2, 1, 1/2, 1/5, 1/10))
final_frequencies <- parallel::mclapply(
seq_along(factors), function(i) {
f <- factors[i]
t <- as.integer(N * f)
# get complete trajectories as a matrix (each column a single trajectory)
reps <- replicate(10000, simulate(N = N, p_start = p_start, generations = t))
# only keep the last slice of the matrix with the final frequencies
data.frame(
t = sprintf("t = %s * N", f),
freq = reps[t, ]
)
},
mc.cores = parallel::detectCores()
) %>% do.call(rbind, .)
final_frequencies$t <- fct_rev(fct_relevel(final_frequencies$t, sprintf("t = %s * N", factors)))
final_frequencies %>% .[.$freq > 0 & .$freq < 1, ] %>%
ggplot() +
geom_histogram(aes(freq, y = ..density.., fill = t), position = "identity", bins = 100, alpha = 0.75) +
labs(x = "allele frequency") +
coord_cartesian(ylim = c(0, 3)) +
facet_grid(t ~ .) +
guides(fill = guide_legend(sprintf("time since\nthe start\n[assuming\nN = %s]", N))) +
theme_minimal() +
theme(strip.text.y = element_blank(),
axis.title.x = element_text(size = 15),
axis.title.y = element_text(size = 15),
axis.text.x = element_text(size = 15),
axis.text.y = element_text(size = 15),
legend.title = element_text(size = 15),
legend.text = element_text(size = 15))
If you’re bored later, try adding selection to this model!
In each generation, weight the frequencies based on fitness of each of the tree genotypes.
If you need a hint, take a look here.
The most famous and widely used are SLiM and msprime.
They are both extremely powerful…
… but they require a lot of programming knowledge and quite a lot of code to write non-trivial simulations (without bugs).
initialize() {
// create a neutral mutation type
initializeMutationType("m1", 0.5, "f", 0.0);
// initialize 1Mb segment
initializeGenomicElementType("g1", m1, 1.0);
initializeGenomicElement(g1, 0, 999999);
// set mutation rate and recombination rate of the segment
initializeMutationRate(1e-8);
initializeRecombinationRate(1e-8);
}
// create an ancestral population p1 of 10000 diploid individuals
1 early() { sim.addSubpop("p1", 10000); }
// in generation 1000, create two daughter populations p2 and p3
1000 early() {
sim.addSubpopSplit("p2", 5000, p1);
sim.addSubpopSplit("p3", 1000, p1);
}
// in generation 10000, stop the simulation and save 100 individuals
// from p2 and p3 to a VCF file
10000 late() {
p2_subset = sample(p2.individuals, 100);
p3_subset = sample(p3.individuals, 100);
c(p2_subset, p3_subset).genomes.outputVCF("/tmp/slim_output.vcf.gz");
sim.simulationFinished();
catn("DONE!");
}
The following is basically the same model as the SLiM script earlier:
import msprime
demography = msprime.Demography()
demography.add_population(name="A", initial_size=10_000)
demography.add_population(name="B", initial_size=5_000)
demography.add_population(name="C", initial_size=1_000)
demography.add_population_split(time=1000, derived=["A", "B"], ancestral="C")
ts = msprime.sim_ancestry(
sequence_length=10e6,
recombination_rate=1e-8,
samples={"A": 100, "B": 100},
demography=demography
)
source: link
Most researchers are not expert programmers
All but the most trivial simulations require lots of code
90%
create populations (splits and \(N_e\) changes)
specify if/how they should mix (rates and times)
save output (VCF, EIGENSTRAT)
Lot of code duplication across projects
slendr
& tidyverse
First run this:
Then run this (and wait for the Python installation to finish!):
RStudio sometimes interferes with Python setup that we need for simulation. To fix this, go to Tools
-> Global Options
in your RStudio and set the following options:
A name, size and the time of appearance must be given:
Splits are indicated by the parent = <pop>
argument:
resize()
Step size decrease:
Exponential increase:
Step size decrease:
Exponential increase:
pop1 <- population("pop1", N = 1000, time = 1)
pop2 <-
population("pop2", N = 1000, time = 300, parent = pop1) %>%
resize(N = 100, how = "step", time = 1000)
pop3 <-
population("pop3", N = 1000, time = 400, parent = pop2) %>%
resize(N = 2500, how = "step", time = 800)
pop4 <-
population("pop4", N = 1500, time = 500, parent = pop3) %>%
resize(N = 700, how = "exponential", time = 1200, end = 2000)
pop5 <-
population("pop5", N = 100, time = 600, parent = pop4) %>%
resize(N = 50, how = "step", time = 900) %>%
resize(N = 250, how = "step", time = 1200) %>%
resize(N = 1000, how = "exponential", time = 1600, end = 2200) %>%
resize(N = 400, how = "step", time = 2400)
slendr 'population' object
--------------------------
name: pop5
non-spatial population
stays until the end of the simulation
population history overview:
- time 600: split from pop4
- time 900: resize from 100 to 50 individuals
- time 1200: resize from 50 to 250 individuals
- time 1600-2200: exponential resize from 250 to 1000 individuals
- time 2400: resize from 1000 to 400 individuals
Typing the compiled model prints a brief summary:
slendr 'model' object
---------------------
populations: pop1, pop2, pop3, pop4, pop5
geneflow events: [no geneflow]
generation time: 1
time direction: forward
total running length: 3000 model time units
model type: non-spatial
configuration files in: /private/var/folders/d_/hblb15pd3b94rg0v35920wd80000gn/T/RtmpgFKEFA/file168917f097902
The model is also saved to disk! (The location can be specified via path =
argument to compile_model()
):
[1] "checksums.tsv" "description.txt" "direction.txt"
[4] "generation_time.txt" "length.txt" "orig_length.txt"
[7] "populations.tsv" "ranges.rds" "resizes.tsv"
[10] "script.py" "script.slim"
We have the compiled model
, how do we simulate data?
slendr has two simulation engines already built-in:
Take a model object and use the built-in simulation engine:
ts
is a so-called “tree sequence”
(As full as possible) a representation of our samples’ history:
This simulates 20 \(\times\) 10000 chromosomes of 100 Mb.
In less than 30 seconds.
That’s a crazy amount of data!
ts <-
population("pop", time = 100000, N = 10000) %>%
compile_model(generation_time = 1, direction = "backward") %>%
msprime(sequence_length = 100e6, recombination_rate = 1e-8)
ts
╔═══════════════════════════╗
║TreeSequence ║
╠═══════════════╤═══════════╣
║Trees │ 391441║
╟───────────────┼───────────╢
║Sequence Length│ 100000000║
╟───────────────┼───────────╢
║Time Units │generations║
╟───────────────┼───────────╢
║Sample Nodes │ 20000║
╟───────────────┼───────────╢
║Total Size │ 66.0 MiB║
╚═══════════════╧═══════════╝
╔═══════════╤═══════╤═════════╤════════════╗
║Table │Rows │Size │Has Metadata║
╠═══════════╪═══════╪═════════╪════════════╣
║Edges │1523118│ 46.5 MiB│ No║
╟───────────┼───────┼─────────┼────────────╢
║Individuals│ 10000│273.5 KiB│ No║
╟───────────┼───────┼─────────┼────────────╢
║Migrations │ 0│ 8 Bytes│ No║
╟───────────┼───────┼─────────┼────────────╢
║Mutations │ 0│ 16 Bytes│ No║
╟───────────┼───────┼─────────┼────────────╢
║Nodes │ 287040│ 7.7 MiB│ No║
╟───────────┼───────┼─────────┼────────────╢
║Populations│ 1│222 Bytes│ Yes║
╟───────────┼───────┼─────────┼────────────╢
║Provenances│ 1│ 1.3 KiB│ No║
╟───────────┼───────┼─────────┼────────────╢
║Sites │ 0│ 16 Bytes│ No║
╚═══════════╧═══════╧═════════╧════════════╝
A tree (sequence) can be represented by tables of:
# A tibble: 3 × 4
name pop node_id time
<chr> <fct> <int> <dbl>
1 pop_1 pop 0 0
2 pop_1 pop 1 0
3 pop_2 pop 2 0
# A tibble: 3 × 2
child_node_id parent_node_id
<int> <int>
1 0 43032
2 0 48715
3 0 54705
By default, msprime
function automatically loads the tree-sequence that the simulation produced in the background:
If we type ts
into an R console, we get…
╔═══════════════════════════╗
║TreeSequence ║
╠═══════════════╤═══════════╣
║Trees │ 141110║
╟───────────────┼───────────╢
║Sequence Length│ 100000000║
╟───────────────┼───────────╢
║Time Units │generations║
╟───────────────┼───────────╢
║Sample Nodes │ 9400║
╟───────────────┼───────────╢
║Total Size │ 24.6 MiB║
╚═══════════════╧═══════════╝
╔═══════════╤══════╤═════════╤════════════╗
║Table │Rows │Size │Has Metadata║
╠═══════════╪══════╪═════════╪════════════╣
║Edges │567858│ 17.3 MiB│ No║
╟───────────┼──────┼─────────┼────────────╢
║Individuals│ 4700│128.5 KiB│ No║
╟───────────┼──────┼─────────┼────────────╢
║Migrations │ 0│ 8 Bytes│ No║
╟───────────┼──────┼─────────┼────────────╢
║Mutations │ 0│ 16 Bytes│ No║
╟───────────┼──────┼─────────┼────────────╢
║Nodes │106198│ 2.8 MiB│ No║
╟───────────┼──────┼─────────┼────────────╢
║Populations│ 5│383 Bytes│ Yes║
╟───────────┼──────┼─────────┼────────────╢
║Provenances│ 1│ 3.9 KiB│ No║
╟───────────┼──────┼─────────┼────────────╢
║Sites │ 0│ 16 Bytes│ No║
╚═══════════╧══════╧═════════╧════════════╝
This R interface links to Python methods implemented in tskit.
Tree sequences make it possible to directly compute many quantities of interest without going via conversion to a genotype table/VCF!
Each “sampled” individual in slendr has a symbolic name, a sampling time, and a population assignment:
Each “sampled” individual in slendr has a symbolic name, a sampling time, and a population assignment:
Let’s say we have the following model and we simulate a tree sequence from it.
# sample 5 individuals
# (i.e. 10 chromosomes)
samples <-
ts_samples(ts) %>%
sample_n(5) %>%
pull(name)
# compute allele frequency
# spectrum from the given set
# of individuals
afs1 <- ts_afs(
ts, list(samples),
mode = "branch",
polarised = TRUE,
span_normalise = TRUE
)
afs1
[1] 40188.717 20286.332 13275.749 10084.000 8204.045 6736.871 5629.087
[8] 5140.529 4426.037 3913.154
╔═══════════════════════════╗
║TreeSequence ║
╠═══════════════╤═══════════╣
║Trees │ 392515║
╟───────────────┼───────────╢
║Sequence Length│ 100000000║
╟───────────────┼───────────╢
║Time Units │generations║
╟───────────────┼───────────╢
║Sample Nodes │ 20000║
╟───────────────┼───────────╢
║Total Size │ 66.2 MiB║
╚═══════════════╧═══════════╝
╔═══════════╤═══════╤═════════╤════════════╗
║Table │Rows │Size │Has Metadata║
╠═══════════╪═══════╪═════════╪════════════╣
║Edges │1527209│ 46.6 MiB│ No║
╟───────────┼───────┼─────────┼────────────╢
║Individuals│ 10000│273.5 KiB│ No║
╟───────────┼───────┼─────────┼────────────╢
║Migrations │ 0│ 8 Bytes│ No║
╟───────────┼───────┼─────────┼────────────╢
║Mutations │ 0│ 16 Bytes│ No║
╟───────────┼───────┼─────────┼────────────╢
║Nodes │ 288075│ 7.7 MiB│ No║
╟───────────┼───────┼─────────┼────────────╢
║Populations│ 1│222 Bytes│ Yes║
╟───────────┼───────┼─────────┼────────────╢
║Provenances│ 1│ 1.3 KiB│ No║
╟───────────┼───────┼─────────┼────────────╢
║Sites │ 0│ 16 Bytes│ No║
╚═══════════╧═══════╧═════════╧════════════╝
There is a duality between mutations and branch lengths in trees (more here).
╔═══════════════════════════╗
║TreeSequence ║
╠═══════════════╤═══════════╣
║Trees │ 392375║
╟───────────────┼───────────╢
║Sequence Length│ 100000000║
╟───────────────┼───────────╢
║Time Units │generations║
╟───────────────┼───────────╢
║Sample Nodes │ 20000║
╟───────────────┼───────────╢
║Total Size │ 91.1 MiB║
╚═══════════════╧═══════════╝
╔═══════════╤═══════╤═════════╤════════════╗
║Table │Rows │Size │Has Metadata║
╠═══════════╪═══════╪═════════╪════════════╣
║Edges │1527630│ 46.6 MiB│ No║
╟───────────┼───────┼─────────┼────────────╢
║Individuals│ 10000│273.5 KiB│ No║
╟───────────┼───────┼─────────┼────────────╢
║Migrations │ 0│ 8 Bytes│ No║
╟───────────┼───────┼─────────┼────────────╢
║Mutations │ 420531│ 14.8 MiB│ No║
╟───────────┼───────┼─────────┼────────────╢
║Nodes │ 287272│ 7.7 MiB│ No║
╟───────────┼───────┼─────────┼────────────╢
║Populations│ 1│222 Bytes│ Yes║
╟───────────┼───────┼─────────┼────────────╢
║Provenances│ 2│ 2.1 KiB│ No║
╟───────────┼───────┼─────────┼────────────╢
║Sites │ 419656│ 10.0 MiB│ No║
╚═══════════╧═══════╧═════════╧════════════╝
population()
compile_model()
msprime()
ts_samples()
ts_afs()
RStudio sometimes interferes with Python setup that we need for simulation. To fix this, go to Tools
-> Global Options
in your RStudio and set the following options:
1. Type library(slendr)
into the R console
slendr is pre-installed for the browser RStudio. If you use RStudio on your own computer (not in the browser), you can get it by install.packages("slendr")
.
2. Type setup_env()
When prompted to setup Python, answer “Yes”!
Collaborator Hestu gave you AFS computed from 10 individuals of a sub-species of the bushy-tailed squirrel discovered in the Forest of Spirits in the land of Hyrule:
Fossil evidence is consistent with constant size of the population over 100,000 generations of its history. An Oracle you met in the Temple of Time said that the true squirrel \(N_e\) has been between 1000 and 30000.
Use slendr to simulate history of this species. Use this to guess the likely value of squirrel’s \(N_e\) given the observed AFS.
Write a function that gets \(N_e\) as input and returns the AFS.
Find the \(N_e\) value that will give the closest AFS to the one you got from Hestu. Use whatever method you’re comfortable with based on your programming experience:
i ) Plot simulated AFS for different \(N_e\) with the AFS and just eye-ball \(N_e\) value that looks correct.
ii ) Simulate AFS across a grid of \(N_e\) values and find the closest matching one (maybe use mean-squared error?)
iii ) Run a mini-Approximate Bayesian Computation, using the Oracle’s range of [10000, 30000] as a uniform prior.
The squirrels have split into three species with different demographic histories (s1, s2, s3 – model on the right). Species s2 and s3 are daughter populations which split in generation 2 from the original species s1.
Help the Oracle predict the future shape of the AFS of the squirrels after 10000 generations, assuming starting \(N_e\) = 6543? Species s1 will remain constant, species s2 will expand 3X, species s3 will get 3X smaller.
Gene flow is programmed using the gene_flow()
function:
Multiple gene-flow events can be gathered in a list:
gene_flow()
checks admixture events for consistency
o <- population("o", time = 1, N = 100)
c <- population("c", time = 2500, N = 500, parent = o)
a <- population("a", time = 3000, N = 2000, parent = c)
b <- population("b", time = 3500, N = 4000, parent = a)
x1 <- population("x1", time = 3800, N = 8000, parent = c)
x2 <- population("x2", time = 4000, N = 10000, parent = x1)
gf <- gene_flow(from = b, to = x1, start = 5500, end = 6000, rate = 0.3)
model <- compile_model(
populations = list(a, b, x1, x2, c, o), gene_flow = gf,
generation_time = 1, simulation_length = 7000
)
ts <- msprime(model, sequence_length = 100e6, recombination_rate = 1e-8) %>%
ts_mutate(mutation_rate = 1e-8)
plot_model(model, proportions = TRUE)
ts_diversity()
Extract a list of lists of individuals’ names for each population:
ts_<...>()
function of slendr accepts a tree-sequence object as its first argument.
“CH” outgroup at time 7 Mya (\(N_e\) = 7k)
“AFR” splitting from “CH” at 6 Mya (\(N_e\) = 10k)
“ALT” splitting from “AFR” at 700 kya (\(N_e\) = 500)
“VI” splitting from “ALT” at 120 kya (\(N_e\) = 1k)
“EUR” splitting from “AFR” at 70 kya (\(N_e\) = 5k)
gene flow from “VI” to “EUR” at 3% over 50-40 kya
generation time 30 years
Then simulate 100Mb sequence with msprime()
, using recombination rate 1e-8 per bp per generation.
Use ts_diversity()
to compute diversity in each simulated population (CH
, AFR
, …).
Does the result match the expectation given demographic history?
What about divergence between all pairs of populations (ts_divergence()
)? Do your results recapitulate phylogenetic relationships? How about ts_f3()
or ts_fst()
?
# A tibble: 5 × 2
set diversity
<chr> <dbl>
1 ALT 0.0000200
2 VI 0.0000374
3 CH 0.000276
4 EUR 0.000383
5 AFR 0.000406
# A tibble: 10 × 3
x y divergence
<chr> <chr> <dbl>
1 ALT VI 0.000100
2 AFR EUR 0.000465
3 EUR VI 0.000847
4 ALT EUR 0.000849
5 AFR VI 0.000871
6 AFR ALT 0.000872
7 AFR CH 0.00427
8 CH VI 0.00427
9 ALT CH 0.00427
10 CH EUR 0.00427
By default, slendr records every individual living at the end of the simulation. Sampling can be also scheduled explicitly:
ts_f4()
Having simulated data from this model, can we detect gene flow from b into x1?
ts_f4()
Simulate tree sequence, add mutations on it:
x
individuals# extract information about samples from populations x1 and x2
x_inds <- ts_samples(ts) %>% filter(pop %in% c("x1", "x2"))
# create a vector of individual's names
# (used as input in ts_<...>() functions
x_names <- x_inds$name
# iterate over all sample names, compute
# f4(X, C; B, O)
f4_result<- map_dfr(
x_names,
function(w) ts_f4(ts, W = w, X = "c_1", Y = "b_1" , Z = "o_1")
)
# add the f4 value as a column to the sample information table
x_inds$f4 <- f4_result$f4
x
individualsggplot(x_inds) +
geom_rect(aes(xmin = 5500, xmax = 6000, ymin = -Inf, ymax = Inf), alpha = 0.5, fill = "gray") +
geom_jitter(aes(time, f4, color = pop)) +
geom_line(data = . %>% group_by(pop, time) %>% summarise(f4 = mean(f4)),
aes(time, f4, color = pop), size = 2) +
geom_hline(yintercept = 0, linetype = 2, size = 1) +
theme_minimal()
Take your model of Neanderthal introgression
Implement temporal sampling:
ts_simplify()
# this needs another R package
# simply using the ape package and running `plot(tree)` will also work
library(ggtree)
nodes <- ts_nodes(tree) %>%
as_tibble %>%
dplyr::select(node = phylo_id, pop)
p_tree <- ggtree(tree) %<+% nodes +
geom_tiplab(aes(color = pop, fill = pop)) +
guides(color = "none") +
scale_x_continuous(limits = c(-7e6, 900e3)) +
geom_vline(xintercept = -700e3, linetype = 2, color = "red")
revts(p_tree)
# this needs another R package
# simply using the ape package and running `plot(tree)` will also work
library(ggtree)
nodes <- ts_nodes(tree) %>%
as_tibble %>%
dplyr::select(node = phylo_id, pop)
p_tree <- ggtree(tree) %<+% nodes +
geom_tiplab(aes(color = pop, fill = pop)) +
guides(color = "none") +
scale_x_continuous(limits = c(-7e6, 900e3)) +
geom_vline(xintercept = -700e3, linetype = 2, color = "red")
revts(p_tree)
# this needs another R package
# simply using the ape package and running `plot(tree)` will also work
library(ggtree)
nodes <- ts_nodes(tree) %>%
as_tibble %>%
dplyr::select(node = phylo_id, pop)
p_tree <- ggtree(tree) %<+% nodes +
geom_tiplab(aes(color = pop, fill = pop)) +
guides(color = "none") +
scale_x_continuous(limits = c(-7e6, 900e3)) +
geom_vline(xintercept = -700e3, linetype = 2, color = "red")
revts(p_tree)
ts_<...>()
tree-sequence functions!You can extract data from a tree sequence in the form of a genotype table using the function ts_genotypes()
.
This function returns a simple R data frame:
Use the genotype table to compute your own ABBA/BABA statistic on the Neanderthal model data and check if you can recover the same signal as you get from ts_f4()
.