Open this link now so you have everything at hand later.
Many problems in population genetics cannot be solved by a mathematician, no matter how gifted. [It] is already clear that computer methods are very powerful. This is good. It […] permits people with limited mathematical knowledge to work on important problems […].
How would you design an algorithm for a popgen simulation?
What minimum components are needed?
If we want to simulate population genetics
We need populations.
We need genetics.
The ‘genetics’ part…
…a linear sequence of nucleotides (a chromosome)
a list of characters (A/G/C/T nucleotides)
a list of 0 or 1 values (ancestral/derived allele),
which (maybe) mutates at a given mutation rate,
and (maybe) recombines at a certain recombination rate.
The ‘population’ part…
… a collection of individuals at a given point in time,
each carrying chromosomes inherited from its parents.
Single-locus Wright-Fisher simulation in R
“What is the expected trajectory of a neutral allele under the influence of genetic drift?”
From Fernando’s lecture on Monday…
A complete Wright-Fisher simulation
p_0 <-0.5# initial allele frequency ("50% marbles" from lecture #1)N <-500# number of chromosomes in a populationT <-2000# number of generations to simulate# a vector for storing frequencies over timep_trajectory <- p_0 # in each generation:for (gen_i in2:T) {# get frequency in the previous generation p_prev <- p_trajectory[gen_i -1] # calculate new frequency ("sampling marbles from a jar") p_trajectory[gen_i] <-rbinom(1, N, p_prev) / N }plot(p_trajectory)
\(N\) = 500, \(p_0 = 0.5\)
Code
plot(p_trajectory, type ="l", ylim =c(0, 1),xlab ="generations", ylab ="allele frequency")abline(h = p_0, lty =2, col ="red")
Let’s make it a function
Input:\(p_0\), \(N\), and the number of generations
Output: allele frequency trajectory vector
simulate <-function(p_0, N, T) { p_trajectory <- p_0for (gen_i in2:T) { p_prev <- p_trajectory[gen_i -1] p <-rbinom(1, N, p_prev) / N p_trajectory[gen_i] <- p }return(p_trajectory)}
All but the most trivial simulations require lots of code
90% [citation needed] of simulations are basically the same!
create populations (splits and \(N_e\) changes)
specify if/how they should mix (rates and times)
Lot of code duplication across projects
slendrmakes “standard” demographic simulations trivial and unlocks new kinds of spatial simulations
Let’s get started
Everything we do will be in R
Always start your R scripts with this (*):
library(slendr)init_env()
My solutions will also use these two packages:
library(ggplot2)library(dplyr)
Typical steps of a slendr R workflow
creating populations
scheduling population splits
programming \(N_e\) size changes
encoding gene-flow events
simulation sequence of a given size
computing statistics from simulated outputs
Creating a population()
Each needs a name, size and time of appearance:
pop1 <-population("pop1", N =1000, time =1)
This creates a normal R object. Typing it out gives a summary:
pop1
slendr 'population' object
--------------------------
name: pop1
non-spatial population
stays until the end of the simulation
population history overview:
- time 1: created as an ancestral population (N = 1000)
Programming population splits
Splits are indicated by the parent = <pop> argument:
pop2 <-population("pop2", N =100, time =50, parent = pop1)
The split is reported in the “historical summary”:
pop2
slendr 'population' object
--------------------------
name: pop2
non-spatial population
stays until the end of the simulation
population history overview:
- time 50: split from pop1 (N = 100)
Scheduling resize events – resize()
Step size decrease:
pop1 <-population("pop1", N =1000, time =1)pop1_step <-resize(pop1, N =100, time =500, how ="step")
Exponential increase:
pop2 <-population("pop2", N =100, time =50, parent = pop1)pop2_exp <-resize(pop2, N =10000, time =500, end =2000, how ="exponential")
A more concise way to express the same thing as before.
Step size decrease:
pop1 <-population("pop1", N =1000, time =1) %>%resize(N =100, time =500, how ="step")
Exponential increase:
pop2 <-population("pop2", N =1000, time =1) %>%resize(N =10000, time =500, end =2000, how ="exponential")
A more complex model
pop1 <-population("pop1", N =1000, time =1)pop2 <-population("pop2", N =1000, time =300, parent = pop1) %>%resize(N =100, how ="step", time =1000)pop3 <-population("pop3", N =1000, time =400, parent = pop2) %>%resize(N =2500, how ="step", time =800)pop4 <-population("pop4", N =1500, time =500, parent = pop3) %>%resize(N =700, how ="exponential", time =1200, end =2000)pop5 <-population("pop5", N =100, time =600, parent = pop4) %>%resize(N =50, how ="step", time =900) %>%resize(N =1000, how ="exponential", time =1600, end =2200)
Each object carries its history!
pop5
slendr 'population' object
--------------------------
name: pop5
non-spatial population
stays until the end of the simulation
population history overview:
- time 600: split from pop4 (N = 100)
- time 900: resize from 100 to 50 individuals
- time 1600-2200: exponential resize from 50 to 1000 individuals
Gene flow / admixture
We can schedule gene_flow() from pop1 into pop2 with:
gf <-gene_flow(from = pop1, to = pop2, start =2000, end =2200, rate =0.13)
Here rate = 0.13 means 13% migrants over the given time window will come from “pop1” into “pop2”.
Multiple gene-flow events can be gathered in a list:
gf <-list(gene_flow(from = pop1, to = pop2, start =500, end =600, rate =0.13),gene_flow(from = ..., to = ..., start = ..., end = ..., rate = ...), ...)
Last step before simulation: compile_model()
model <-compile_model(list(pop1, pop2, pop3, pop4, pop5),generation_time =1,simulation_length =3000,direction ="forward")
Compilation takes a list of model components, performs internal consistency checks, returns a single model object.
Last step before simulation: compile_model()
model <-compile_model(list(pop1, pop2, pop3, pop4, pop5),gene_flow = gf, # <----- in case our model includes gene flowgeneration_time =1,simulation_length =3000,direction ="forward")
Gene flow(s) that we programmed are included via the gene_flow argument.
Model summary
Typing the compiled model into R prints a brief summary:
model
slendr 'model' object
---------------------
populations: pop1, pop2, pop3, pop4, pop5
geneflow events: 1
generation time: 1
time direction: forward
total running length: 3000 model time units
model type: non-spatial
configuration files in: /private/var/folders/d_/hblb15pd3b94rg0v35920wd80000gn/T/RtmpyYatVY/fileaef453c85ae
plot(afs[-1], type ="b",xlab ="allele count bin",ylab ="frequency")
Note: We drop the first element (afs[-1]) technical reasons related to tskit. You don’t have to worry about that here, but you can read this for more detail.
Exercise #2
Part a: One-population AFS simulator
In a new script model2.R write a function called simulate_afs(), which will take Ne as its only parameter.
It should create a one-population forward-time model (simulation_length 100000, generation_time 1), simulate 10Mb tree sequence (recombination and mutation rates of 1e-8), compute AFS for 10 samples and return it.
Use this function to compute AFS vectors for various Ne values. Plot those AFS and observe how (and why?) do they differ based on Ne you simulated.
Part a: Hint
You can start building model2.R from this “template”:
library(slendr); init_env()simulate_afs <-function(Ne) { ... your one-population model code:population(), compile_model(), msprime() ... result <- ... compute AFS using ts_afs() on 10 samples, save it to `result` ...return(result)}afs_1 <-simulate_afs(Ne =1000)plot(afs_1, type ="o")
When used in R, your function should work like this:
You know that the population had a constant \(N_e\) somewhere between 1000 and 30000 for the past 100,000 generations, and had mutation and recombination rates of 1e-8 (i.e., parameters already implemented by the simulate_afs() function).
Guess the true \(N_e\) given the observed AFS by running single-population simulations for a range of \(N_e\) values and comparing each run to afs_observed.
Part b: Hints
Using your custom simulate_afs() function, find the value of Ne that will give the closest AFS to the observed AFS:
option #1 [easy]: Plot AFS vectors for various \(N_e\) values, then eyeball which looks closest to the observed AFS
option #2 [hard]: Simulate AFS vectors in steps of possible Ne (maybe lapply()?), find the closest AFS
Exercise #2: solution
Exercise #3
Exercise #3: more statistics! (a)
Use msprime() to simulate a 50Mb tree sequence ts from your introgression model in model1.R (if that takes more than two minutes, try just 10Mb).
(Remember to add mutations with ts_mutate().)
Exercise #3: more statistics! (b)
In model1.R compute (some of) these on your ts object:
outgroup ts_f3(A; B, C) using CHIMP as the outgroup (A!) for different pairs of “B” and “C” populations
using Ben’s explanation on Wednesday, try to compute this \(f_3\) using combination of \(f_2\) statistics (ts_f2(A, B))
You can find help by typing ?ts_diversity etc. into R!
Exercise #3: more statistics! (c)
Compute \(f_4\) test of Neanderthal introgression in EUR:
Hint: check the values of these two statistics (ts_f4()):
\(f_4\)(<afr>, <eur>; <neand>, <chimp>)
\(f_4\)(<afr1>, <afr2>; <neand>, <chimp>)]
Is one “much more negative” than the other as expected assuming introgression?
You’ve learned about symmetries in \(f_4\) depending on the arrangement of the “quartet”. How many unique \(f_4\) values involving a single quartet can you find and why? (When computing ts_f4() here, set mode = "branch").
Exercise #3: hint
For multipopulation statistics, you need a (named) list of samples recorded in each population.
ts_names() has a split = "pop" option just for this:
Imagine we have pop1, pop2, … compiled in a model.
To record ancient individuals in the tree sequence, we can use schedule_sampling() like this:
schedule_sampling( model, # compiled slendr model objecttimes =c(100, 500), # at these times (can be also a single number) ...list(pop1, 42), # ... sample 42 individuals from pop1list(pop2, 10), # ... sample 10 individuals from pop2list(pop3, 1) # ... sample 1 individual from pop 3)
Sampling schedule format
The output of schedule_sampling() is a plain data frame:
schedule_sampling(model, times =c(40000, 30000, 20000, 10000), list(eur, 1))
# A tibble: 4 × 7
time pop n y_orig x_orig y x
<int> <chr> <int> <lgl> <lgl> <lgl> <lgl>
1 10000 EUR 1 NA NA NA NA
2 20000 EUR 1 NA NA NA NA
3 30000 EUR 1 NA NA NA NA
4 40000 EUR 1 NA NA NA NA
We can bind multiple sampling schedules together, giving us finer control about sampling:
eur_samples <-schedule_sampling(model, times =c(40000, 30000, 20000, 10000, 0), list(eur, 1))afr_samples <-schedule_sampling(model, times =0, list(afr, 1))samples <-bind_rows(eur_samples, afr_samples)
How to use a sampling schedule?
To sample individuals based on a given schedule, we use the samples = argument of the msprime() function: