Many problems in population genetics cannot be solved by a mathematician, no matter how gifted. [It] is already clear that computer methods are very powerful. This is good. It […] permits people with limited mathematical knowledge to work on important problems […]
All but the most trivial simulations require lots of code
90% of simulations are basically the same!
create populations (splits and \(N_e\) changes)
specify if/how they should mix (rates and times)
save output (VCF, EIGENSTRAT)
Lot of code duplication across projects
Let’s get started
We will need slendr & tidyverse
# load data analysis and plotting packageslibrary(dplyr) library(ggplot2)library(magrittr)# load slendr itselflibrary(slendr)init_env()
(ignore the message about missing SLiM)
slendr haiku
Build simple models,
simulate data from them.
Just one plain R script.
Typical steps (outline of this tutorial)
creating populations
scheduling population splits
programming \(N_e\) size changes
encoding gene-flow events
simulation sequence of a given size
computing statistics from simulated outputs
Creating a population()
Each needs a name, size and time of appearance (i.e., “split”):
pop1 <-population("pop1", N =1000, time =1)
This creates a normal R object. Typing it gives a summary:
pop1
slendr 'population' object
--------------------------
name: pop1
non-spatial population
stays until the end of the simulation
population history overview:
- time 1: created as an ancestral population (N = 1000)
Programming population splits
Splits are indicated by the parent = <pop> argument:
pop2 <-population("pop2", N =100, time =50, parent = pop1)
The split is reported in the “historical summary”:
pop2
slendr 'population' object
--------------------------
name: pop2
non-spatial population
stays until the end of the simulation
population history overview:
- time 50: split from pop1 (N = 100)
Scheduling resize events – resize()
Step size decrease:
pop1 <-population("pop1", N =1000, time =1)pop1_step <-resize(pop1, N =100, time =500, how ="step")
Exponential increase:
pop2 <-population("pop2", N =100, time =50, parent = pop1)pop2_exp <-resize(pop2, N =10000, time =500, end =2000, how ="exponential")
A more concise way to express the same thing as before.
Step size decrease:
pop1 <-population("pop1", N =1000, time =1) %>%resize(N =100, time =500, how ="step")
Exponential increase:
pop2 <-population("pop2", N =1000, time =1) %>%resize(N =10000, time =500, end =2000, how ="exponential")
A more complex model
pop1 <-population("pop1", N =1000, time =1)pop2 <-population("pop2", N =1000, time =300, parent = pop1) %>%resize(N =100, how ="step", time =1000)pop3 <-population("pop3", N =1000, time =400, parent = pop2) %>%resize(N =2500, how ="step", time =800)pop4 <-population("pop4", N =1500, time =500, parent = pop3) %>%resize(N =700, how ="exponential", time =1200, end =2000)pop5 <-population("pop5", N =100, time =600, parent = pop4) %>%resize(N =50, how ="step", time =900) %>%resize(N =250, how ="step", time =1200) %>%resize(N =1000, how ="exponential", time =1600, end =2200) %>%resize(N =400, how ="step", time =2400)
Remember that slendr objects internally carry their whole history!
pop5
slendr 'population' object
--------------------------
name: pop5
non-spatial population
stays until the end of the simulation
population history overview:
- time 600: split from pop4 (N = 100)
- time 900: resize from 100 to 50 individuals
- time 1200: resize from 50 to 250 individuals
- time 1600-2200: exponential resize from 250 to 1000 individuals
- time 2400: resize from 1000 to 400 individuals
Last step before simulation: compile_model()
model <-compile_model(list(pop1, pop2, pop3, pop4, pop5),generation_time =1,simulation_length =3000)
Compilation takes a list of model components, performs internal consistency checks, returns a single model object.
Model summary
Typing the compiled model into R prints a brief summary:
model
slendr 'model' object
---------------------
populations: pop1, pop2, pop3, pop4, pop5
geneflow events: [no geneflow]
generation time: 1
time direction: forward
total running length: 3000 model time units
model type: non-spatial
configuration files in: /private/var/folders/d_/hblb15pd3b94rg0v35920wd80000gn/T/RtmpmRWkEV/file33b924deedc
Model visualization
plot_model(model)
Exercise #1
Exercise #1 — write your own model!
You can use this “template”:
library(slendr)init_env()chimp <-population(...)# <... rest of your code ...>model <-compile_model(populations =list(chimp, ...),generation_time =30)plot_model(model) # verify visually
Exercise #1 — solution
Simulating data (finally…)
We have a compiled model, how do we simulate data?
pop <-population("pop", time =1e6, N =10000)model <-compile_model(pop, generation_time =30, direction ="backward")ts <-msprime(model, sequence_length =1e6, recombination_rate =1e-8)
(simulates 2 \(\times\) 10000 chromosomes of 100 Mb)
You know from fossil evidence that the population had constant \(N_e\) for the past 100,000 generations, and that the \(N_e\) was somewhere between 1000 and 30000.
Use slendr to guess the true value of\(N_e\) given the observed AFS by running single-population simulations of different \(N_e\) and comparing ts_afs() results to afs_observed.
Exercise #2 – hints
Write an R function (in a new script) that gets \(N_e\) as input, creates a slendr population, compiles a model, simulates a tree sequence, runs ts_afs() on it, and returns the AFS.
Find the \(N_e\) value that will give the closest AFS to the observed one. For instance, you could:
a ) Plot simulated AFS for different \(N_e\) with the AFS and just eyeball \(N_e\) value that looks correct from a graph.
b ) Simulate AFS automatically in steps of possible \(N_e\) values and find the closest matching one.
Exercise #2 solution (a) – eye-balling
Exercise #2 solution (b) – grid
Gene flow / admixture
gene_flow() events
Gene flow is programmed the gene_flow() function.
If we have populations p1 and p2, we schedule gene flow with:
gf <-gene_flow(from = p1, to = p2, start =500, end =600, rate =0.13)
Multiple gene-flow events can be gathered in a list:
gf <-list(gene_flow(from = p1, to = p2, start =500, end =600, rate =0.13),gene_flow(from =<..>, to =<..>, start =<...>, end =<...>, rate =<...>),< potentially many more ... >)
# A = Africans, B = Europeans, C = outgroupts_f3(ts, A =list(afr = samples$AFR), B =list(eur = samples$EUR), C ="CHIMP_1")
# A tibble: 1 × 4
A B C f3
<chr> <chr> <chr> <dbl>
1 afr eur CHIMP_1 0.0000308
# A = Africans, B = Neanderthals, C = outgroupts_f3(ts, A =list(afr = samples$AFR), B =list(nea = samples$NEA), C ="CHIMP_1")
# A tibble: 1 × 4
A B C f3
<chr> <chr> <chr> <dbl>
1 afr nea CHIMP_1 0.000205
# A = Europeans, B = Neanderthals, C = outgroupts_f3(ts, A =list(eur = samples$EUR), B =list(nea = samples$NEA), C ="CHIMP_1")
# A tibble: 1 × 4
A B C f3
<chr> <chr> <chr> <dbl>
1 eur nea CHIMP_1 0.000248
\(f_4\)-statistic
f4(AFR_1, X; NEA, CH) is “almost zero” for X = AFR:
ts_f4(ts, W ="AFR_1", X ="AFR_2", Y ="NEA_1", Z ="CHIMP_1")
# A tibble: 1 × 5
W X Y Z f4
<chr> <chr> <chr> <chr> <dbl>
1 AFR_1 AFR_2 NEA_1 CHIMP_1 -0.00000125
f4(AFR, X; NEA, CH) is “signif. negative” for X = EUR:
ts_f4(ts, W ="AFR_1", X ="EUR_1", Y ="NEA_1", Z ="CHIMP_1")
# A tibble: 1 × 5
W X Y Z f4
<chr> <chr> <chr> <chr> <dbl>
1 AFR_1 EUR_1 NEA_1 CHIMP_1 -0.0000189
\(f_4\)-statistic for all samples
Time-series data
Sampling aDNA samples through time
Imagine we have pop1, pop2, … compiled in a model.
To record ancient individuals in the tree sequence, we can use schedule_sampling() like this:
schedule_sampling( model, # compiled slendr model objecttimes =c(100, 500), # at these times (can be also a single number) ...list(pop1, 42), # ... sample 42 individuals from pop1list(pop2, 10), # ... sample 10 individuals from pop2list(pop3, 1) # ... sample 1 individual from pop 3)
Sampling schedule format
The output of schedule_sampling() is a plain data frame:
schedule_sampling(model, times =c(40000, 30000, 20000, 10000), list(eur, 1))
# A tibble: 4 × 7
time pop n y_orig x_orig y x
<int> <chr> <int> <lgl> <lgl> <lgl> <lgl>
1 10000 EUR 1 NA NA NA NA
2 20000 EUR 1 NA NA NA NA
3 30000 EUR 1 NA NA NA NA
4 40000 EUR 1 NA NA NA NA
We can bind multiple sampling schedules together, giving us finer control about sampling:
eur_samples <-schedule_sampling(model, times =c(40000, 30000, 20000, 10000, 0), list(eur, 1))afr_samples <-schedule_sampling(model, times =0, list(afr, 1))samples <-bind_rows(eur_samples, afr_samples)
How to use a sampling schedule?
To sample individuals based on a given schedule, we use the samples = argument of the msprime() functions: