Simulations in population genetics

Authors

Martin Petr, PopGen 2022

Many problems in population genetics cannot be solved by a mathematician, no matter how gifted. [It] is already clear that computer methods are very powerful. This is good. It […] permits people with limited mathematical knowledge to work on important problems […]

James F. Crow – interview

Why simulate genomic data?

Exploring expected behavior of statistics
Fitting model parameters (i.e. ABC)
Ground truth for method development

Exploring expected behavior of statistics

Image from Peter (2016)

Fitting model parameters (i.e. ABC)

Image from Wikipedia on ABC

Ground truth for testing and method development

Image from Schiffels and Durbin (2014)

What does it mean to simulate a genome?

How would you design an algorithm for a popgen simulation?

What minimum components are needed for the program to be useful?

If we want to simulate population genetics

We need populations.

We need genetics.

A chromosome is…

…a linear sequence of nucleotides…

a list of characters (A/G/C/T nucleotides)
a list of 0 or 1 values (ancestral/derived allele)

… which accumulates mutations every generation at a given mutation rate.

A population is…

A collection of individuals at a given point in time.

Each individual carrying two homologous chromosomes inherited from two parents in the previous generation.

Chromosomes recombine at a certain recombination rate.

Home-brewed single-locus simulations in R

Let’s make the algorithm even more minimal by focusing on the evolution of a single mutation across time (i.e. no mutation process, no recombination).

“What is the expected trajectory of an allele under the influence of genetic drift?”

Some basic setup

If you want to try the following couple of examples yourself, you will need to paste the following lines into your R session (either R in the terminal, or better an RStudio session):

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.8     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(parallel)
library(MASS)


Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select

set.seed(42)

Single-locus simulation

N <- 500 # number of (haploid) individuals in the population
generations <- 500 # number of generations to run the simulation for
p_start <- 0.5 # initial allele frequency

# initialize an (empty) trajectory vector of allele frequencies
p_traj<- rep(NA, generations)
p_traj[1] <- p_start

# in each generation...
for (gen_i in 2:generations) {
  p <- p_traj[gen_i - 1] # get the current frequency

  # ... calculate the allele frequency in the next generation ...
  p_next <- rbinom(1, N, p) / N

  # ... and save it to the trajectory vector
  p_traj[gen_i] <- p_next
}

p_traj

\(N\) = 500, \(p_0 = 0.5\)

Code

plot(p_traj, type = "l", ylim = c(0, 1),
     xlab = "generations", ylab = "allele frequency")
abline(h = p_start, lty = 2, col = "red")

Let’s make it a function

Input: \(N\), \(p_0\) and the number of generations

Output: allele frequency trajectory vector

simulate <- function(N, p_start, generations) {
  # initialize an (empty) trajectory vector of allele frequencies
  p_traj<- rep(NA, generations)
  p_traj[1] <- p_start

  # in each generation...
  for (gen_i in 2:generations) {
    p <- p_traj[gen_i - 1] # get the current frequency
    # ... calculate the allele frequency in the next generation ...
    p_next <- rbinom(1, N, p) / N
    # ... and save it to the trajectory vector
    p_traj[gen_i] <- p_next
  }

  p_traj
}

\(N\) = 500, \(p_0 = 0.5\) (20 replicates)

Code

reps <- replicate(20, simulate(N = 500, p_start = 0.5, generations = 2000))

matplot(reps, ylim = c(0, 1), xlab = "generations", ylab = "allele frequency", type = "l", lty = 1)

\(N\) = 1000, \(p_0 = 0.5\) (20 replicates)

Code

reps <- replicate(20, simulate(N = 1000, p_start = 0.5, generations = 2000))

matplot(reps, ylim = c(0, 1), xlab = "generations", ylab = "allele frequency", type = "l", lty = 1)

\(N\) = 5000, \(p_0 = 0.5\) (20 replicates)

Code

reps <- replicate(20, simulate(N = 5000, p_start = 0.5, generations = 2000))

matplot(reps, ylim = c(0, 1), xlab = "generations", ylab = "allele frequency", type = "l", lty = 1)

\(N\) = 10000, \(p_0 = 0.5\) (20 replicates)

Code

reps <- replicate(20, simulate(N = 10000, p_start = 0.5, generations = 20000))

matplot(reps, ylim = c(0, 1), xlab = "generations", ylab = "allele frequency", type = "l", lty = 1)

\(N\) = 10000, \(p_0 = 0.5\) (20 replicates)

Code

factors <- fractions(c(3, 2, 1, 1/2, 1/5, 1/10))
matplot(reps, ylim = c(0, 1), xlab = "generations", ylab = "allele frequency", type = "l", lty = 1)
abline(v = 10000 * factors, lwd = 5)

Expected allele frequency distribution

Code

p_start <- 0.5
N <- 10000
factors <- fractions(c(3, 2, 1, 1/2, 1/5, 1/10))

final_frequencies <- parallel::mclapply(
  seq_along(factors), function(i) {
    f <- factors[i]
    t <- as.integer(N * f)
    # get complete trajectories as a matrix (each column a single trajectory)
    reps <- replicate(10000, simulate(N = N, p_start = p_start, generations = t))
    # only keep the last slice of the matrix with the final frequencies
    data.frame(
      t = sprintf("t = %s * N", f),
      freq = reps[t, ]
    )
  },
  mc.cores = parallel::detectCores()
) %>% do.call(rbind, .)
final_frequencies$t <- fct_rev(fct_relevel(final_frequencies$t, sprintf("t = %s * N", factors)))

final_frequencies %>% .[.$freq > 0 & .$freq < 1, ] %>%
ggplot() +
  geom_histogram(aes(freq, y = ..density.., fill = t), position = "identity", bins = 100, alpha = 0.75) +
  labs(x = "allele frequency") +
  coord_cartesian(ylim = c(0, 3)) +
  facet_grid(t ~ .) +
  guides(fill = guide_legend(sprintf("time since\nthe start\n[assuming\nN = %s]", N))) +
  theme_minimal() +
  theme(strip.text.y = element_blank(),
        axis.title.x = element_text(size = 15),
        axis.title.y = element_text(size = 15),
        axis.text.x = element_text(size = 15),
        axis.text.y = element_text(size = 15),
        legend.title = element_text(size = 15),
        legend.text = element_text(size = 15))

*“Diffusion Models in Population Genetics”*, Kimura (1964)

Bonus exercise #1

If you’re bored later, try adding selection to this model!

In each generation, weight the frequencies based on fitness of each of the tree genotypes.

If you need a hint, take a look here.

But now for something completely different.

Let’s do “real” simulations!

There are many pieces of simulation software

The most famous and widely used are SLiM and msprime.

They are both extremely powerful…

… but they require a lot of programming knowledge and quite a lot of code to write non-trivial simulations (without bugs).

We will learn simulation concepts using slendr,
a convenient R interface to both SLiM and msprime.

SLiM

What is SLiM?

Forward-time simulator

It’s fully programmable!

Massive library of functions for:
- Demographic events
- Various mating systems
- Selection, quantitative traits, …

> 700 pages long manual!

SLiMgui – IDE for SLiM

Simple neutral simulation in SLiM

initialize() {
    // create a neutral mutation type
    initializeMutationType("m1", 0.5, "f", 0.0);
    // initialize 1Mb segment
    initializeGenomicElementType("g1", m1, 1.0);
    initializeGenomicElement(g1, 0, 999999);
    // set mutation rate and recombination rate of the segment
    initializeMutationRate(1e-8);
    initializeRecombinationRate(1e-8);
}

// create an ancestral population p1 of 10000 diploid individuals
1 early() { sim.addSubpop("p1", 10000); }

// in generation 1000, create two daughter populations p2 and p3
1000 early() {
    sim.addSubpopSplit("p2", 5000, p1);
    sim.addSubpopSplit("p3", 1000, p1);
}

// in generation 10000, stop the simulation and save 100 individuals
// from p2 and p3 to a VCF file
10000 late() {
    p2_subset = sample(p2.individuals, 100);
    p3_subset = sample(p3.individuals, 100);
    c(p2_subset, p3_subset).genomes.outputVCF("/tmp/slim_output.vcf.gz");
    sim.simulationFinished();
    catn("DONE!");
}

msprime

What is msprime?

A Python module for writing coalescent simulations

Extremely fast (genome-scale, population-scale data)

You must know Python fairly well to build complex models

Simple simulation using msprime

The following is basically the same model as the SLiM script earlier:

import msprime

demography = msprime.Demography()
demography.add_population(name="A", initial_size=10_000)
demography.add_population(name="B", initial_size=5_000)
demography.add_population(name="C", initial_size=1_000)
demography.add_population_split(time=1000, derived=["A", "B"], ancestral="C")

ts = msprime.sim_ancestry(
  sequence_length=10e6,
  recombination_rate=1e-8,
  samples={"A": 100, "B": 100},
  demography=demography
)

source: link

www.slendr.net

Why a new package? – spatial simulations!

Why a new package?

Most researchers are not expert programmers
All but the most trivial simulations require lots of code
90% of simulations are basically the same!
- create populations (splits and \(N_e\) changes)
- specify if/how they should mix (rates and times)
- save output (VCF, EIGENSTRAT)
Lot of code duplication across projects

slendr makes “standard” simulations trivial (for newbies and experts) and unlocks new kinds of spatial simulations

Let’s get started

We will use `slendr` & `tidyverse`

First run this:

library(tidyverse) # table processing/filtering and plotting

library(slendr) # simulation and tree-sequence analysis

The interface to all required Python modules has been activated.


Attaching package: 'slendr'

The following object is masked from 'package:MASS':

    area

(ignore the message about missing SLiM)

Then run this (and wait for the Python installation to finish!):

setup_env()

The entire lecture & exercises will be in R!

Workaround for an RStudio bug

RStudio sometimes interferes with Python setup that we need for simulation. To fix this, go to Tools -> Global Options in your RStudio and set the following options:

slendr workflow:

Build models from simple components

Then simulate data from them

All within a single R script

Typical steps

creating populations
scheduling population splits
programming \(N_e\) size changes
encoding gene-flow events
simulation sequence of a given size
computing statistics from simulated outputs

Creating a population

A name, size and the time of appearance must be given:

pop1 <- population("pop1", N = 1000, time = 1)

Typing an object prints out a summary in the R console:

pop1

slendr 'population' object 
-------------------------- 
name: pop1 
non-spatial population
stays until the end of the simulation

population history overview:
  - time 1: created as an ancestral population (N = 1000)

Programming population splits

Splits are indicated by the parent = <pop> argument:

pop2 <- population("pop2", N = 100, time = 50, parent = pop1)

The split is reported in the “historical summary”:

pop2

slendr 'population' object 
-------------------------- 
name: pop2 
non-spatial population
stays until the end of the simulation

population history overview:
  - time 50: split from pop1

Scheduling resize events – `resize()`

Step size decrease:

pop1 <- population("pop1", N = 1000, time = 1)
pop1_step <- resize(pop1, N = 100, time = 500, how = "step")

Exponential increase:

pop2 <- population("pop2", N = 100, time = 50, parent = pop1)
pop2_exp <- resize(pop2, N = 10000, time = 500, end = 2000, how = "exponential")

Tidyverse-style pipe interface

Step size decrease:

pop1 <-
  population("pop1", N = 1000, time = 1) %>%
  resize(N = 100, time = 500, how = "step")

Exponential increase:

pop2 <-
  population("pop2", N = 1000, time = 1) %>%
  resize(N = 10000, time = 500, end = 2000, how = "exponential")

This accomplishes the same thing as the code on the previous slide, but it is a bit more “elegant”.

More complex full model

pop1 <- population("pop1", N = 1000, time = 1)

pop2 <-
  population("pop2", N = 1000, time = 300, parent = pop1) %>%
  resize(N = 100, how = "step", time = 1000)

pop3 <-
  population("pop3", N = 1000, time = 400, parent = pop2) %>%
  resize(N = 2500, how = "step", time = 800)

pop4 <-
  population("pop4", N = 1500, time = 500, parent = pop3) %>%
  resize(N = 700, how = "exponential", time = 1200, end = 2000)

pop5 <-
  population("pop5", N = 100, time = 600, parent = pop4) %>%
  resize(N = 50, how = "step", time = 900) %>%
  resize(N = 250, how = "step", time = 1200) %>%
  resize(N = 1000, how = "exponential", time = 1600, end = 2200) %>%
  resize(N = 400, how = "step", time = 2400)

Remember: each object carries its history

pop5

slendr 'population' object 
-------------------------- 
name: pop5 
non-spatial population
stays until the end of the simulation

population history overview:
  - time 600: split from pop4
  - time 900: resize from 100 to 50 individuals
  - time 1200: resize from 50 to 250 individuals
  - time 1600-2200: exponential resize from 250 to 1000 individuals
  - time 2400: resize from 1000 to 400 individuals

Last step before simulation: compilation

model <- compile_model(
  list(pop1, pop2, pop3, pop4, pop5),
  generation_time = 1,
  simulation_length = 3000
)

Compilation takes a list of model components, performs internal consistency checks, returns a single model object.

Model summary

Typing the compiled model prints a brief summary:

model

slendr 'model' object 
--------------------- 
populations: pop1, pop2, pop3, pop4, pop5 
geneflow events: [no geneflow]
generation time: 1 
time direction: forward 
total running length: 3000 model time units
model type: non-spatial

configuration files in: /private/var/folders/d_/hblb15pd3b94rg0v35920wd80000gn/T/RtmpYMM0mb/file115aa33c7fbbf

The model is also saved to disk! (The location can be specified via path = argument to compile_model()):

 [1] "checksums.tsv"       "description.txt"     "direction.txt"      
 [4] "generation_time.txt" "length.txt"          "orig_length.txt"    
 [7] "populations.tsv"     "ranges.rds"          "resizes.tsv"        
[10] "script.py"           "script.slim"

Model visualization

plot_model(model)

Simulating data (finally…)

We have the compiled model, how do we simulate data?

slendr has two simulation engines already built-in:

SLiM engine
msprime engine

You don’t have to write any msprime/SLiM code!

Take a model object and use the built-in simulation engine:

ts <- msprime(model, sequence_length = 100e6, recombination_rate = 1e-8)

ts is a so-called “tree sequence”

The output of a slendr simulation is a tree sequence

What is a tree sequence?

A record of full genetic ancestry of a set of samples
An encoding of DNA sequence carried by those samples
An efficient analysis framework

from the tskit project

Why a tree sequence?

What we usually have

What we usually want

(As full as possible) a representation of our samples’ history:

And this is exactly what tree sequences give us.

How does it work?

This simulates 20 \(\times\) 10000 chromosomes of 100 Mb.

In less than 30 seconds.

That’s a crazy amount of data!

And it only requires 66.1 Mb of memory!

ts <-
  population("pop", time = 100000, N = 10000) %>%
  compile_model(generation_time = 1, direction = "backward") %>%
  msprime(sequence_length = 100e6, recombination_rate = 1e-8)

ts

╔═══════════════════════════╗
║TreeSequence               ║
╠═══════════════╤═══════════╣
║Trees          │     393696║
╟───────────────┼───────────╢
║Sequence Length│  100000000║
╟───────────────┼───────────╢
║Time Units     │generations║
╟───────────────┼───────────╢
║Sample Nodes   │      20000║
╟───────────────┼───────────╢
║Total Size     │   66.4 MiB║
╚═══════════════╧═══════════╝
╔═══════════╤═══════╤═════════╤════════════╗
║Table      │Rows   │Size     │Has Metadata║
╠═══════════╪═══════╪═════════╪════════════╣
║Edges      │1531556│ 46.7 MiB│          No║
╟───────────┼───────┼─────────┼────────────╢
║Individuals│  10000│273.5 KiB│          No║
╟───────────┼───────┼─────────┼────────────╢
║Migrations │      0│  8 Bytes│          No║
╟───────────┼───────┼─────────┼────────────╢
║Mutations  │      0│ 16 Bytes│          No║
╟───────────┼───────┼─────────┼────────────╢
║Nodes      │ 288450│  7.7 MiB│          No║
╟───────────┼───────┼─────────┼────────────╢
║Populations│      1│222 Bytes│         Yes║
╟───────────┼───────┼─────────┼────────────╢
║Provenances│      1│  1.3 KiB│          No║
╟───────────┼───────┼─────────┼────────────╢
║Sites      │      0│ 16 Bytes│          No║
╚═══════════╧═══════╧═════════╧════════════╝

How does this work?!

Tree-sequence tables (tskit docs)

A tree (sequence) can be represented by tables of:

nodes (“chromosomes”)
edges (branches) between nodes
mutations on edges
individuals (who carry nodes)
populations

A set of such tables is a tree sequence.

Tree-sequence tables in practice

Registered S3 method overwritten by 'ggtree':
  method      from 
  identify.gg ggfun

ggtree v3.2.1  For help: https://yulab-smu.top/treedata-book/

If you use ggtree in published research, please cite the most appropriate paper(s):

1. Guangchuang Yu. Using ggtree to visualize data on tree-like structures. Current Protocols in Bioinformatics. 2020, 69:e96. doi:10.1002/cpbi.96
2. Guangchuang Yu, Tommy Tsan-Yuk Lam, Huachen Zhu, Yi Guan. Two methods for mapping and visualizing associated data on phylogeny using ggtree. Molecular Biology and Evolution. 2018, 35(12):3041-3043. doi:10.1093/molbev/msy194
3. Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution. 2017, 8(1):28-36. doi:10.1111/2041-210X.12628


Attaching package: 'ggtree'

The following object is masked from 'package:tidyr':

    expand

Code

ts_nodes(ts) %>%
  as_tibble() %>%
  dplyr::select(name, pop, node_id, time) %>% head(3)

# A tibble: 3 × 4
  name  pop   node_id  time
  <chr> <fct>   <int> <dbl>
1 pop_1 pop         0     0
2 pop_1 pop         1     0
3 pop_2 pop         2     0

Code

ts_edges(ts) %>%
  as_tibble() %>%
  dplyr::select(child_node_id, parent_node_id) %>% head(3)

# A tibble: 3 × 2
  child_node_id parent_node_id
          <int>          <int>
1             0          23016
2             0          44520
3             0          52098

Let’s get back to the model we defined earlier

Simulating a tree-sequence output

By default, msprime function automatically loads the tree-sequence that the simulation produced in the background:

ts <- msprime(
  model,
  sequence_length = 100e6,
  recombination_rate = 1e-8
)

If we type ts into an R console, we get…

… a tree-sequence content summary

ts

╔═══════════════════════════╗
║TreeSequence               ║
╠═══════════════╤═══════════╣
║Trees          │     142013║
╟───────────────┼───────────╢
║Sequence Length│  100000000║
╟───────────────┼───────────╢
║Time Units     │generations║
╟───────────────┼───────────╢
║Sample Nodes   │       9400║
╟───────────────┼───────────╢
║Total Size     │   24.8 MiB║
╚═══════════════╧═══════════╝
╔═══════════╤══════╤═════════╤════════════╗
║Table      │Rows  │Size     │Has Metadata║
╠═══════════╪══════╪═════════╪════════════╣
║Edges      │571820│ 17.5 MiB│          No║
╟───────────┼──────┼─────────┼────────────╢
║Individuals│  4700│128.5 KiB│          No║
╟───────────┼──────┼─────────┼────────────╢
║Migrations │     0│  8 Bytes│          No║
╟───────────┼──────┼─────────┼────────────╢
║Mutations  │     0│ 16 Bytes│          No║
╟───────────┼──────┼─────────┼────────────╢
║Nodes      │107149│  2.9 MiB│          No║
╟───────────┼──────┼─────────┼────────────╢
║Populations│     5│383 Bytes│         Yes║
╟───────────┼──────┼─────────┼────────────╢
║Provenances│     1│  3.9 KiB│          No║
╟───────────┼──────┼─────────┼────────────╢
║Sites      │     0│ 16 Bytes│          No║
╚═══════════╧══════╧═════════╧════════════╝

What can we do with this?

R interface to tskit

This R interface links to Python methods implemented in tskit.

Here is the magic

Tree sequences make it possible to directly compute many quantities of interest without going via conversion to a genotype table/VCF!

Ralph et al. (2020)

How do we use this in practice?

To analyze tree sequences, we need to refer to “samples”

Extracting sample information

Each “sampled” individual in slendr has a symbolic name, a sampling time, and a population assignment:

Extracting sample information

Each “sampled” individual in slendr has a symbolic name, a sampling time, and a population assignment:

ts_samples(ts)

# A tibble: 4,700 × 3
   name     time pop  
   <chr>   <dbl> <chr>
 1 pop1_1      0 pop1 
 2 pop1_2      0 pop1 
 3 pop1_3      0 pop1 
 4 pop1_4      0 pop1 
 5 pop1_5      0 pop1 
 6 pop1_6      0 pop1 
 7 pop1_7      0 pop1 
 8 pop1_8      0 pop1 
 9 pop1_9      0 pop1 
10 pop1_10     0 pop1 
# … with 4,690 more rows
# ℹ Use `print(n = ...)` to see more rows

ts_samples(ts) %>% count(pop)

# A tibble: 5 × 2
  pop       n
  <chr> <int>
1 pop1   1000
2 pop2    100
3 pop3   2500
4 pop4    700
5 pop5    400

Analyzing tree sequences with slendr

Let’s say we have the following model and we simulate a tree sequence from it.

pop <- population("pop", N = 10000, time = 1)
model <- compile_model(pop, generation_time = 1, simulation_length = 10000)

ts <- msprime(model, sequence_length = 100e6, recombination_rate = 1e-8)

slendr provides a large library of functions for computing population genetic statistics on tree sequences

Example: allele frequency spectrum

# sample 5 individuals
# (i.e. 10 chromosomes)
samples <-
  ts_samples(ts) %>%
  sample_n(5) %>%
  pull(name)

# compute allele frequency
# spectrum from the given set
# of individuals
afs1 <- ts_afs(
  ts, list(samples),
  mode = "branch",
  polarised = TRUE,
  span_normalise = TRUE
)

afs1

 [1] 39842.189 20153.717 13448.952 10158.250  8053.971  6727.498  5899.735
 [8]  5169.740  4297.832  3859.274

plot(afs1, type = "b",
     xlab = "allele count bin",
     ylab = "frequency")

But wait, we don’t have any mutations!

ts

╔═══════════════════════════╗
║TreeSequence               ║
╠═══════════════╤═══════════╣
║Trees          │     391815║
╟───────────────┼───────────╢
║Sequence Length│  100000000║
╟───────────────┼───────────╢
║Time Units     │generations║
╟───────────────┼───────────╢
║Sample Nodes   │      20000║
╟───────────────┼───────────╢
║Total Size     │   66.1 MiB║
╚═══════════════╧═══════════╝
╔═══════════╤═══════╤═════════╤════════════╗
║Table      │Rows   │Size     │Has Metadata║
╠═══════════╪═══════╪═════════╪════════════╣
║Edges      │1524626│ 46.5 MiB│          No║
╟───────────┼───────┼─────────┼────────────╢
║Individuals│  10000│273.5 KiB│          No║
╟───────────┼───────┼─────────┼────────────╢
║Migrations │      0│  8 Bytes│          No║
╟───────────┼───────┼─────────┼────────────╢
║Mutations  │      0│ 16 Bytes│          No║
╟───────────┼───────┼─────────┼────────────╢
║Nodes      │ 287141│  7.7 MiB│          No║
╟───────────┼───────┼─────────┼────────────╢
║Populations│      1│222 Bytes│         Yes║
╟───────────┼───────┼─────────┼────────────╢
║Provenances│      1│  1.3 KiB│          No║
╟───────────┼───────┼─────────┼────────────╢
║Sites      │      0│ 16 Bytes│          No║
╚═══════════╧═══════╧═════════╧════════════╝

How can we compute statistics?

There is a duality between mutations and branch lengths in trees (more here).

But what if we want mutations?

Coalescent and mutation processes can be decoupled!

This means we can add mutations
after the simulation.

This allows efficient, massive simulations (especially with SLiM)

ts_mutated <- ts_mutate(
  ts,
  mutation_rate = 1e-8
)

Or, with a shortcut:

ts <- msprime(
  model,
  sequence_length = 100e6,
  recombination_rate = 1e-8
) %>%
  ts_mutate(
    mutation_rate = 1e-8
  )

╔═══════════════════════════╗
║TreeSequence               ║
╠═══════════════╤═══════════╣
║Trees          │     395164║
╟───────────────┼───────────╢
║Sequence Length│  100000000║
╟───────────────┼───────────╢
║Time Units     │generations║
╟───────────────┼───────────╢
║Sample Nodes   │      20000║
╟───────────────┼───────────╢
║Total Size     │   91.6 MiB║
╚═══════════════╧═══════════╝
╔═══════════╤═══════╤═════════╤════════════╗
║Table      │Rows   │Size     │Has Metadata║
╠═══════════╪═══════╪═════════╪════════════╣
║Edges      │1538292│ 46.9 MiB│          No║
╟───────────┼───────┼─────────┼────────────╢
║Individuals│  10000│273.5 KiB│          No║
╟───────────┼───────┼─────────┼────────────╢
║Migrations │      0│  8 Bytes│          No║
╟───────────┼───────┼─────────┼────────────╢
║Mutations  │ 422840│ 14.9 MiB│          No║
╟───────────┼───────┼─────────┼────────────╢
║Nodes      │ 288962│  7.7 MiB│          No║
╟───────────┼───────┼─────────┼────────────╢
║Populations│      1│222 Bytes│         Yes║
╟───────────┼───────┼─────────┼────────────╢
║Provenances│      2│  2.1 KiB│          No║
╟───────────┼───────┼─────────┼────────────╢
║Sites      │ 421984│ 10.1 MiB│          No║
╚═══════════╧═══════╧═════════╧════════════╝

Example: allele frequency spectrum

# sample 5 individuals
# (i.e. 10 chromosomes)
samples <-
  ts_samples(ts) %>%
  sample_n(5) %>%
  pull(name)

# compute allele frequency
# spectrum from the given set
# of individuals
afs2 <- ts_afs(
  ts, list(samples),
  polarised = TRUE
)

afs2

 [1] 41144 20165 13360  9838  7696  6768  5874  5151  4584  3971

plot(afs2, type = "b",
     xlab = "allele count bin",
     ylab = "frequency")
lines(afs1, type = "b", col = "red")
legend("topright", legend = c("mutation-based AFS", "theoretical branch-based AFS"),
       fill = c("black", "red"))

What we have learned so far

creating populations – population()
compiling models – compile_model()
simulating tree sequences – msprime()
extracting samples – ts_samples()
computing AFS – ts_afs()

Let’s put this to use!

Exercise #1

Open your RStudio session:

On-site participants: http://ricco.popgen.dk:3838
Remote participants: http://cloud.popgen.dk:8787

If you feel the servers are slow, you can also use your own laptop (setup instructions are here).

Workaround for an RStudio bug

RStudio sometimes interferes with Python setup that we need for simulation. To fix this, go to Tools -> Global Options in your RStudio and set the following options:

Before you start

1. Type library(slendr) into the R console

slendr is pre-installed for the browser RStudio. If you use RStudio on your own computer (not in the browser), you can get it by install.packages("slendr").

2. Type setup_env()

When prompted to setup Python, answer “Yes”!

Exercise #1

Collaborator Hestu gave you AFS computed from 10 individuals of a sub-species of the bushy-tailed squirrel discovered in the Forest of Spirits in the land of Hyrule:

afs_observed <- c(2520, 1449, 855, 622, 530, 446, 365, 334, 349, 244, 264, 218, 
133, 173, 159, 142, 167, 129, 125, 143)

Fossil evidence is consistent with constant size of the population over 100,000 generations of its history. An Oracle you met in the Temple of Time said that the true squirrel \(N_e\) has been between 1000 and 30000.

Use slendr to simulate history of this species. Use this to guess the likely value of squirrel’s \(N_e\) given the observed AFS.

Exercise #1 – hints

Write a function that gets \(N_e\) as input and returns the AFS.
Find the \(N_e\) value that will give the closest AFS to the one you got from Hestu. Use whatever method you’re comfortable with based on your programming experience:
- i ) Plot simulated AFS for different \(N_e\) with the AFS and just eye-ball \(N_e\) value that looks correct.
- ii ) Simulate AFS across a grid of \(N_e\) values and find the closest matching one (maybe use mean-squared error?)
- iii ) Run a mini-Approximate Bayesian Computation, using the Oracle’s range of [10000, 30000] as a uniform prior.

Exercise #1 – eye-balling solution

solution

Exercise #1 – simulations on a grid

solution

Exercise #1 – ABC inference

solution

Exercise #2

The squirrels have split into three species with different demographic histories (s1, s2, s3 – model on the right). Species s2 and s3 are daughter populations which split in generation 2 from the original species s1.

Help the Oracle predict the future shape of the AFS of the squirrels after 10000 generations, assuming starting \(N_e\) = 6543? Species s1 will remain constant, species s2 will expand 3X, species s3 will get 3X smaller.

Exercise #2 – solution

link

What else can slendr do?

Gene flow events

Gene flow is programmed using the gene_flow() function:

gf <- gene_flow(from = pop1, to = pop2, start = 500, end = 600, rate = 0.13)

Multiple gene-flow events can be gathered in a list:

gf <- list(
  gene_flow(from = pop1, to = pop2, start = 500, end = 600, rate = 0.13),
  gene_flow(from = pop2, to = pop1, start = 700, end = 750, rate = 0.1)
)

gene_flow() checks admixture events for consistency

Let’s build another toy model

Code

o <- population("o", time = 1, N = 100)
c <- population("c", time = 2500, N = 500, parent = o)
a <- population("a", time = 3000, N = 2000, parent = c)
b <- population("b", time = 3500, N = 4000, parent = a)
x1 <- population("x1", time = 3800, N = 8000, parent = c)
x2 <- population("x2", time = 4000, N = 10000, parent = x1)

gf <- gene_flow(from = b, to = x1, start = 5500, end = 6000, rate = 0.3)

model <- compile_model(
  populations = list(a, b, x1, x2, c, o), gene_flow = gf,
  generation_time = 1, simulation_length = 7000
)

ts <- msprime(model, sequence_length = 100e6, recombination_rate = 1e-8) %>%
  ts_mutate(mutation_rate = 1e-8)

plot_model(model, proportions = TRUE)

Diversity – `ts_diversity()`

Extract a list of lists of individuals’ names for each population:

samples <-
  ts_samples(ts) %>%
  split(., .$pop) %>%
  lapply(pull, "name")

Compute diversity in each population:

ts_diversity(ts, samples) %>%
  arrange(diversity)

# A tibble: 6 × 2
  set    diversity
  <chr>      <dbl>
1 o     0.00000368
2 c     0.0000197 
3 a     0.0000545 
4 b     0.0000686 
5 x2    0.0000728 
6 x1    0.0000765

Basically every ts_<...>() function of slendr accepts a tree-sequence object as its first argument.

Exercise #3

Program the following model in slendr

The interface to all required Python modules has been activated.

Parameter values

“CH” outgroup at time 7 Mya (\(N_e\) = 7k)
“AFR” splitting from “CH” at 6 Mya (\(N_e\) = 10k)
“ALT” splitting from “AFR” at 700 kya (\(N_e\) = 500)
“VI” splitting from “ALT” at 120 kya (\(N_e\) = 1k)
“EUR” splitting from “AFR” at 70 kya (\(N_e\) = 5k)
gene flow from “VI” to “EUR” at 3% over 50-40 kya
generation time 30 years

Then simulate 100Mb sequence with msprime(), using recombination rate 1e-8 per bp per generation.

Exercise #3 – solution

link

Exercise #4

Diversity and divergence

Use ts_diversity() to compute diversity in each simulated population (CH, AFR, …).

Does the result match the expectation given demographic history?

What about divergence between all pairs of populations (ts_divergence())? Do your results recapitulate phylogenetic relationships? How about ts_f3() or ts_fst()?

Exercise #4 – solution (diversity)

Code

ts <-
  msprime(model, sequence_length = 50e6, recombination_rate = 1e-8) %>%
  ts_mutate(mutation_rate = 1e-8)

sample_names <- ts_samples(ts) %>% split(., .$pop) %>% lapply(pull, "name")

ts_diversity(ts, sample_sets = sample_names) %>% arrange(diversity)

# A tibble: 5 × 2
  set   diversity
  <chr>     <dbl>
1 ALT   0.0000198
2 VI    0.0000359
3 CH    0.000280 
4 EUR   0.000380 
5 AFR   0.000401

Exercise #4 – solution (divergence)

Code

divergence <-
  ts_divergence(ts, sample_names) %>%
  arrange(divergence)

divergence

# A tibble: 10 × 3
   x     y     divergence
   <chr> <chr>      <dbl>
 1 ALT   VI     0.0000975
 2 AFR   EUR    0.000459 
 3 ALT   EUR    0.000848 
 4 EUR   VI     0.000848 
 5 AFR   ALT    0.000869 
 6 AFR   VI     0.000870 
 7 ALT   CH     0.00426  
 8 CH    VI     0.00426  
 9 CH    EUR    0.00426  
10 AFR   CH     0.00427

Code

divergence %>%
  mutate(pair = paste(x, "-", y),
         pair = fct_reorder(pair, divergence)) %>%
  ggplot(aes(pair, divergence)) +
  geom_point() +
  labs(x = "", y = "divergence") +
  ggtitle("Pairwise genetic divergence between populations") +
  theme_minimal()

Sampling “ancient DNA” time-series

By default, slendr records every individual living at the end of the simulation. Sampling can be also scheduled explicitly:

x_schedule <- schedule_sampling(
  model, times = seq(from = 7000, to = 4000, by = -100),
  list(x1, 5), list(x2, 5)
)

abco_schedule <- schedule_sampling(
  model, times = 7000,
  list(a, 1), list(b, 1), list(c, 1), list(o, 1)
)

schedule <- rbind(x_schedule, abco_schedule)

The schedule can be used in an msprime() call like this:

ts <- msprime(model, sequence_length = 100e6, recombination_rate = 1e-8,
              samples = schedule)

Computing \(f_4\) statistic – `ts_f4()`

Having simulated data from this model, can we detect gene flow from b into x1?

Computing \(f_4\) statistic – `ts_f4()`

Simulate tree sequence, add mutations on it:

ts <- msprime(model, samples = schedule,
              sequence_length = 100e6, recombination_rate = 1e-8) %>%
  ts_mutate(mutation_rate = 1e-8)

ts_f4(ts, "x1_1", "x2_1", "b_1" , "o_1")

# A tibble: 1 × 5
  W     X     Y     Z               f4
  <chr> <chr> <chr> <chr>        <dbl>
1 x1_1  x2_1  b_1   o_1   -0.000000105

ts_f4(ts, "x1_100", "x2_100", "b_1" , "o_1")

# A tibble: 1 × 5
  W      X      Y     Z             f4
  <chr>  <chr>  <chr> <chr>      <dbl>
1 x1_100 x2_100 b_1   o_1   0.00000104

Computing \(f_4\) statistic for all `x` individuals

# extract information about samples from populations x1 and x2
x_inds <- ts_samples(ts) %>% filter(pop %in% c("x1", "x2"))

# create a vector of individual's names
# (used as input in ts_<...>() functions
x_names <- x_inds$name

# iterate over all sample names, compute
# f4(X, C; B, O)
f4_result<- map_dfr(
  x_names,
  function(w) ts_f4(ts, W = w, X = "c_1", Y = "b_1" , Z = "o_1")
)

# add the f4 value as a column to the sample information table
x_inds$f4 <- f4_result$f4

Computing \(f_4\) statistic for all `x` individuals

Code

ggplot(x_inds) +
  geom_rect(aes(xmin = 5500, xmax = 6000, ymin = -Inf, ymax = Inf), alpha = 0.5, fill = "gray") +
  geom_jitter(aes(time, f4, color = pop)) +
  geom_line(data = . %>% group_by(pop, time) %>% summarise(f4 = mean(f4)),
            aes(time, f4, color = pop), size = 2) +
  geom_hline(yintercept = 0, linetype = 2, size = 1) +
  theme_minimal()

`summarise()` has grouped output by 'pop'. You can override using the `.groups`
argument.

Exercise #5

Neanderthal ancestry trajectory

Take your model of Neanderthal introgression
Implement temporal sampling:

one “European” every 1000 yrs between 40 kya and today
“Altai” (70 ky old) and “Vindija” individuals (40 ky old)

Compute \(f_4\)-ratio statistic using:

ts_f4ratio(ts, X = '<vector of "European" individuals>',
           A = '<"Altai">', C = '<"African">', O = '<"Chimp">')

Plot the estimated trajectory of Neanderthal ancestry in Europe over time.

Exercise #5 – solution

solution

The interface to all required Python modules has been activated.

Tree-sequence simplification

Sometimes a tree sequence is too big

The interface to all required Python modules has been activated.

Simplification reduces the genealogies only to those nodes that form ancestors of a desired set of samples.

Meet `ts_simplify()`

ts_small <- ts_simplify(
  ts,
  simplify_to = c(
    "CH_1",
    "AFR_1", "AFR_2", "AFR_3",
    "ALT_1", "VI_1",
    "EUR_1", "EUR_2", "EUR_3", "EUR_4", "EUR_5"
  )
)

Only the coalescent nodes of genealogies involving
these samples will be retained.

Extracting a tree #1734

tree <- ts_phylo(ts_small, 1734, quiet = TRUE)

Code

# this needs another R package
# simply using the ape package and running `plot(tree)` will also work
library(ggtree)

nodes <- ts_nodes(tree) %>%
  as_tibble %>%
  dplyr::select(node = phylo_id, pop)

p_tree <- ggtree(tree) %<+% nodes +
  geom_tiplab(aes(color = pop, fill = pop)) +
  guides(color = "none") +
  scale_x_continuous(limits = c(-7e6, 900e3)) +
  geom_vline(xintercept = -700e3, linetype = 2, color = "red")
revts(p_tree)

Extracting a tree #1

tree <- ts_phylo(ts_small, 1, quiet = TRUE)

Code

# this needs another R package
# simply using the ape package and running `plot(tree)` will also work
library(ggtree)

nodes <- ts_nodes(tree) %>%
  as_tibble %>%
  dplyr::select(node = phylo_id, pop)

p_tree <- ggtree(tree) %<+% nodes +
  geom_tiplab(aes(color = pop, fill = pop)) +
  guides(color = "none") +
  scale_x_continuous(limits = c(-7e6, 900e3)) +
  geom_vline(xintercept = -700e3, linetype = 2, color = "red")
revts(p_tree)

Introgression?

Extracting a tree #903

tree <- ts_phylo(ts_small, 903, quiet = TRUE)

Code

# this needs another R package
# simply using the ape package and running `plot(tree)` will also work
library(ggtree)

nodes <- ts_nodes(tree) %>%
  as_tibble %>%
  dplyr::select(node = phylo_id, pop)

p_tree <- ggtree(tree) %<+% nodes +
  geom_tiplab(aes(color = pop, fill = pop)) +
  guides(color = "none") +
  scale_x_continuous(limits = c(-7e6, 900e3)) +
  geom_vline(xintercept = -700e3, linetype = 2, color = "red")
revts(p_tree)

Introgression?

Bonus exercise #2

Try to implement something relevant for your own work

Program a demographic model for your species of interest
Choose a statistic whose value “true value” you know from the literature (divergence? diversity? f4-statistic? ancestry proportion?)
Try to replicate the statistic in a slendr simulation using some of its ts_<...>() tree-sequence functions!

Bonus exercise #3

Try to implement your own D-statistic

You can extract data from a tree sequence in the form of a genotype table using the function ts_genotypes().

This function returns a simple R data frame:

each column for an individual
each row for a site
each cell giving the number of derived alleles in that individual at that site

Use the genotype table to compute your own ABBA/BABA statistic on the Neanderthal model data and check if you can recover the same signal as you get from ts_f4().

Why simulate genomic data?

Exploring expected behavior of statistics

Fitting model parameters (i.e. ABC)

Ground truth for testing and method development

What does it mean to simulate a genome?

If we want to simulate population genetics

A chromosome is…

A population is…

Home-brewed single-locus simulations in R

“What is the expected trajectory of an allele under the influence of genetic drift?”

Some basic setup

Single-locus simulation

\(N\) = 500, \(p_0 = 0.5\)

Let’s make it a function

\(N\) = 500, \(p_0 = 0.5\) (20 replicates)

\(N\) = 1000, \(p_0 = 0.5\) (20 replicates)

\(N\) = 5000, \(p_0 = 0.5\) (20 replicates)

\(N\) = 10000, \(p_0 = 0.5\) (20 replicates)

\(N\) = 10000, \(p_0 = 0.5\) (20 replicates)

Expected allele frequency distribution

Bonus exercise #1

But now for something completely different.

Let’s do “real” simulations!

There are many pieces of simulation software

SLiM

What is SLiM?

SLiMgui – IDE for SLiM

Simple neutral simulation in SLiM

msprime

What is msprime?

What is msprime?

Simple simulation using msprime

www.slendr.net

Why a new package? – spatial simulations!

Why a new package?

Let’s get started

We will use slendr & tidyverse

The entire lecture & exercises will be in R!

Workaround for an RStudio bug

slendr workflow:

Build models from simple components

Then simulate data from them

All within a single R script

Typical steps

Creating a population

Programming population splits

Scheduling resize events – resize()

Tidyverse-style pipe interface

More complex full model

Remember: each object carries its history

Last step before simulation: compilation

Model summary

Model visualization

Simulating data (finally…)

The output of a slendr simulation is a tree sequence

What is a tree sequence?

Why a tree sequence?

What we usually have

What we usually want

How does it work?

How does this work?!

Tree-sequence tables (tskit docs)

A set of such tables is a tree sequence.

Tree-sequence tables in practice

Let’s get back to the model we defined earlier

Simulating a tree-sequence output

… a tree-sequence content summary

What can we do with this?

R interface to tskit

Here is the magic

How do we use this in practice?

To analyze tree sequences, we need to refer to “samples”

Extracting sample information

Extracting sample information

Analyzing tree sequences with slendr

Example: allele frequency spectrum

But wait, we don’t have any mutations!

How can we compute statistics?

But what if we want mutations?

Coalescent and mutation processes can be decoupled!

We will use `slendr` & `tidyverse`

Scheduling resize events – `resize()`

This means we can add mutations
after the simulation.

Diversity – `ts_diversity()`

Computing \(f_4\) statistic – `ts_f4()`

Computing \(f_4\) statistic – `ts_f4()`

Computing \(f_4\) statistic for all `x` individuals

Computing \(f_4\) statistic for all `x` individuals

Meet `ts_simplify()`