A workshop on data analysis, visualization, and reproducibility for population genetics in R


13-17 October 2025, Tartu, Estonia

Why does this workshop even exist?


Why am I stressing “programming”, “automation”, and “fundamentals” so much in selling this workshop?


What format does the workshop have and
how can we take the most out of it?

Every field of science today is computational


Some less, some much more
(such as genomics)

It was possible to do cutting-edge science without much programming 10-15 years ago.


“In the good old days” you could easily be at the top of your field with nothing but Microsoft Excel. Not now.

Education has been catching up, but not as quickly as it could (and should).

The ECHO team said something very profound during early discussions

“When a new student starts their Master’s/PhD degree, they’re usually thrown into a pond because their supervisor doesn’t guide them through the basic computational competencies that they should know.”

Maybe they don’t have time.
Maybe it’s expected students will pick things up along the way.
Maybe they don’t realize that those skills are important…

… because getting results is the primary focus.

This isn’t just a university education problem.

We are in a “reproducibility crisis”


This has many causes


“Publish or perish”.

Incentives rewarding “fast” over “careful”.

Science became much bigger at a faster rate than modern, user-friendly methods were developed.

It’s still very rare for scientists to publish their code!

Why even do science?

After all, it should be about searching the truth.


We should strive to do better.

Especially as junior researchers, all of us.


Sharing reproducible research code is a crucial part of this.

Counterpoint


Is perfectly reproducible science automatically reliable and more trustworthy?

Automatically reliable?


Definitely not.

Research can be incorrect, and still be perfectly reproducible in beautiful clean code.

More trustworthy?


Most definitely.

Having access to fully reproducible scripts and pipelines is the only way to discover and fix errors, and build towards better research.

The alternative of not publishing reproducible code and workflows is the equivalent of not publishing proper laboratory protocols.


Materials and Methods:
“We ran a couple of PCRs on a few drops of our DNA extracts.”

Reproducibility has multiple levels

Often it’s you yourself who must reproduce your project!
(And it often happens under somewhat stressful circumstances.)


Every field of science today is computational


Which is why trustworthy science requires building a solid basis (and confidence!) in computer programming and learning the best available techniques for computational research.


This workshop hopes to provide the opportunity to do exactly that.

(Focusing on reproducible R workflows as a general example.)

Wow this guy sure likes to talk a lot.


Why should I listen to him?

Who am I? — a brief introduction

  • I really like programming computers
    • I’ve been privileged to do it since I was a small kid
    • (Yes, I school I was one of those weird, unpopular nerds)
  • I did BSc in Computer Science after high school
  • I decided I don’t want to work in an IT company
  • I loved physiology/immunology topics in the House TV series
  • To the horror of my mom, I did another BSc (Biochemistry)

Who am I? — a brief introduction

  • Then I did MSc in Cell Biology, but really sucked at lab work
  • … and I started missing programming again 🤦‍♀️
  • I applied for PhD at Max Planck Institute in Leipzig to study aDNA of Neanderthals, Denisovans, and ancient humans
  • Years of bioinformatics, data science, and statistics made me aware of the barriers to reliable computational biology
  • As an outsider (computer scientist) and an insider (a biologist), I have a unique insight into both worlds

Writing R has been my job for 10 years


  1. admixr (bodkan.net/admixr) — \(f\)-statistics R package
  2. slendr (bodkan.net/slendr) — popgen simulations R package
  3. demografr (bodkan.net/demografr) — R package for simulation-based demographic inference (like ABC)


I design all my software (obsessively) to allow easy and reproducible research for everyone.

The challenge of teaching a course on


“The best of modern R for
reproducible data science”

is that our expectations and prior experiences are totally different.

#1 teaching rule: “know the audience”

#1 teaching rule: “know the audience”

#1 teaching rule: “know the audience”

#1 teaching rule: “know the audience”

To satisfy everyone, we would need…

To satisfy everyone, we would need…

To satisfy everyone, we would need…

This is hardly possible!


So how did we deal with this?


(i.e., enough of philosophizing…)

We made workshop, not lecture series

  • Lectures would be guaranteed to be boring for some of you
    • (regardless of that arbitrarily set “expected competence”)
  • Programming is a skill, not a knowledge to transfer!
  • We want to give as much time to “the practical” as possible:
    1. focus exclusively on “hands-on exercises”
    2. give you the chance to work on your own projects (hopefully integrating bits and pieces of the exercises)
  • Give you safe space to do 1. and/or 2. at your own pace

This workshop is also a (work)book

www.bodkan.net/simgen

What new Master’s / PhD biology students need to know to hit the ground running. A future 2 weeks course at UCPH.

Outline of this workshop

Outline of this workshop

The main “exercise” should always be to ask yourself:
“How can I apply (some of) this material
to make (some of) my work better?”

Target audiences


Tips on getting the most out of this workshop depending on your level.

If you already know (a lot of) coding

  • Although many of the exercises might be easy for you, I suggest you go through all of them anyway. I wrote all code from scratch and still learned things I didn’t know about.
  • Every time you solve an exercise with a function you already knew, spend some minutes on its help page. Skim through arguments you didn’t know, run its (obscure) examples.
  • Each session has a “Bonus exercises and resources” headings with links to much more advanced topics that all expert R programmers should know. Self-study in the remaining time.

If you’re just starting a coding journey

  • Many exercises might feel overwhelming
  • This isn’t a lack of ability — programming is about building mental models of a problem and solving it in code

    • This is a skill like any other (languages, knitting, all crafts)
    • As such, it takes practice
  • The exercises are designed to guide you through this process
  • All materials are both exercises and real-world recipes for you to study later and directly apply to your own work

An crucial note, regardless of your level…

99% of exercises are given solutions!

As you go through the materials, you’ll see this all the time:

This is the only way to make the workshop useful for all levels of experience (I learned this the hard way).

But please, let’s make a pact.

Do not look at my solutions unless…

  1. You’re utterly lost
  2. You asked me (or a colleague) for a hint and are still unsure
  3. Or your done with your solution and want to check the result

Learning to solve computing problems is supposed to be hard and (sometimes) frustrating. This is the only way to learn it.

When you do look, don’t blindly copy it into R…

  1. Read each line of my solution and figure out what it does
  2. Re-type it into your R session (ideally by hand)
  3. Run it like this line by line, inspecting intermediate results

Exercises are here to help you learn to build “mental models” of programming problems. This is the only way to learn it.


I came up with each question and had to code up the answer myself. It was hard for me to! 🙂

We would like the workshop to be

interactive
and
collaborative

teacher
️⬇️
student

teacher-student
️⬆️
⬇️
student-teacher

We are all here to learn


Share your ideas!

Discuss how the things you’ve learned apply (or not!) to your own data and your own projects.

Let’s talk about options on how to use the content of the workshop to improve your projects in the future.

Our time schedule


  • 10:00-12:00 Morning session
  • 12:00-14:00 Lunch break
  • 14:00-16/17:00 Afternoon session (depends on each day)

It’s impossible to map each “topic” to specific time slot because this would require knowing how much time will each exercise (and each session of exercises) take for all of you.

We’ll play it by the ear.

Resources for self study (all free!)

Absolute must:

https://r4ds.hadley.nz

Absolute overkill:

https://adv-r.hadley.nz/