Forum:Mission Impossible: you have 1 minute to analyze the Ebola Genome
4
27
Entering edit mode
7.9 years ago

I teach an introductory bioinformatics course. For yesterday's lecture I wanted to demonstrate to students just how much you can get done by properly combining all these awesome tools with the unix command line.

And that got me thinking ... so how much can you get done in a day ... how about an hour? ... then .... well, how about a minute ... a minute you say??? ... yeah right that's just crazy talk, sounds like ... mission impossible. Or is it really?

So I googled the Mission Impossible theme song, I found a version that is about 1 minute long and I came up with a challenge with the following rules:

1. You may use any tool or background information that can be reasonably expected to be on a bioinformaticians' computer.
2. You have to start with an empty folder
3. Start the music and your script. Your script needs to finish before the theme song.
4. At the end of the run your folder needs to contain a piece of information that on its own is noteworthy and publication quality information (say an essential part of a prior publication)

All right then - and here is my entry. It produces all major single nucleotide polymorphisms of the 2014 Ebola genome as published in Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak Science 2014. It requires parallel, efetch, bwa, samtools and freebayes.

The script follows. Let me tell you running it while the theme song is on makes it surprisingly exciting!

It even wastes 16 seconds for dramatic flair. Still finishes in time on a MacBook Air while running a presentation.

ebola mission-impossible • 4.7k views
6
Entering edit mode

In one minute Chuck Norris sequenced his genome. And found zero mutations.

4
Entering edit mode

That would mean that Chuck Norris and Craig Venter are the same person? (Might explain a few things...)

4
Entering edit mode

Took you longer than a minute to write the script though..

4
Entering edit mode

yeah but Tom Cruise also has time to prepare for each mission - that's still fair comparison

1
Entering edit mode

What's really impressive about this is the power behind such basic and standard tools.

btw, I found a bug. the last few lines should read:

echo "*** WARNING! The data will self destruct in one minute! ***"
echo
sleep 60
rm -r ~/edu/mission

1
Entering edit mode

ha, looks like I was caught bluffing - funny!

3
Entering edit mode
7.9 years ago
Pablo ★ 1.9k

Nice post!

It looks like you have plenty of time left to analyze what the variants mean:

java -jar snpEff.jar -v ebola_zaire ~/edu/mission/results.vcf > ~/edu/mission/results.eff.vcf


Note that the reference they used is "KJ660346" instead of KM034562 (at least for the annotations part).

0
Entering edit mode

that's pretty cool! just a few extra seconds - I've used the reference that the main paper used but one can just swap that out in the first lines.

0
Entering edit mode

It's weird that they mention KM034562 in the main paper, they asked me to provide "KJ660346" for their analysis...

1
Entering edit mode

Strange indeed, the VCF file (file S1 here http://www.sciencemag.org/content/345/6202/1369/suppl/DC1) is computed relative to KM034562

the difference between the two genome builds is:

18957M2S and MD:Z:1848C4433T5527T3787G3358A

2
Entering edit mode
7.1 years ago

You should try using HISAT. It can read directly from SRA without pesky bloated fastq. Might be fast enough to leave time for analyzing your output too.

0
Entering edit mode

interesting point - I will explore that as I plan to rework this example.

0
Entering edit mode
7.1 years ago

I think this is brilliant, but I feel that there is an issue with it: you don't verify that it actually worked, which I see as an essential part of these scripts (also, throwing away warnings and errors is not really something one should promote)

0
Entering edit mode

the reason to throw away the outputs was cosmetic - there is a message bloat going on that kind of messes up the cute messages I prepared.

I blame the tool developers - why can't their tools be silenced and instructed to only write to the output when the tool has actually something meaningful and unexpected to say?

In fact error messages would be lost in the chaos as tools run in parallel and had I not silenced them all there would loads of useless information printed on each alignment and data fetching that takes place in parallel that - hence it would be pointless to keep them.

0
Entering edit mode
7.1 years ago
DCGenomics ▴ 320

Might be worth noting that HISAT now works directly with SRA (i.e. if you had RNAseq, you might be able to go even faster!).