Question

Forum:Mission Impossible: you have 1 minute to analyze the Ebola Genome

28

Entering edit mode

9.7 years ago

Istvan Albert 101k

I teach an introductory bioinformatics course. For yesterday's lecture I wanted to demonstrate to students just how much you can get done by properly combining all these awesome tools with the unix command line.

And that got me thinking ... so how much can you get done in a day ... how about an hour? ... then .... well, how about a minute ... a minute you say??? ... yeah right that's just crazy talk, sounds like ... mission impossible. Or is it really?

So I googled the Mission Impossible theme song, I found a version that is about 1 minute long and I came up with a challenge with the following rules:

You may use any tool or background information that can be reasonably expected to be on a bioinformaticians' computer.
You have to start with an empty folder
Start the music and your script. Your script needs to finish before the theme song.
At the end of the run your folder needs to contain a piece of information that on its own is noteworthy and publication quality information (say an essential part of a prior publication)

All right then - and here is my entry. It produces all major single nucleotide polymorphisms of the 2014 Ebola genome as published in Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak Science 2014. It requires parallel, efetch, bwa, samtools and freebayes.

The script follows. Let me tell you running it while the theme song is on makes it surprisingly exciting!

It even wastes 16 seconds for dramatic flair. Still finishes in time on a MacBook Air while running a presentation.

ebola mission-impossible • 6.2k views

ADD COMMENT • link updated 22 months ago by Ram 44k • written 9.7 years ago by Istvan Albert 101k

7

Entering edit mode

In one minute Chuck Norris sequenced his genome. And found zero mutations.

ADD REPLY • link 9.7 years ago by dariober 14k

4

Entering edit mode

That would mean that Chuck Norris and Craig Venter are the same person? (Might explain a few things...)

ADD REPLY • link 9.7 years ago by Adamc ▴ 680

4

Entering edit mode

Took you longer than a minute to write the script though..

ADD REPLY • link 9.7 years ago by 5heikki 11k

4

Entering edit mode

yeah but Tom Cruise also has time to prepare for each mission - that's still fair comparison

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Istvan Albert 101k

1

Entering edit mode

What's really impressive about this is the power behind such basic and standard tools.

btw, I found a bug. the last few lines should read:

echo "*** WARNING! The data will self destruct in one minute! ***"
echo
sleep 60
rm -r ~/edu/mission

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Katie D'Aco ★ 1.1k

1

Entering edit mode

ha, looks like I was caught bluffing - funny!

ADD REPLY • link 9.7 years ago by Istvan Albert 101k

Ram · Answer 1 · 2014-11-12

3

Entering edit mode

9.7 years ago

Pablo ★ 1.9k

Nice post!

It looks like you have plenty of time left to analyze what the variants mean:

java -jar snpEff.jar -v ebola_zaire ~/edu/mission/results.vcf > ~/edu/mission/results.eff.vcf

Note that the reference they used is "KJ660346" instead of KM034562 (at least for the annotations part).

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Pablo ★ 1.9k

0

Entering edit mode

that's pretty cool! just a few extra seconds - I've used the reference that the main paper used but one can just swap that out in the first lines.

ADD REPLY • link 9.7 years ago by Istvan Albert 101k

0

Entering edit mode

It's weird that they mention KM034562 in the main paper, they asked me to provide "KJ660346" for their analysis...

ADD REPLY • link 9.7 years ago by Pablo ★ 1.9k

1

Entering edit mode

Strange indeed, the VCF file (file S1 here http://www.sciencemag.org/content/345/6202/1369/suppl/DC1) is computed relative to KM034562

the difference between the two genome builds is:

18957M2S and MD:Z:1848C4433T5527T3787G3358A

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Istvan Albert 101k

Ram · Answer 2 · 2015-09-04

2

Entering edit mode

8.9 years ago

osullivanchristopher ▴ 210

You should try using HISAT. It can read directly from SRA without pesky bloated fastq. Might be fast enough to leave time for analyzing your output too.

ADD COMMENT • link updated 22 months ago by Ram 44k • written 8.9 years ago by osullivanchristopher ▴ 210

0

Entering edit mode

interesting point - I will explore that as I plan to rework this example.

ADD REPLY • link 8.9 years ago by Istvan Albert 101k

Ram · Answer 3 · 2015-09-04

0

Entering edit mode

8.9 years ago

naim.matasci • 0

I think this is brilliant, but I feel that there is an issue with it: you don't verify that it actually worked, which I see as an essential part of these scripts (also, throwing away warnings and errors is not really something one should promote)

ADD COMMENT • link 8.9 years ago by naim.matasci • 0

0

Entering edit mode

the reason to throw away the outputs was cosmetic - there is a message bloat going on that kind of messes up the cute messages I prepared.

I blame the tool developers - why can't their tools be silenced and instructed to only write to the output when the tool has actually something meaningful and unexpected to say?

In fact error messages would be lost in the chaos as tools run in parallel and had I not silenced them all there would loads of useless information printed on each alignment and data fetching that takes place in parallel that - hence it would be pointless to keep them.

ADD REPLY • link updated 22 months ago by Ram 44k • written 8.9 years ago by Istvan Albert 101k

score 0 · Answer 4 · 2015-09-07

0

Entering edit mode

8.9 years ago

DCGenomics ▴ 330

Might be worth noting that HISAT now works directly with SRA (i.e. if you had RNAseq, you might be able to go even faster!).

ADD COMMENT • link 8.9 years ago by DCGenomics ▴ 330