Forum: Mission Impossible: you have 1 minute to analyze the Ebola Genome
27
gravatar for Istvan Albert
5.4 years ago by
Istvan Albert ♦♦ 83k
University Park, USA
Istvan Albert ♦♦ 83k wrote:

I teach an introductory bioinformatics course. For yesterday's lecture I wanted to demonstrate to students just how much you can get done by properly combining all these awesome tools with the unix command line.

And that got me thinking ... so how much can you get done in a day ...  how about an hour? ... then .... well, how about a minute ... a minute you say???  ... yeah right that's just crazy talk, sounds like ... mission impossible. Or is it really? 

So I googled the Mission Impossible theme song, I found a version that is about 1 minute long and I came up with a  challenge with the following rules:

  1. You may use any tool or background information that can be reasonably expected to be on a bioinformaticians' computer. 
  2. You have to start with an empty folder
  3. Start the music and your script. Your script needs to finish before the theme song.
  4. At the end of the run your folder needs to contain a piece of information that on its own is noteworthy and publication quality information (say an essential part of a prior publication)

All right then - and here is my entry. It produces all major single nucleotide polymorphisms of the 2014 Ebola genome as published in Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak Science 2014.  It requires parallel, efetch, bwa, samtools and freebayes.

The script follows. Let me tell you running it while the theme song is on makes it surprisingly exciting!

It even wastes 16 seconds for dramatic flair. Still finishes in time on a MacBook Air while running a presentation.

mission-impossible ebola forum • 3.5k views
ADD COMMENTlink modified 4.6 years ago by DCGenomics320 • written 5.4 years ago by Istvan Albert ♦♦ 83k
6

In one minute Chuck Norris sequenced his genome. And found zero mutations.

ADD REPLYlink written 5.4 years ago by dariober11k
4

That would mean that Chuck Norris and Craig Venter are the same person? (Might explain a few things...)

ADD REPLYlink written 5.4 years ago by Adamc640
4

Took you longer than a minute to write the script though..

ADD REPLYlink written 5.4 years ago by 5heikki8.7k
4

yeah but  Tom Cruise also has time to prepare for each mission  - that's still fair comparison

ADD REPLYlink written 5.4 years ago by Istvan Albert ♦♦ 83k
1

What's really impressive about this is the power behind such basic and standard tools.

btw, I found a bug. the last few lines should read: 

echo "*** WARNING! The data will self destruct in one minute! ***"
echo
sleep 60
rm -r ~/edu/mission
ADD REPLYlink modified 5.4 years ago by Istvan Albert ♦♦ 83k • written 5.4 years ago by Katie D'Aco1000
1

ha, looks like I was caught bluffing - funny!

ADD REPLYlink written 5.4 years ago by Istvan Albert ♦♦ 83k
3
gravatar for Pablo
5.4 years ago by
Pablo1.9k
Canada
Pablo1.9k wrote:

Nice post!

It looks like you have plenty of time left to analyze what the variants mean:

java -jar snpEff.jar -v ebola_zaire ~/edu/mission/results.vcf > ~/edu/mission/results.eff.vcf 

Note that the reference they used is "KJ660346" instead of KM034562 (at least for the annotations part).

 

 

ADD COMMENTlink modified 5.4 years ago • written 5.4 years ago by Pablo1.9k

that's pretty cool! just a few extra seconds - I've used the reference that the main paper used but one can just swap that out in the first lines.

ADD REPLYlink written 5.4 years ago by Istvan Albert ♦♦ 83k

It's weird that they mention KM034562 in the main paper, they asked me to provide "KJ660346" for their analysis...

ADD REPLYlink written 5.4 years ago by Pablo1.9k
1

Strange indeed, the VCF file (file S1 here http://www.sciencemag.org/content/345/6202/1369/suppl/DC1) is computed relative to KM034562 

the difference between the two genome builds is:

18957M2S and  MD:Z:1848C4433T5527T3787G3358A

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by Istvan Albert ♦♦ 83k
2
gravatar for osullivanchristopher
4.6 years ago by
United States
osullivanchristopher200 wrote:

you should try using HISAT.  it can read directly from SRA without pesky bloated fastq. might be fast enough to leave time for analyzing your output too.

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by osullivanchristopher200

interesting point - I will explore that as I plan to rework this example.

ADD REPLYlink written 4.6 years ago by Istvan Albert ♦♦ 83k
0
gravatar for naim.matasci
4.6 years ago by
naim.matasci0 wrote:

I think this is brilliant, but I feel that there is an issue with it: you don't verify that it actually worked, which I see as an essential part of these scripts (also, throwing away warnings and errors is not really something one should promote)

ADD COMMENTlink written 4.6 years ago by naim.matasci0

the reason to throw away the outputs was cosmetic - there is a message bloat going on that kind of messes up the cute messages I prepared.

I blame the tool developers - why can't their tools be silenced and instructed to only write to the output when the tool has actually something meaningful and unexpected to say? 

In fact  error messages would be lost in the chaos as tools run in parallel and had I not silenced them all there would loads of useless information printed on each alignment and data fetching that takes place in parallel that - hence it would be pointless to keep them. 

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Istvan Albert ♦♦ 83k
0
gravatar for DCGenomics
4.6 years ago by
DCGenomics320
United States
DCGenomics320 wrote:

Might be worth noting that HISAT now works directly with SRA (i.e. if you had RNAseq, you might be able to go even faster!).

ADD COMMENTlink written 4.6 years ago by DCGenomics320
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1520 users visited in the last hour