awesome: "bulky fruit salad of a data format".
While I of course never have stupid mistakes...ahem...I have many "friends" who:
but I'm sure there are some other very common pitfalls that are unique to bioinformatics programming. What are your favorites?
Invent a new weakly defined, internally redundant, ambiguous, bulky fruit salad of a data format. Again.
We should work on a standard .. I keep saying that but no one would hear me anyway Laughs & Tears ~!
It made me laugh when I read the MIQE guidelines not long after this comic came out: "The nomenclature describing the fractional PCR cycle used for quantification is inconsistent, with threshold cycle (Ct), crossing point (Cp),and take-off point (TOP) currently used in the literature ... we propose the use of quantification cycle (Cq)"
I truncated many fasta files this way when trying to see which headers it contained:
grep > some.fasta
I also see a lot of off-by-one errors due to switching between formats
Bed is 0 based
GFF/GTF are 1-based
and switching between languages:
Python and nearly every other modern language are 0-based indexing
R is 1-based (as is Lua)
IMHO being off by one is the emperor of all bioinformatics mistakes - it rules them all - and probably causes tens of millions of dollars in wasted effort
Not only is the bed format 0-based, it's also "half-open", meaning the start position is inclusive, but the end position is not.
So if your region starts at position 100 and ends at 101 using standard 1-based coordinates with both start and end inclusive (ie it's two bases long), when you convert it to 0-based half-open coords for bed format the region now starts at 99 but it still ends at 101!
Gene annotation stored in an excel file and find out that some HUGO gene names have been hacked by Excel. SEPT9 become sept-9. Conclusion Do not use the .xls format to store your data.
Listen people saying this eternal mistake "Hey these two sequences are 50% homologs"
This is a popular one, dec1 is another well known example. But you can actually tell Excel not to do that auto correction. Since you most often get the data from biologist who may have treated the data in Excel already better use an another ID column and not the gene name column if that is available (you often receive both anyway). These errors can even occur in databases that you download data from or which are used for annotation, so it is good to check.
The MARCH genes have tripped me up in the past. Using Excel/Calc etc is fine as long as gene name column is set to 'text' during import.
I feel like a lot of "stupid mistakes" revolve around betrayed trust and false assumptions
For example:
Trusting that a downloaded file is actually fully downloaded
Trusting that an aligner will accept a list of query files instead of just taking the first and ignoring the rest (quiz: which ones am i talking about?)
Assuming that the quality scores in a FASTQ file are from a great Sanger-encoded run instead of a very poor Illumina-1.3 run
Assuming chr1 is followed by chr2 not chr10
If you forgive an attempt to be somewhat provocative, my two favorite mistakes are:
1 Letting academics build software
Academics are in the need to publish papers, and one easy way to do that is to implement an algorithm, demonstrate that it works (more or less), and type it up in a manuscript. BT,DT. But robust and useful software requires a bit more than that, as evidenced by the sad state of affairs in typical bioinformatics software (I think I've managed to crash every de novo assembler I've tried, for instance. Not to mention countless hours spent trying - often in vain - to get software to compile and run). Unfortunately, you don't get a lot of academic credit for improved installation proceedures, testing, software manuals, or especially, debugging of complicated errors. Much better and productive to move on to the next publishable implementation.
2 Letting academics build infrastructure
Same argument as above, really. Academics are eager to apply to construct research infrastructures, but of course they aren't all that interested in doing old and boring stuff. So although today's needs might be satisfied by a $300 FTP server, they will usually start conjecturing about tomorrow's needs instead, and embark on ambitious, blue sky stuff that might result in papers, but not in actually useful tools. And even if you get a useful database or web application up and running (and published), there is little incentive to update or improve it, and it is usually left to bitrot, while the authors go off in search of the next publication.
To be clear, it's not a problem with academics themselves (after all, I'm one), just that the incentives are all wrong...
Yeah I don't know what why it is so hard for me to remember all the great bioinformatics software that has come from industry, like uhh Eland, or the great standards that have come from industry, like Phred-64 FASTQ.
I always wonder if they have ever check the program/code that come with paper. In one paper, they hardcode the input file in code, make me waste a whole afternoon to figure out what's the hell wrong with it.
@Jeremy: I'm not so sure industry is much better, and it's possible that academia is the democracy of development - worst, except for the others. Also, a lot of industry software are add-ons, designed to sell something else. FWIW, Newbler seems to be one of the better assemblers out there, and CLC is at least half-decent as an analysis platform for non-informaticians.
Out of the (relatively few) tools I have experience with, bowtie/tophat/cufflinks and also fastqc are the exceptions in terms of documentation, UI, maintenance, non-brittleness.
Well i have couple:
1) Run a batch BLAST job and forgetting to put the "-o something.out" option. Then switching off the monitor and coming the next day to see a bunch of characters in my terminal
2) "tar -zxvf" without checking the tar file before, I have decompressed thousands of files in my current directory assuming they came in their own folder.
Forget the tar problem, just use atool: http://www.nongnu.org/atool/
I'll offer this one, which is a bit on the general side: Deletion of data that appear to serve no relevance from the computational side, but which have importance to the biology/biologist. Often, this arises from a lack of clear communication between the two individuals/teams as to what everything means, what it exactly means and why it is relevant to the process being developed.
Masking out sequence in a FASTA file (e.g. s/TAAT/NNNN/ig) where the sequence is formated, i.e. split onto multiple lines.
This will miss TAAT that is split over the end of one line and the start of the next!
The classic mistake (also mentioned above by Casey) is not being aware the genome assembly effect coordinates.
Do pathways statistics or gene set enrichment statistics and then represent the list of gene sets as a valuable result, instead using that statistics just as a means to decide which pathways need to be evaluated.
(This is bad for many reasons for instance because the statistical contribution of a key regulatory gene in a pathway is equal to that of 1 out 7 iso-enzymes that catalyze a non-relevant side reaction, and because the significance of a pathway changes when you add a few non-relevant genes, and also because we have many overlapping pathways).
Another typical mistake is to solve problems that nobody has.
No, I think it is actually wrong to publish a list of pathways without further judgement. I think not doing the judgement is a mistake. But I have to admit that I don't really understand your examples. So maybe my English is not good enough to understand the finesses of the difference between poor judgement an stupid mistakes.
i would define a stupid mistake as falling prey to a trivial but catastrophic pitfall, an error in judgment is more due a fundamental lack of understanding or willful ignorance
Re-inventing the wheel. So often did I have to debug (or just replace) a bad implementation of a fasta-parser when BioPython/BioPerl have perfect implementations, I don't understand why no-one bothers to use them. 10 minutes in Google can save you 2 days of work and other people a week of work (you save 2 days of programming, they save a week of understanding your program to find the bug)
I fully agree, re-inventing the wheel is so tempting. We are way too eager to write a few lines of code each time. Plus, because you may have convinced yourself that you can resolve the code in 15 minutes, you don't bother about writting any documentation. In short, there is a very large tendency to re-invent the wheel... many, many times!
I gave my Amazon EC2 password to someone in my group who wanted to run something quickly (estimated cost, $2). I received the bill 2 months later: $156. This person forgot to close the instance. This is a 8 months story and I'm still waiting for my reimbursement... Conclusion: don't trust colleagues!
Easy one: you wait for 4 hours downloading a big DNA file (e.g. bam file) and you mistakenly delete it when trying to move it with a good old rm.
using "rm -rf * .fasta " in unintended directory; especially if within the home directory...
yes the space between * and .fa has bitten me as well. maybe there is an idiot guard against that somewhere?
I made one a few months ago. I launched a heavy process in a pay-per-use cluster, it was running for one week. I thought, 6 pennies/hr cannot be too much money. I received a bill for $832 usd. I'm not using this cluster again unless I estimate the total cost of the process.
edit: the price is per core
By my count, 6 pennies per hour is $1.44 a day or about $10 a week. How did you get $832?
I've just made one, which cost me a good headache trying to figure out the biology underlying my strange results!
POS field to be the leftmost position of my mapped read on the '+' strand, and the rightmost position on the '-' strandNote to self : "Read the manual..."
not to mention how much work is to actually get the rightmost coordinate.
Isn't this just POS+length(SEQ)? I'm having doubts now...
that only applies if the sequence contains only matches or mismatches, this means edit strings that are composed of a number followed by M (like 76M) . For all other alignments you will need to parse the CIGAR string and build the end coordinate from the start + numbers in the edit string.
Phew... I'm glad I only have matches and mismatches, so I fall in the easy category :-) Thanks a lot for adding this information, this can be a big trap!
I have a perl parser that will change the read length bases on the cigar if you ever want / need it.
Thanks a lot! I'll keep that in mind! Or maybe you can share it here as a Tool? or as an answer to this thread : http://biostars.org/post/show/41951/mapping-reads-with-bwa-and-bowtie/ ?
Hah! I just reverse complemented the reference genome, and redid the alignment. Admittedly, this was for 454 data.
But you should be careful. Doing that will misplace the indel position in a microsatellite.
If I understand you correctly, you are saying that this will inflate the number of variants, since many have ambigous positions? Interesting - do aligners generally guarantee that such ambigous variants are consistently placed for forward and reverse reads?
BWA always places the indel at the beginning of a microsatellite. If you align the read to the rev-complemented ref, the indel will be at the end. Many indel callers assume the bwa behavior, though there are also tools to left-align indels.
Running the bwa/GATK pipeline with a corrupt/incompletely generated bwa index of hg19. Everything still aligned, but one of 2 mates would have its strand set incorrectly. Other than the insert size distribution, everything seemed normal, until the TableRecalibration step downshifted all quality scores significantly and then UnifiedGenotyper called 0 SNPs. 1st time I've seen a problem with step 1 of a pipeline not become obvious until step 5+.
My common one
cat aVeryBigFile.whatever &
Cant even stop it now without closing terminal
Cheers
just kill -9 $PID from another session, no?
can't you just 'fg' then ctrl-c ?
Some really great comments here, nice to know that such things happen to all genii ;). I have to say my most painful moments relate to my assumption that data obtained elsewhere is correct in every way. I also remember early in my career, using PDB files and realising that sometimes, chains are represented more than once, thus when manually checking calculations involving atomic coordinates, being utterly perplexed and wanting to break my computer. Oh the joys of Bioinformatics.
Assuming that the gene IDs in "knownGenes.gtf" from UCSC are actually gene IDs. Instead they just put the transcript ID as the gene ID.
This just caused me a bit of pain when doing read counting at the gene level. Basically, any consittutive exon in a gene with multiple splice forms was ignored because all the reads in that exon were treated as ambiguous.
I wouldn't say it's stupid , but I think a very common mistake is to not correct for batch effects in high-throughput data.
Batch effects can (best-case) hide the real effect that you're looking for, or (worst-case) make it look like your variable of interest is contributing to your findings when it's actually an artifact.
Leek + Irizarry et al. have a sobering review on this here.
Just read this article: "How Not To Be A Bioinformatician" Thought it would be interesting to post here....
meta: should this Q be community-wiki?
good way to boost reputation.
what's the mean by said generate random genomic sites without avoiding masked (NNN) gaps? more detailed? do not understand.
see this?
http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr1:1,10000
you wouldn't want to sample from it
Gotten it. Thank you.