Question: What are the most common stupid mistakes in bioinformatics?
 
29
 
 

While I of course never have stupid mistakes...ahem...I have many "friends" who:

  1. forget to check both strands
  2. generate random genomic sites without avoiding masked (NNN) gaps
  3. confuse genome freezes and even species

but I'm sure there are some other very common pitfalls that are unique to bioinformatics programming. What are your favorites?

 
 
 

thanks for censoring my answer!

log in to reply • written 13 months ago by Casbon  235213
 
2

meta: should this Q be community-wiki?

log in to reply • written 13 months ago by Chris Miller  657524
 

good way to boost reputation.

log in to reply • written 21 days ago by Raygozak  27119
 
[ content deleted ]
log in to reply • written 19 days ago by Leonard  0

36 answers

 
49
 
 

Invent a new weakly defined, internally redundant, ambiguous, bulky fruit salad of a data format. Again.

 
 
 
7

awesome: "bulky fruit salad of a data format".

log in to reply • written 13 months ago by brentp  12151135
 
1

This still makes me laugh every time I read it...and cry a little inside too.

log in to reply • written 26 days ago by Daniel Standage  209213
 

We should work on a standard .. I keep saying that but no one would hear me anyway Laughs & Tears ~!

log in to reply • written 23 days ago by madkitty  42
 
log in to reply • written 23 days ago by Daniel Standage  209213
 
 
40
 
 

I truncated many fasta files this way when trying to see which headers it contained:

grep > some.fasta

I also see a lot of off-by-one errors due to switching between formats

  • Bed is 0 based

  • GFF/GTF are 1-based

and switching between languages:

  • Python and nearly every other modern language are 0-based indexing

  • R is 1-based (as is Lua)

 
 
 
12

IMHO being off by one is the emperor of all bioinformatics mistakes - it rules them all - and probably causes tens of millions of dollars in wasted effort

log in to reply • written 13 months ago by Istvan Albert ♦♦ 145611133
 
2

Not only is the bed format 0-based, it's also "half-open", meaning the start position is inclusive, but the end position is not.

So if your region starts at position 100 and ends at 101 using standard 1-based coordinates with both start and end inclusive (ie it's two bases long), when you convert it to 0-based half-open coords for bed format the region now starts at 99 but it still ends at 101!

log in to reply • written 13 months ago by Nina  222
 
 
26
 
 

Gene annotation stored in an excel file and find out that some HUGO gene names have been hacked by Excel. SEPT9 become sept-9. Conclusion Do not use the .xls format to store your data.

Listen people saying this eternal mistake "Hey these two sequences are 50% homologs"

 
 
 
2

This is a popular one, dec1 is another well known example. But you can actually tell Excel not to do that auto correction. Since you most often get the data from biologist who may have treated the data in Excel already better use an another ID column and not the gene name column if that is available (you often receive both anyway). These errors can even occur in databases that you download data from or which are used for annotation, so it is good to check.

log in to reply • written 13 months ago by Chris Evelo  804722
 
9
log in to reply • written 13 months ago by Simon Cockell  5881927
 

@Simon Thanks for the publication

log in to reply • written 13 months ago by Fred Fleche  341513
 

The MARCH genes have tripped me up in the past. Using Excel/Calc etc is fine as long as gene name column is set to 'text' during import.

log in to reply • written 13 months ago by Ian  224311
 

Uh oh. Looks like something was lost here in the migration.

log in to reply • written 26 days ago by Daniel Standage  209213
 
 
20
 
 

Reminds me of that old joke...

There are only two hard things in computer science: naming things, cache invalidation and off-by-one errors

 
 
 
 
17
 
 

I feel like a lot of "stupid mistakes" revolve around betrayed trust and false assumptions

For example:

  1. Trusting that a downloaded file is actually fully downloaded

  2. Trusting that an aligner will accept a list of query files instead of just taking the first and ignoring the rest (quiz: which ones am i talking about?)

  3. Assuming that the quality scores in a FASTQ file are from a great Sanger-encoded run instead of a very poor Illumina-1.3 run

  4. Assuming chr1 is followed by chr2 not chr10

 
 
 
2

I like the last item. :)

log in to reply • written 11 months ago by Yuri  1041514
 
 
16
 
 
  • off-by-one errors
  • regex errors
  • parsing a complex alignment/file format incorrectly (e.g. BLAST or GenBank, probably the original rationale for developing BioPerl)
  • failing to account for strand
  • failing to revcomp sequences
  • failing to account for the last element in a file (because of a improper loop condition or no EOL character on last line)
  • failing to account for OS dependent line breaks
  • using the wrong assembly/annotation/release
  • using the wrong genome coordinate system
  • using the wrong file (multiple versions, version skew)
  • failing to account for nested/intercalated annotation features (e.g. genes)
  • assuming all jobs have completed on a cluster
  • deleting files
  • not randomizing your data properly
  • improper use of statistical tests
  • not documenting methods fully (to check and correct all of the above)
 
 
 
2

+1 for OS dependent line breaks. This still trips me up on occasion when I get files from other groups and find that (gasp) they use windows.

log in to reply • written 11 months ago by Wjeck  3419
 
 
15
 
 

If you forgive an attempt to be somewhat provocative, my two favorite mistakes are:

1 Letting academics build software

Academics are in the need to publish papers, and one easy way to do that is to implement an algorithm, demonstrate that it works (more or less), and type it up in a manuscript. BT,DT. But robust and useful software requires a bit more than that, as evidenced by the sad state of affairs in typical bioinformatics software (I think I've managed to crash every de novo assembler I've tried, for instance. Not to mention countless hours spent trying - often in vain - to get software to compile and run). Unfortunately, you don't get a lot of academic credit for improved installation proceedures, testing, software manuals, or especially, debugging of complicated errors. Much better and productive to move on to the next publishable implementation.

2 Letting academics build infrastructure

Same argument as above, really. Academics are eager to apply to construct research infrastructures, but of course they aren't all that interested in doing old and boring stuff. So although today's needs might be satisfied by a $300 FTP server, they will usually start conjecturing about tomorrow's needs instead, and embark on ambitious, blue sky stuff that might result in papers, but not in actually useful tools. And even if you get a useful database or web application up and running (and published), there is little incentive to update or improve it, and it is usually left to bitrot, while the authors go off in search of the next publication.

 
 
 
1

To be clear, it's not a problem with academics themselves (after all, I'm one), just that the incentives are all wrong...

log in to reply • written 13 months ago by Ketil  15719
 

It's not like you're getting any incentives anyway once you've published something.. Otherwise I'll be a student forever

log in to reply • written 23 days ago by madkitty  42
 
8

Yeah I don't know what why it is so hard for me to remember all the great bioinformatics software that has come from industry, like uhh Eland, or the great standards that have come from industry, like Phred-64 FASTQ.

log in to reply • written 13 months ago by Jeremy Leipzig  820823
 

I always wonder if they have ever check the program/code that come with paper. In one paper, they hardcode the input file in code, make me waste a whole afternoon to figure out what's the hell wrong with it.

log in to reply • written 12 months ago by Tg  204
 

@Jeremy: I'm not so sure industry is much better, and it's possible that academia is the democracy of development - worst, except for the others. Also, a lot of industry software are add-ons, designed to sell something else. FWIW, Newbler seems to be one of the better assemblers out there, and CLC is at least half-decent as an analysis platform for non-informaticians.

log in to reply • written 12 months ago by Ketil  15719
 

Out of the (relatively few) tools I have experience with, bowtie/tophat/cufflinks and also fastqc are the exceptions in terms of documentation, UI, maintenance, non-brittleness.

log in to reply • written 26 days ago by bw.  22
 

I am fine with point 2, but I have to disagree with 1. Your de novo assembler example is actually not a good one. De novo assembly is very complicated and highly data dependent. I doubt any assemblers work for any data sets, no matter developed by academia or by professional programmers.

log in to reply • written 22 days ago by lh3  11741223
 
 
11
 
 

trying to solve any problem with BioPerl :-)

but the

  • '+1' error
  • and the grep > some.fasta

are my favorite mistakes.

 
 
 
4

s/Bioperl/perl/ ;) - haters gonna hate!

log in to reply • written 13 months ago by Casbon  235213
 
 
10
 
 

Well i have couple:

1) Run a batch BLAST job and forgetting to put the "-o something.out" option. Then switching off the monitor and coming the next day to see a bunch of characters in my terminal

2) "tar -zxvf" without checking the tar file before, I have decompressed thousands of files in my current directory assuming they came in their own folder.

 
 
 

+1 for 2) #mostly happens with downloaded softwares !

log in to reply • written 11 months ago by Nagarajan Paramasivam  3129
 
 
9
 
 

Using grep to find sequence (or other) IDs without using the -w switch: "grep 'seq12'" will also find seq121, seq122 and so on.

 
 
 
1

This tip is helpful indeed!

log in to reply • written 10 months ago by Zhaorong  3735
 
 
8
 
 
  • having manual components to an analysis pipeline (editing data sets running scripts manually)
  • Not dealing with error conditions at all. This is one thing that I really noticed when I started with bioinformatics; code that would just merrily continue when it hit incorrect data and output jibberish or fail far away from the bad data. A debugging nightmare.
  • Not testing edge and corner cases for input data
  • Assuming that your input data is sane; I've run into all sorts of inconsistency issues with public data sets (i.e. protein domains at positions off the end of the protein, etc). Usually fixed promptly if you complain but you've got to find them first.
 
 
 
 
8
 
 

Try to open microarray or, worse, NGS datafiles with excel or word...

 
 
 
 
7
 
 

I often encounter problems related to the fact the computer scientists index their arrays starting with 0, while biologists index their sequences starting with 1. Simple concept that drives the noobs mad and even trips up more experienced scientists every once in a while.

 
 
 
 
7
 
 

Masking out sequence in a FASTA file (e.g. s/TAAT/NNNN/ig) where the sequence is formated, i.e. split onto multiple lines.

This will miss TAAT that is split over the end of one line and the start of the next!

The classic mistake (also mentioned above by Casey) is not being aware the genome assembly effect coordinates.

 
 
 
 
7
 
 

One mistake not unique to bioinformatics is: while editing one source file, compile and run another file.

 
 
 
3

;) then hours debugging the wrong file

log in to reply • written 11 months ago by Tony  133213
 
 
6
 
 

Having separate files for each sequence.

Of a 454 run.

 
 
 

Or rather, using a file system that can't handle a few million files in a directory?

log in to reply • written 13 months ago by Ketil  6827
 

I was using Ubuntu Linux. It "handled" it, just slowly.

log in to reply • written 13 months ago by Andrewjgrimm  4118
 
 
5
 
 

Generate Multiple Sequence Alignment direct from fasta or other file and ask: why there are no gaps & deletion in the MSA viewer?

 
 
 
 
5
 
 

Do pathways statistics or gene set enrichment statistics and then represent the list of gene sets as a valuable result, instead uding that statistics just as a means to decide which pathways need to be evaluated.

(This is bad for many reasons for instance because the statistical contribution of a key regulatory gene in a pathway is equal to that of 1 out 7 iso-enzymes that catalyze a non-relevant side reaction, and because the significance of a pathway changes when you add a few non-relevant genes, and also because we have many overlapping pathways).

Another typical mistake is to solve problems that nobody has.

 
 
 
2

these sound like poor judgments (e.g. Clinton-Lewinsky), not stupid mistakes (e.g. dangling chad)

log in to reply • written 13 months ago by Jeremy Leipzig  820823
 

No, I think it is actually wrong to publish a list of pathways without further judgement. I think not doing the judgement is a mistake. But I have to admit that I don't really understand your examples. So maybe my English is not good enough to understand the finesses of the difference between poor judgement an stupid mistakes.

log in to reply • written 13 months ago by Chris Evelo  804722
 
2

i would define a stupid mistake as falling prey to a trivial but catastrophic pitfall, an error in judgment is more due a fundamental lack of understanding or willful ignorance

log in to reply • written 13 months ago by Jeremy Leipzig  820823
 

In that case you are right, these would be judgement errors.

log in to reply • written 13 months ago by Chris Evelo  804722
 
 
5
 
 

I'll offer this one, which is a bit on the general side: Deletion of data that appear to serve no relevance from the computational side, but which have importance to the biology/biologist. Often, this arises from a lack of clear communication between the two individuals/teams as to what everything means, what it exactly means and why it is relevant to the process being developed.

 
 
 
 
5
 
 

Re-inventing the wheel. So often did I have to debug (or just replace) a bad implementation of a fasta-parser when BioPython/BioPerl have perfect implementations, I don't understand why no-one bothers to use them. 10 minutes in Google can save you 2 days of work and other people a week of work (you save 2 days of programming, they save a week of understanding your program to find the bug)

 
 
 

I fully agree, re-inventing the wheel is so tempting. We are way too eager to write a few lines of code each time. Plus, because you may have convinced yourself that you can resolve the code in 15 minutes, you don't bother about writting any documentation. In short, there is a very large tendency to re-invent the wheel... many, many times!

log in to reply • written 21 days ago by Javier Herrero  253
 
 
4
 
 
  1. Using excel to sort or manage your csv records.
  2. Using multiple alignments or highly diverse sequences or worst recombining sequences and inferring evolutionary history based on the resulting tree.
 
 
 

I've done that mistake a couple of times I didn't know it was stupid thou .. Now I look like a fool in front of my computer reading this thread XD

log in to reply • written 23 days ago by madkitty  42
 
 
3
 
 

What kills me the most is the hand editing of data sets.

If you're reading this and do it, please stop -- and start using automated builds -- with clear documentation.

 
 
 
 
3
 
 

Loosing an hour to learn that some files saved on a Mac have a strange 'begining of file' like character...

Normalize file standards already! x_o

 
 
 
 
3
 
 

Not keeping an adequate notebook.

 
 
 
 
2
 
 

One mistake: not looking to see that the 0x4 bit in the bitflag column of a SAM (or BAM) file indicates the entry is mapped. RNAME, CIGAR, and POS may be set to something non-null (an actual string!) but these are not meaningful if the 0x4 flag says the read is unmapped.

 
 
 
 
2
 
 

Forget to do 'dos2unix' and then spend a lot of time trying to figure out why there is no OUTPUT

 
 
 
 
2
 
 

tacking on another command line argument without looking through the rest of them

novoalign -a ATCTCGTATGCCGTCTTCTGCTTG -d genome.ndx -F ILMFQ -f query.fq -a -m -l 17 -h 60 -t 65 -o sam -o FullNW

the first adapter argument (-a ATCTCGTATGCCGTCTTCTGCTTG) is negated by the empty second one

 
 
 
 
2
 
 

Easy one: you wait for 4 hours downloading a big DNA file (e.g. bam file) and you mistakenly delete it when trying to move it with a good old rm.

 
 
 
 
2
 
 

using "rm -rf * .fasta " in unintended directory; especially if within the home directory...

 
 
 

then do not use the recursive switch (-r) to delete files within the same directory :-P

log in to reply • written 26 days ago by Tony  133213
 

yes the space between * and .fa has bitten me as well. maybe there is an idiot guard against that somewhere?

log in to reply • written 26 days ago by Jeremy Leipzig  820823
 

I was about to add this one myself. It's bitten me a couple of times.

log in to reply • written 22 days ago by Travis  19910
 
 
1
 
 

What about building an interface not aimed at a community of (fellow) Biologists?

 
 
 
1

that might be stupid and it might be a mistake but it isn't a "stupid mistake". see comments under Chris Evelo's post.

log in to reply • written 13 months ago by Jeremy Leipzig  820823
 

Then I will conclude with the same comment as Chris.

log in to reply • written 13 months ago by Andra Waagmeester  283314
 
 
1
 
 

s/foo/bar/g without the g at the end as I just proved in the field

 
 
 

I have an opposite mistake. I accidentally replace a string in the whole document and something I didn't want to replace gets replaced. Now I visually select the area where I want to replace...

log in to reply • written 11 months ago by Sequencegeek  5519
 
 
1
 
 

Making claims without experimental validation. Especially involving studies utilizing multiplexed technologies such as microarrays and high-throughput sequencing.

 
 
 
 
1
 
 

developing algorithms and software around one single piece of low quality data with no prior knowledge while being ignorant about the entire problem.

 
 
 
 
1
 
 

My common one cat aVeryBigFile.whatever &

Cant even stop it now without closing terminal

Cheers

 
 
 

just kill -9 $PID from another session, no?

log in to reply • written 26 days ago by Simon Cockell  5881927
 

Good thought but, on a server (I should have told earlier, I commit this on server sessions), your process id's in one login are not known in other login/session.

log in to reply • written 26 days ago by Sukhdeep Singh  855
 
6

I'm pretty sure that ps aux | grep cat from another session will give you the PID of any running cat process on the server.

log in to reply • written 26 days ago by Daniel Standage  209213
 
2

can't you just 'fg' then ctrl-c ?

log in to reply • written 26 days ago by brentp  12151135
 

@brentp, @Daniel cool guys, now this mistake can be rectified :)

log in to reply • written 26 days ago by Sukhdeep Singh  855
 
 
1
 
 

Running the bwa/GATK pipeline with a corrupt/incompletely generated bwa index of hg19. Everything still aligned, but one of 2 mates would have its strand set incorrectly. Other than the insert size distribution, everything seemed normal, until the TableRecalibration step downshifted all quality scores significantly and then UnifiedGenotyper called 0 SNPs. 1st time I've seen a problem with step 1 of a pipeline not become obvious until step 5+.

 
 
 
 
0
 
 

Scripting an hour to do something you could have done in half an hour manually, and then never needing to repeat it again.

 
 
 
1

I tend to see the opposite not spending the time up-front to do it right and having to continue to do it manually/semi-manually ad nauseam

log in to reply • written 12 days ago by brentp  12151135
 
Log in to add a post