If you forgive an attempt to be somewhat provocative, my two favorite mistakes are:
Letting academics build software
Academics are in the need to publish papers, and one easy way to do that is to implement an algorithm, demonstrate that it works (more or less), and type it up in a manuscript. BT,DT. But robust and useful software requires a bit more than that, as evidenced by the sad state of affairs in typical bioinformatics software (I think I've managed to crash every de novo assembler I've tried, for instance. Not to mention countless hours spent trying - often in vain - to get software to compile and run). Unfortunately, you don't get a lot of academic credit for improved installation proceedures, testing, software manuals, or especially, debugging of complicated errors. Much better and productive to move on to the next publishable implementation.
Letting academics build infrastructure
Same argument as above, really. Academics are eager to apply to construct research infrastructures, but of course they aren't all that interested in doing old and boring stuff. So although today's needs might be satisfied by a $300 FTP server, they will usually start conjecturing about tomorrow's needs instead, and embark on ambitious, blue sky stuff that might result in papers, but not in actually useful tools. And even if you get a useful database or web application up and running (and published), there is little incentive to update or improve it, and it is usually left to bitrot, while the authors go off in search of the next publication.
Re-inventing the wheel. So often did I have to debug (or just replace) a bad implementation of a fasta-parser when BioPython/BioPerl have perfect implementations, I don't understand why no-one bothers to use them. 10 minutes in Google can save you 2 days of work and other people a week of work (you save 2 days of programming, they save a week of understanding your program to find the bug)
I gave my Amazon EC2 password to someone in my group who wanted to run something quickly (estimated cost, $2). I received the bill 2 months later: $156. This person forgot to close the instance. This is a 8 months story and I'm still waiting for my reimbursement... Conclusion: don't trust colleagues!
I'll offer this one, which is a bit on the general side: Deletion of data that appear to serve no relevance from the computational side, but which have importance to the biology/biologist. Often, this arises from a lack of clear communication between the two individuals/teams as to what everything means, what it exactly means and why it is relevant to the process being developed.
having manual components to an analysis pipeline (editing data sets running scripts manually)
Not dealing with error conditions at all. This is one thing that I really noticed when I started with bioinformatics; code that would just merrily continue when it hit incorrect data and output gibberish or fail far away from the bad data. A debugging nightmare.
Not testing edge and corner cases for input data
Assuming that your input data is sane; I've run into all sorts of inconsistency issues with public data sets (i.e. protein domains at positions off the end of the protein, etc). Usually fixed promptly if you complain but you've got to find them first.
I often encounter problems related to the fact the computer scientists index their arrays starting with 0, while biologists index their sequences starting with 1. Simple concept that drives the noobs mad and even trips up more experienced scientists every once in a while.
Do pathways statistics or gene set enrichment statistics and then represent the list of gene sets as a valuable result, instead using that statistics just as a means to decide which pathways need to be evaluated.
(This is bad for many reasons for instance because the statistical contribution of a key regulatory gene in a pathway is equal to that of 1 out 7 iso-enzymes that catalyze a non-relevant side reaction, and because the significance of a pathway changes when you add a few non-relevant genes, and also because we have many overlapping pathways).
Another typical mistake is to solve problems that nobody has.
One mistake: not looking to see that the 0x4 bit in the bitflag column of a SAM (or BAM) file indicates the entry is mapped. RNAME, CIGAR, and POS may be set to something non-null (an actual string!) but these are not meaningful if the 0x4 flag says the read is unmapped.
Running the bwa/GATK pipeline with a corrupt/incompletely generated bwa index of hg19. Everything still aligned, but one of 2 mates would have its strand set incorrectly. Other than the insert size distribution, everything seemed normal, until the TableRecalibration step downshifted all quality scores significantly and then UnifiedGenotyper called 0 SNPs. 1st time I've seen a problem with step 1 of a pipeline not become obvious until step 5+.
I made one a few months ago. I launched a heavy process in a pay-per-use cluster, it was running for one week. I thought, 6 pennies/hr cannot be too much money. I received a bill for $832 usd. I'm not using this cluster again unless I estimate the total cost of the process.
Possibly: implementing methods that magically generate p-values from non-replicated RNA-seq experiments, possibly as a result to pressure from 'experimentalists'. I really would like to know the history behind their implementation (where they forced by reviewers, or by other groups?). Now we have to explain why those p-values are bogus, and why there are so little significantly differentially expressed genes detected in a non-replicated analysis.
I had another good one recently. I was executing an untested bash script for generation of output from two input files on our cluster. I just let it run over night. When I checked in the morning, it had generated about 40 TB worth of output (expectation was about 20 MB). There was a tiny spelling mistake that led to an infinite loop. Oops. I was lucky to check it when I did because there was still a few TB space left so at least other jobs didn't get killed because of it..
I have spent hours, in repeated occasions, looking for a mysterious error in a perl script that at the end was simply a = instead of a == within an IF statement.
Another recurrent mistake: not documenting what I did and what those scripts do in the belief that everything is so intuitive, organised, simple and natural that it won't be necessary. Then, sometime after, I have to spend hours trying to guess what all that mess was.
Some really great comments here, nice to know that such things happen to all genii ;). I have to say my most painful moments relate to my assumption that data obtained elsewhere is correct in every way. I also remember early in my career, using PDB files and realising that sometimes, chains are represented more than once, thus when manually checking calculations involving atomic coordinates, being utterly perplexed and wanting to break my computer. Oh the joys of Bioinformatics.
Assuming that the gene IDs in "knownGenes.gtf" from UCSC are actually gene IDs. Instead they just put the transcript ID as the gene ID.
This just caused me a bit of pain when doing read counting at the gene level. Basically, any constitutive exon in a gene with multiple splice forms was ignored because all the reads in that exon were treated as ambiguous.
How about writing a tool and being convinced it works perfectly, so you start testing it on a complete dataset instead of testing it first on a subset and finding out after it ran for an hour or so that you made a tiny mistake somewhere. Sooo much time wasted that I'll never get back :P
This one was really good, embarking on sudo yum update when there was lots of stuff to update and swap space was very low. Ended up with a situation much like this. Took me good four hours before I saw my desktop again.
Another fun one is when you develop a pipeline with a small test set thinking speed over all and then you increase your test set size and realize that you're creating TBs of temp data and utilizing hundreds of GBs of RAM :)
Running statistical analyses with no understanding of the tests your using, why you're using them, when it's appropriate to use them, how your data needs to be formated, normalize, or scaled to use them... but hey, you made a volcano plot... so it must be publication quality. right? read the damn paper...
The most pervasive and damaging problem afflicting bioinformatics studies is the failure to curate covariate information, resulting in problems of separation and/or complete separation.
This problem is particularly sinister because this frequently occurs long before the bioinformatician is consulted on the project, meaning the project already has unnecessary limitations before the data scientist/statistician/what-have-you receives the data.
At minimum, this will limit statistical power, in certain cases, it means there is a confound that can't be mitigated without meta-analytics or extreme ends. In my experience this affects, oh I don't know, >70% of labs, including world-class, well-known labs, who don't retain a statistician.
Staying in the office all day
good way to boost reputation.
meta: should this Q be community-wiki?
what's the mean by said generate random genomic sites without avoiding masked (NNN) gaps? more detailed? do not understand.
you wouldn't want to sample from it