awesome: "bulky fruit salad of a data format".
Question: What are the most common stupid mistakes in bioinformatics? |
||
29
|
While I of course never have stupid mistakes...ahem...I have many "friends" who:
but I'm sure there are some other very common pitfalls that are unique to bioinformatics programming. What are your favorites? |
|
49
|
Invent a new weakly defined, internally redundant, ambiguous, bulky fruit salad of a data format. Again. |
|
26
|
Gene annotation stored in an excel file and find out that some HUGO gene names have been hacked by Excel. SEPT9 become sept-9. Conclusion Do not use the .xls format to store your data. Listen people saying this eternal mistake "Hey these two sequences are 50% homologs" |
|
|
2
This is a popular one, dec1 is another well known example. But you can actually tell Excel not to do that auto correction. Since you most often get the data from biologist who may have treated the data in Excel already better use an another ID column and not the gene name column if that is available (you often receive both anyway). These errors can even occur in databases that you download data from or which are used for annotation, so it is good to check.
9
http://www.biomedcentral.com/1471-2105/5/80 The MARCH genes have tripped me up in the past. Using Excel/Calc etc is fine as long as gene name column is set to 'text' during import. | ||
20
|
Reminds me of that old joke... There are only two hard things in computer science: naming things, cache invalidation and off-by-one errors |
|
|
| ||
17
|
I feel like a lot of "stupid mistakes" revolve around betrayed trust and false assumptions For example:
|
|
16
|
|
|
11
|
trying to solve any problem with BioPerl :-) but the
are my favorite mistakes. |
|
10
|
Well i have couple: 1) Run a batch BLAST job and forgetting to put the "-o something.out" option. Then switching off the monitor and coming the next day to see a bunch of characters in my terminal 2) "tar -zxvf" without checking the tar file before, I have decompressed thousands of files in my current directory assuming they came in their own folder. |
|
9
|
Using grep to find sequence (or other) IDs without using the -w switch: "grep 'seq12'" will also find seq121, seq122 and so on. |
|
8
|
|
|
|
| ||
7
|
I often encounter problems related to the fact the computer scientists index their arrays starting with 0, while biologists index their sequences starting with 1. Simple concept that drives the noobs mad and even trips up more experienced scientists every once in a while. |
|
|
| ||
7
|
Masking out sequence in a FASTA file (e.g. s/TAAT/NNNN/ig) where the sequence is formated, i.e. split onto multiple lines. This will miss TAAT that is split over the end of one line and the start of the next! The classic mistake (also mentioned above by Casey) is not being aware the genome assembly effect coordinates. |
|
|
| ||
7
|
One mistake not unique to bioinformatics is: while editing one source file, compile and run another file. |
|
6
|
Having separate files for each sequence. Of a 454 run. |
|
5
|
Generate Multiple Sequence Alignment direct from fasta or other file and ask: why there are no gaps & deletion in the MSA viewer? |
|
|
| ||
5
|
I'll offer this one, which is a bit on the general side: Deletion of data that appear to serve no relevance from the computational side, but which have importance to the biology/biologist. Often, this arises from a lack of clear communication between the two individuals/teams as to what everything means, what it exactly means and why it is relevant to the process being developed. |
|
|
| ||
4
|
|
|
3
|
What kills me the most is the hand editing of data sets. If you're reading this and do it, please stop -- and start using automated builds -- with clear documentation. |
|
|
| ||
3
|
Loosing an hour to learn that some files saved on a Mac have a strange 'begining of file' like character... Normalize file standards already! x_o |
|
|
| ||
2
|
One mistake: not looking to see that the 0x4 bit in the bitflag column of a SAM (or BAM) file indicates the entry is mapped. |
|
|
| ||
2
|
tacking on another command line argument without looking through the rest of them
the first adapter argument (-a ATCTCGTATGCCGTCTTCTGCTTG) is negated by the empty second one |
|
|
| ||
2
|
Easy one: you wait for 4 hours downloading a big DNA file (e.g. bam file) and you mistakenly delete it when trying to move it with a good old rm. |
|
|
| ||
2
|
using "rm -rf * .fasta " in unintended directory; especially if within the home directory... |
|
1
|
What about building an interface not aimed at a community of (fellow) Biologists? |
|
1
|
s/foo/bar/g without the g at the end as I just proved in the field |
|
1
|
Making claims without experimental validation. Especially involving studies utilizing multiplexed technologies such as microarrays and high-throughput sequencing. |
|
|
| ||
1
|
developing algorithms and software around one single piece of low quality data with no prior knowledge while being ignorant about the entire problem. |
|
|
| ||
1
|
My common one
Cant even stop it now without closing terminal Cheers |
|
|
just | ||
1
|
Running the bwa/GATK pipeline with a corrupt/incompletely generated bwa index of hg19. Everything still aligned, but one of 2 mates would have its strand set incorrectly. Other than the insert size distribution, everything seemed normal, until the TableRecalibration step downshifted all quality scores significantly and then UnifiedGenotyper called 0 SNPs. 1st time I've seen a problem with step 1 of a pipeline not become obvious until step 5+. |
|
|
| ||
0
|
Scripting an hour to do something you could have done in half an hour manually, and then never needing to repeat it again. |
|
thanks for censoring my answer!
meta: should this Q be community-wiki?
good way to boost reputation.