Question: What Are The Most Common Stupid Mistakes In Bioinformatics?
46
gravatar for Jeremy Leipzig
3.1 years ago by
Philadelphia, PA
Jeremy Leipzig12k wrote:

While I of course never have stupid mistakes...ahem...I have many "friends" who:

  1. forget to check both strands
  2. generate random genomic sites without avoiding masked (NNN) gaps
  3. confuse genome freezes and even species

but I'm sure there are some other very common pitfalls that are unique to bioinformatics programming. What are your favorites?

ADD COMMENTlink modified 11 months ago by Manu Prestat2.8k • written 3.1 years ago by Jeremy Leipzig12k
2

meta: should this Q be community-wiki?

ADD REPLYlink written 3.1 years ago by Chris Miller12k
1

good way to boost reputation.

ADD REPLYlink written 24 months ago by Raygozak830

what's the mean by said generate random genomic sites without avoiding masked (NNN) gaps? more detailed? do not understand.

ADD REPLYlink written 22 months ago by zhilongjia40

see this?

http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr1:1,10000

you wouldn't want to sample from it

ADD REPLYlink modified 22 months ago • written 22 months ago by Jeremy Leipzig12k

Gotten it. Thank you.

ADD REPLYlink written 22 months ago by zhilongjia40
67
gravatar for Keith James
3.1 years ago by
Keith James5.4k
UK
Keith James5.4k wrote:

Invent a new weakly defined, internally redundant, ambiguous, bulky fruit salad of a data format. Again.

ADD COMMENTlink written 3.1 years ago by Keith James5.4k
9

awesome: "bulky fruit salad of a data format".

ADD REPLYlink written 3.1 years ago by brentp17k
6

This still makes me laugh every time I read it...and cry a little inside too.

ADD REPLYlink written 2.0 years ago by Daniel Standage3.0k

We should work on a standard .. I keep saying that but no one would hear me anyway Laughs & Tears ~!

ADD REPLYlink written 2.0 years ago by madkitty230
8

http://xkcd.com/927/

ADD REPLYlink written 2.0 years ago by Daniel Standage3.0k
1

It made me laugh when I read the MIQE guidelines not long after this comic came out: "The nomenclature describing the fractional PCR cycle used for quantification is inconsistent, with threshold cycle (Ct), crossing point (Cp),and take-off point (TOP) currently used in the literature ... we propose the use of quantification cycle (Cq)"

ADD REPLYlink written 22 months ago by dominic10
49
gravatar for brentp
3.1 years ago by
brentp17k
Denver, Colorado
brentp17k wrote:

I truncated many fasta files this way when trying to see which headers it contained:

grep > some.fasta

I also see a lot of off-by-one errors due to switching between formats

  • Bed is 0 based

  • GFF/GTF are 1-based

and switching between languages:

  • Python and nearly every other modern language are 0-based indexing

  • R is 1-based (as is Lua)

ADD COMMENTlink written 3.1 years ago by brentp17k
16

IMHO being off by one is the emperor of all bioinformatics mistakes - it rules them all - and probably causes tens of millions of dollars in wasted effort

ADD REPLYlink written 3.1 years ago by Istvan Albert ♦♦ 39k
4

Not only is the bed format 0-based, it's also "half-open", meaning the start position is inclusive, but the end position is not.

So if your region starts at position 100 and ends at 101 using standard 1-based coordinates with both start and end inclusive (ie it's two bases long), when you convert it to 0-based half-open coords for bed format the region now starts at 99 but it still ends at 101!

ADD REPLYlink written 3.0 years ago by Nina260
35
gravatar for Fred Fleche
3.1 years ago by
Fred Fleche3.7k
Paris, France
Fred Fleche3.7k wrote:

Gene annotation stored in an excel file and find out that some HUGO gene names have been hacked by Excel. SEPT9 become sept-9. Conclusion Do not use the .xls format to store your data.

Listen people saying this eternal mistake "Hey these two sequences are 50% homologs"

ADD COMMENTlink modified 2.0 years ago by Istvan Albert ♦♦ 39k • written 3.1 years ago by Fred Fleche3.7k
10

http://www.biomedcentral.com/1471-2105/5/80

ADD REPLYlink modified 21 months ago by Michael Dondrup27k • written 3.1 years ago by Simon Cockell6.6k
2

This is a popular one, dec1 is another well known example. But you can actually tell Excel not to do that auto correction. Since you most often get the data from biologist who may have treated the data in Excel already better use an another ID column and not the gene name column if that is available (you often receive both anyway). These errors can even occur in databases that you download data from or which are used for annotation, so it is good to check.

ADD REPLYlink written 3.1 years ago by Chris Evelo8.8k
2

"Hey these two sequences are 50% homologs"... I know. Whereas those are 45% homologs only ;-)

ADD REPLYlink written 17 months ago by Manu Prestat2.8k

@Simon Thanks for the publication

ADD REPLYlink written 3.1 years ago by Fred Fleche3.7k

The MARCH genes have tripped me up in the past. Using Excel/Calc etc is fine as long as gene name column is set to 'text' during import.

ADD REPLYlink written 3.0 years ago by Ian3.3k

Uh oh. Looks like something was lost here in the migration.

ADD REPLYlink written 2.0 years ago by Daniel Standage3.0k
29
gravatar for Casbon
3.1 years ago by
Casbon2.7k
Casbon2.7k wrote:

Reminds me of that old joke...

There are only two hard things in computer science: naming things, cache invalidation and off-by-one errors

ADD COMMENTlink written 3.1 years ago by Casbon2.7k
22
gravatar for Jeremy Leipzig
3.1 years ago by
Philadelphia, PA
Jeremy Leipzig12k wrote:

I feel like a lot of "stupid mistakes" revolve around betrayed trust and false assumptions

For example:

  1. Trusting that a downloaded file is actually fully downloaded

  2. Trusting that an aligner will accept a list of query files instead of just taking the first and ignoring the rest (quiz: which ones am i talking about?)

  3. Assuming that the quality scores in a FASTQ file are from a great Sanger-encoded run instead of a very poor Illumina-1.3 run

  4. Assuming chr1 is followed by chr2 not chr10

ADD COMMENTlink written 3.1 years ago by Jeremy Leipzig12k
4

I like the last item. :)

ADD REPLYlink written 2.8 years ago by Yuri1.1k
21
gravatar for Casey Bergman
3.1 years ago by
Casey Bergman15k
Manchester, UK
Casey Bergman15k wrote:
  • off-by-one errors
  • regex errors
  • parsing a complex alignment/file format incorrectly (e.g. BLAST or GenBank, probably the original rationale for developing BioPerl)
  • failing to account for strand
  • failing to revcomp sequences
  • failing to account for the last element in a file (because of a improper loop condition or no EOL character on last line)
  • failing to account for OS dependent line breaks
  • using the wrong assembly/annotation/release
  • using the wrong genome coordinate system
  • using the wrong file (multiple versions, version skew)
  • failing to account for nested/intercalated annotation features (e.g. genes)
  • assuming all jobs have completed on a cluster
  • deleting files
  • not randomizing your data properly
  • improper use of statistical tests
  • not documenting methods fully (to check and correct all of the above)
ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by Casey Bergman15k
4

+1 for OS dependent line breaks. This still trips me up on occasion when I get files from other groups and find that (gasp) they use windows.

ADD REPLYlink written 2.9 years ago by Wjeck430
21
gravatar for Ketil
3.0 years ago by
Ketil3.3k
Ketil3.3k wrote:

If you forgive an attempt to be somewhat provocative, my two favorite mistakes are:

1 Letting academics build software

Academics are in the need to publish papers, and one easy way to do that is to implement an algorithm, demonstrate that it works (more or less), and type it up in a manuscript. BT,DT. But robust and useful software requires a bit more than that, as evidenced by the sad state of affairs in typical bioinformatics software (I think I've managed to crash every de novo assembler I've tried, for instance. Not to mention countless hours spent trying - often in vain - to get software to compile and run). Unfortunately, you don't get a lot of academic credit for improved installation proceedures, testing, software manuals, or especially, debugging of complicated errors. Much better and productive to move on to the next publishable implementation.

2 Letting academics build infrastructure

Same argument as above, really. Academics are eager to apply to construct research infrastructures, but of course they aren't all that interested in doing old and boring stuff. So although today's needs might be satisfied by a $300 FTP server, they will usually start conjecturing about tomorrow's needs instead, and embark on ambitious, blue sky stuff that might result in papers, but not in actually useful tools. And even if you get a useful database or web application up and running (and published), there is little incentive to update or improve it, and it is usually left to bitrot, while the authors go off in search of the next publication.

ADD COMMENTlink written 3.0 years ago by Ketil3.3k
10

Yeah I don't know what why it is so hard for me to remember all the great bioinformatics software that has come from industry, like uhh Eland, or the great standards that have come from industry, like Phred-64 FASTQ.

ADD REPLYlink written 3.0 years ago by Jeremy Leipzig12k
4

To be clear, it's not a problem with academics themselves (after all, I'm one), just that the incentives are all wrong...

ADD REPLYlink written 3.0 years ago by Ketil3.3k

It's not like you're getting any incentives anyway once you've published something.. Otherwise I'll be a student forever

ADD REPLYlink written 2.0 years ago by madkitty230
1

I am fine with point 2, but I have to disagree with 1. Your de novo assembler example is actually not a good one. De novo assembly is very complicated and highly data dependent. I doubt any assemblers work for any data sets, no matter developed by academia or by professional programmers.

ADD REPLYlink modified 24 months ago • written 24 months ago by lh320k

I always wonder if they have ever check the program/code that come with paper. In one paper, they hardcode the input file in code, make me waste a whole afternoon to figure out what's the hell wrong with it.

ADD REPLYlink written 3.0 years ago by Tg240

@Jeremy: I'm not so sure industry is much better, and it's possible that academia is the democracy of development - worst, except for the others. Also, a lot of industry software are add-ons, designed to sell something else. FWIW, Newbler seems to be one of the better assemblers out there, and CLC is at least half-decent as an analysis platform for non-informaticians.

ADD REPLYlink written 3.0 years ago by Ketil3.3k

Out of the (relatively few) tools I have experience with, bowtie/tophat/cufflinks and also fastqc are the exceptions in terms of documentation, UI, maintenance, non-brittleness.

ADD REPLYlink written 2.0 years ago by bw.60
18
gravatar for Shellfishgene
2.9 years ago by
Shellfishgene230
Shellfishgene230 wrote:

Using grep to find sequence (or other) IDs without using the -w switch: "grep 'seq12'" will also find seq121, seq122 and so on.

ADD COMMENTlink written 2.9 years ago by Shellfishgene230
1

This tip is helpful indeed!

ADD REPLYlink written 2.8 years ago by Zhaorong530
1

Yea, until you make the assumption that -w actually works only on whitespace.

printf "foo-choo" | grep -Fw -e "foo"

returns foo-choo. Hate that.

ADD REPLYlink written 12 months ago by sjneph490
14
gravatar for Pierre Lindenbaum
3.1 years ago by
France
Pierre Lindenbaum58k wrote:

trying to solve any problem with BioPerl :-)

but the

  • '+1' error
  • and the grep > some.fasta

are my favorite mistakes.

ADD COMMENTlink written 3.1 years ago by Pierre Lindenbaum58k
9

s/Bioperl/perl/ ;) - haters gonna hate!

ADD REPLYlink written 3.1 years ago by Casbon2.7k

why do you hate BioPerl :(?

ADD REPLYlink written 12 months ago by anin.gregory40
3

My top reason is that BioPerl is inefficient due to its OOP layer.

ADD REPLYlink written 12 months ago by lh320k
13
gravatar for Zhaorong
2.9 years ago by
Zhaorong530
State College, PA
Zhaorong530 wrote:

One mistake not unique to bioinformatics is: while editing one source file, compile and run another file.

ADD COMMENTlink written 2.9 years ago by Zhaorong530
6

;) then hours debugging the wrong file

ADD REPLYlink written 2.9 years ago by Tony1.7k
1

ouch... brings back bad memories...

ADD REPLYlink written 22 months ago by kajendiran5680
11
gravatar for Rayna
3.0 years ago by
Rayna200
Paris/Munich
Rayna200 wrote:

Try to open microarray or, worse, NGS datafiles with excel or word...

ADD COMMENTlink written 3.0 years ago by Rayna200
11
gravatar for Andres Pinzon
2.9 years ago by
Andres Pinzon110
Colombia
Andres Pinzon110 wrote:

Well i have couple:

1) Run a batch BLAST job and forgetting to put the "-o something.out" option. Then switching off the monitor and coming the next day to see a bunch of characters in my terminal

2) "tar -zxvf" without checking the tar file before, I have decompressed thousands of files in my current directory assuming they came in their own folder.

ADD COMMENTlink written 2.9 years ago by Andres Pinzon110
1

+1 for 2) #mostly happens with downloaded softwares !

ADD REPLYlink written 2.9 years ago by Naga350

Forget the tar problem, just use atool: http://www.nongnu.org/atool/

ADD REPLYlink written 18 months ago by Ryan Thompson2.4k
8
gravatar for Larry_Parnell
3.1 years ago by
Larry_Parnell15k
Boston, MA USA
Larry_Parnell15k wrote:

I'll offer this one, which is a bit on the general side: Deletion of data that appear to serve no relevance from the computational side, but which have importance to the biology/biologist. Often, this arises from a lack of clear communication between the two individuals/teams as to what everything means, what it exactly means and why it is relevant to the process being developed.

ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by Larry_Parnell15k
8
gravatar for Gareth Palidwor
3.0 years ago by
Gareth Palidwor1.5k
Ottawa
Gareth Palidwor1.5k wrote:
  • having manual components to an analysis pipeline (editing data sets running scripts manually)
  • Not dealing with error conditions at all. This is one thing that I really noticed when I started with bioinformatics; code that would just merrily continue when it hit incorrect data and output jibberish or fail far away from the bad data. A debugging nightmare.
  • Not testing edge and corner cases for input data
  • Assuming that your input data is sane; I've run into all sorts of inconsistency issues with public data sets (i.e. protein domains at positions off the end of the protein, etc). Usually fixed promptly if you complain but you've got to find them first.
ADD COMMENTlink written 3.0 years ago by Gareth Palidwor1.5k
7
gravatar for Daniel Standage
3.1 years ago by
Daniel Standage3.0k
Bloomington, Indiana, USA
Daniel Standage3.0k wrote:

I often encounter problems related to the fact the computer scientists index their arrays starting with 0, while biologists index their sequences starting with 1. Simple concept that drives the noobs mad and even trips up more experienced scientists every once in a while.

ADD COMMENTlink written 3.1 years ago by Daniel Standage3.0k
7
gravatar for Ian
3.0 years ago by
Ian3.3k
University of Manchester, UK
Ian3.3k wrote:

Masking out sequence in a FASTA file (e.g. s/TAAT/NNNN/ig) where the sequence is formated, i.e. split onto multiple lines.

This will miss TAAT that is split over the end of one line and the start of the next!

The classic mistake (also mentioned above by Casey) is not being aware the genome assembly effect coordinates.

ADD COMMENTlink written 3.0 years ago by Ian3.3k
6
gravatar for Chris Evelo
3.1 years ago by
Chris Evelo8.8k
Maastricht, The Netherlands
Chris Evelo8.8k wrote:

Do pathways statistics or gene set enrichment statistics and then represent the list of gene sets as a valuable result, instead using that statistics just as a means to decide which pathways need to be evaluated.

(This is bad for many reasons for instance because the statistical contribution of a key regulatory gene in a pathway is equal to that of 1 out 7 iso-enzymes that catalyze a non-relevant side reaction, and because the significance of a pathway changes when you add a few non-relevant genes, and also because we have many overlapping pathways).

Another typical mistake is to solve problems that nobody has.

ADD COMMENTlink modified 21 months ago • written 3.1 years ago by Chris Evelo8.8k
2

these sound like poor judgments (e.g. Clinton-Lewinsky), not stupid mistakes (e.g. dangling chad)

ADD REPLYlink written 3.1 years ago by Jeremy Leipzig12k
2

i would define a stupid mistake as falling prey to a trivial but catastrophic pitfall, an error in judgment is more due a fundamental lack of understanding or willful ignorance

ADD REPLYlink written 3.1 years ago by Jeremy Leipzig12k

No, I think it is actually wrong to publish a list of pathways without further judgement. I think not doing the judgement is a mistake. But I have to admit that I don't really understand your examples. So maybe my English is not good enough to understand the finesses of the difference between poor judgement an stupid mistakes.

ADD REPLYlink written 3.1 years ago by Chris Evelo8.8k

In that case you are right, these would be judgement errors.

ADD REPLYlink written 3.1 years ago by Chris Evelo8.8k
6
gravatar for Andrewjgrimm
3.1 years ago by
Andrewjgrimm410
Sydney, Australia
Andrewjgrimm410 wrote:

Having separate files for each sequence.

Of a 454 run.

ADD COMMENTlink written 3.1 years ago by Andrewjgrimm410

Or rather, using a file system that can't handle a few million files in a directory?

ADD REPLYlink written 3.0 years ago by Ketil3.3k

I was using Ubuntu Linux. It "handled" it, just slowly.

ADD REPLYlink written 3.0 years ago by Andrewjgrimm410
6
gravatar for Zhidkov
2.9 years ago by
Zhidkov450
Israel
Zhidkov450 wrote:

Forget to do 'dos2unix' and then spend a lot of time trying to figure out why there is no OUTPUT

ADD COMMENTlink written 2.9 years ago by Zhidkov450

 

Classic. This one tricked me 3 times over the course of two years, spending one hour each time to figure out what the h... is going on.

ADD REPLYlink modified 6 days ago • written 6 days ago by Christian720
6
gravatar for Zev.Kronenberg
2.0 years ago by
Zev.Kronenberg7.5k
United States
Zev.Kronenberg7.5k wrote:

Not keeping an adequate notebook.

ADD COMMENTlink written 2.0 years ago by Zev.Kronenberg7.5k
5
gravatar for Thaman
3.1 years ago by
Thaman2.9k
Finland
Thaman2.9k wrote:

Generate Multiple Sequence Alignment direct from fasta or other file and ask: why there are no gaps & deletion in the MSA viewer?

ADD COMMENTlink written 3.1 years ago by Thaman2.9k
5
gravatar for hadasa
3.1 years ago by
hadasa930
hadasa930 wrote:
  1. Using excel to sort or manage your csv records.
  2. Using multiple alignments or highly diverse sequences or worst recombining sequences and inferring evolutionary history based on the resulting tree.
ADD COMMENTlink written 3.1 years ago by hadasa930

I've done that mistake a couple of times I didn't know it was stupid thou .. Now I look like a fool in front of my computer reading this thread XD

ADD REPLYlink written 2.0 years ago by madkitty230
5
gravatar for Philipp
2.0 years ago by
Philipp1.4k
Brisbane, Australia
Philipp1.4k wrote:

Re-inventing the wheel. So often did I have to debug (or just replace) a bad implementation of a fasta-parser when BioPython/BioPerl have perfect implementations, I don't understand why no-one bothers to use them. 10 minutes in Google can save you 2 days of work and other people a week of work (you save 2 days of programming, they save a week of understanding your program to find the bug)

ADD COMMENTlink written 2.0 years ago by Philipp1.4k

I fully agree, re-inventing the wheel is so tempting. We are way too eager to write a few lines of code each time. Plus, because you may have convinced yourself that you can resolve the code in 15 minutes, you don't bother about writting any documentation. In short, there is a very large tendency to re-invent the wheel... many, many times!

ADD REPLYlink written 24 months ago by Javier Herrero280
5
gravatar for Manu Prestat
17 months ago by
Manu Prestat2.8k
Berkeley
Manu Prestat2.8k wrote:

I gave my Amazon EC2 password to someone in my group who wanted to run something quickly (estimated cost, $2). I received the bill 2 months later: $156. This person forgot to close the instance. This is a 8 months story and I'm still waiting for my reimbursement... Conclusion: don't trust colleagues!

ADD COMMENTlink written 17 months ago by Manu Prestat2.8k
4
gravatar for Blunders
3.1 years ago by
Blunders940
Blunders940 wrote:

What kills me the most is the hand editing of data sets.

If you're reading this and do it, please stop -- and start using automated builds -- with clear documentation.

ADD COMMENTlink written 3.1 years ago by Blunders940
4
gravatar for Jeremy Leipzig
2.9 years ago by
Philadelphia, PA
Jeremy Leipzig12k wrote:

tacking on another command line argument without looking through the rest of them

novoalign -a ATCTCGTATGCCGTCTTCTGCTTG -d genome.ndx -F ILMFQ -f query.fq -a -m -l 17 -h 60 -t 65 -o sam -o FullNW

the first adapter argument (-a ATCTCGTATGCCGTCTTCTGCTTG) is negated by the empty second one

ADD COMMENTlink written 2.9 years ago by Jeremy Leipzig12k
4
gravatar for Pascal
2.3 years ago by
Pascal1.1k
Barcelona
Pascal1.1k wrote:

Easy one: you wait for 4 hours downloading a big DNA file (e.g. bam file) and you mistakenly delete it when trying to move it with a good old rm.

ADD COMMENTlink written 2.3 years ago by Pascal1.1k
4
gravatar for Rm
2.0 years ago by
Rm6.5k
US
Rm6.5k wrote:

using "rm -rf * .fasta " in unintended directory; especially if within the home directory...

ADD COMMENTlink written 2.0 years ago by Rm6.5k

then do not use the recursive switch (-r) to delete files within the same directory :-P

ADD REPLYlink written 2.0 years ago by Tony1.7k

yes the space between * and .fa has bitten me as well. maybe there is an idiot guard against that somewhere?

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by Jeremy Leipzig12k

I was about to add this one myself. It's bitten me a couple of times.

ADD REPLYlink written 24 months ago by Travis2.2k

:) I did the same stupid thing many time !! I lost weeks of works by one click !

ADD REPLYlink written 21 months ago by Mchimich140
4
gravatar for JC
21 months ago by
JC4.8k
Seattle
JC4.8k wrote:

I made one a few months ago. I launched a heavy process in a pay-per-use cluster, it was running for one week. I thought, 6 pennies/hr cannot be too much money. I received a bill for $832 usd. I'm not using this cluster again unless I estimate the total cost of the process.

edit: the price is per core

ADD COMMENTlink modified 17 months ago • written 21 months ago by JC4.8k

By my count, 6 pennies per hour is $1.44 a day or about $10 a week. How did you get $832?

ADD REPLYlink written 17 months ago by Ryan Thompson2.4k

maybe its price per CPU and depending upon the size of cluster, BAM!!!

ADD REPLYlink written 17 months ago by Sukhdeep Singh4.6k

I used ALL the cores ...

ADD REPLYlink modified 17 months ago • written 17 months ago by JC4.8k

Ah well, 6 pennies per CPU-hour is a little different, isn't it? :)

ADD REPLYlink written 17 months ago by Ryan Thompson2.4k

yes, but it was not clear in the service description from our local provider

ADD REPLYlink written 17 months ago by JC4.8k
3
gravatar for Eric Normandeau
3.0 years ago by
Eric Normandeau7.1k
Quebec, Canada
Eric Normandeau7.1k wrote:

Loosing an hour to learn that some files saved on a Mac have a strange 'begining of file' like character...

Normalize file standards already! x_o

ADD COMMENTlink written 3.0 years ago by Eric Normandeau7.1k
3
gravatar for Vince
3.0 years ago by
Vince360
Davis, CA
Vince360 wrote:

One mistake: not looking to see that the 0x4 bit in the bitflag column of a SAM (or BAM) file indicates the entry is mapped. RNAME, CIGAR, and POS may be set to something non-null (an actual string!) but these are not meaningful if the 0x4 flag says the read is unmapped.

ADD COMMENTlink written 3.0 years ago by Vince360
3
gravatar for T S
2.3 years ago by
T S30
T S30 wrote:

developing algorithms and software around one single piece of low quality data with no prior knowledge while being ignorant about the entire problem.

ADD COMMENTlink written 2.3 years ago by T S30
3
gravatar for Leonor Palmeira
21 months ago by
Leonor Palmeira3.2k
Liège, Belgium
Leonor Palmeira3.2k wrote:

I've just made one, which cost me a good headache trying to figure out the biology underlying my strange results!

  • expecting SAM's POS field to be the leftmost position of my mapped read on the '+' strand, and the rightmost position on the '-' strand

Note to self : "Read the manual..."

ADD COMMENTlink written 21 months ago by Leonor Palmeira3.2k

not to mention how much work is to actually get the rightmost coordinate.

ADD REPLYlink written 21 months ago by Istvan Albert ♦♦ 39k
1

Hah! I just reverse complemented the reference genome, and redid the alignment. Admittedly, this was for 454 data.

ADD REPLYlink written 18 months ago by Ketil3.3k

Hehe! That's the best idea ever! This thread keeps on giving.

ADD REPLYlink written 18 months ago by Istvan Albert ♦♦ 39k

But you should be careful. Doing that will misplace the indel position in a microsatellite.

ADD REPLYlink written 17 months ago by lh320k

If I understand you correctly, you are saying that this will inflate the number of variants, since many have ambigous positions? Interesting - do aligners generally guarantee that such ambigous variants are consistently placed for forward and reverse reads?

ADD REPLYlink written 17 months ago by Ketil3.3k
2

BWA always places the indel at the beginning of a microsatellite. If you align the read to the rev-complemented ref, the indel will be at the end. Many indel callers assume the bwa behavior, though there are also tools to left-align indels.

ADD REPLYlink written 17 months ago by lh320k

Isn't this just POS+length(SEQ)? I'm having doubts now...

ADD REPLYlink written 21 months ago by Leonor Palmeira3.2k
2

that only applies if the sequence contains only matches or mismatches, this means edit strings that are composed of a number followed by M (like 76M) . For all other alignments you will need to parse the CIGAR string and build the end coordinate from the start + numbers in the edit string.

ADD REPLYlink modified 21 months ago • written 21 months ago by Istvan Albert ♦♦ 39k

Phew... I'm glad I only have matches and mismatches, so I fall in the easy category :-) Thanks a lot for adding this information, this can be a big trap!

ADD REPLYlink written 21 months ago by Leonor Palmeira3.2k

I have a perl parser that will change the read length bases on the cigar if you ever want / need it.

ADD REPLYlink written 21 months ago by Zev.Kronenberg7.5k

Thanks a lot! I'll keep that in mind! Or maybe you can share it here as a Tool? or as an answer to this thread : http://biostars.org/post/show/41951/mapping-reads-with-bwa-and-bowtie/ ?

ADD REPLYlink written 20 months ago by Leonor Palmeira3.2k
2
gravatar for bw.
2.0 years ago by
bw.60
San Francisco
bw.60 wrote:

Running the bwa/GATK pipeline with a corrupt/incompletely generated bwa index of hg19. Everything still aligned, but one of 2 mates would have its strand set incorrectly. Other than the insert size distribution, everything seemed normal, until the TableRecalibration step downshifted all quality scores significantly and then UnifiedGenotyper called 0 SNPs. 1st time I've seen a problem with step 1 of a pipeline not become obvious until step 5+.

ADD COMMENTlink written 2.0 years ago by bw.60
2
gravatar for Niek De Klein
23 months ago by
Niek De Klein2.0k
Netherlands
Niek De Klein2.0k wrote:

Scripting an hour to do something you could have done in half an hour manually, and then never needing to repeat it again.

ADD COMMENTlink written 23 months ago by Niek De Klein2.0k
6

I tend to see the opposite not spending the time up-front to do it right and having to continue to do it manually/semi-manually ad nauseam

ADD REPLYlink written 23 months ago by brentp17k
1

but that one is in here already

ADD REPLYlink written 22 months ago by Niek De Klein2.0k
2
gravatar for girlwithglasses
12 months ago by
Oakland, CA
girlwithglasses240 wrote:

Creating a new set of (public) identifiers for your database for entities that already have identifiers in a widely-user public db.

Having unstable identifiers that are publicly available.

ADD COMMENTlink written 12 months ago by girlwithglasses240
1
gravatar for Andra Waagmeester
3.1 years ago by
Maastricht, the Netherlands
Andra Waagmeester3.0k wrote:

What about building an interface not aimed at a community of (fellow) Biologists?

ADD COMMENTlink written 3.1 years ago by Andra Waagmeester3.0k
1

that might be stupid and it might be a mistake but it isn't a "stupid mistake". see comments under Chris Evelo's post.

ADD REPLYlink written 3.1 years ago by Jeremy Leipzig12k

Then I will conclude with the same comment as Chris.

ADD REPLYlink written 3.1 years ago by Andra Waagmeester3.0k
1
gravatar for Russh
3.0 years ago by
Russh1.1k
U. Liverpool
Russh1.1k wrote:

s/foo/bar/g without the g at the end as I just proved in the field

ADD COMMENTlink written 3.0 years ago by Russh1.1k

I have an opposite mistake. I accidentally replace a string in the whole document and something I didn't want to replace gets replaced. Now I visually select the area where I want to replace...

ADD REPLYlink written 2.9 years ago by Sequencegeek640
1
gravatar for Gww
2.9 years ago by
Gww2.4k
Canada
Gww2.4k wrote:

Making claims without experimental validation. Especially involving studies utilizing multiplexed technologies such as microarrays and high-throughput sequencing.

ADD COMMENTlink written 2.9 years ago by Gww2.4k
1
gravatar for Sukhdeep Singh
2.0 years ago by
Sukhdeep Singh4.6k
Germany
Sukhdeep Singh4.6k wrote:

My common one cat aVeryBigFile.whatever &

Cant even stop it now without closing terminal

Cheers

ADD COMMENTlink written 2.0 years ago by Sukhdeep Singh4.6k
4

can't you just 'fg' then ctrl-c ?

ADD REPLYlink written 2.0 years ago by brentp17k
1

@brentp, @Daniel cool guys, now this mistake can be rectified :)

ADD REPLYlink written 2.0 years ago by Sukhdeep Singh4.6k

just kill -9 $PID from another session, no?

ADD REPLYlink written 2.0 years ago by Simon Cockell6.6k

Good thought but, on a server (I should have told earlier, I commit this on server sessions), your process id's in one login are not known in other login/session.

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by Sukhdeep Singh4.6k
6

I'm pretty sure that ps aux | grep cat from another session will give you the PID of any running cat process on the server.

ADD REPLYlink written 2.0 years ago by Daniel Standage3.0k

if you don't have other cat job running, you can also type killall cat. Even if you don't see your input, it should work.

ADD REPLYlink modified 11 months ago • written 11 months ago by Manu Prestat2.8k
1
gravatar for kajendiran56
22 months ago by
kajendiran5680
kajendiran5680 wrote:

Some really great comments here, nice to know that such things happen to all genii ;). I have to say my most painful moments relate to my assumption that data obtained elsewhere is correct in every way. I also remember early in my career, using PDB files and realising that sometimes, chains are represented more than once, thus when manually checking calculations involving atomic coordinates, being utterly perplexed and wanting to break my computer. Oh the joys of Bioinformatics.

ADD COMMENTlink written 22 months ago by kajendiran5680
1
gravatar for Ryan Thompson
18 months ago by
Ryan Thompson2.4k
TSRI, La Jolla, CA
Ryan Thompson2.4k wrote:

Using a statistical test on data that does not satisfy the assumptions of that test.

For example, finding differentially-expressed genes by doing ANOVA on log2(FPKM).

ADD COMMENTlink written 18 months ago by Ryan Thompson2.4k
1
gravatar for Ryan Thompson
17 months ago by
Ryan Thompson2.4k
TSRI, La Jolla, CA
Ryan Thompson2.4k wrote:

Assuming that the gene IDs in "knownGenes.gtf" from UCSC are actually gene IDs. Instead they just put the transcript ID as the gene ID.

This just caused me a bit of pain when doing read counting at the gene level. Basically, any consittutive exon in a gene with multiple splice forms was ignored because all the reads in that exon were treated as ambiguous.

ADD COMMENTlink written 17 months ago by Ryan Thompson2.4k
1
gravatar for axelwilhelm
17 months ago by
axelwilhelm30
axelwilhelm30 wrote:

Using the same output (> oh.shit) for multiple commands.

ADD COMMENTlink written 17 months ago by axelwilhelm30
0
gravatar for brentp
22 months ago by
brentp17k
Denver, Colorado
brentp17k wrote:

I wouldn't say it's stupid , but I think a very common mistake is to not correct for batch effects in high-throughput data.

Batch effects can (best-case) hide the real effect that you're looking for, or (worst-case) make it look like your variable of interest is contributing to your findings when it's actually an artifact.

Leek + Irizarry et al. have a sobering review on this here.

ADD COMMENTlink written 22 months ago by brentp17k
0
gravatar for Rm
22 months ago by
Rm6.5k
US
Rm6.5k wrote:

Just read this article: "How Not To Be A Bioinformatician" Thought it would be interesting to post here....

ADD COMMENTlink written 22 months ago by Rm6.5k
0
gravatar for Leszek
11 months ago by
Leszek2.9k
Barcelona, Spain
Leszek2.9k wrote:

I found myself guilty iterating through the loop and storing data let's say every 100 iterations... but not storing the very last bit of the data (ie lines 10001 to 10026) at the very end.

ADD COMMENTlink written 11 months ago by Leszek2.9k
0
gravatar for Manu Prestat
11 months ago by
Manu Prestat2.8k
Berkeley
Manu Prestat2.8k wrote:

A double mistake combo, 1 - use tar to compress a single file and, 2 - inverting the command arguments

tar xvfz file file.tgz

instead of

tar xvfz file.tgz file

Bye bye file!

It happened to me so many times, that I was considering doing an imagery brain check up.

ADD COMMENTlink modified 11 months ago • written 11 months ago by Manu Prestat2.8k
Please log in to add an answer.

Help
Access
  • RSS
  • Stats
  • API

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.0.0
Traffic: 404 users visited in the last hour