Question

Bowtie Inaccuracy Limit

0

Entering edit mode

12.1 years ago

Arpssss ▴ 40

From BowTie Paper I found that, it is able to find exact matches and also in exact matches. Now, from bowtie manual , I found how to build index for a genomic database. So, I build it using command,

bowtie-build hg19.fa hg19

Now, I want to run a query read file named "a493081_1.fastq" to find exact and inexact matches (allowing 1,2 and 3 substitutions - as specified in BowTie paper) for 150 bps read length.

So, I issue the command

./bowtie --all -v 0 hg19 a493081_1.fastq a.txt

to find all alignments with 0 mismatch. And BowTie outputs,

# reads processed: 200000
# reads with at least one reported alignment: 145692 (72.85%)
# reads that failed to align: 54308 (27.15%)
Reported 173932 alignments to 1 output stream(s)

However, all reads are taken from hg19, so BowTie should give output "NO reads that failed to align". BowTie provides inaccurate matching, but near about 30 % inaccuracy is not similar as I found from various comparison. Can anybody help me, in what reasons this inaccuracy can happen or any procedure to make it more accurate.

Additional: I should mention, my fastq file contains 150 bps single end reads.

bowtie bowtie2 genome • 3.6k views

ADD COMMENT • link updated 12.1 years ago by Istvan Albert 101k • written 12.1 years ago by Arpssss ▴ 40

0

Entering edit mode

Can anybody help me by informing: Is it possible to reach error of 30 % for BowTie 1 ? Or I am making some mistakes ?

ADD REPLY • link 12.1 years ago by Arpssss ▴ 40

0

Entering edit mode

show us a read that did not align

ADD REPLY • link 12.1 years ago by Jeremy Leipzig 22k

0

Entering edit mode

Try using Bowtie2. In the documentation they say that bowtie 1 was developed having in mind short reads and bowtie2 should perform much better with larger read lengths. See: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#how-is-bowtie-2-different-from-bowtie-1

ADD REPLY • link 12.1 years ago by Fidel ★ 2.0k

score 2 · Answer 1 · 2012-06-23

Your data is most likely flawed in some manner. Given the heuristic nature of high throughput aligners we cannot expect to be able to map back all reads even if these were simulated from the target reference genome. On the other hand an error rate of 30% would be excessive and frankly it would make the tool unusable for most purposes. So that alone indicates that your are misusing either the data or the aligner.

At the same time note that there is more to accuracy than simply accepting a reported match. One should also verify that the match is indeed a true positive. For a more thorough comparison of the accuracy of several mappers see Heng Li's ROC curves at:

http://lh3lh3.users.sourceforge.net/alnROC.shtml