Forum:Attention: Bowtie2 And Multiple Hits
4
29
Entering edit mode
8.6 years ago

We just tested Bowtie2 for reads mapping multiple times in a genome:

1 . We took one read with multiple (perfect) mapping loci in the genome (hg19):

FASTQ entry:

@test_sequence
CAAAGTGTTTACTTTTGGATTTTTGCCAGTCTAACAGGTGAAGCCCTGGAGATTCTTATTAGTGATTTGGGCTGGGGCCTGGCCATGTGTATTTTTTTAAA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII


UCSC - BLAT:

browser details YourSeq          101     1   101   101 100.0%     2   -  114359280 114359380    101
browser details YourSeq          101     1   101   101 100.0%    15   -  102519435 102519535    101
browser details YourSeq          101     1   101   101 100.0%     1   +      11635     11735    101


2 . We created a fastq file containing 1,000 times this entry.

3 . We started bowtie2 with the following command:

bowtie2 -x hg19 -U test.fq -S test.sam


4 . bowtie2 claims in its manual:

By default, bowtie2 searches for distinct, valid alignments for each read. When it finds a valid alignment, it continues looking for alignments that are nearly as good or better. The best alignment found is reported (randomly selected from among best if tied).


5 . BUT:

=> For all reads bowtie2 reports the same location (chr1:11635-11735) in the SAM output

=> Using default parameters, bowtie2 does NOT select the reported mapping randomly

Why is that a problem?

Two examples:

1. Assume you analyze a small RNA-seq experiment to quantify microRNAs. Now you have an example like hsa-let-7a. The 5' arm of this microRNA (hsa-let-7a-5p) occurs three times in the human genome, while the 3' arm (hsa-let-7a-3p) is unqiue. When you now use bowtie2 for mapping (without knowing the effect shown above), you might think, that your hsa-let-7a-5p reads are randomly distributed over these three loci, normalizing the "expression" in a way. But that is not the case. Instead one of the three let-7 microRNAs gets all reads and you think it is highly expressed. But is it?
2. The same problem occurs when doing transcriptome sequencing. It might happen, that all the reads of a gene having pseudogene copy in the genome, map to the latter, resulting in no expression at all. Is it not expressed?

TIP: You can and should force bowtie2 to report all multiple loci (-k or -a parameter). Do not use the bowtie2 default parameter for NGS read mapping!

bowtie next-gen mapping Forum • 23k views
2
Entering edit mode

that's pretty scary. i assume it's the same if you set the seed?

0
Entering edit mode

What do you mean by "setting the seed"?

0
Entering edit mode

--seed <int> Use <int> as the seed for pseudo-random number generator. Default: 0. Use any number other than zero - what happens?

0
Entering edit mode

we did not try that, since the we wanted to use the default parameters.

3
Entering edit mode

We just checked it and when setting --seed, the locus switches to another one, but it is still only one single locus for all 1,000 reads! ;)

3
Entering edit mode

When you write code that requires a random number selection you typically must provide a seed value to start the pseudo-random number generation. Running the algorithm with the same seed will produce the same "random" output. The time when the program was run (expressed as an integer) is a typical choice of seed. Apparently Bowtie does not modify the seed value between reads once you start alignments, so the same "randomly selected" locus is chosen in your pathological (but not unreasonable) test data.

0
Entering edit mode

I wonder if it actually thinks chr1 position is the 'best' position for some odd reason. Is it an error in the randomly selection process or an error in finding best match.

3
Entering edit mode

If he's changing the random seed and getting a different location (see above), then it doesn't sound like one match is scoring better than the others.

0
Entering edit mode

scary bias that could throw off a lot of downstream expression analysis specially for genomes with high repeat content.

7
Entering edit mode
8.5 years ago

Here a small update from the mirrored discussion at seqanswers.com

fkrueger:

As of yesterday night a new version of Bowtie 2 has been released (Bowtie 2 version 2.0.2) which adds an option --non-deterministic:

 "Added --non-deterministic option, which better matches how some users
expect the pseudo-random generator inside Bowtie 2 to work.  Normally, if
you give the same read (same name, sequence and qualities) and --seed, you
get the same answer.  --non-deterministic breaks that guarantee.  This can
be more appropriate for datasets where the input contains many identical
reads (same name, same sequence, same qualities)."


Does this option address the multiple mapping issue reported by original poster?

Me:

But, in my opinion, the problem still exists.

When I run bowtie2 with default parameters, I still get one non-random hit for all reads. All tools that use bowtie2 (e.g. tophat2) for mapping, do not use --non-deterministic, or -a/-k parameter. So, downstream analysis will still be affected.

Why not set it as default parameter? Time complexity?

Me:

We now checked the behavior of the new version of bowtie2 (v.2.0.2) with our testset from above (3 equivalent mapping loci):

1 . (1) no --non-deterministic option + (2) same name for all reads:

• 100% of all reads map to one locus.

2 . (1) no --non-deterministic option + (2) different name for all reads:

• 99% of all reads map to one locus.
• 1% of the reads map to on of the other loci.

3 . (1) --non-deterministic option + (2) same name for all reads:

• 65% of the reads map to locus1
• 19% of the reads map to locus2
• 16% of the reads map to locus3

4 . (1) --non-deterministic option + (2) different name for all reads:

• same result as in 3.
3
Entering edit mode
8.6 years ago

Some ideas:

2) Are you sure that the reference build and index is complete, and that all chromosomes are present? Do you get the same results when using one of their pre-built indices?

3) what happens if you set "-k 2"? Does that return two mapping locations, as you expect?

4) have you contacted the developers to see if there's an explanation? If this truly is a bug, they will definitely want to know about it.

0
Entering edit mode

2) and 3) We get all three loci when setting -k 10 or -a (together with suboptimal hits)

4) We did not contact the developers

2
Entering edit mode

I'd encourage you to report the bug here, so they can either suggest what you might be doing wrong, or fix the issue in the next release: http://sourceforge.net/tracker/?func=add&group_id=236897&atid=1101606

0
Entering edit mode

I just reported the bug.

1
Entering edit mode
8.6 years ago

There are also several widely used tools that map the reads with bowtie1/2. It would be nice to check they use the -k paramter and if not, any bias/error is added to their results.

To name some of these tools:

0
Entering edit mode

I don't know if it has always read this way, but the bowtie manual states: "This randomness flows from a simple seeded pseudo-random number generator and is deterministic in the sense that Bowtie will always produce the same results for the same read when run with the same initial "seed" value (see --seed option)." So your test case, where you have the same read 1000 times, is going to produce the same results by design. You may not think of the reads as the same, but perhaps their internal implementation notices they are identical (do they have different read names or the same?).

1
Entering edit mode

They all have different names (@test_sequence_n). But even if bowtie2 was designed like that, it would not change the message, right? I just think that it is important to know what these mapping tools are really doing in order to do a correct downstream analysis.

0
Entering edit mode
8.3 years ago
dsbreak ▴ 150

From the description of the --non-deterministic option, it sounds like the pseudo-random generator only changes when there is a significant change in the name+sequence+qualities string. For your simulated 101 nt reads with identical quality scores, appending the n (1..1000) to @testsequence probably doesn't change it much (consistent with your Case #2). Have you tried changing your name+sequence+qualities more significantly? Even for putative "PCR duplicates", the .FASTQ data that we tend to analyze differs not only in the quality string but also in the name (tile, x-coordinate, and y-coordinate).

1
Entering edit mode

Of course you can change all the names and qualities more significantly. And I will not start a discussion about the similarity of the qualities of the same illumina sequences.... The message of the complete discussion is just that there is a bias when running bowtie2 with default parameters. Of course you can overcome that problem when changing the parameters or the names of the input reads. Don't get me wrong here! I think bowtie2 is a good mapping tool. Nevertheless, for short read data I would use the -a paramter, which is not set by default. And, in my opinion, the statement that the mapping location is 'randomly selected' is wrong, since it is only random selected, if you have significantly different names, which they cannot assume by default. I just wanna say: ATTENTION!