Question: Mapping With Second Best Hits
0
gravatar for Walter
6.8 years ago by
Walter0
Walter0 wrote:

I'm having some trouble with repetitively mapping Illumina data. I'd like to remove any reads that have a reasonably close second-best hit in the genome/transcriptome, so if the best hit has an edit distance of n, I'd like to remove it if the second-best hit is within n+x (say n < 5, x < 2).

For historical/collaboration reasons, we've been using bwa, which I know is designed to find second-best hits but not necessarily all hits up to a given edit distance. I believe that the bwa aln -N option is supposed to find all (more or less) hits up to a given edit distance (without the -N option, it appears to only find second-best hits with up to a given number of mismatches, not including indels). I'm getting garbage for the hits in the XA tag, eg:

readID1    16    chr5    67488388    20    101M    *    0    0    ACAGAGGAACCTTCTTAGTGCTCACCGACCTTATACCCAGCTATGGGGACCAGCTCCAGAAAGCACCTTTGCTCCCAATTGGTTAGCTATAGTAAGGTGCG    DDDDDDDC?DCDDDDDDDDDDBDDDDFHHHHHHG@JJJJJJIIHJJJJIJJJJJIJJJJJJJIHGJJJJJIIGJJJJJJJJJIJJJJIHHHHHFFFFF@<C    XT:A:U    NM:i:1    X0:i:1    X1:i:2    XM:i:1    XO:i:0    XG:i:0    MD:Z:43G57    XA:Z:chr5,-67488388,101M,6;chr5,-67488390,2S99M,6;

Where you'll note that the start coordinates are identical or very similar for all three hits. For reference, this is how I'm calling bwa:

bwa aln -b -O2 -n 6 -t 12 -N ref.fa input.bam | bwa samse ref.fa - input.bam | samtools view -h -b -S -o remapped_N.bam -

I've also tried bowtie2 (very slow) and masai (very memory intensive) for mapping. A few questions:

  1. Am I doing something wrong in how I'm calling bwa?

  2. Does anyone have a better mapper for my particular situation? (I'm actually mapping transcriptome data against a genome including an artificial transcriptome as the rna-seq specific mappers don't seem to do well with finding possible second-best hits, but I'm open to suggestions.)

PS I've looked at this post: For short reads, which aligners find all hits given certain edit distance threshold? but I don't think any of the output was checked for accuracy nor speed.

mapping bwa • 2.4k views
ADD COMMENTlink modified 6.8 years ago by Istvan Albert ♦♦ 84k • written 6.8 years ago by Walter0

Have you tried mapping quality cutoffs to achieve what you want? It's not exactly the same as an edit distance cutoff, since it depends on per-base quality scores, but it seems like it might help you if your goal is finding things that are "very unique".

And any reason the gap open penalty (-O) is so low? I haven't thought about it too much, but there's a chance that could be a reason why you're getting some of the "garbage hits".

ADD REPLYlink written 6.8 years ago by matted7.3k

also, I get the exact same output when I set -O to 5 instead of 2

ADD REPLYlink written 6.8 years ago by Walter0

As I mentioned before, I'm concerned with indels as well as with mismatches. I'm pretty sure that bwa ignores indels in determining the mapping quality. If possible, I would like to know where those second-best hits are, which should be in the XA tag.

ADD REPLYlink written 6.8 years ago by Walter0
1
gravatar for JacobS
6.8 years ago by
JacobS920
Cleveland, Ohio
JacobS920 wrote:

Just a thought... I've done studies that sought artificial gene duplication in large genome assemblies by using Bowtie2 and parsing the output .SAM file to find reads that mapped with high scores to more than one location. If you have paired-end reads, this is especially revealing, since the significance of the finding two mapping locations is increased when the R1 and R2 both map to each location as a proper pair.

When running your Bowtie2 command, use the -k <int> optional parameter. This tells Bowtie2 to report up to int best alignments per read pair (or single-end read). So, if you are looking for reads that indicate duplicated regions in the genome, I would map using Bowtie2 and -k 3. Then, parsing the .SAM file would tell me how many reads mapped uniquely (1 position), how many reads mapped to duplicates (2 positions), and how many reads were low complexity/poor scoring/too common for use (3 position, which include 3 or more places).

Find more in the doc for Bowtie2: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml

ADD COMMENTlink modified 6.8 years ago • written 6.8 years ago by JacobS920

my understanding was that results from the -k option are not necessarily in order of mapping quality and therefore I'm worried that I need to use the -a option -- which is super slow, although maybe not slower than other all-mappers given my combination of processors and memory.

ADD REPLYlink written 6.8 years ago by Walter0

eh, turns out using a high -k is the best I could come up with

ADD REPLYlink written 6.8 years ago by Walter0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 783 users visited in the last hour