Obtaining uniquely mapped reads from BWA mem alignment (filtering by q score does not seem to do the trick in my case)
2
5
Entering edit mode
5.3 years ago

Hello,

I am pretty new to bioinformatics and to this forum, and would very much appreciate any suggestions you could offer to help me with ways to filter my alignments (obtained using bwa mem) so I only have uniquely mapped reads, other than using q scores (which I have already done and doesn't solve my problem)? Here is a bit more explanation:

I am trying to obtain uniquely mapped reads from my bwa mem alignments to use in downstream analyses. I realize that "uniquely mapped reads" is a loaded term, and that most sources I have found on the topic suggest filtering by q score should do the trick. My problem is that I have tried filtering by progressively higher q scores and I still have a (very small) number of reads that appear to not be uniquely mapped. Because I am aligning my reads from each sample to 96 different loci (in bwa), I want to make sure my reads are only mapped to one locus.

For the one sample I am using to test out commands, I filtered my data so I only have properly paired reads and mapped reads (in the case of my unpaired reads) and also filtered by q score (I get nearly the same results at q of 10, 20, or 30, so it's not an issue of changing the q score level), merged all my bam files, and obtained 193,998 reads total for that sample. When I open the bam file in Geneious and count the number of reads for each locus they add up to 194, 050. I realize the difference between these two numbers is an incredibly small number of reads, and I am going to only use loci that have ~20 or 30 mapped reads for each sample (haven't yet decided on that threshold) so maybe this doesn't matter terribly much in the grand scheme, but I still would like to have these two numbers match so I am not incorrectly including loci that weren't actually sequenced for a particular sample.

I have also seen suggestions to filter by XT tags, but none of my files have this tag. I DO have the XS tag (most seem to be XS:i:0). My understanding of what XS:i:0 means is a bit shaky, but I think it means that the secondary alignment has a score of zero so these would be reads that only have one alignment? And maybe this could be a good tag to use for filtering?

So my question is this: Does anyone have any suggestions for other tags or filtering steps I can use for my data so that my reads only map to one location, other than filtering for q score (which I have already done and will do in addition to any other filtering steps)?

Here is the applicable portion of code that I am running on alignments obtained from bwa (sorted them by coordinate and converted them to bam already in a previous step). The first line is what I'm using on my paired-end file and the second line is what I am using on my unpaired read 1 file (using the same code for my unpaired read 2 file). Note that I used trimmomatic to remove low quality bases from my reads, which is why I have three different read files for each sample.

samtools view -q 10 -f 0x02 -b pe001_sorted.bam > pe001_sorted_properlypaired.bam
samtools view -q 10 -F 0x04 -b unpaired1_sorted.bam > unpaired1_sorted_mapped.bam


next-gen alignment bwa • 11k views
2
Entering edit mode

I was unsure about the best place to comment on whether the grep commands provided on the other site worked, but I am sure some of you are curious. I would appreciate any insights you all have into comparing the results of each set of commands. Here is the command I tried (that was posted on the other site) to attempt to only have "uniquely mapped reads"

samtools view -h pe001_sorted.bam | grep -v -e 'XA:Z:' -e 'SA:Z:' | samtools view -q 10 -F 3332 -f 2 -b > pe001_sorted_properlypaired.bam


The commands worked in the sense that they gave me 193,100 mapped reads in the flagstats file (approximately what I expected), but Geneious is STILL saying I have more sequences than the flagstat file says (it says 193,156 sequences)!!! At this point I'm wondering if this is some sort of quirk with Geneious and will reach out to their help and ask if they have run across this particular problem. Additionally, the above commands cut out about 800 reads for this one sample as compared to the commands I used to filter by MAPQ and properly paired/mapped reads, so I am not sure I should go the grep route anyway.

The commands I used in which I just filtered by MAPQ scores and properly paired/mapped reads are:

samtools view -q 10 -F 3332 -f 2 -b pe001_sorted.bam > pe001_sorted_properlypaired.bam


The above commands gave me 193, 957 in flagstats and then 194,007 in Geneious (a difference of 50 as opposed to a difference of 46 using the other commands).

Any thoughts on which method you would use/which method makes sense? I don't want to cut out reads unnecessarily, and it seems as though the vast majority of the bioinformatics community just filters using MAPQ scores and does not really attempt to obtain "truly uniquely mapped reads". However, since I am mapping to multiple genomic regions I really don't want to incorrectly include loci that weren't actually mapped to.

I was hoping to open all of my bam files in Geneious and then group the consensus for each sample for each locus into a folder (so each folder would have a locus), and then align all the consensus sequences in a folder, then export the alignment for each locus for use in R. I am positive there is a bioinformatics solution that I could do rather than using Geneious, but maybe I should ask it in another question:)

1
Entering edit mode

I'm glad you caught our 3332 vs. 1284 mistake! :)

I'm not a fan of the grep route; I've long been an advocate for "filter by MAPQ and be done with it".

I encourage you to ask about the analysis you were planning in Geneious in another post. Please include what your goal is with the per-sample alignments, since that could be important.

1
Entering edit mode

Your issue is likely just supplemental alignments (use samtools view -cF 1284 -q 1 file.bam and see if that better matches what genious says. A question arises whether you want reads that are trully only reasonably mappable to one locus or are simply very likely only mapped to one locus. Filtering by MAPQ is giving you the latter, you'll need the grep -v command linked to by bioinformatics for the former.

0
Entering edit mode

I don't really understand the different between "only reasonably mapped to one locus" and "very likely only map to one locus" Devon - i'm not that advanced at SAM yet :P Just when i think i finally get it, i learn something new! Is it because the MAPQ can still be high if there's a poor second alignment, but with a special bwa tag reads can also be labelled as totally, positively, unique?

I don't really see where chimeric alignments come into this question either, re. the SA tag.

Man SAM is so complicated -_-

1
Entering edit mode

Yes, at the end of the day the question is "what's really unique?" BWA (and bowtie2 and most other aligners) can give a decently high MAPQ to an alignment even if there are other possible alignments (according to the parameters used by the aligner for defining this, since if you use a loose enough definition then everything becomes a possible alignment). This happens when the other alignments have a score significantly below that of the returned/preferred alignment. To me this seems quite reasonable, but from the reply elsewhere it can become ambiguous if one is searching for "very likely correct" (i.e., high MAPQ and forget the rest) alignments or only those where the aligner can't find anything else that's valid (i.e., grep -v the XA auxiliary tag and ignore the fact that what will go in there can change if you change the bwa settings).

Regarding the SA tag, my guess is that the count difference from BWA is simply due to it counting those rather than ignoring them.

0
Entering edit mode

Ah ok, i follow - thanks man :)

0
Entering edit mode

Thank you for your reply! This is giving me quite a bit to think about. It seems reasonable to me to filter using MAPQ and be ok with accepting that reads are "very likely only mapped to one locus" rather than "truly only mappable to one location". Particularly because it appears that for this one sample at least, there are only between 50-60 reads out of ~200,000 that are mapped to multiple locations. At least that is how I am interpreting the fact that flagstat tells me there are 193,998 reads but Geneious says 194,050.

I do have one follow-up question for you though. For the command you posted, what is the -F 1284? I know I have come across that at some point in trying to answer my question, but now I can't find where I read it.

Additionally, would I still need to include the -F 0x04 filter to remove the unmapped reads from the unpaired files and the -F 0x04 and -f 0x02 filters to remove unmapped reads and only include properly paired reads for my paired end files?

I have tried running some variations of these commands but am not having much luck. I suspect it's because I don't fully understand what the -F 1284 is doing.

Thank you again for helping me solve this problem!

1
Entering edit mode

Hello bioinfo

It appears that your post has been cross-posted to another site: https://bioinformatics.stackexchange.com/questions/508/obtaining-uniquely-mapped-reads-from-bwa-mem-alignment/519

This is typically not recommended as it runs the risk of annoying people in both communities.

I acknowledge crossposting by bioinfor, not the original poster. Anyway, please do not just screen-scrape new question, because it triggers cross posting effects, irrespective of who actually did the cross posting.

0
Entering edit mode

I don't think she did it herself, I think someone else did. We're going to run into this a lot in the next few weeks, so i think we should have a little patience with the SE guys and just say they can link to our stuff, and they can copy our stuff, but they can't dump their noise in our signal :-)

1
Entering edit mode

I can't help but think this is a weird method to boost your post-count and attract users.

1
Entering edit mode

Yup. But so long as they attribute Biostars, and we figure out the correct answer by talking to betsy, then it's win-win. That's kind of the reason SE is doomed to fail in my opinion (but i'll certainly let them try!). The only person who know's what the real problem will be betsy when she figures it out. You can't solve questions about data by sitting in front of a log fire swirling your brandy and talking about the importance of the SA tag. You just have to work with the user and get to the bottom of their specific problem. That's what Biostars does best, and while i'm sure it upsets those (like me) on the autism spectrum that there's no 1 correct answer for every 1 well-defined question, that's the nature of data science.

1
Entering edit mode
0
Entering edit mode

I've posted this question to the Bioinformatics Stack Exchange (see here), because it's a well-formatted, specific question with a good chunk of research behind it.

2
Entering edit mode

Why not answer it here? Why cross-post someones else question to another site and answer it there?

0
Entering edit mode

I don't feel comfortable answering questions here, and I was encouraged to cross-post on the Stack Exchange Meta. See the discussion about that here:

https://bioinformatics.meta.stackexchange.com/a/43

1
Entering edit mode

I was encouraged to cross-post on the Stack Exchange Meta

I think this is only encouraged when the site is in beta. When it is fully active people will no longer tolerate cross-posting.

1
Entering edit mode

Even in beta, it was apparently not a good idea to encourage it / me. My post angered Michael Dondrup enough that he put out a "tut, tut" post on the matter:

edit: got the name wrong

3
Entering edit mode

That's actually Michael Dondrup, Robert just edited the tags. Having said that, I agree that this should be avoided.

2
Entering edit mode

How is this the accepted answer o_O

2
Entering edit mode

I don't understand why, and would rather not have a legacy of "cross-post" statements being considered answers... I guess this was my fault for posting this as an answer rather than a comment to the question.

0
Entering edit mode

That is absolutely not a problem, that's what mods are for.

1
Entering edit mode

It's good to keep us focussed and avoids that we do real work :-)

1
Entering edit mode

It is no longer, and it is not an answer! Moved to a comment.

1
Entering edit mode

I had moved it from an answer to a comment, then saw it was actually answered on the other site, so moved it back and bumped it to the top so it'd be more readily seen as "having been answered" so others didn't expand unnecessary effort on it. If this happens frequently then we should come up with a best practice.

0
Entering edit mode

I should add that I'm not entirely sure the answer on bioinfo.SE is correct, so I was probably hasty in bumping the answer to the top (mea culpa!).

0
Entering edit mode

I'm not entirely sure I need to go down the grep wormhole. I haven't tried the commands posted on the other site yet, but they appear to answer my exact question (i.e., get rid of reads that are not "truly mappable to only one location"). However, I'm now doubting whether I really need to get the number of reads in my flagstat file for my merged bam to match exactly with the number of reads shown in Geneious. I don't know how this is typically dealt with in other multi-locus studies similar to what I am doing.

What I really want to avoid is including genomic regions in my study that were not actually amplified and sequenced because of an error in the alignment software. However, I think the risk of this is very low because I plan to only include a locus for a particular sample if it has more than 20 or 30 reads. In addition to that, I am using the consensus sequence from all the reads that mapped to a locus for a particular sample. I am just not sure if that fact I don't have only "uniquely mapped reads" (however flimsy of a concept that is) will prevent the work from being published.

0
Entering edit mode

You can usually restrict your analysis to only the regions of interest, so I wouldn't worry about a small amount of off-target alignment. No one is going to complain if you just say you filtered low quality reads (<5 or so).

5
Entering edit mode
5.3 years ago
John 13k

The XT tag was used by early versions of bwa but not bwa-mem, which is like the new version for all intents and purposes. The XT tag is mainly seen these days as the tag for marking adapter sequences in the GATK best practices workflow.

Any tag that starts with an "X", "Y", or "Z" are user-defined, which means it's going to be unique to the tool that put them in the file. It's a real shame that tags have this 2-character limitation and can't just say what it is that they do, but meh. Bioinformatics amiright.

The issue your having is most likely due to a totally non-intuitive aspect of the SAM format, where data is allowed to be wrong provided you can figure out that it's wrong. For example, a read can have "mapped in a proper pair" set, but if it's also unmapped, then you are supposed to recognise that the "mapped in a proper pair == true" is actually false.

So long story short, you can't use 0x02 on it's own. You have to say "i don't want unmapped reads, and i do want properly paired reads" which is -F 0x04 -f 0x02. As Devon points out you probably also want to get rid of secondary alignments, so the grand total is:

samtools view -q 10 -F 1284 -f 0x02 -b pe001_sorted.bam > pe001_sorted_properlypaired.bam

1
Entering edit mode

Thank you for your reply! I am currently trying out these commands and wondered if you could clarify what the 1284 means? I know I have seen this explained somewhere but now I can't find that post or manual that I read about that in. I'm trying to understand if I would need to include a similar command for the unpaired files I have as well. Currently I have just been doing:

samtools view -q 10 -F 0x04 -b pe001_unpaired1_sorted.bam > pe001_sorted_mapped_unpaired1.bam


I was aiming to include only the mapped reads in each of the unpaired files and then combine the paired end file (with hopefully properly paired and mapped reads) and two unpaired files to get my final, merged bam file.

2
Entering edit mode

Yeah you add the numbers together for the ones you want to keep (-f), and the ones you want to remove (-F). A good tool to figure this out is Explain SAM flags. I don't remember any of these numbers off by heart, i use that tool to work it out each time. The "0x..." notation you were using before is hexidecimal (base 16), which happens to look like base 10 decimal for low numbers like 2 and 4, but yeah i'd avoid that and just use plain base 10 numbers like '1284' , not it's hexidecimal equivilent '0x504'.

Yeah if you're new to bioinformatics you've really thrown yourself into the deep end with this -_- haha it gets better though :)

EDIT: Also, i'm probably wrong about this all being due to not filtering out unmapped reads. It's probably the tag issue. Sorry :(

2
Entering edit mode

Ahhhhh, that explains it! I didn't really understand that it doesn't have to be written in the hexidecimal notation.

Good to know it gets better :) Definitely feel like I was thrown into the deep end.

And maybe I will force myself to at least try the grep commands on the other site to see if I can get it to work and compare the results to just filtering out low quality reads. I really want to move on to my downstream analyses, though!

1
Entering edit mode

1284 will filter out supplemental alignments (where parts of the read map to very distant regions of the genome, such as when you have structural variation or just a glitch in library prep), unmapped alignments, and secondary alignments.

0
Entering edit mode

I have a doubt in the command suggested by you. -f stands for only taking those reads. So if you have written -f 0x02 that means reads associated with all other flags will be discarded and only 0x02 associated reads will be considered. Then there is no need of mentioning those flag whose associated reads we don't want to see i.e. -F 1284 because we have already discarded those flags (0x100,0x400,0x04). Can you justify why -F is required here. Thanks.

0
Entering edit mode

Flags are bitwise and, thus, can be summed. -f 2 only discards entries with bit 2 unset, regardless of the status of other flags.

1
Entering edit mode
5.3 years ago
mforde84 ★ 1.4k

Try this, or something along these lines:

samtools view -bq 1 file.bam > unique.bam

0
Entering edit mode

samtools view -f 0x02 -bq 1 file.bam > unique.bam


I am still getting about the same number of additional reads in Geneious. For example, the flagstat file is showing that I have 194,308 properly paired and mapped reads (for the paired-end reads and unpaired reads, respectively) but when I load the merged bam file into Geneious I am getting 194,360 reads total (so 52 reads total).

Do I need to take out the -f 0x02 portion of the command (the part that asks for only properly paired reads?). Can you explain a bit about the reasoning behind the command you provided?