I am pretty new to bioinformatics and to this forum, and would very much appreciate any suggestions you could offer to help me with ways to filter my alignments (obtained using bwa mem) so I only have uniquely mapped reads, other than using q scores (which I have already done and doesn't solve my problem)? Here is a bit more explanation:
I am trying to obtain uniquely mapped reads from my bwa mem alignments to use in downstream analyses. I realize that "uniquely mapped reads" is a loaded term, and that most sources I have found on the topic suggest filtering by q score should do the trick. My problem is that I have tried filtering by progressively higher q scores and I still have a (very small) number of reads that appear to not be uniquely mapped. Because I am aligning my reads from each sample to 96 different loci (in bwa), I want to make sure my reads are only mapped to one locus.
For the one sample I am using to test out commands, I filtered my data so I only have properly paired reads and mapped reads (in the case of my unpaired reads) and also filtered by q score (I get nearly the same results at q of 10, 20, or 30, so it's not an issue of changing the q score level), merged all my bam files, and obtained 193,998 reads total for that sample. When I open the bam file in Geneious and count the number of reads for each locus they add up to 194, 050. I realize the difference between these two numbers is an incredibly small number of reads, and I am going to only use loci that have ~20 or 30 mapped reads for each sample (haven't yet decided on that threshold) so maybe this doesn't matter terribly much in the grand scheme, but I still would like to have these two numbers match so I am not incorrectly including loci that weren't actually sequenced for a particular sample.
I have also seen suggestions to filter by XT tags, but none of my files have this tag. I DO have the XS tag (most seem to be XS:i:0). My understanding of what XS:i:0 means is a bit shaky, but I think it means that the secondary alignment has a score of zero so these would be reads that only have one alignment? And maybe this could be a good tag to use for filtering?
So my question is this: Does anyone have any suggestions for other tags or filtering steps I can use for my data so that my reads only map to one location, other than filtering for q score (which I have already done and will do in addition to any other filtering steps)?
Here is the applicable portion of code that I am running on alignments obtained from bwa (sorted them by coordinate and converted them to bam already in a previous step). The first line is what I'm using on my paired-end file and the second line is what I am using on my unpaired read 1 file (using the same code for my unpaired read 2 file). Note that I used trimmomatic to remove low quality bases from my reads, which is why I have three different read files for each sample.
samtools view -q 10 -f 0x02 -b pe001_sorted.bam > pe001_sorted_properlypaired.bam samtools view -q 10 -F 0x04 -b unpaired1_sorted.bam > unpaired1_sorted_mapped.bam
Thank you very much for any help you can offer!
I was unsure about the best place to comment on whether the grep commands provided on the other site worked, but I am sure some of you are curious. I would appreciate any insights you all have into comparing the results of each set of commands. Here is the command I tried (that was posted on the other site) to attempt to only have "uniquely mapped reads"
The commands worked in the sense that they gave me 193,100 mapped reads in the flagstats file (approximately what I expected), but Geneious is STILL saying I have more sequences than the flagstat file says (it says 193,156 sequences)!!! At this point I'm wondering if this is some sort of quirk with Geneious and will reach out to their help and ask if they have run across this particular problem. Additionally, the above commands cut out about 800 reads for this one sample as compared to the commands I used to filter by MAPQ and properly paired/mapped reads, so I am not sure I should go the grep route anyway.
The commands I used in which I just filtered by MAPQ scores and properly paired/mapped reads are:
The above commands gave me 193, 957 in flagstats and then 194,007 in Geneious (a difference of 50 as opposed to a difference of 46 using the other commands).
Any thoughts on which method you would use/which method makes sense? I don't want to cut out reads unnecessarily, and it seems as though the vast majority of the bioinformatics community just filters using MAPQ scores and does not really attempt to obtain "truly uniquely mapped reads". However, since I am mapping to multiple genomic regions I really don't want to incorrectly include loci that weren't actually mapped to.
I was hoping to open all of my bam files in Geneious and then group the consensus for each sample for each locus into a folder (so each folder would have a locus), and then align all the consensus sequences in a folder, then export the alignment for each locus for use in R. I am positive there is a bioinformatics solution that I could do rather than using Geneious, but maybe I should ask it in another question:)
I'm glad you caught our 3332 vs. 1284 mistake! :)
I'm not a fan of the
greproute; I've long been an advocate for "filter by MAPQ and be done with it".
I encourage you to ask about the analysis you were planning in Geneious in another post. Please include what your goal is with the per-sample alignments, since that could be important.
Your issue is likely just supplemental alignments (use
samtools view -cF 1284 -q 1 file.bamand see if that better matches what genious says. A question arises whether you want reads that are trully only reasonably mappable to one locus or are simply very likely only mapped to one locus. Filtering by MAPQ is giving you the latter, you'll need the
grep -vcommand linked to by bioinformatics for the former.
I don't really understand the different between "only reasonably mapped to one locus" and "very likely only map to one locus" Devon - i'm not that advanced at SAM yet :P Just when i think i finally get it, i learn something new! Is it because the MAPQ can still be high if there's a poor second alignment, but with a special bwa tag reads can also be labelled as totally, positively, unique?
I don't really see where chimeric alignments come into this question either, re. the SA tag.
Man SAM is so complicated -_-
Yes, at the end of the day the question is "what's really unique?" BWA (and bowtie2 and most other aligners) can give a decently high MAPQ to an alignment even if there are other possible alignments (according to the parameters used by the aligner for defining this, since if you use a loose enough definition then everything becomes a possible alignment). This happens when the other alignments have a score significantly below that of the returned/preferred alignment. To me this seems quite reasonable, but from the reply elsewhere it can become ambiguous if one is searching for "very likely correct" (i.e., high MAPQ and forget the rest) alignments or only those where the aligner can't find anything else that's valid (i.e.,
XAauxiliary tag and ignore the fact that what will go in there can change if you change the bwa settings).
Regarding the SA tag, my guess is that the count difference from BWA is simply due to it counting those rather than ignoring them.
Ah ok, i follow - thanks man :)
Thank you for your reply! This is giving me quite a bit to think about. It seems reasonable to me to filter using MAPQ and be ok with accepting that reads are "very likely only mapped to one locus" rather than "truly only mappable to one location". Particularly because it appears that for this one sample at least, there are only between 50-60 reads out of ~200,000 that are mapped to multiple locations. At least that is how I am interpreting the fact that flagstat tells me there are 193,998 reads but Geneious says 194,050.
I do have one follow-up question for you though. For the command you posted, what is the -F 1284? I know I have come across that at some point in trying to answer my question, but now I can't find where I read it.
Additionally, would I still need to include the -F 0x04 filter to remove the unmapped reads from the unpaired files and the -F 0x04 and -f 0x02 filters to remove unmapped reads and only include properly paired reads for my paired end files?
I have tried running some variations of these commands but am not having much luck. I suspect it's because I don't fully understand what the -F 1284 is doing.
Thank you again for helping me solve this problem!
It appears that your post has been cross-posted to another site: https://bioinformatics.stackexchange.com/questions/508/obtaining-uniquely-mapped-reads-from-bwa-mem-alignment/519
This is typically not recommended as it runs the risk of annoying people in both communities.
I acknowledge crossposting by bioinfor, not the original poster. Anyway, please do not just screen-scrape new question, because it triggers cross posting effects, irrespective of who actually did the cross posting.
I don't think she did it herself, I think someone else did. We're going to run into this a lot in the next few weeks, so i think we should have a little patience with the SE guys and just say they can link to our stuff, and they can copy our stuff, but they can't dump their noise in our signal :-)
I can't help but think this is a weird method to boost your post-count and attract users.
Yup. But so long as they attribute Biostars, and we figure out the correct answer by talking to betsy, then it's win-win. That's kind of the reason SE is doomed to fail in my opinion (but i'll certainly let them try!). The only person who know's what the real problem will be betsy when she figures it out. You can't solve questions about data by sitting in front of a log fire swirling your brandy and talking about the importance of the SA tag. You just have to work with the user and get to the bottom of their specific problem. That's what Biostars does best, and while i'm sure it upsets those (like me) on the autism spectrum that there's no 1 correct answer for every 1 well-defined question, that's the nature of data science.
See my response on bioinfo se meta: https://bioinformatics.meta.stackexchange.com/questions/78/please-stop-cross-posting-behavior
I've posted this question to the Bioinformatics Stack Exchange (see here), because it's a well-formatted, specific question with a good chunk of research behind it.
Why not answer it here? Why cross-post someones else question to another site and answer it there?
I don't feel comfortable answering questions here, and I was encouraged to cross-post on the Stack Exchange Meta. See the discussion about that here:
I think this is only encouraged when the site is in beta. When it is fully active people will no longer tolerate cross-posting.
Even in beta, it was apparently not a good idea to encourage it / me. My post angered Michael Dondrup enough that he put out a "tut, tut" post on the matter:
edit: got the name wrong
That's actually Michael Dondrup, Robert just edited the tags. Having said that, I agree that this should be avoided.
How is this the accepted answer o_O
I don't understand why, and would rather not have a legacy of "cross-post" statements being considered answers... I guess this was my fault for posting this as an answer rather than a comment to the question.
That is absolutely not a problem, that's what mods are for.
It's good to keep us focussed and avoids that we do real work :-)
It is no longer, and it is not an answer! Moved to a comment.
I had moved it from an answer to a comment, then saw it was actually answered on the other site, so moved it back and bumped it to the top so it'd be more readily seen as "having been answered" so others didn't expand unnecessary effort on it. If this happens frequently then we should come up with a best practice.
I should add that I'm not entirely sure the answer on bioinfo.SE is correct, so I was probably hasty in bumping the answer to the top (mea culpa!).
I'm not entirely sure I need to go down the grep wormhole. I haven't tried the commands posted on the other site yet, but they appear to answer my exact question (i.e., get rid of reads that are not "truly mappable to only one location"). However, I'm now doubting whether I really need to get the number of reads in my flagstat file for my merged bam to match exactly with the number of reads shown in Geneious. I don't know how this is typically dealt with in other multi-locus studies similar to what I am doing.
What I really want to avoid is including genomic regions in my study that were not actually amplified and sequenced because of an error in the alignment software. However, I think the risk of this is very low because I plan to only include a locus for a particular sample if it has more than 20 or 30 reads. In addition to that, I am using the consensus sequence from all the reads that mapped to a locus for a particular sample. I am just not sure if that fact I don't have only "uniquely mapped reads" (however flimsy of a concept that is) will prevent the work from being published.
You can usually restrict your analysis to only the regions of interest, so I wouldn't worry about a small amount of off-target alignment. No one is going to complain if you just say you filtered low quality reads (<5 or so).