I am aligning paired end reads to contigs from an assembly using BWA and it looks like I get different results if I use the contigs or their reverse complements. Does this make sense? Which result should I use since they are different?
Are you getting these differences only with multimappers? That'd be expected.
Why would that be the case, doesn't the aligner align both the read and it's reverse complement to the contig? If the forward sense read maps in multiple places wouldn't the reverse complement map to the same places?
Also - what is the best way to check this, I've been dumping the alignments to bam and I noticed the difference in a downstream application. When I check the bam file sizes I see a differnce of a few kilobytes, but I'm not sure how to find the specific alignments that are different.
Of course, but the randomly returned alignment for multimappers wouldn't be the same, so the result you're seeing would be different. There's no point in actually checking this unless what you're seeing is differences in reliable (MAPQ >= 1 at least) alignments, since otherwise you're just seeing the effect of various random seeds.
You should qualify and quantify what you mean by " different results". Later you talk about differences in BAM size, that's normal and expected. Information presented differently will compress at different rates. Do you see more meaningful differences?
Yes, downstream I am using the alignments of the reads to bridge between contigs in the assembly graph and have different results.
I was surprised to see that these results are different depending on whether I use the sense or anti-sense sequence but so far the only difference I've been able to track down is that the bam file (and bam index and sorted bam file) sizes are different.
I'm not sure how to track down individual alignments that are different, but am surprised that bwa isn't agnostic to sense.
You say "assembly graphs" - bwa does not produce that - that sounds like a different tool producing a different output. Perhaps the tool producing your assembly graph caused the problem.
As I said before you need to first quantify the "difference" or at least have some evidence that the alignments are different. Sizes are not proof. I very much doubt that your alignments will be different in a systematic way when reverse complementing the reference sequences.
@Istvan - no, the other parts of the pipeline are deterministic:
First I assemble the reads, then, because the fastg format for assembly graphs contains both the forward and reverse complement sequences for each edge, I remove one, then I align the reads to the contigs to create a graph of paired end connections. I get different connectivity graphs when I choose the forward sequences in the fastg from when I choose the first sequence that appears, either forward or reverse complement.
The only thing that changes is the sense of some of the sequences being aligned to. I have observed different bam file sizes, but perhaps this is to be expected even if the alignments are exactly the same. My questions are:
1) How can I track down if there are some different alignments in the bam files?
2) Does anyone know whether it makes sense that BWA would do this, and understand why that would be the case? I was under the impression that the aligner should be agnostic to the sense of the reference sequences.
I think to clarify this you need to let go of all the other aspects of the problem. You do a lot of things and claim that every other step is deterministic yet it is not clear that they are. I think you are jumping ahead to saying bwa does something incorrect where it is far more likely that you not repeating exactly the same process - the data is not actually the same.
Take your sequences align them with BWA. Take an index stat, count how many reads align on one strand versus the other: flagstat and indxstats can do that. Now reverse complement the references and do the same. But note, don't just select another dataset that you "think" should be the reverse complement. Reverse complement the actual data you have used in the previous step. Now rerun the alignments then flagstat and indxstats this will summarize some of the differences. Now turn both files into BED files and visualize them. I think you'll find that both are the same.
@istvan - I did this with no change, as expected.
There aren't two datasets, there is one dataset which is in a fastg file so it contains the forward and reverse complement sequence of every node.
I create a fasta file by dumping half of these. In one case I keep only the forward sense nodes, in the other case I use the first direction, forward or reverse complement (think of this as randomly selected which to keep). The stats that you suggested to look at are slightly different for each. When I take the reverse complement myself instead of reading it from the fastg file, the results are of course the same as when I read the reverse complements from the file.
Any more suggestions as to why bwa works differently on the all forward vs the some forward some reverse nodes?
I did this with no change, as expected.
I did this with no change, as expected.
So if you did this it means that bwa does actually work as expected.
We keep talking about "differences" but in this entire thread, you have never actually shown a difference. File size is NOT a difference. Why don't you list some actual differences?
Please also only differences with MAPQ > 0.
I don't know how to find the differences - that is what I asked. Using flagstat as you suggested I get:
3214312 + 0 in total (QC-passed reads + QC-failed reads) \\ 3214312 + 0 in total (QC-passed reads + QC-failed reads)
3183485 + 0 mapped (99.04%:-nan%) \\ 3183473 + 0 mapped (99.04%:-nan%)
3214312 + 0 paired in sequencing \\ 3214312 + 0 paired in sequencing
2979678 + 0 properly paired (92.70%:-nan%) \\ 2978322 + 0 properly paired (92.66%:-nan%)
3169922 + 0 with itself and mate mapped \\ 3169906 + 0 with itself and mate mapped
13563 + 0 singletons (0.42%:-nan%) \\ 13567 + 0 singletons (0.42%:-nan%)
113930 + 0 with mate mapped to a different chr \\ 113804 + 0 with mate mapped to a different chr
107779 + 0 with mate mapped to a different chr (mapQ>=5) \\ 107765 + 0 with mate mapped to a different chr (mapQ>=5)
idxstats shows there are many edges with small differences in the number of mapped and unmapped. For example:
EDGE_1811_length_385_cov_18.5182 385 221 5 \\ EDGE_1811_length_385_cov_18.5182 385 216 5
Why these differences exist is a mystery to me since the sequences are the same, except in some cases one reference contains the reverse complement of the other. Is that how BWA is expected to work?
short read aligners are not "optimal" aligners, they are not guaranteed to find neither the best nor all alignments, though in general make very few mistakes. These mistakes are usually for alignments that are not perfect to begin with and could be false positives or false negatives.
For example, in this case, the number of mapped reads differ by 12 alignments out of over 3 million so that makes it about .0004 percent!
I would not lose much sleep over that. I would not quantify that as different beyond expected.
Thanks. Ineresting that even the small number of differences makes a difference when looking for paths in the graph formed by contigs and the paired-end reads connecting them. I guess the 1300 or so differences in the paired reads makes the difference. Wish I knew why there's a difference depending on the sense of the reference sequences, I thought it was reasonable to assume the aligner would be completely agnostic to this.