RNA-seq reads mapping to paralogs with high MAPQ. How to ensure this?
1
0
Entering edit mode
7.1 years ago
biotech ▴ 540

I've mapped 75 bp paired-end Illumina RNA-seq data to a bacterial draft genome. We are interested in a family of genes belonging to trimeric autotransporter family. As can be appreciated in the figure, it should be quite difficult to have reads mapping uniquely to one gene because all the genes have same domains. However, we have many reads mapping with MAPQ>20 to this domains.

There could be some explanations to this; however, both of these cases should be flagged as having multiple mappings:

- Subtle SNP differences enough for BWA to discriminate these similar regions

- BWA acting randomly when assigning best match

I would like to know if it's safe to rely in BWA MAPQ>20 value as threshold.

BWA code:

#BWA index
bwa index $genome_file #BWA mapping bwa mem -t 8$genome_file $fastq_file_R1$fastq_file_R2 | gzip -3 > P_S1_L001_aln-pe.sam_inverted.gz
bwa mem -t 8 $genome_file$fastq_file_R3 \$fastq_file_R4 | gzip -3 > V_S1_L001_aln-pe.sam_inverted.gz

Here is the figure showing some proteins of the family, from Pina S. 2009 et al.: http://www.ncbi.nlm.nih.gov/pubmed/19011035

Thanks, Bernardo

RNA-Seq MAPQ BWA paralogs • 2.7k views
1
Entering edit mode
7.1 years ago

If BWA is assigning a MAPQ of 20 then there's enough sequence divergence for it to do so. In such cases, it would be completely inappropriate for BWA to call the alignments multimapping, since they're not (a read is only multimapping if the best and second best alignment have the same score...meaning the MAPQ is 0 (probably...this ends up being aligner-dependent)).

Edit: BTW, you can just pipe to samtools, which uses blocked gzip compression. That'll give you better integration downstream as well as better performance if you ever need random access within the file.