How to modify a BAM file to avoid double-counting of variants from overlapping PE reads?
0
0
Entering edit mode
3.2 years ago
lechu ▴ 20

Hi, I need to find a way to avoid counting the same mismatch twice in cases when it (the mismatch) lies on the overlapping part of mate reads. One way to do it which I am aware of, and tried t implement is to set PHRED scores to zero on one of the read mates in aligned BAM file to zero. I know that many tools, e.g., Samtools mpileup or Picard deal with this problem behind the scenes. My problem is that Samtools produces a pileup file, but I need a bam file with base PHRED scores modified accordingly. Suggestions on how to do it will be very welcome. In case I am wrong, and the above-mentioned tools can give me the desired BAM file, then how do I make it?

I also tried to use the following script: https://github.com/brentp/bwa-meth/blob/master/bam-merge-pairs.py but it could not get past the error message:

 File "bam_merge_pairs_lk.py", line 19
 print "\t".join(pair[0])
          ^
 SyntaxError: invalid syntax

Cheers, Lech

RNA-Seq BSseq • 1.3k views
ADD COMMENT
0
Entering edit mode

The author removed this script from Github this morning, maybe after getting informed about this. Can you say what the final goal is, so you want to count mismatches, wouldn't it be better to use a variant caller and then filter or count based on the VCF file? Naive counting based on reads is probably not a good idea as you cannot say whether the variant is reliable.

ADD REPLY
0
Entering edit mode

Thank ATpoint, yes, I contacted the author.

I would like to count the number of mismatches per fragment in my RNAseq data. The mismatches are caused by RNA labeling and are not SNPs. I will use variant calling to be able to exclude SNPs (which in this case are false positives).

In this case, the problem is that I don't know if two mismatches should be counted twice (if present on both mates, but outside of the overlap) or once (if they are on the overlap of a given mate pair).

ADD REPLY
0
Entering edit mode

if present on both mates, but outside of the overlap

How would that be possible?

ADD REPLY
0
Entering edit mode

Sorry, I was not precise with my wording, but this is what I meant:

----------x----------------------->
                        <---------------------------x----------
count should be 2

vs

-----------------------------x--------->
                        <----x---------------------------------
count should be 1
ADD REPLY
0
Entering edit mode

How about merging the reads if they overlap, e.g. with FLASH https://ccb.jhu.edu/software/FLASH/ and then do the counting on these merged reads?

ADD REPLY
0
Entering edit mode

Thanks for the tip! I thought about it but prefer to do merging after mapping, using coordinates to ensure correct fuse. That being said, I found no software that does it (well, the only software I found - aftermerge.py - was either buggy, or I was unable to properly configure it). I am trying BBmerge right now, which seems to do the trick you suggested, although the approach with tagging the read positions would be more elegant and reliable (at least this is my impression).

ADD REPLY

Login before adding your answer.

Traffic: 2711 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6