Remove copied reads from a bam
0
0
Entering edit mode
6.6 years ago

I am working on reanalyzing a cohort of bams. Due to an issue in the previous processing some of these bams have reads that are literal copies, not duplicates. For example: ```

samtools view <>.cs.bam | grep HA49BADXX140730:1:2116:12269:27668

```

gives the below. I haven't seen this issue before and it doesn't seem like it's been addressed much. So, I'm looking for a simple method/script to be able to go through a (queryname sorted?) bam and remove reads that have the same name, chromosome and position. Any thoughts?

Thanks!

<h5>#</h5>
HA49BADXX140730:1:2116:12269:27668  161 chr1    12997907    0   71M5S   =   12998670    817 AACCCCAGGGGGAGCCTGCAGTCAGCTGAGATGGCACCATTGCACTCTAAACTGCAACCTGGGCAACAAGCATGAT    ############################################################################    NM:i:5  MD:Z:9T4G0T37C8T8   AS:i:46 XS:i:45 RG:Z:SA9755_T1.HA49B.1

HA49BADXX140730:1:2116:12269:27668  161 chr1    12997907    0   71M5S   =   13109003    111150  AACCCCAGGGGGAGCCTGCAGTCAGCTGAGATGGCACCATTGCACTCTAAACTGCAACCTGGGCAACAAGCATGAT    ############################################################################    NM:i:5  MD:Z:9T4G0T37C8T8   AS:i:46 XS:i:45 RG:Z:SA9755_T1.HA49B.1

HA49BADXX140730:1:2116:12269:27668  81  chr1    12998670    0   22S54M  =   12997907    -817    CTAAACTCCAACCTGGTCAACAAGCATCATAACTCTCCCGGGGGGCAGGATACAGCTCCACGCATAAGTTTTGGAG    35=?>9>=3??=;?>;.(@>5..;;7);>3=)545-;'0AAA=B?1*=A:?2<7@C?)8?147<+<227@+<+=11    NM:i:1  MD:Z:50T3   AS:i:50 XS:i:50 RG:Z:SA9755_T1.HA49B.1  XA:Z:chr1,+13368515,54M22S,1;chr1,-13109003,22S54M,1;chr1,-12939535,27S49M,1;

HA49BADXX140730:1:2116:12269:27668  81  chr1    13109003    0   22S54M  =   12997907    -111150 CTAAACTCCAACCTGGTCAACAAGCATCATAACTCTCCCGGGGGGCAGGATACAGCTCCACGCATAAGTTTTGGAG    35=?>9>=3??=;?>;.(@>5..;;7);>3=)545-;'0AAA=B?1*=A:?2<7@C?)8?147<+<227@+<+=11    NM:i:1  MD:Z:50T3   AS:i:50 XS:i:50 RG:Z:SA9755_T1.HA49B.1  XA:Z:chr1,+13368515,54M22S,1;chr1,-12998670,22S54M,1;chr1,-12939535,27S49M,1;
bam • 1.2k views
ADD COMMENT
0
Entering edit mode

Any thoughts?

you cannot trust this kind of results: re-analysis your data

ADD REPLY
0
Entering edit mode

Ok, I went back to the original bam (before I realigned it) and the reads are there:

HA49BADXX140730:1:2116:12269:27668  653 *   0   0   *   *   0   0   AACCCCAGGGGGAGCCTGCAGTCAGCTGAGATGGCACCATTGCACTCTAAACTGCAACCTGGGCAACAAGCATGAT    ############################################################################    PG:Z:MarkDuplicates.5   RG:Z:HA49B.1    OQ:Z:############################################################################


HA49BADXX140730:1:2116:12269:27668  653 *   0   0   *   *   0   0   AACCCCAGGGGGAGCCTGCAGTCAGCTGAGATGGCACCATTGCACTCTAAACTGCAACCTGGGCAACAAGCATGAT    ############################################################################    PG:Z:MarkDuplicates.5   RG:Z:HA49B.1

It looks like there was a snafu with some score recalibration at some point. But I'm not sure if the uniq approach would work, because of that OQ. Is there a way to do it based solely on read name and sequence?

ADD REPLY
0
Entering edit mode

I don't quite fully understand what happened to your data - are you sure that you know what has happened? Aren't the reads above already marked as a duplicate in your post above? - see PG:Z:MarkDuplicates.5 to the right.

If you want to throw these out of the BAM, you could try running PICARD and SAMtools on the aligned BAM:

java -jar picard.jar MarkDuplicates INPUT=MyStrangeBAM.bam OUTPUT=MyStrangeBAM_Dupes.bam ASSUME_SORTED=true METRICS_FILE=MyStrangeBAM_Dupes.txt VALIDATION_STRINGENCY=SILENT ;
samtools index MyStrangeBAM_Dupes.bam ;
samtools view -b -F 0x400 MyStrangeBAM_Dupes.bam > MyStrangeBAM_DupesRemoved.bam ;
samtools index MyStrangeBAM_DupesRemoved.bam ;

The other manual way to do it would be to convert it to SAM and then use awk to remove duplicates based on, from what I can see, columns 1 (read ID) and 10 (sequence):

awk '{if (x[$1$10]) { y[$1$10]++; print $0; if (y[$1$10] == 1) { print x[$1$10] } } x[$1$10] = $0}' MyStrangeBAM.sam

I have not tested this code

ADD REPLY

Login before adding your answer.

Traffic: 1777 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6