Entering edit mode
6.6 years ago
danny.k.wells
•
0
I am working on reanalyzing a cohort of bams. Due to an issue in the previous processing some of these bams have reads that are literal copies, not duplicates. For example: ```
samtools view <>.cs.bam | grep HA49BADXX140730:1:2116:12269:27668
```
gives the below. I haven't seen this issue before and it doesn't seem like it's been addressed much. So, I'm looking for a simple method/script to be able to go through a (queryname sorted?) bam and remove reads that have the same name, chromosome and position. Any thoughts?
Thanks!
<h5>#</h5>HA49BADXX140730:1:2116:12269:27668 161 chr1 12997907 0 71M5S = 12998670 817 AACCCCAGGGGGAGCCTGCAGTCAGCTGAGATGGCACCATTGCACTCTAAACTGCAACCTGGGCAACAAGCATGAT ############################################################################ NM:i:5 MD:Z:9T4G0T37C8T8 AS:i:46 XS:i:45 RG:Z:SA9755_T1.HA49B.1
HA49BADXX140730:1:2116:12269:27668 161 chr1 12997907 0 71M5S = 13109003 111150 AACCCCAGGGGGAGCCTGCAGTCAGCTGAGATGGCACCATTGCACTCTAAACTGCAACCTGGGCAACAAGCATGAT ############################################################################ NM:i:5 MD:Z:9T4G0T37C8T8 AS:i:46 XS:i:45 RG:Z:SA9755_T1.HA49B.1
HA49BADXX140730:1:2116:12269:27668 81 chr1 12998670 0 22S54M = 12997907 -817 CTAAACTCCAACCTGGTCAACAAGCATCATAACTCTCCCGGGGGGCAGGATACAGCTCCACGCATAAGTTTTGGAG 35=?>9>=3??=;?>;.(@>5..;;7);>3=)545-;'0AAA=B?1*=A:?2<7@C?)8?147<+<227@+<+=11 NM:i:1 MD:Z:50T3 AS:i:50 XS:i:50 RG:Z:SA9755_T1.HA49B.1 XA:Z:chr1,+13368515,54M22S,1;chr1,-13109003,22S54M,1;chr1,-12939535,27S49M,1;
HA49BADXX140730:1:2116:12269:27668 81 chr1 13109003 0 22S54M = 12997907 -111150 CTAAACTCCAACCTGGTCAACAAGCATCATAACTCTCCCGGGGGGCAGGATACAGCTCCACGCATAAGTTTTGGAG 35=?>9>=3??=;?>;.(@>5..;;7);>3=)545-;'0AAA=B?1*=A:?2<7@C?)8?147<+<227@+<+=11 NM:i:1 MD:Z:50T3 AS:i:50 XS:i:50 RG:Z:SA9755_T1.HA49B.1 XA:Z:chr1,+13368515,54M22S,1;chr1,-12998670,22S54M,1;chr1,-12939535,27S49M,1;
you cannot trust this kind of results: re-analysis your data
Ok, I went back to the original bam (before I realigned it) and the reads are there:
It looks like there was a snafu with some score recalibration at some point. But I'm not sure if the uniq approach would work, because of that OQ. Is there a way to do it based solely on read name and sequence?
I don't quite fully understand what happened to your data - are you sure that you know what has happened? Aren't the reads above already marked as a duplicate in your post above? - see PG:Z:MarkDuplicates.5 to the right.
If you want to throw these out of the BAM, you could try running PICARD and SAMtools on the aligned BAM:
The other manual way to do it would be to convert it to SAM and then use
awk
to remove duplicates based on, from what I can see, columns 1 (read ID) and 10 (sequence):I have not tested this code