Question

Remove copied reads from a bam

0

Entering edit mode

6.6 years ago

danny.k.wells • 0

I am working on reanalyzing a cohort of bams. Due to an issue in the previous processing some of these bams have reads that are literal copies, not duplicates. For example: ```

samtools view <>.cs.bam | grep HA49BADXX140730:1:2116:12269:27668

```

gives the below. I haven't seen this issue before and it doesn't seem like it's been addressed much. So, I'm looking for a simple method/script to be able to go through a (queryname sorted?) bam and remove reads that have the same name, chromosome and position. Any thoughts?

Thanks!

HA49BADXX140730:1:2116:12269:27668  161 chr1    12997907    0   71M5S   =   12998670    817 AACCCCAGGGGGAGCCTGCAGTCAGCTGAGATGGCACCATTGCACTCTAAACTGCAACCTGGGCAACAAGCATGAT    ############################################################################    NM:i:5  MD:Z:9T4G0T37C8T8   AS:i:46 XS:i:45 RG:Z:SA9755_T1.HA49B.1

HA49BADXX140730:1:2116:12269:27668  161 chr1    12997907    0   71M5S   =   13109003    111150  AACCCCAGGGGGAGCCTGCAGTCAGCTGAGATGGCACCATTGCACTCTAAACTGCAACCTGGGCAACAAGCATGAT    ############################################################################    NM:i:5  MD:Z:9T4G0T37C8T8   AS:i:46 XS:i:45 RG:Z:SA9755_T1.HA49B.1

HA49BADXX140730:1:2116:12269:27668  81  chr1    12998670    0   22S54M  =   12997907    -817    CTAAACTCCAACCTGGTCAACAAGCATCATAACTCTCCCGGGGGGCAGGATACAGCTCCACGCATAAGTTTTGGAG    35=?>9>=3??=;?>;.(@>5..;;7);>3=)545-;'0AAA=B?1*=A:?2<7@C?)8?147<+<227@+<+=11    NM:i:1  MD:Z:50T3   AS:i:50 XS:i:50 RG:Z:SA9755_T1.HA49B.1  XA:Z:chr1,+13368515,54M22S,1;chr1,-13109003,22S54M,1;chr1,-12939535,27S49M,1;

HA49BADXX140730:1:2116:12269:27668  81  chr1    13109003    0   22S54M  =   12997907    -111150 CTAAACTCCAACCTGGTCAACAAGCATCATAACTCTCCCGGGGGGCAGGATACAGCTCCACGCATAAGTTTTGGAG    35=?>9>=3??=;?>;.(@>5..;;7);>3=)545-;'0AAA=B?1*=A:?2<7@C?)8?147<+<227@+<+=11    NM:i:1  MD:Z:50T3   AS:i:50 XS:i:50 RG:Z:SA9755_T1.HA49B.1  XA:Z:chr1,+13368515,54M22S,1;chr1,-12998670,22S54M,1;chr1,-12939535,27S49M,1;

bam • 1.2k views

ADD COMMENT • link 6.6 years ago by danny.k.wells • 0

0

Entering edit mode

Any thoughts?

you cannot trust this kind of results: re-analysis your data

ADD REPLY • link 6.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Ok, I went back to the original bam (before I realigned it) and the reads are there:

HA49BADXX140730:1:2116:12269:27668  653 *   0   0   *   *   0   0   AACCCCAGGGGGAGCCTGCAGTCAGCTGAGATGGCACCATTGCACTCTAAACTGCAACCTGGGCAACAAGCATGAT    ############################################################################    PG:Z:MarkDuplicates.5   RG:Z:HA49B.1    OQ:Z:############################################################################


HA49BADXX140730:1:2116:12269:27668  653 *   0   0   *   *   0   0   AACCCCAGGGGGAGCCTGCAGTCAGCTGAGATGGCACCATTGCACTCTAAACTGCAACCTGGGCAACAAGCATGAT    ############################################################################    PG:Z:MarkDuplicates.5   RG:Z:HA49B.1

It looks like there was a snafu with some score recalibration at some point. But I'm not sure if the uniq approach would work, because of that OQ. Is there a way to do it based solely on read name and sequence?

ADD REPLY • link updated 6.6 years ago by GenoMax 141k • written 6.6 years ago by danny.k.wells • 0

0

Entering edit mode

I don't quite fully understand what happened to your data - are you sure that you know what has happened? Aren't the reads above already marked as a duplicate in your post above? - see PG:Z:MarkDuplicates.5 to the right.

If you want to throw these out of the BAM, you could try running PICARD and SAMtools on the aligned BAM:

java -jar picard.jar MarkDuplicates INPUT=MyStrangeBAM.bam OUTPUT=MyStrangeBAM_Dupes.bam ASSUME_SORTED=true METRICS_FILE=MyStrangeBAM_Dupes.txt VALIDATION_STRINGENCY=SILENT ;
samtools index MyStrangeBAM_Dupes.bam ;
samtools view -b -F 0x400 MyStrangeBAM_Dupes.bam > MyStrangeBAM_DupesRemoved.bam ;
samtools index MyStrangeBAM_DupesRemoved.bam ;

The other manual way to do it would be to convert it to SAM and then use awk to remove duplicates based on, from what I can see, columns 1 (read ID) and 10 (sequence):

awk '{if (x[$1$10]) { y[$1$10]++; print $0; if (y[$1$10] == 1) { print x[$1$10] } } x[$1$10] = $0}' MyStrangeBAM.sam

I have not tested this code

ADD REPLY • link 6.6 years ago by Kevin Blighe 87k