Question: Remove copied reads from a bam
0
gravatar for danny.k.wells
4 weeks ago by
danny.k.wells0 wrote:

I am working on reanalyzing a cohort of bams. Due to an issue in the previous processing some of these bams have reads that are literal copies, not duplicates. For example: ```

samtools view <>.cs.bam | grep HA49BADXX140730:1:2116:12269:27668

```

gives the below. I haven't seen this issue before and it doesn't seem like it's been addressed much. So, I'm looking for a simple method/script to be able to go through a (queryname sorted?) bam and remove reads that have the same name, chromosome and position. Any thoughts?

Thanks!

<h5>#</h5>
HA49BADXX140730:1:2116:12269:27668  161 chr1    12997907    0   71M5S   =   12998670    817 AACCCCAGGGGGAGCCTGCAGTCAGCTGAGATGGCACCATTGCACTCTAAACTGCAACCTGGGCAACAAGCATGAT    ############################################################################    NM:i:5  MD:Z:9T4G0T37C8T8   AS:i:46 XS:i:45 RG:Z:SA9755_T1.HA49B.1

HA49BADXX140730:1:2116:12269:27668  161 chr1    12997907    0   71M5S   =   13109003    111150  AACCCCAGGGGGAGCCTGCAGTCAGCTGAGATGGCACCATTGCACTCTAAACTGCAACCTGGGCAACAAGCATGAT    ############################################################################    NM:i:5  MD:Z:9T4G0T37C8T8   AS:i:46 XS:i:45 RG:Z:SA9755_T1.HA49B.1

HA49BADXX140730:1:2116:12269:27668  81  chr1    12998670    0   22S54M  =   12997907    -817    CTAAACTCCAACCTGGTCAACAAGCATCATAACTCTCCCGGGGGGCAGGATACAGCTCCACGCATAAGTTTTGGAG    35=?>9>=3??=;?>;.(@>5..;;7);>3=)545-;'0AAA=B?1*=A:?2<7@C?)8?147<+<227@+<+=11    NM:i:1  MD:Z:50T3   AS:i:50 XS:i:50 RG:Z:SA9755_T1.HA49B.1  XA:Z:chr1,+13368515,54M22S,1;chr1,-13109003,22S54M,1;chr1,-12939535,27S49M,1;

HA49BADXX140730:1:2116:12269:27668  81  chr1    13109003    0   22S54M  =   12997907    -111150 CTAAACTCCAACCTGGTCAACAAGCATCATAACTCTCCCGGGGGGCAGGATACAGCTCCACGCATAAGTTTTGGAG    35=?>9>=3??=;?>;.(@>5..;;7);>3=)545-;'0AAA=B?1*=A:?2<7@C?)8?147<+<227@+<+=11    NM:i:1  MD:Z:50T3   AS:i:50 XS:i:50 RG:Z:SA9755_T1.HA49B.1  XA:Z:chr1,+13368515,54M22S,1;chr1,-12998670,22S54M,1;chr1,-12939535,27S49M,1;
bam • 134 views
ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by danny.k.wells0

Any thoughts?

you cannot trust this kind of results: re-analysis your data

ADD REPLYlink written 4 weeks ago by Pierre Lindenbaum99k

Ok, I went back to the original bam (before I realigned it) and the reads are there:

HA49BADXX140730:1:2116:12269:27668  653 *   0   0   *   *   0   0   AACCCCAGGGGGAGCCTGCAGTCAGCTGAGATGGCACCATTGCACTCTAAACTGCAACCTGGGCAACAAGCATGAT    ############################################################################    PG:Z:MarkDuplicates.5   RG:Z:HA49B.1    OQ:Z:############################################################################


HA49BADXX140730:1:2116:12269:27668  653 *   0   0   *   *   0   0   AACCCCAGGGGGAGCCTGCAGTCAGCTGAGATGGCACCATTGCACTCTAAACTGCAACCTGGGCAACAAGCATGAT    ############################################################################    PG:Z:MarkDuplicates.5   RG:Z:HA49B.1

It looks like there was a snafu with some score recalibration at some point. But I'm not sure if the uniq approach would work, because of that OQ. Is there a way to do it based solely on read name and sequence?

ADD REPLYlink modified 4 weeks ago by genomax34k • written 4 weeks ago by danny.k.wells0

I don't quite fully understand what happened to your data - are you sure that you know what has happened? Aren't the reads above already marked as a duplicate in your post above? - see PG:Z:MarkDuplicates.5 to the right.

If you want to throw these out of the BAM, you could try running PICARD and SAMtools on the aligned BAM:

java -jar picard.jar MarkDuplicates INPUT=MyStrangeBAM.bam OUTPUT=MyStrangeBAM_Dupes.bam ASSUME_SORTED=true METRICS_FILE=MyStrangeBAM_Dupes.txt VALIDATION_STRINGENCY=SILENT ;
samtools index MyStrangeBAM_Dupes.bam ;
samtools view -b -F 0x400 MyStrangeBAM_Dupes.bam > MyStrangeBAM_DupesRemoved.bam ;
samtools index MyStrangeBAM_DupesRemoved.bam ;

The other manual way to do it would be to convert it to SAM and then use awk to remove duplicates based on, from what I can see, columns 1 (read ID) and 10 (sequence):

awk '{if (x[$1$10]) { y[$1$10]++; print $0; if (y[$1$10] == 1) { print x[$1$10] } } x[$1$10] = $0}' MyStrangeBAM.sam

I have not tested this code

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Kevin Blighe3.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1380 users visited in the last hour