Duplicate reads remain in bam after picard MarkDuplicates --REMOVE_DUPLICATES true
0
0
Entering edit mode
2.2 years ago
kalavattam ▴ 190

When I use picard MarkDuplicates to deduplicate a merged bam file from a sci-ATAC-seq experiment, duplicates remain in the file.

The merged bam file comes from a processing pipeline that takes paired-end fastq files and associates individual reads with nuclei of origin via associated molecular barcodes, changing sequencing platform-specific information in fastq files to barcode information—and thus, after the fastq files are aligned, in bam qname fields. The pipeline aligns the fastq files with bowtie2 and merges resulting bam files with sambamba merge. I sort the merged bam file by coordinate with picard SamSort, and then I call picard MarkDuplicates:

picard MarkDuplicates \
--INPUT "${file}.bam" \
--OUTPUT "${file_rmdup}.bam" \
--METRICS_FILE "${file_rmdup}.txt" \
--REMOVE_DUPLICATES true

While a small number of duplicates appear to be removed, many more remain in the bam file:

samtools view "${file_rmdup}.bam" | head -20
# GAATTCGTGCGCGTTCATCAGAGAGGTCGTACTGAC:0  99  chrX    3041    42  50M =   3205    214 AGTTAATTCATATGGATAACAGCAATAATACTGAGCCCCTGAGACTACTT  AAAAAEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE  MD:Z:50 PG:Z:MarkDuplicates XG:i:0  NM:i:0  XM:i:0  XN:i:0  XO:i:0  AS:i:0  YS:i:0  YT:Z:CP
# GAATTCGTGCGCGTTCATCAGAGAGGTCGTACTGAC:0  99  chrX    3041    42  50M =   3205    214 AGTTAATTCATATGGATAACAGCAATAATACTGAGCCCCTGAGACTACTT  AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE  MD:Z:50 PG:Z:MarkDuplicates XG:i:0  NM:i:0  XM:i:0  XN:i:0  XO:i:0  AS:i:0  YS:i:-3 YT:Z:CP
# GAATTCGTGCGCGTTCATCAGAGAGGTCGTACTGAC:0  147 chrX    3205    42  50M =   3041    -214    CTGGGTGGGACACAGTCTTCTCTTGTTTTCTTTTGTAACAAGATGTAGAA  EEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE6EEEEEEAAAAA  MD:Z:50 PG:Z:MarkDuplicates XG:i:0  NM:i:0  XM:i:0  XN:i:0  XO:i:0  AS:i:0  YS:i:0  YT:Z:CP
# GAATTCGTGCGCGTTCATCAGAGAGGTCGTACTGAC:0  147 chrX    3205    42  50M =   3041    -214    CTGGGTGGGACTCAGTCTTCTCTTGTTTTCTTTTGTAACAAGATGTAGAA  EEEEE<EEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA  MD:Z:11A38  PG:Z:MarkDuplicates XG:i:0  NM:i:1  XM:i:1  XN:i:0  XO:i:0  AS:i:-3 YS:i:0  YT:Z:CP
# GAATTCGTGCGCTATGGTTTAATTAGCCGTACTGAC:0  99  chrX    22016   39  50M =   22179   213 ACATTCTAGCGTGACCCCAGTCAATTCTGAGACTGAAGACAAACTTAATA  AAAAAEEEEEEEE/EEEEEEEEEEEEEEAEEAEEEEEEEEEEEEEEEAEE  MD:Z:50 PG:Z:MarkDuplicates XG:i:0  NM:i:0  XM:i:0  XN:i:0  XO:i:0  AS:i:0  XS:i:-25    YS:i:0  YT:Z:CP
# GAATTCGTGCGCTATGGTTTAATTAGCCGTACTGAC:0  99  chrX    22016   39  50M =   22179   213 ACATTCTAGCGTGACCCCAGTCAATTCTGAGACTGAAGACAAACTTAATA  AAAAAEEEAEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEAEA  MD:Z:50 PG:Z:MarkDuplicates XG:i:0  NM:i:0  XM:i:0  XN:i:0  XO:i:0  AS:i:0  XS:i:-30    YS:i:0  YT:Z:CP
# GAATTCGTGCGCTATGGTTTAATTAGCCGTACTGAC:0  147 chrX    22179   39  50M =   22016   -213    TAAAATGGCTTGTCATAATGACTGAAATACTTGAGCCTACTTCCTTGTCC  EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA  MD:Z:50 PG:Z:MarkDuplicates XG:i:0  NM:i:0  XM:i:0  XN:i:0  XO:i:0  AS:i:0  XS:i:-15    YS:i:0  YT:Z:CP
# GAATTCGTGCGCTATGGTTTAATTAGCCGTACTGAC:0  147 chrX    22179   39  50M =   22016   -213    TAAAATGGCTTGTCATAATGACTGAAATACTTGAGCCTACTTCCTTGTCC  EEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA  MD:Z:50 PG:Z:MarkDuplicates XG:i:0  NM:i:0  XM:i:0  XN:i:0  XO:i:0  AS:i:0  XS:i:-15    YS:i:0  YT:Z:CP
# GAATTCGTACGGCGTTAACTCTAACTCGTAATCTTA:0  99  chrX    23671   42  50M =   23786   165 CATTGATAATCCCAGCCTAAACGACCAGCAGTCACTCTGGAAGCCACAGA  A/A6A6A66AEAEAA//A6/A6//AE6/A6/AAA6666AA/<//A/66A/  MD:Z:50 PG:Z:MarkDuplicates XG:i:0  NM:i:0  XM:i:0  XN:i:0  XO:i:0  AS:i:0  YS:i:0  YT:Z:CP
# GAATTCGTACGGCGTTAACTCTAACTCGTAATCTTA:0  147 chrX    23786   42  50M =   23671   -165    TCAATCCCTCATTATCATCCTCAGACCTTTTGTTATAGTCTCTCCCACTA  6EEEEEEEAEE/EEEEEEEEEE/EEEEEEEEEE6E6EEEE6E/EEAAAAA  MD:Z:50 PG:Z:MarkDuplicates XG:i:0  NM:i:0  XM:i:0  XN:i:0  XO:i:0  AS:i:0  YS:i:0  YT:Z:CP
# GAATTCGTGCGCTATGGTTTAATTAGCCGTACTGAC:0  163 chrX    24055   40  50M =   24088   83  GAACCACAGAGAAAATAGAAACCAAGGAACAAAACACCCACCTCACAACG  AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<AEEEEEEEEEEE  MD:Z:48G1   PG:Z:MarkDuplicates XG:i:0  NM:i:1  XM:i:1  XN:i:0  XO:i:0  AS:i:-5 YS:i:-10    YT:Z:CP
# GAATTCGTTCTATCGCTGAGGCGAGAGCTAATCTTA:0  99  chrX    24083   42  50M =   24178   145 ACAAAACACCCACCTCACAACGAGAAGACCAGGTATCAGGATGTAGGAAT  AAAAAAEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEE  MD:Z:20G15T13   PG:Z:MarkDuplicates XG:i:0  NM:i:2  XM:i:2  XN:i:0  XO:i:0  AS:i:-10    YS:i:0  YT:Z:CP
# GAATTCGTTCTATCGCTGAGGCGAGAGCTAATCTTA:0  99  chrX    24083   42  50M =   24178   145 ACAAAACACCCACCTCACAACGAGAAGACCAGGTATCAGGATGTAGGAAT  AAAAAEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEE/EEE  MD:Z:20G15T13   PG:Z:MarkDuplicates XG:i:0  NM:i:2  XM:i:2  XN:i:0  XO:i:0  AS:i:-10    YS:i:0  YT:Z:CP
# GAATTCGTTCTATCGCTGAGGCGAGAGCTAATCTTA:0  99  chrX    24083   42  50M =   24178   145 ACAAAACACCCACCTCACAACGAGAAGACCAGGTATCAGGATGTAGGAAT  AAAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE  MD:Z:20G15T13   PG:Z:MarkDuplicates XG:i:0  NM:i:2  XM:i:2  XN:i:0  XO:i:0  AS:i:-10    YS:i:0  YT:Z:CP
# GAATTCGTGCGCTATGGTTTAATTAGCCGTACTGAC:0  83  chrX    24088   40  50M =   24055   -83 ACACCCACCTCACAACGAGAAGACCAGGTATCAGGATGTAGGAATACAAT  EEEEEEEEEEEEEEEEEEEEEEEA/EEEEEEEEEEEEEEEEEEEEAAAAA  MD:Z:15G15T18   PG:Z:MarkDuplicates XG:i:0  NM:i:2  XM:i:2  XN:i:0  XO:i:0  AS:i:-10    YS:i:-5 YT:Z:CP
# GAATTCGTTCTATCGCTGAGGCGAGAGCTAATCTTA:0  147 chrX    24178   42  50M =   24083   -145    AATAATAGCCTGAACAATACATCTCCATTAAAGCCCAGCAACCCTACCAC  EE/EEEEEEEE/EEEEEEEEE6EEEEEEEAEEEEEEEEEEEEEEEAAAAA  MD:Z:50 PG:Z:MarkDuplicates XG:i:0  NM:i:0  XM:i:0  XN:i:0  XO:i:0  AS:i:0  YS:i:-10    YT:Z:CP
# GAATTCGTTCTATCGCTGAGGCGAGAGCTAATCTTA:0  147 chrX    24178   42  50M =   24083   -145    AATAATAGCCTGAACAATACATCTCCATTAAAGCCCAGCAACCCTACCAC  EEAEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA  MD:Z:50 PG:Z:MarkDuplicates XG:i:0  NM:i:0  XM:i:0  XN:i:0  XO:i:0  AS:i:0  YS:i:-10    YT:Z:CP
# GAATTCGTTCTATCGCTGAGGCGAGAGCTAATCTTA:0  147 chrX    24178   42  50M =   24083   -145    AATAATAGCCTGAACAATACATCTCCATTAAAGCCCAGCAACCCTACCAC  EEAEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA  MD:Z:50 PG:Z:MarkDuplicates XG:i:0  NM:i:0  XM:i:0  XN:i:0  XO:i:0  AS:i:0  YS:i:-10    YT:Z:CP
# GAATTCGTTCTATCGCTGAGGCGAGAGCTAATCTTA:0  99  chrX    24589   42  50M =   24948   409 AAGTTTGGGTTGAAAGTTTTGTGGGTGAGTTTGAAGTGGAAACTTCTAGT  AAAAAEEEEEEAEEEEEEEEAEEEEEEEEEEE6E//EEEE/<EEEEEEEE  MD:Z:50 PG:Z:MarkDuplicates XG:i:0  NM:i:0  XM:i:0  XN:i:0  XO:i:0  AS:i:0  YS:i:0  YT:Z:CP
# GAATTCGTTCTATCGCTGAGGCGAGAGCTAATCTTA:0  147 chrX    24948   42  50M =   24589   -409    AAAAACTCACCTAAGAACTGCTTGTGAGGTTCCTGCAGCAACTTGTAGTT  EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA  MD:Z:50 PG:Z:MarkDuplicates XG:i:0  NM:i:0  XM:i:0  XN:i:0  XO:i:0  AS:i:0  XS:i:-30    YS:i:0  YT:Z:CP
#  For example, what I observe:
# GAATTCGTGCGCGTTCATCAGAGAGGTCGTACTGAC:0  99  chrX    3041    42  50M =   3205    214 AGTTAATTCATATGGATAACAGCAATAATACTGAGCCCCTGAGACTACTT  AAAAAEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE  MD:Z:50 PG:Z:MarkDuplicates XG:i:0  NM:i:0  XM:i:0  XN:i:0  XO:i:0  AS:i:0  YS:i:0  YT:Z:CP
# GAATTCGTGCGCGTTCATCAGAGAGGTCGTACTGAC:0  99  chrX    3041    42  50M =   3205    214 AGTTAATTCATATGGATAACAGCAATAATACTGAGCCCCTGAGACTACTT  AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE  MD:Z:50 PG:Z:MarkDuplicates XG:i:0  NM:i:0  XM:i:0  XN:i:0  XO:i:0  AS:i:0  YS:i:-3 YT:Z:CP
# GAATTCGTGCGCGTTCATCAGAGAGGTCGTACTGAC:0  147 chrX    3205    42  50M =   3041    -214    CTGGGTGGGACACAGTCTTCTCTTGTTTTCTTTTGTAACAAGATGTAGAA  EEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE6EEEEEEAAAAA  MD:Z:50 PG:Z:MarkDuplicates XG:i:0  NM:i:0  XM:i:0  XN:i:0  XO:i:0  AS:i:0  YS:i:0  YT:Z:CP
# GAATTCGTGCGCGTTCATCAGAGAGGTCGTACTGAC:0  147 chrX    3205    42  50M =   3041    -214    CTGGGTGGGACTCAGTCTTCTCTTGTTTTCTTTTGTAACAAGATGTAGAA  EEEEE<EEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA  MD:Z:11A38  PG:Z:MarkDuplicates XG:i:0  NM:i:1  XM:i:1  XN:i:0  XO:i:0  AS:i:-3 YS:i:0  YT:Z:CP

#  Instead, I expect this:
# GAATTCGTGCGCGTTCATCAGAGAGGTCGTACTGAC:0  99  chrX    3041    42  50M =   3205    214 AGTTAATTCATATGGATAACAGCAATAATACTGAGCCCCTGAGACTACTT  AAAAAEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE  MD:Z:50 PG:Z:MarkDuplicates XG:i:0  NM:i:0  XM:i:0  XN:i:0  XO:i:0  AS:i:0  YS:i:0  YT:Z:CP
# GAATTCGTGCGCGTTCATCAGAGAGGTCGTACTGAC:0  147 chrX    3205    42  50M =   3041    -214    CTGGGTGGGACACAGTCTTCTCTTGTTTTCTTTTGTAACAAGATGTAGAA  EEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE6EEEEEEAAAAA  MD:Z:50 PG:Z:MarkDuplicates XG:i:0  NM:i:0  XM:i:0  XN:i:0  XO:i:0  AS:i:0  YS:i:0  YT:Z:CP

In checking the log that results from running picard Markduplicates, I see the following:

WARNING 2022-02-27 21:13:22 AbstractOpticalDuplicateFinderCommandLineProgram    A field field parsed out of a read name was expected to contain an integer and did not. Read name: GAATTCGTGCGCGTTCATCAGAGAGGTCGTACTGAC:0. Cause: String 'GAATTCGTGCGCGTTCATCAGAGAGGTCGTACTGAC' did not start with a parsable number.

Again, fastq sequence identifiers have been converted from sequencing platform-specific information to nucleus barcode information, and thus sequencing platform information is no longer found in bam qname fields, which I think is needed by picard MarkDuplicates.

Is it possible to use picard MarkDuplicates for deduplication of merged bam files resulting from this pipeline?

deduplication qname fastq Picard bam • 2.1k views
ADD COMMENT
0
Entering edit mode

the tool has multiple settings to remove duplicates, and the explanations are a bit complex,

in your case I would recommend filtering by flag, note how some of your flags don't really make sense

samtools flags 3041
0xbe1   3041    PAIRED,MREVERSE,READ1,READ2,SECONDARY,QCFAIL,SUPPLEMENTARY

you have it labeled as both read1 and read 2, and also secondary and supplementary. Seems pretty odd frankly.

Or get this:

samtools flags 3205
0xc85   3205    PAIRED,UNMAP,READ2,DUP,SUPPLEMENTARY

unmapped, duplicated and supplementary? ...

long story short, I think you'll get better results and more control if you filter by flag rather than have the tool remove alignments

ADD REPLY
0
Entering edit mode

Thank you for the response. Sorry, I should have included a header for column names: 3041 and 3205 are in pos fields; the flags are 99 and 147.

samtools flags 99
# 0x63  99  PAIRED,PROPER_PAIR,MREVERSE,READ1

samtools flags 147
# 0x93  147 PAIRED,PROPER_PAIR,REVERSE,READ2
ADD REPLY
1
Entering edit mode

Ah yes, my bad I misread the file since every line started with a # it seemed like a more custom format.

I will say that replacing read names is a bad practice as it destroys very important information in the BAM file (information on the pairing) that later is impossible to reliably restore. I suspect Picard is misbehaving for that reason.

If you need to add the barcode then add it after a unique identifier @1-ATGC . (Just adding that won't solve the optical duplicate detection because that needs more information). The proper way by the way would be to tag each alignment with the read group that marks it with the appropriate information.

I would give the samtools rmdup a try to remove duplicates:

samtools rmdup
ADD REPLY
0
Entering edit mode

Thank you again. samtools rmdup works well in this situation, and repair from the Subread package is a nice and fast tool to get paired read-ends together.

ADD REPLY

Login before adding your answer.

Traffic: 1651 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6