Marking and Removing duplicates using picard (picard-tools-1.129)
1
3
Entering edit mode
7.3 years ago
ravi.uhdnis ▴ 210

hi, i ran the following command for 'marking and removing duplicates' from my WGS data from illumina HiSeq platform:

java -Xms4g -jar /usr/local/picard-tools-1.129/picard.jar MarkDuplicates INPUT=2102.bwa.sam.sort.bam OUTPUT=2102.bwa.sam.sort.rmdup1.bam METRICS_FILE=2102.bwa.sam.sort.rmdup.txt2 REMOVE_DUPLICATES=true VALIDATION_STRINGENCY=LENIENT

I found my input file was input file was 6.8G whereas output file formed of 7.0G. Moreover, i didn't find any duplicates removed from the files after visualization by IGV or via command line by samtools i.e. "diff -c <(samtools view 2102.bwa.sam.sort.bam | cut -f -9) <(samtools view 2102.bwa.sam.sort.rmdup1.bam | cut -f -9) | less". Any suggestion where i am missing something ?. Thanks, Ravi

next-gen genome sequencing • 23k views
1
Entering edit mode

Try to use /usr/local/picard-tools-1.129/MarkDuplicates.jar directly. Alternatively you can try samtools rmdup to test if results change.

0
Entering edit mode

hi, thank you for suggestion. going to try it now.

0
Entering edit mode

Hi, the 'picard-tools-1.129' directory didn't have 'MarkDuplicates.jar' file. The only files available are : htsjdk-1.129.jar  libIntelDeflater.so  picard.jar  picard-lib.jar so i don't think i have the option to use 'MarkDuplicates' the way you suggested ?.

0
Entering edit mode

Can you please post the content (or the basic stats) from the Metrics file that was output? It should be 2102.bwa.sam.sort.rmdup.txt2 as you specified in your command line.

0
Entering edit mode

Thank you. Yes, here is the content of the above file (i shortened the names of files while posting this query):

## htsjdk.samtools.metrics.StringHeader
# picard.sam.markduplicates.MarkDuplicates INPUT=[2102_L001_R1R2_P.bwa.sam.sort.bam] OUTPUT=2102_L001_R1R2_P.bwa.sam.sort.rmdup1.bam METRICS_FILE=2102_L001_R1R2_P.bwa.sam
TION_SIZE_RATIO=0.25 PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES READ_NAME_REG
EX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_IN
DEX=false CREATE_MD5_FILE=false
# Started on: Thu Jun 04 14:43:36 EDT 2015

## METRICS CLASS    picard.sam.DuplicationMetrics
PLICATION    ESTIMATED_LIBRARY_SIZE
Unknown Library    1318015    55033367    1914385    83153    514977    93703    0.009993    3564090250

## HISTOGRAM    java.lang.Double
BIN    VALUE
1.0    1.001692
2.0    1.988036
3.0    2.959267
4.0    3.915616
5.0    4.857312
6.0    5.784578
7.0    6.697637
8.0    7.596705
9.0    8.481997
10.0    9.353724
11.0    10.212095
12.0    11.057313
13.0    11.88958
14.0    12.709094
15.0    13.516052
16.0    14.310645
17.0    15.093063
18.0    15.863493
19.0    16.622117
20.0    17.369118
21.0    18.104673
22.0    18.828957
23.0    19.542143
24.0    20.244402
25.0    20.9359
26.0    21.616803
27.0    22.287272
28.0    22.947469
29.0    23.597549
30.0    24.237669
31.0    24.867981
32.0    25.488634
33.0    26.099778
34.0    26.701557
35.0    27.294116
36.0    27.877595
37.0    28.452134
38.0    29.01787
39.0    29.574937
40.0    30.123468
41.0    30.663595
42.0    31.195445
43.0    31.719146
44.0    32.234823
45.0    32.742598
46.0    33.242593

47.0    33.734927
48.0    34.219717
49.0    34.697079
50.0    35.167127
51.0    35.629972
52.0    36.085725
53.0    36.534495
54.0    36.976389
55.0    37.411512
56.0    37.839968
57.0    38.261859
58.0    38.677285
59.0    39.086346
60.0    39.489139
61.0    39.88576
62.0    40.276305
63.0    40.660865
64.0    41.039532
65.0    41.412398
66.0    41.77955
67.0    42.141077
68.0    42.497064
69.0    42.847596
70.0    43.192758
71.0    43.532631
72.0    43.867296
73.0    44.196833
74.0    44.521321
75.0    44.840837
76.0    45.155457
77.0    45.465256
78.0    45.770309
79.0    46.070687
80.0    46.366463
81.0    46.657707
82.0    46.944488
83.0    47.226875
84.0    47.504935
85.0    47.778735
86.0    48.048339
87.0    48.313812
88.0    48.575218
89.0    48.832618
90.0    49.086074
91.0    49.335647
92.0    49.581395
93.0    49.823379
94.0    50.061654
95.0    50.296278
96.0    50.527308
97.0    50.754797
98.0    50.978801
99.0    51.199372
100.0    51.416564

2
Entering edit mode

I should also add that the discrepancy in size is likely due to the fact that MarkDuplicates adds an entry to each record in the BAM file. See the statement which starts at line 182:

0
Entering edit mode

yes, i was also assuming this fact that picard might have add some 'mark' in .bam file, which leads in increase in size but at the same time i have read that after 'duplicate removal' the overall size of the file should be reduced. So i am bit confused with my file's size (& number of lines) outcome. Sorry, i can't understand java. Thanks for comment, Ravi

1
Entering edit mode

Thanks! There are definitely some duplicates there that Picard caught. Looks like your duplication rate is just under 1%.

Try running this command on both your input BAM and your MarkDuplicates output BAM:

samtools view [BAMfile] | wc -l


What value do you get for each file?

0
Entering edit mode
7.3 years ago
ravi.uhdnis ▴ 210

hi, thanks for the comment and command. here is the output of this command on both .bam files. Certainly the number of lines seems reduced in 'Mark & Remove Duplicate' file. here it is:

[rsindhu@master test_2102]$samtools view 2102_L001_R1R2_P.bwa.sam.sort.bam | wc -l 113465731 [rsindhu@master test_2102]$ samtools view 2102_L001_R1R2_P.bwa.sam.sort.rmdup1.bam | wc -l
112352624

3
Entering edit mode

Well congrats. You have a genomic library of high complexity. I have been analyzing NGS data for quiet a while but have never seen any library with PCR duplication rate < 1%. You should be excited rather getting tensed about increase in the bam size. The increase could be due to the mark added by picard as suggested by Dan OR it could be difference in the compression ratio used in the original bam and the bam generated by picard. Not sure though.

0
Entering edit mode

Hi, thank you for the suggestion. I am new for this whole area so seeking help at almost every step. Here, since i am using data of only 'ONE' lane out of 8 in my short 'test' experiment so it might be the reason for such small duplication rate. Once i'll use the whole data of a sample (which is about 10 times the current test sample) then the original duplicate rates will appear, which might be much more than 1%. It's my assumption but yes, that's also true that the data quality is really good for this run. Thanks for the comment.

1
Entering edit mode

Run samtools view [Dedupped BAM] | less

What you should see is that each record in the BAM file has Picard MarkDuplicates (or something similar) in the optional fields. That's what is increasing your file size after the removal of duplicates.

0
Entering edit mode

By the way, you could replace the pipe with samtools view -c, which counts the records itself rather than emitting the full bam only to have it consumed by wc.

In addition, if you want to check to only count reads that have not been flagged as duplicates, you could further replace your calls with samtools view -cF1024 {bam}. -F means skip all reads with any bits in parameter set. This skips all reads marked as duplicates.

I typically use 2816 or 3840 when counting my bams. 2816 just out of habit, as the 1024 bit is usually unset in my workflows.