Question: Will Picard MarkDuplicates also un-mark duplicates?
1
gravatar for Chris Miller
3.3 years ago by
Chris Miller18k
Washington University in St. Louis, MO
Chris Miller18k wrote:

If I take a bam that's already aligned and has dups marked, remove a bunch of reads, then re-run Picard's mark-duplicates, will it correctly change the flags of reads that are no longer duplicates (but may have been before ditching the reads)?

This question is proving surprisingly hard to google for.

ADD COMMENTlink modified 3.3 years ago by Pierre Lindenbaum98k • written 3.3 years ago by Chris Miller18k

@Pierre Lindenbaum

Actually I have a question rather than an answer here. I am very hesitating if I should use 'REMOVE_DUPLICATES=true' or 'REMOVE_DUPLICATES=false'. What difference it will make by only marking the flag to duplicates and removing duplicates? Thank you,

Lisa

 

ADD REPLYlink written 2.1 years ago by zhanghon0

Depends on what downstream tools you're using. In general, my preference is not to remove them, as changes in software or approach may make that data valuable at some later date. My general inclination is always to avoid throwing away any data, though.

ADD REPLYlink written 2.1 years ago by Chris Miller18k
5
gravatar for Pierre Lindenbaum
3.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum98k wrote:

let's test it. I use the following BAM ( 3 pairs of reads, same sequence/cigar)

@HD	VN:1.0 SO:coordinate	SO:coordinate
@SQ	SN:seq1	LN:5000
@CO	Example of SAM/BAM file format.
EAS54_61:4:143:69:577	99	seq1	36	98	35M	=	185	184	GTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG	===;=====48=844;=;+=5==*57,2+5&,5+5	MF:i:18	Aq:i:35	NM:i:2	UQ:i:38	H0:i:0	H1:i:1
EAS54_61:4:143:69:578	99	seq1	36	98	35M	=	185	184	GTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG	===;=====48=844;=;+=5==*57,2+5&,5+5	MF:i:18	Aq:i:35	NM:i:2	UQ:i:38	H0:i:0	H1:i:1
EAS54_61:4:143:69:579	99	seq1	36	98	35M	=	185	184	GTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG	===;=====48=844;=;+=5==*57,2+5&,5+5	MF:i:18	Aq:i:35	NM:i:2	UQ:i:38	H0:i:0	H1:i:1
EAS54_61:4:143:69:577	147	seq1	185	98	35M	=	36	-184	ATTGGGAGCCCCTCTAAGCCGTTCTATTTGTAATG	222&<21<<<<12<7<01<<<<<0<<<<<<<20<<	MF:i:18	Aq:i:35	NM:i:1	UQ:i:5	H0:i:1	H1:i:0
EAS54_61:4:143:69:578	147	seq1	185	98	35M	=	36	-184	ATTGGGAGCCCCTCTAAGCCGTTCTATTTGTAATG	222&<21<<<<12<7<01<<<<<0<<<<<<<20<<	MF:i:18	Aq:i:35	NM:i:1	UQ:i:5	H0:i:1	H1:i:0
EAS54_61:4:143:69:579	147	seq1	185	98	35M	=	36	-184	ATTGGGAGCCCCTCTAAGCCGTTCTATTTGTAATG	222&<21<<<<12<7<01<<<<<0<<<<<<<20<<	MF:i:18	Aq:i:35	NM:i:1	UQ:i:5	H0:i:1	H1:i:0

I run markduplicate:

@HD	VN:1.4	SO:coordinate
@SQ	SN:seq1	LN:5000
@PG	ID:MarkDuplicates	PN:MarkDuplicates	VN:1.115(30b1e546cc4dd80c918e151dbfe46b061e63f315_1402927010)	CL:picard.sam.MarkDuplicates INPUT=[sorted.bam] OUTPUT=marked.bam METRICS_FILE=jeter    PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates REMOVE_DUPLICATES=false ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
@CO	Example of SAM/BAM file format.
EAS54_61:4:143:69:577	99	seq1	36	98	35M	=	185	184	GTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG	===;=====48=844;=;+=5==*57,2+5&,5+5	H0:i:0	H1:i:1	MF:i:18	PG:Z:MarkDuplicates	NM:i:2	UQ:i:38	Aq:i:35
EAS54_61:4:143:69:578	1123	seq1	36	98	35M	=	185	184	GTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG	===;=====48=844;=;+=5==*57,2+5&,5+5	H0:i:0	H1:i:1	MF:i:18	PG:Z:MarkDuplicates	NM:i:2	UQ:i:38	Aq:i:35
EAS54_61:4:143:69:579	1123	seq1	36	98	35M	=	185	184	GTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG	===;=====48=844;=;+=5==*57,2+5&,5+5	H0:i:0	H1:i:1	MF:i:18	PG:Z:MarkDuplicates	NM:i:2	UQ:i:38	Aq:i:35
EAS54_61:4:143:69:577	147	seq1	185	98	35M	=	36	-184	ATTGGGAGCCCCTCTAAGCCGTTCTATTTGTAATG	222&<21<<<<12<7<01<<<<<0<<<<<<<20<<	H0:i:1	H1:i:0	MF:i:18	PG:Z:MarkDuplicates	NM:i:1	UQ:i:5	Aq:i:35
EAS54_61:4:143:69:578	1171	seq1	185	98	35M	=	36	-184	ATTGGGAGCCCCTCTAAGCCGTTCTATTTGTAATG	222&<21<<<<12<7<01<<<<<0<<<<<<<20<<	H0:i:1	H1:i:0	MF:i:18	PG:Z:MarkDuplicates	NM:i:1	UQ:i:5	Aq:i:35
EAS54_61:4:143:69:579	1171	seq1	185	98	35M	=	36	-184	ATTGGGAGCCCCTCTAAGCCGTTCTATTTGTAATG	222&<21<<<<12<7<01<<<<<0<<<<<<<20<<	H0:i:1	H1:i:0	MF:i:18	PG:Z:MarkDuplicates	NM:i:1	UQ:i:5	Aq:i:35

I remove some reads. I keep one pair of reads marked as dup :

@HD	VN:1.4	SO:coordinate
@SQ	SN:seq1	LN:5000
@PG	ID:MarkDuplicates	PN:MarkDuplicates	VN:1.115(30b1e546cc4dd80c918e151dbfe46b061e63f315_1402927010)	CL:picard.sam.MarkDuplicates INPUT=[sorted.bam] OUTPUT=marked.bam METRICS_FILE=jeter    PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates REMOVE_DUPLICATES=false ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
@CO	Example of SAM/BAM file format.
EAS54_61:4:143:69:579	1123	seq1	36	98	35M	=	185	184	GTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG	===;=====48=844;=;+=5==*57,2+5&,5+5	H0:i:0	H1:i:1	MF:i:18	PG:Z:MarkDuplicates	NM:i:2	UQ:i:38	Aq:i:35
EAS54_61:4:143:69:579	1171	seq1	185	98	35M	=	36	-184	ATTGGGAGCCCCTCTAAGCCGTTCTATTTGTAATG	222&<21<<<<12<7<01<<<<<0<<<<<<<20<<	H0:i:1	H1:i:0	MF:i:18	PG:Z:MarkDuplicates	NM:i:1	UQ:i:5	Aq:i:35

I re-run picard, the flag is removed.

@HD	VN:1.4	SO:coordinate
@SQ	SN:seq1	LN:5000
@PG	ID:MarkDuplicates	PN:MarkDuplicates	VN:1.115(30b1e546cc4dd80c918e151dbfe46b061e63f315_1402927010)	CL:picard.sam.MarkDuplicates INPUT=[sorted.bam] OUTPUT=marked.bam METRICS_FILE=jeter    PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates REMOVE_DUPLICATES=false ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
@PG	ID:MarkDuplicates.1	PN:MarkDuplicates	PP:MarkDuplicates	VN:1.115(30b1e546cc4dd80c918e151dbfe46b061e63f315_1402927010)	CL:picard.sam.MarkDuplicates INPUT=[jeter2.bam] OUTPUT=marked2.bam METRICS_FILE=jeter    PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates REMOVE_DUPLICATES=false ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
@CO	Example of SAM/BAM file format.
EAS54_61:4:143:69:579	99	seq1	36	98	35M	=	185	184	GTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG	===;=====48=844;=;+=5==*57,2+5&,5+5	H0:i:0	H1:i:1	MF:i:18	PG:Z:MarkDuplicates.1	NM:i:2	UQ:i:38	Aq:i:35
EAS54_61:4:143:69:579	147	seq1	185	98	35M	=	36	-184	ATTGGGAGCCCCTCTAAGCCGTTCTATTTGTAATG	222&<21<<<<12<7<01<<<<<0<<<<<<<20<<	H0:i:1	H1:i:0	MF:i:18	PG:Z:MarkDuplicates.1	NM:i:1	UQ:i:5	Aq:i:35
ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Pierre Lindenbaum98k
2

Well played, sir.  In retrospect, doing the experiment myself would have been faster than that 10 minutes of searching and typing this post.  Thanks!

ADD REPLYlink written 3.3 years ago by Chris Miller18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1389 users visited in the last hour