Question: How can I identify which read is optical duplicate in SAM file?
0
gravatar for Tori
3.4 years ago by
Tori80
Finland
Tori80 wrote:

I used Picard's markeduplicates and got following output in the terminal.

[Tue Nov 17 15:16:43 EET 2015] net.sf.picard.sam.MarkDuplicates INPUT=[../Tophat2-downsampled/28391/accepted_hits.bam] OUTPUT=28391.deduped.sam METRICS_FILE=28391.txt REMOVE_DUPLICATES=false OPTICAL_DUPLICATE_PIXEL_DISTANCE=75    ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Tue Nov 17 15:16:43 EET 2015] Executing as bishwa@portal.rack2 on Linux 3.10.0-229.14.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_65-b17; Picard version: 1.77(1266)
INFO    2015-11-17 15:16:43    MarkDuplicates    Start of doWork freeMemory: 502696952; totalMemory: 506462208; maxMemory: 5726797824
INFO    2015-11-17 15:16:43    MarkDuplicates    Reading input file and constructing read end information.
INFO    2015-11-17 15:16:43    MarkDuplicates    Will retain up to 22725388 data points before spilling to disk.
INFO    2015-11-17 15:16:45    MarkDuplicates    Read 191568 records. 0 pairs never matched.
INFO    2015-11-17 15:16:46    MarkDuplicates    After buildSortedReadEndLists freeMemory: 436332280; totalMemory: 638582784; maxMemory: 5726797824
INFO    2015-11-17 15:16:46    MarkDuplicates    Will retain up to 178962432 duplicate indices before spilling to disk.
INFO    2015-11-17 15:16:47    MarkDuplicates    Traversing read pair information and detecting duplicates.
INFO    2015-11-17 15:16:47    MarkDuplicates    Traversing fragment information and detecting duplicates.
INFO    2015-11-17 15:16:47    MarkDuplicates    Sorting list of duplicate records.
INFO    2015-11-17 15:16:47    MarkDuplicates    After generateDuplicateIndexes freeMemory: 632341752; totalMemory: 2070413312; maxMemory: 5726797824
INFO    2015-11-17 15:16:47    MarkDuplicates    Marking 2681 records as duplicates.
INFO    2015-11-17 15:16:47    MarkDuplicates    Found 129 optical duplicate clusters.
INFO    2015-11-17 15:16:48    MarkDuplicates    Before output close freeMemory: 2301791296; totalMemory: 2313682944; maxMemory: 5726797824
INFO    2015-11-17 15:16:49    MarkDuplicates    After output close freeMemory: 2314374208; totalMemory: 2326265856; maxMemory: 5726797824
[Tue Nov 17 15:16:49 EET 2015] net.sf.picard.sam.MarkDuplicates done. Elapsed time: 0.10 minutes.
Runtime.totalMemory()=2326265856

 

The terminal output says there are 129 optical duplicates clusters. How can I know the optical duplicate cluster's tile number, x coordinate and y coordinate?

sam picard • 1.9k views
ADD COMMENTlink modified 3.4 years ago by Pierre Lindenbaum118k • written 3.4 years ago by Tori80
2
gravatar for Pierre Lindenbaum
3.4 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

> How can I know the optical duplicate cluster's tile number, x coordinate and y coordinate?

see the illumina  read name nomenclature: https://en.wikipedia.org/wiki/FASTQ_format

 

@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
EAS139 the unique instrument name
136 the run id
FC706VJ the flowcell id
2 flowcell lane
2104 tile number within the flowcell lane
15343 'x'-coordinate of the cluster within the tile
197393 'y'-coordinate of the cluster within the tile
1 the member of a pair, 1 or 2 (paired-end or mate-pair reads only)
Y Y if the read is filtered, N otherwise
18 0 when none of the control bits are on, otherwise it is an even number
ATCACG index sequence
ADD COMMENTlink written 3.4 years ago by Pierre Lindenbaum118k

Those field you mentioned are of fastq files. Picard's MarkDuplicates take BAM/SAM as input and outputs BAM/SAM.

ADD REPLYlink written 3.4 years ago by Tori80

but they are part of the ILLUMINA sam reads. Look at the name of your reads.

ADD REPLYlink written 3.4 years ago by Pierre Lindenbaum118k

One of the reads from SAM file looks likes this

J00117:26:H3YC3BBXX:5:1224:28384:20918  419     chr1    12734   3       101M    =       12855   222     TAGACAGTGAGTGGGAGTGGCGTCGCCCCTAGGGCTCTACGGGGCCGGCATCTCCTGTCTCCTGGAGAGGCTTCGATGCCCCTCCACACCCTCTTGATCTT   AAFFFKKFKKKAFKKKKKKKKKKFKKKKKKKKKKKKFKKKKKKAFKK,AKKKKFFKKKKKKKFKKKKKKKKF7AKKAFFFKFK<AKFKK,FAAFFFKFKK,   AS:i:-6 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:49G51      YT:Z:UU NH:i:2  CC:Z:chr15      CP:i:102518336  HI:i:0

How do I know if this read is optical duplicate?

ADD REPLYlink modified 3.3 years ago • written 3.4 years ago by Tori80

it's not event a 'any-kind-of' duplicate because 419 doesn't contain the DUP flag (1024) : https://broadinstitute.github.io/picard/explain-flags.html

in other case, view the reads before and after this read

e.g.

samtools view your.bam | grep -F "READNAME" -A 10 -B 10

check,  there a rDUP ead at the same chrom+pos , use the names of the reads to check if the cartesian distance is small.

 

ADD REPLYlink written 3.4 years ago by Pierre Lindenbaum118k

@Pierre Lindenbaum  Thanks for your informative comment. So, PCR duplicates and optical duplicates can not be identified from BAM/SAM file because both have same flag 1024. If I plot x and y coordinates I can possibly see in the plot the location of the duplicates. If there are points very close to one another, then they are optical duplicates. Please correct me if I am wrong. By the way, what does it mean if the flag 1024 is negative. I also found negative 1024 flag in my SAM file.

ADD REPLYlink written 3.3 years ago by Tori80

> If I plot x and y coordinates I can possibly see in the plot the location of the duplicates.

yes,... possibly

> it mean if the flag 1024 is negative

it can't be negative because it's an array of bits. If there is a negative flag, your bam is corrupted.

 

 

 

ADD REPLYlink written 3.3 years ago by Pierre Lindenbaum118k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2480 users visited in the last hour