Question: Warning in GATK Picard MarkDuplicates
gravatar for Mehulsharma.253
3 days ago by
Mehulsharma.2530 wrote:

I ran MarkDuplicates on my sorted BAM file with GATK 4.0.11 picard tools but got the following warning:

WARNING 2018-12-07 10:14:23 AbstractOpticalDuplicateFinderCommandLineProgram A field field parsed out of a read name was expected to contain an integer and did not. Read name: ERR044626.41637014. Cause: String 'ERR044626.41637014' did not start with a parsable number.

Is there a need to worry about this ? What is the reason for this error ? ValidateSam didn't give me any errors except incorrect NM tags.

ADD COMMENTlink modified 3 days ago by Pierre Lindenbaum115k • written 3 days ago by Mehulsharma.2530
gravatar for Pierre Lindenbaum
3 days ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum115k wrote:

MarkDuplicates expects to find a pattern in the reads' names (it is used to find the lane, tile and the X,Y position in the lane/tile)

With Casava 1.8 the format of the '@' line has changed:

@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

EAS139 the unique instrument name


if your reads names don't have a standard format , MarkDuplicates won't be able to find the information. There is an option to fix this:

Regular expression that can be used to parse read names in the incoming SAM file. Read names are parsed to extract three variables: tile/region, x coordinate and y coordinate. These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size. Set this option to null to disable optical duplicate detection, e.g. for RNA-seq or other data where duplicate sets are extremely large and estimating library complexity is not an aim. Note that without optical duplicate counts, library size estimation will be inaccurate. The regular expression should contain three capture groups for the three variables, in order. It must match the entire read name. Note that if the default regex is specified, a regex match is not actually done, but instead the read name is split on colon character. For 5 element names, the 3rd, 4th and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8), the 5th, 6th, and 7th elements are assumed to be tile, x and y values.

ADD COMMENTlink modified 3 days ago • written 3 days ago by Pierre Lindenbaum115k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1648 users visited in the last hour