Duplicate marking, read names, and the SRA
1
0
Entering edit mode
5 months ago
Luka • 0

Like many of us, I've written pipelines that implement best practices workflows for alignment of whole-genome sequencing data. These include a duplicate marking step, usually with Picard or Samtools aimed at identifying PCR and optical duplicates in raw sequencing data.

Illumina has a standard sequence identifier that are used for read names:

@[instrument_name]:[flowcell_name]:[tile]:[x]:[y]#[index]/[read_number]

These names, particularly the x and y coordinates, vital to detect optical duplicate reads.

The Sequence Read Archive (SRA) is an extremely valuable resource that has enabled genomics research to a staggering degree. The SRA also completely discards read names, a decision that is fairly controversial on the SRA github. Apparently you can recover original read names using the cloud data delivery service, but it costs money to download the data from s3/gcp, which adds up very quickly when dealing with sequencing data. Furthermore, this bit confuses me, because if the NCBI can deliver data with original read names, then clearly they are storing them, so why doesn't the standard SRA toolkit return them? But I digress.

I'm not going to argue with their decision here - that's been done plenty in other areas. There does appear to be little readily available data or discussion on the specific, actual impact of optical duplicates that are now impossible (or at least much more difficult) to identify once the data have been churned through the NCBI/SRA. I'm sure that such data exists, but either it's difficult to find or I'm very bad at finding it.

Are optical duplicates that remain in SRA data meaningfully harmful to downstream analyses? Can we do anything about them?

SRA MarkDuplicates • 957 views
ADD COMMENT
1
Entering edit mode
5 months ago
GenoMax 146k

To be specific, not every dataset in SRA loses the original identifiers. In many cases those remain recoverable using the -F option when you dump the data (cloud download not required). This may be true (not 100% certain) for data that is directly submitted to NCBI (example above was submitted to ENA).

Optical duplicates are only relevant for patterned flowcells (and may be a problem only under specific circumstances e.g. overloaded flowcells). You will have to go through the trouble of identifying the kind of flowcell the data is from before you can decide to tackle the optical duplicates for public data.

An alignment free tool - clumpify.sh is available to assess duplicates (Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates. ) It can remove all duplicates, PCR or otherwise, if no sequence duplicates are desired in the data, without the need for alignment first.

ADD COMMENT
0
Entering edit mode

In many cases those remain recoverable using the -F option when you dump the data (cloud download not required)

The -F option (or any equivalent) was removed in fasterq-dump, and fastq-dump is deprecated.

You will have to go through the trouble of identifying the kind of flowcell the data is from before you can decide to tackle the optical duplicates for public data.

Right, the intent of making this post is to discuss how much bioinformaticians should care about specifically optical duplicates. Technical details of sequencing runs, such as specific characteristics of duplicates, aren't always rigorously taught.

One thing I'm further curious about based on your reply, is because we distinguish PCR and optical duplicates, are the two meaningfully different in how they present in terms of their sequences? ie. if the reads are identical, then a script (such as clumpify.sh) would be able to identify/remove both sets in one pass. Unfortunately in cases where one would want to _only_ remove optical duplicates, this wouldn't work, but in e.g. WGS this would be fine (and I'd expect standard dupmarking tools to remove the optical dups anyways, if this were true).

ADD REPLY
0
Entering edit mode

fastq-dump is deprecated

Curios if there a reference for this? fasterq-dump seems to allow for parallel streams but fastq-dump is still included in the sratoolkit. Fasterq-dump appears to have --use-name and --seq-defline options perhaps they would be relevant to recover the original headers (if present).

are the two meaningfully different in how they present in terms of their sequences?

Not in terms of sequence (unless the data has UMI's) but clumpify.sh can distinguish between the two by using the cluster distance option.

I am not sure how how big of a problem optical duplicates are. If sequencing facility is following loading recommendations/good practices they should not be a big problem. Do you have a specific reason/experience to be worried about them?

ADD REPLY
0
Entering edit mode

Curios if there a reference for this?

Apparently it's not deprecated yet, but will be soon and the SRA intends for it to be at some point "soon". In either case with some SRA data I've look at, the read names were still stripped even with -F - I suppose the data was originally submitted with stripped read names, but that just makes the original question all the more relevant because even if SRA tracked read names then a lack of coordinate information for reads is still a real world problem we encounter.

Not in terms of sequence (unless the data has UMI's) but clumpify.sh can distinguish between the two by using the cluster distance option.

The source code of BBMap/clumpify indicates that clumpify computes cluster distance based on X and Y coordinates from an illumina-formatted read name. Therefore if the read names are not following Illumina spec, clumpify.sh cannot compute distances and distinguish optical from PCR duplicates.

I am not sure how how big of a problem optical duplicates are. If sequencing facility is following loading recommendations/good practices they should not be a big problem. Do you have a specific reason/experience to be worried about them?

I am not sure either if they're in practice a big problem, which brings us back to my original question, which in hindsight I could have phrased more generally:

Are optical duplicates [that cannot be distinguished due to stripped read names] meaningfully harmful to downstream analyses? Can we do anything about them?

Although if they present the same as PCR duplicates when coordinate information is absent, then one can probably assume that they would be caught and removed by dup marking software ie. end bam is the same from either case, although it might take a bit longer because the sequences need to be checked. But this won't work in cases where you wouldn't want to mark all dups (ie. amplicon sequencing)

ADD REPLY
0
Entering edit mode

With public data one has to work with what is there. If the information about position is absent then only thing that could be done is to look at sequence information. That can be done without alignment (with clumpify) or after alignment (with picard).

If removing duplicates is your aim then either of above would work. Nothing can be done about optical (or really cluster) dups in that case.

ADD REPLY

Login before adding your answer.

Traffic: 1027 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6