Question: How do tools parse the QNAME?
1
gravatar for John
2.2 years ago by
John12k
Germany
John12k wrote:

Hey all :)

The SAM spec is pretty lenient on what is and is not allowed in a QNAME. The only really relevant part is:

QNAME: Query template NAME. Reads/segments having identical QNAME are regarded to come from the same template. A QNAME ‘*’ indicates the information is unavailable. In a SAM file, a read may occupy multiple alignment lines, when its alignment is chimeric or when multiple mappings are given.

So each template gets a unique QNAME, and the same QNAME can appear multiple times in the file (for multiple sequenced reads, secondary alignments, etc).

The problem i'm facing right now is that there's obviously a lot more to QNAMEs than just that. First off, I thought a template was not a read, but the whole fragment as it entered the sequencer - so paired-end reads are two reads from the same template, so they should have the same QNAME. In other words, putting /1 and /2 or anything else at the ends of the QNAME to denote the read pair should go against the standard?

The QNAMEs often also encodes the position in the flowcell that the template came from, which some programs use to detect optical duplicates. However, I suspect that duplicate detection is exactly the same without it - just those duplicates are marked as PCR duplicates rather than Optical (which is information usually thrown away anyway).

So my question is, if I was to rename all the QNAMEs in a BAM to something unique to the template, but without the flowcell info nor the /1 /2 mate info, would that cause any actual problems downstream? Are there any tools that just wont work if I just rigorously follow the standards here? If so, what would a good 'fake' QNAME look like?

bam • 967 views
ADD COMMENTlink modified 2.2 years ago by Devon Ryan81k • written 2.2 years ago by John12k
1

putting /1 and /2 or anything else at the ends of the QNAME to denote the read pair should go against the standard?

Yes, that is against the standard and will break tools. Don't do that. /1 and /2 go to FLAG.

ADD REPLYlink written 2.2 years ago by lh330k

Awesome, hahah - I thought I was going mad for a moment, but glad the QNAMEs are (or at least should be) exactly what you said they should be when you wrote the spec :)

ADD REPLYlink written 2.2 years ago by John12k
3
gravatar for Devon Ryan
2.2 years ago by
Devon Ryan81k
Freiburg, Germany
Devon Ryan81k wrote:

Short answer: read names don't matter except for sorting (by read name, not coordinate), pairing (e.g., counting with featureCounts or htseq-count), and marking optical duplicates. Of course optical duplicates will get marked as duplicates regardless, so who really cares about that.

BTW, typically aligners strip /1 and such off, though not always. One should generally not rely on qnames for anything unless absolutely needed.

ADD COMMENTlink written 2.2 years ago by Devon Ryan81k
1

Of course optical duplicates will get marked as duplicates regardless, so who really cares about that.

This is an important consideration with patterned flowcells (to see if the duplicates are optical). I have been trying Picard MarkDuplicates option to identify these. With limited number of samples I have looked at, this has not worked reliably (get some duplicates but none have been marked optical with the settings suggested by GATK tutorials, unless the lab did a great job of loading the flowcells with just the right concentration of libraries).

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by genomax49k

One method recommended by the GATK for determining sequencing efficiency is to mark duplicates per-lane, then mark duplicates again once you've merged the lanes.

It took me some time to figure out why anyone would do this, but then it "clicked" that it must be because any extra duplicates marked in the second round of deduping must be solely from the PCR process, so with that you should be able to figure out an estimate (although im not sure how you calculate it exactly) of PCR duplication vs non-PCR duplication (.'. optical duplication) without having to deal with pixel-distances, etc. I heard that the way Illumina reports pixel distances changed (or something like that) so the tools like MarkDuplicates require a pixel distance threshold of either 10 or 1000 (or some other several-orders-of-magnitude-difference like that) and there's no easy way to tell which you need. If you're getting literally 0 optical dupes, that could be why.

Anyway, its no longer part of the GATK best-practices to mark dupes twice, because it's not very exciting and FastQC has some pretty good metrics for that now.

ADD REPLYlink written 2.2 years ago by John12k
1

@John: This is a patterned flowcell specific issue and is due to "pad-hopping" or contamination of nanowells nearby during ExAmp clustering. This is related to library characteristics and loading concentration. I saw that recently discussed here. Sounds like we will get a new tag that will hopefully be consumed by Picard in near future.

ADD REPLYlink written 2.2 years ago by genomax49k

I hadn't realized that the optical duplicate rate had gotten high enough to matter on patterned flow cells. That would indeed be an issue for them then.

ADD REPLYlink written 2.2 years ago by Devon Ryan81k

Thank you Devon, thats massively helpful (and relieving) to hear :)

ADD REPLYlink written 2.2 years ago by John12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 629 users visited in the last hour