Question

What is a read group?

3

Entering edit mode

5.7 years ago

crisagazzola ▴ 30

Could someone explain to me like I'm 5 years old what a read group is? I've read several definitions of it. For example "A read group is the set of reads that were generated from a single run of a sequencing instrument". So in this definition, is the set of reads the same thing as the set of all the base pair sequence segments that are generated after the DNA has been ran through the sequencing machine? Are the "set of reads" the ones that are contained in the fastq file?

I've read other definitions that use the terms "lane", and "flow cell". I've looked up these terms as well but still don't understand what the read group is referring to. I think I've spotted it in some .fastq files. I'm a software developer with no background in bioinformatics that has been playing around with the Picard tools, and for some of the tools, you must pass a read group as an argument. I want to make sure I understand what I'm passing in, and what it does. Thank you.

read group sequencing sequence next-gen • 3.2k views

ADD COMMENT • link updated 5.7 years ago by Devon Ryan 104k • written 5.7 years ago by crisagazzola ▴ 30

0

Entering edit mode

Which Picard tools are you trying to use?

This page has some good discussion of read group.

ADD REPLY • link 5.7 years ago by goodez ▴ 640

0

Entering edit mode

I've been using FastqToSam which takes in a read group as an argument

ADD REPLY • link 5.7 years ago by crisagazzola ▴ 30

0

Entering edit mode

I guess I'm just looking for some confirmation on the meaning of the basic terminology. For example, in the page you provide a link to, it's stated that "There is no formal definition of what is a read group, but in practice, this term refers to a set of reads that were generated from a single run of a sequencing instrument".

So are the "set of reads" referring to the same strings found in the FASTQ or SAM file, which describe a segment of DNA? For example, "ACTTTAGAAATTTACTTTTA". Is that a "read"? And is the entire set of them found in a FASTQ file, the "read group"?

ADD REPLY • link 5.7 years ago by crisagazzola ▴ 30

0

Entering edit mode

Past thread of interest:
Read Group In Sam/Bam Files: What Do They Exactly Describe?

ADD REPLY • link 5.7 years ago by GenoMax 141k

0

Entering edit mode

Always nice to see non-wet lab people going to the effort of really understanding the process! :) +1

ADD REPLY • link 5.7 years ago by Joe 21k

score 3 · Answer 1 · 2018-07-30

is the set of reads the same thing as the set of all the base pair sequence segments that are generated after the DNA has been ran through the sequencing machine?

Bases, not "base pairs", but yes.

Are the "set of reads" the ones that are contained in the fastq file?

Yes

More generally, a "read group" is a set of sequences (in one or more fastq files) having a common set of metadata. This metadata generally includes patient/sample ID, library ID (the library is the preparation of the patient/sample DNA that's actually sequenced and there can be more than one library made per patient/sample) and flow cell.

A "flow cell" is the physical device (it's a partially hollow glass slide) on the sequencer where the sequencing actually takes place. These are typically single-use. The flow cell is always a component of the read group, since it can represent a batch effect that downstream software may need to deal with (e.g., the software may be written to model some sort of sequencing bias on a per-flowcell basis). Flow cells themselves are comprised of 1 or more lanes, which quite literally are lanes through the flow cell in which DNA and fluids flow. Theoretically one could conceive of lane-specific biases that software could be written to handle. In practice this isn't really an issue (for that reason, fastq files commonly contain sequence from multiple lanes), but you'll still see references to lane-effects in software that was written a number of years ago.