Question

read group, fastq files, and multiple run ids

0

Entering edit mode

4.2 years ago

Cricket ▴ 10

I have paired fastq files that I need to run through a variant detection pipeline. The only information I have on the files are what I can glean from the sequence identifiers. The files are quite large (~40G per direction), and upon closer inspection, I discovered that the files have several run ids.

@E00566:94:HLFT5CCXY:8:2224:9993:9660
@E00566:93:HM5CHCCXY:1:1101:10003:10275

I think it best to split the files based on run ids, define readgroups based on BI's documentation (see below), run through the steps to get to a gvcf file, then merge all the gvcf files (that are related to the same sample). However, I am concerned about marking duplicates across multiple runs.

The Broad Institute kinda implies this in their updated readgroup description (https://gatk.broadinstitute.org/hc/en-us/articles/360035890671), but this (C: Adding read group to bam files from multiplexed samples) implies something else.

If someone could weigh in on the appropriateness of my approach as well as my duplicates concerns I would be more than grateful.

picard gatk fastq readgroups next-gen • 1.2k views

ADD COMMENT • link updated 4.0 years ago by Biostar 20 • written 4.2 years ago by Cricket ▴ 10

0

Entering edit mode

However, I am concerned about marking duplicates across multiple runs.

Was the sample library/pool sequenced on multiple flowcells?

ADD REPLY • link 4.2 years ago by GenoMax 141k

0

Entering edit mode

Yes. I believe this can be determined from looking at the sequence identifiers (in bold) as well as in different lanes (though as a standalone, I don't think the lanes are too important). @E00566:94:HLFT5CCXY:8:2224:9993:9660 @E00566:93:HM5CHCCXY:1:1101:10003:10275

ADD REPLY • link 4.2 years ago by Cricket ▴ 10

0

Entering edit mode

No. Those are just flowcell barcodes. Do you know if the same library/pool ran on both of those flowcells? If so you could consider those runs as technical replicates.

ADD REPLY • link 4.2 years ago by GenoMax 141k

0

Entering edit mode

Ohhh...that I do not know, nor can I determine it or ask anyone. The only things I can glean from the files are what is listed in the sequence headers. Giving that I don't have this information, would you suggest splitting as I mentioned previously?

ADD REPLY • link 4.2 years ago by Cricket ▴ 10

0

Entering edit mode

Do you think someone merged files from multiple runs because they were technical replicates to begin with? Otherwise that sounds like a strange thing to do. It is certainly not making your life any easier.

ADD REPLY • link 4.2 years ago by GenoMax 141k

0

Entering edit mode

Darn tootin'! I really don't know. Nor do I know how many hands the files passed through before they got to me. I do know that the investigator wanted very high coverage (very rare disease).

ADD REPLY • link 4.2 years ago by Cricket ▴ 10

0

Entering edit mode

Then I would tend to think that these are tech replicates but one can't be sure if you have no evidence. You could separate the files and look for common SNP's to confirm.

ADD REPLY • link 4.2 years ago by GenoMax 141k

0

Entering edit mode

Woof. Having said that, is the best recommendation I have heard. Thank you @genomax.

ADD REPLY • link 4.2 years ago by Cricket ▴ 10