Question: STAR add RG
0
gravatar for umn_bist
4.1 years ago by
umn_bist370
umn_bist370 wrote:

So I am transitioning into using STAR as our primary aligner for RNA-seq workflow and was curious if there's a way to add RG without knowing what it is beforehand - primarily because I am working with publicly available TCGA fastq files. These were not sequenced in-house.

I see --outSAMattrRGline and --outSAMreadID under STAR options but 1) the file needs to be in SAM/BAM format to extract RG (?) but our premapped files are in fastq format and 2) to add RGID I need to manually load (have prior knowledge of the RGID of our samples).

I looked up the filename format for TCGA files but it does not involve specific read group info. I am wondering if there's a way to extract RGID from fastq file and automatically input into the aligner for downstream analysis using GATK.

rna-seq rgid star gatk • 2.4k views
ADD COMMENTlink modified 3.8 years ago by Biostar ♦♦ 20 • written 4.1 years ago by umn_bist370
2
gravatar for Devon Ryan
4.1 years ago by
Devon Ryan94k
Freiburg, Germany
Devon Ryan94k wrote:

You're over thinking this.

You already know the read group information for the TCGA samples, it's whatever you want it to be.

Also, fastq files don't contain meta-data.

ADD COMMENTlink written 4.1 years ago by Devon Ryan94k

Oh okay, I see. Double checking - is it recommended/preferred to have the same RG for paired-end reads? I will ultimately have them merged into a single bam for variant calling downstream.

ADD REPLYlink written 4.1 years ago by umn_bist370
1

It is my understanding (although refer to Devon when he comes back) that the RGID is supposed to be unique to the lane/chip/sequencer as its the primary unit GATK uses to model noise in the data.
The samples ID (RGSM) needs to be the same for all samples you want to merge together and call SNPs on.

So my RGID is like "RG:Z:DBV2SVN1.D263EACXX.7.CTTGTA" in one bam,
"RG:Z:DBV2SVN1.D263EACXX.3.AATTCA" in another bam, (same chip, different lane and multiplex barcode),
but in BOTH bams the RGSM is "C57BL/6J" because they're both from the same isogenic mouse line.

When i merge them together, GATK will (hopefully) be able to call SNPs right... lets see :P

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by John12k
1

I'm in complete agreement with all of your comments :) For samples that you download you typically just use the SRA/ENA/whatever accession as the sample ID and library, unless the metadata online happens to suggest something more reasonable.

ADD REPLYlink written 4.1 years ago by Devon Ryan94k

Since these files are single samples, I think I'm going to add an arbitrary RGID after merging paired-ends using picard (addorreplacereadgroup function). For variant calling (we're looking at rare, uncharacterized somatic mutations) is it preferred to merge all my samples into a single bam or call variants individually/separately?

ADD REPLYlink written 4.1 years ago by umn_bist370
2

Again, im not the authority on this - just someone who's read the docs a lot lately.
My understanding is that it's better to do joint calling than to do the calling individually and then look at overlapping (or all) SNPs called. To do joint calling however, you don't have to merge the BAM - you just have to make gVCFs for each BAM, and then run the joint caller on those gVCFs to 'merge' them at the variant level. However, as I found out the hard way, those gVCFs will have the sample ID in them, and they need to be the same for them to be joined.
So yeah, just give everything you want to treat as "the same" (genetically) the same sample ID, and you should be fine.

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by John12k

Awesome. The info will be very helpful. Thank you! Upvoted.

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by umn_bist370
1

You should be aligning mates in a pair at the same time...

ADD REPLYlink written 4.1 years ago by Devon Ryan94k
0
gravatar for umn_bist
4.1 years ago by
umn_bist370
umn_bist370 wrote:

Deleted. See comment above.

ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by umn_bist370
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1500 users visited in the last hour