Question

STAR_2PASS for SNP calling from RNA seq data

0

Entering edit mode

9.5 years ago

thjnant ▴ 160

Hello,

I am going through the STAR_2PASS of the GATK pipeline to get SNPs out of RNA-seq data.

I have run the first round of alignment for my 6 samples, now I am in the second round that I must run this command:

genomeDir=/path/to/hg19_2pass
mkdir $genomeDir
STAR --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles hg19.fa \
    --sjdbFileChrStartEnd /path/to/1pass/SJ.out.tab --sjdbOverhang 75 --runThreadN <n>

For this option:

--sjdbFileChrStartEnd /path/to/1pass/SJ.out.tab

Should I use the SJ.out.tab file of only one of my samples and use that for others or should I use the one for each sample?

Thanks in advance

RNA-Seq star • 3.0k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by thjnant ▴ 160

0

Entering edit mode

I would think that you'd get the best results from merging the tab files and then using the result.

ADD REPLY • link 9.5 years ago by Devon Ryan 104k

1

Entering edit mode

Or by running STAR on a large subset of your entire dataset (FASTQ files from multiple representative (or all) samples) on the first-pass.

ADD REPLY • link 9.5 years ago by Sean Davis 26k

0

Entering edit mode

Yup and that'd probably be a bit faster since you don't need all of the instances to run to completion. Do you happen to know if anyone's looked for an optimal subset percentage? While the real value will vary, I expect there's a decent ball-park starting place to be found (perhaps as a function of total number of reads).

ADD REPLY • link 9.5 years ago by Devon Ryan 104k

0

Entering edit mode

If you believe the old RUM paper, perhaps 40-100M reads will get you the vast, vast majority of splice junctions that are available in a dataset. One can always test by simply staging the analysis. Run 5%, 10%, 15%, etc. to see where the return plateaus, but that is probably overkill.

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by Sean Davis 26k

0

Entering edit mode

The rarefaction curve route would end up taking as long as just processing everything at once (well, unless you really had a LOT of samples). 40-100M reads seems reasonable.

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by Devon Ryan 104k