Question: STAR_2PASS for SNP calling from RNA seq data
0
gravatar for thjnant
5.0 years ago by
thjnant90
Germany
thjnant90 wrote:

Hello,

I am going through the STAR_2PASS of the GATK pipeline to get SNPs out of RNA-seq data.

I have run the first round of alignment for my 6 samples, now I am in the second round that I must run this command:

genomeDir=/path/to/hg19_2pass
mkdir $genomeDir
STAR --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles hg19.fa \
    --sjdbFileChrStartEnd /path/to/1pass/SJ.out.tab --sjdbOverhang 75 --runThreadN <n>

For this option:

    --sjdbFileChrStartEnd /path/to/1pass/SJ.out.tab

 

Should I use the SJ.out.tab file of only one of my samples and use that for others or should I use the one for each sample?

Thanks in advance.

rna-seq star • 1.9k views
ADD COMMENTlink modified 5.0 years ago • written 5.0 years ago by thjnant90

I would think that you'd get the best results from merging the tab files and then using the result.

ADD REPLYlink written 5.0 years ago by Devon Ryan92k
1

Or by running STAR on a large subset of your entire dataset (FASTQ files from multiple representative (or all) samples) on the first-pass.

ADD REPLYlink written 5.0 years ago by Sean Davis25k

Yup and that'd probably be a bit faster since you don't need all of the instances to run to completion. Do you happen to know if anyone's looked for an optimal subset percentage? While the real value will vary, I expect there's a decent ball-park starting place to be found (perhaps as a function of total number of reads).

ADD REPLYlink written 5.0 years ago by Devon Ryan92k

If you believe the old RUM paper, perhaps 40-100M reads will get you the vast, vast majority of splice junctions that are available in a dataset.  One can always test by simply staging the analysis.  Run 5%, 10%, 15%, etc. to see where the return plateaus, but that is probably overkill.  

ADD REPLYlink written 5.0 years ago by Sean Davis25k

The rarefaction curve route would end up taking as long as just processing everything at once (well, unless you really had a LOT of samples). 40-100M reads seems reasonable.

ADD REPLYlink written 5.0 years ago by Devon Ryan92k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1794 users visited in the last hour