Question: STAR aligner with singletons and short reads
gravatar for umn_bist
4.7 years ago by
umn_bist380 wrote:

So before mapping, I run trim my files using BBDuk and PRINSEQ - and PRINSEQ outputs separate singleton.fa for each of my paired-ends so that I end up with 4 total files (out1.fa, singleton1.fa, out2.fa, singleton2.fa)

${BBDuk} -Xmx120g in1="${file1}" in2="${file2}" out1="${file3}" out2="${file4}"  ref="${adapter}" trimq=10

paste - - - - < "${file3}" | sort -k1,1 -t " " | tr "\t" "\n" > "${file5}" 
paste - - - - < "${file4}" | sort -k1,1 -t " " | tr "\t" "\n" > "${file6}" 

perl ${PRINSEQ} -fastq "${file5}" -fastq2 "${file6}" -no_qual_header -trim_right 1 -custom_params "A 75%;T 75%;G 75%;C 75%" min_qual_mean 25 -min_len 40 -out_format 3 -out_good "${file1%_1.fastq}_tsa" -out_bad null -log

With TopHat I fed all 4 files for alignment but from my understanding, STAR is singleton aware (?) and only needs the PE as inputs, but is it still possible to feed all 4 files?

#${STAR} --runMode alignReads --twopassMode Basic --runThreadN 24 --outSAMtype BAM SortedByCoordinate --sjdbGTFfile "${sjdb}" --genomeDir "${reference}" --readFilesIn "${file7}" "${file8}"

Lastly, Alex recommends to use a large value for --sjdbOverhang (length of seq on each side of splice junction) like 100. But knowing that my reads may be very short I would like to find the optimal value (mate_length - 1).

How could I go about, first, finding mate_length and then automatically feeding the value without intervention for each of my RNA seq files?

EDIT: I saw in GATK RNA-seq variant calling best practice that they use --sjdbOverhang 75 and was curious if I should just set a hard value for all of my files instead (like 75 or default 100).

rna-seq star • 1.5k views
ADD COMMENTlink modified 4.6 years ago by Biostar ♦♦ 20 • written 4.7 years ago by umn_bist380

To do adapter-trimming with BBDuk, you also need to specify a few other parameters. Your above command will do adapter-filtering, not adapter-trimming. You can also set a min length cutoff and min quality cutoff if you wish with the minlen and maq flags (minlen=40 maq=25) but personally I think a minimum average quality of 25 is extremely high and likely to be detrimental.

{BBDuk} -Xmx120g in1="${file1}" in2="${file2}" out1="${file3}" out2="${file4}"  ref="${adapter}" trimq=10 qtrim=r ktrim=r k=23 mink=11 hdist=1 tbo tpe minlen=40 outm=discarded.fq

When a pair has a read that falls below your minimum length threshold, it's easiest to discard the entire pair than to process both the pairs and singletons. I don't really recommend that in most situations where these singletons are only a very small percentage of your total data.

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Brian Bushnell17k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1258 users visited in the last hour