Question

STAR aligner with singletons and short reads

0

Entering edit mode

8.2 years ago

umn_bist ▴ 390

So before mapping, I run trim my files using BBDuk and PRINSEQ - and PRINSEQ outputs separate singleton.fa for each of my paired-ends so that I end up with 4 total files (out1.fa, singleton1.fa, out2.fa, singleton2.fa)

${BBDuk} -Xmx120g in1="${file1}" in2="${file2}" out1="${file3}" out2="${file4}"  ref="${adapter}" trimq=10

paste - - - - < "${file3}" | sort -k1,1 -t " " | tr "\t" "\n" > "${file5}" 
paste - - - - < "${file4}" | sort -k1,1 -t " " | tr "\t" "\n" > "${file6}" 

perl ${PRINSEQ} -fastq "${file5}" -fastq2 "${file6}" -no_qual_header -trim_right 1 -custom_params "A 75%;T 75%;G 75%;C 75%" min_qual_mean 25 -min_len 40 -out_format 3 -out_good "${file1%_1.fastq}_tsa" -out_bad null -log

With TopHat I fed all 4 files for alignment but from my understanding, STAR is singleton aware (?) and only needs the PE as inputs, but is it still possible to feed all 4 files?

#${STAR} \
  --runMode alignReads \
  --twopassMode Basic \
  --runThreadN 24 \
  --outSAMtype BAM SortedByCoordinate \
  --sjdbGTFfile "${sjdb}" \
  --genomeDir "${reference}" \
  --readFilesIn "${file7}" "${file8}"

Lastly, Alex recommends to use a large value for --sjdbOverhang (length of seq on each side of splice junction) like 100. But knowing that my reads may be very short I would like to find the optimal value (mate_length - 1).

How could I go about, first, finding mate_length and then automatically feeding the value without intervention for each of my RNA seq files?

EDIT: I saw in GATK RNA-seq variant calling best practice that they use --sjdbOverhang 75 and was curious if I should just set a hard value for all of my files instead (like 75 or default 100).

RNA-Seq STAR • 2.3k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.2 years ago by umn_bist ▴ 390

0

Entering edit mode

To do adapter-trimming with BBDuk, you also need to specify a few other parameters. Your above command will do adapter-filtering, not adapter-trimming. You can also set a min length cutoff and min quality cutoff if you wish with the minlen and maq flags (minlen=40 maq=25) but personally I think a minimum average quality of 25 is extremely high and likely to be detrimental.

{BBDuk} -Xmx120g in1="${file1}" in2="${file2}" out1="${file3}" out2="${file4}"  ref="${adapter}" trimq=10 qtrim=r ktrim=r k=23 mink=11 hdist=1 tbo tpe minlen=40 outm=discarded.fq

When a pair has a read that falls below your minimum length threshold, it's easiest to discard the entire pair than to process both the pairs and singletons. I don't really recommend that in most situations where these singletons are only a very small percentage of your total data.

ADD REPLY • link 8.1 years ago by Brian Bushnell 20k