I am analyzing my RNA-seq data using Trimmomatric for trimming, Tophat2, bowtie for alignments/mapping, samtools for sort, and htseq for counting. However, after aligning my sequencing on the genome reference using tophat2 and bowtie, my multiple alignment % is 66.2% which is too high to be validated. Would you mind to review my script and give any suggestions? THANK YOU!!
Trimmomatric
java -jar /home/apps/trimmomatic/trimmomatic-0.33/trimmomatic-0.33.jar SE -threads 12 -phred33 -trimlog KC1_trimmomatic_log.txt ~/data/genome_raw_seq/bicolor_87_C12_1.fastq ~/results/trimmomatic/bicolor_87_C12_1_trimmed.fastq ILLUMINACLIP:myAdapters.fa:2:30:10 TRAILING:28 LEADING:28 MINLEN:30
tophat2
tophat2 -o ~/results/align/87C12_1_align_ERtophat --read-mismatches 0 --read-gap-length 0 --read-edit-dist 0 --max-insertion-length 0 --min-intron-length 25 --num-threads 16 --max-multihits 10 --microexon-search --library-type fr-firststrand --b2-very-sensitive --GTF ~/PhytozomeV11/Sbicolor/annotation/Sbicolor_313_v3.1.gene_exons.gff3 Sbicolorv3 ~/results/trimmomatic/bicolor_87_C12_1_trimmed.fastq
samtools
samtools sort -n ~/results/align/87C12_1_align/accepted_hits.bam ~/results/align/87C12_1_align/87C12_1_ER_sorted
htseq
htseq-count -q -f bam -m intersection-nonempty -s no -t gene -i ID ~/results/align/87C12_1_align/87C12_1_ER_sorted.bam ~/PhytozomeV11/Sbicolor/annotation/Sbicolor_313_v3.1.gene_exons.gff3 > ~/results/Counts/COUNToutputfile_87_C12_1.txt
Have you checked to see if the multi-mappers represent rRNA reads?
I thought about rRNA reads but is it necessary to remove rRNAs before aligning?
No, it's not necessary. But if a large fraction of your reads are rRNA then a lot of multimapping reads is not unexpected.
With regard to all parameters you specified for tophat2: unless you know what you are doing it's always better to stick to the defaults.
That said, you should know that the old 'Tuxedo' pipeline of Tophat and Cufflinks is no longer the "advisable" tool for RNA-seq analysis. The software is deprecated/ in low maintenance and should be replaced by HISAT2, StringTie and ballgown. See this paper: Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. (If you can't get access to that publication, let me know and I'll -cough- help you.) There are also other alternatives, including alignment with STAR and bbmap, or pseudo-alignment using kallisto or salmon.
Thank you!!
I was trying to use STAR, but having a hard time to produce genome_index. Do you know where I could find any template scripts of HISAT2 and StringTie and Ballgown. Thank you!
If I'm not mistaken the manual is fairly clear about this.
The manual, presumably.
Just a detail, but what you call "a script" here, is more commonly called "a command".