Question

Low alignment with Trinity but high alignment in SAM files

0

Entering edit mode

5.7 years ago

Moneeb Bajwa ▴ 10

Hello,

I have performed alignment with Trinity and got only around 2% alignment. My SAM files and trimmed reads seemed to be fine; the SAM files had high alignment rate based on what was returned to me with samtools. Is this possible or am I doing something wrong with Trinity?

I appreciate the help

alignment genome RNA-Seq next-gen Trinity • 2.7k views

ADD COMMENT • link updated 5.7 years ago by h.mon 35k • written 5.7 years ago by Moneeb Bajwa ▴ 10

1

Entering edit mode

Could you please share the commands used for both alignment as well for getting stat from SAM.

ADD REPLY • link 5.7 years ago by Jeffin Rockey ★ 1.3k

0

Entering edit mode

I just used gsnap for the SAM files by aligning trimmed mouse reads to the mouse genome (GRCm38.primary_assembly.genome.fa from GENCODE). I used gmap to build the database first and built intron index with the gencode.vM18.primary_assembly.annotation.gff3 file. I used this for the SAM stat:

#Count the total number of reads in the alignment
total="$(samtools view -c $samFile)"
#Count the aligned reads in the alignment
mapped="$(samtools view -F4 -c $samFile)"

For Trinity I did this (they were single-end, strand-specific forward reads):

nice -n19 $trinityPath/util/align_and_estimate_abundance.pl \
--transcripts Trinity-GG.fasta --aln_method bowtie2 --est_method eXpress \
--trinity_mode --output_dir $outPath$sample --seqType fq --SS_lib_type F \
--single $Reads --thread_count 1

And for the Trinity-GG.fasta I did this:

Trinity --genome_guided_bam MergedBams.bam \
--genome_guided_max_intron 10000 \
--max_memory 10G --CPU 4

ADD REPLY • link 5.7 years ago by Moneeb Bajwa ▴ 10

1

Entering edit mode

What is your library prep kit? Is it really stranded? Is it from Illumina? In general, stranded Illumina kits use dUTP method, with Trinity, you should use --SS_lib_type R for single-end reads, and --SS_lib_type RF for paired-end reads.

Are the reads you are trying to align the same used to assemble the transcriptome?

Are you working with Mus musculus? Do you really need to assemble a transcriptome?

ADD REPLY • link 5.7 years ago by h.mon 35k

0

Entering edit mode

Here are the reads' link which gives all the info needed: https://www.ncbi.nlm.nih.gov/bioproject/PRJDB5631. I need to do differential expression analysis so I need to assemble the transcriptome right? Also the merged BAM file was only like 10G while the SAM files in total were around 150G; is that normal?

ADD REPLY • link 5.7 years ago by Moneeb Bajwa ▴ 10

0

Entering edit mode

Yea actually it is still not working after rerunning the sorting and merging; the BAM files are quite small still and I believe that is messing everything up

ADD REPLY • link 5.7 years ago by Moneeb Bajwa ▴ 10

0

Entering edit mode

I see a similar issue here in this post: Sam To Bam - Loss Of Data Or Just Great Compression?. Should I just accept the Trinity results and go ahead with differential expression analysis? These also are cancer derived organoids if that makes a difference.

ADD REPLY • link 5.7 years ago by Moneeb Bajwa ▴ 10

score 2 · Accepted Answer · 2018-07-27

First of all I will answer the question you asked on a comment, but it is the most important question here:

I need to do differential expression analysis so I need to assemble the transcriptome right?

You do not need to assemble the transcriptome to perform gene / transcript differential expression analysis. There is a very, very good genome assembly for mouse. Also, there is a very, very good annotation for the mouse genome - and consequently, there is a very good transcriptome for the mouse. You do not need to assemble, it is better to map / count using the existing reference genome / transcrptome. In addition, the reads from the project you linked are single-end and only 36bp long, they will not provide a good transcriptome assembly.

You can perform gene-level differential expression by first mapping the reads to the genome with STAR to get gene expression counts (as I have told you before here), and then using DESeq2 or edgeR to perform the differential expression statistical analysis.

You can perform transcript-level differential expression quantifying transcript counts with Salmon or kallisto, and performing the statistical analysis with sleuth.

Now, for your original question:

I did map one of the files to the mouse genome with STAR, obtaining gene counts:

Sox17   9368    81      9287
Mrpl15  4473    30      4443
Lypla1  3122    20      3102
Tcea1   397     6       391

One can use these counts to infer strandedness of the reads. I know the protocol is stranded, so the first column of counts is automatically excluded, as it represents unstranded counts (check STAR manual). It is easy to see then that the true counts are in the fourth column, which the STAR manual describes as:

column 4:  counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)

This means the library prep kit produces what we call "reverse" stranded reads. The Trinity manual says the correct parameter to indicate this library type is --SS_lib_type R, which is the opposite of what you are using in your call to the align_and_estimate_abundance.pl script. So maybe if you use align_and_estimate_abundance.pl --SS_lib_type R you will get good mapping rate.

But remember, you do not need to assemble the transcriptome.