Question

3' RNA-seq (Lexogen Quant-seq) Mapping Problems with Salmon

2

Entering edit mode

3.5 years ago

bdy8 ▴ 80

Hello all First of all I apologise if I miss any information out, this is my second Biostar post so I apologise for anything missed.

I have 3' RNA-seq reads prepped with the Lexogen FWD Quantseq kit. Reads range from 4m to 15m per sample. My species is a stony coral (Acropora palmata).

I have so far done adapter and quality trimming as recommended by Lexogen, and viewed with FastQC (everything seems good)

bbduk.sh in=stdin.fq out=${i}_trimmed_clean ref=/data/resources/polyA.fa.gz,/data/resources/truseq_rna.fa.gz k=13 ktrim=r useshortkmers=t mink=5 qtrim=r trimq=10 minlength=20

I am now having some problems as I am trying 2 methods

STAR align and Salmon Quant.

I am again using the STAR align recommended parameters provided by Lexogen (see code chunk below which is the info from their website and so does not include my exact read paths). My STAR index is gff3 and masked.fasta avaliable from the Acropora genome. I have added the extra flags -outFilterScoreMinOverLread 0.5 and --outFilterMatchNminOverLread 0.5 which has resulted in unique mapping be around 65% for all samples. I do probably have reads from the symbiotic Symbiodinum and this is a value which is repoprted for other coral TAG-seq studies.

STAR --runThreadN 8 --genomeDir /data/star/human --readFilesIn fastq/${sample}_R1.fastq --outFilterType BySJout --outFilterMultimapNmax 20 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.1 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --outSAMattributes NH HI NM MD --outSAMtype BAM SortedByCoordinate --outFileNamePrefix star_out/${sample}

Salmon Index and Salmon Quant

I have also been testing the Salmon Index and Quant for my data. Here is where my query comes in.

I generate a transcriptome using gff read for Salmon indexing. This is the same fasta and gff3 file as used in STAR.

gffread -F -w /nethome/bdy8/apal_genome/version3.1/Apal_3.1_gff_t_.fa -g /nethome/bdy8/apal_genome/version3.1/Apalm_assembly_v3.1_200911.masked.fasta /nethome/bdy8/apal_genome/version3.1/Apalm_assembly_v3.1_notrna_200911.gff3

I then utilise this in the Salmon Index command. As I have short reads I have gone with a K value of 11.

salmon index -t /nethome/bdy8/apal_genome/version3.1/Apal_3.1_gff_t_.fa -i /nethome/bdy8/apal_genome/version3.1/Apal_trans_index_3.1 -k 9

Then this is used in the Salmon quant.

salmon quant -i /nethome/bdy8/apal_genome/version3.1/Apal_trans_index_3.1 -l SR -p 12 --minScoreFraction 0.3 --noLengthCorrection -r /scratch/projects/transcriptomics/ben_young/POR/tagseq/host/trimmed_reads/'"${PALPAL}"'_tr.fastq -o /scratch/projects/transcriptomics/ben_young/POR/tagseq/host/salmon_quant/'"${PALPAL}"'_salmon

This is where I have problems. My mapping rate is 3% and the number of reads getting removed due to 'high number of mappings discarded because of alignment score' which I do not understand. Comparing it to what I get from alignment with STAR I would expect at least a 40% alignment rate. From reading of other questions and recommendations from salmon quant-help, i have used --noLengthCorrection (recomended for Quantseq data in man page) and --minScoreFraction 0.3 to try and increase the mapping. Moving --minScoreFraction 0.3 to 0 increases the mapping rate to ~8% but that is still awful. I have also tried older versions of the genome with no luck so far. I have also checked the -SR option in the salmon quant outputs and this is what it detects my reads as.

On recommendation from Genomax I have also tried the flag --softclip and this had no discernable affect on the mapping rate.

Any and all advice would be seriously appreciated and please let me know if you need any additional information required. I appreciate the help.

RNA-Seq salmon index quant • 2.7k views

ADD COMMENT • link 3.5 years ago by bdy8 ▴ 80

1

Entering edit mode

I discussed this with @Rob Patro (author of salmon).

He has asked that you run salmon with --softclip option and report what happens. He thinks that: since this is an organism with a (presumably) not great reference, paired with 3' sequencing, it's possible there is a lot of soft-clipping going on.

What fraction of your STAR mapped reads are getting assigned to annotations? Is that the 40% you refer to?

He is busy today but will likely be visiting this thread tomorrow.

ADD REPLY • link 3.5 years ago by GenoMax 141k

0

Entering edit mode

Thank you so much for this recommendation. I will run this now and update the answer below as well as in the question above.

Re my organism, it is not the best reference but compared to other coral genomes that have been used for 3' RNA-seq it is comparable.

The 40% i refer to was just a rough estimate of what I would expect the Salmon Index and quant to produce from reading other questions on mapping rates (i.e. expect to see a slight reduction when comparing to the genome).

Re the STAR mapped reads I apologize if I get this wrong. I have around 65% uniquely mapped, and when quantifying with Salmon I end up with around 25% quantified (of this 90% are uniquely mapped, and 10% ambiguously mapped, info in the salmon_quant.log file). Of this i have counts for around 10k of the transcripts provided in the gff3 file (total number transcripts =32k).

ADD REPLY • link 3.5 years ago by bdy8 ▴ 80

0

Entering edit mode

Hi genomax, so incorporating --softclip increased it marginally by around half a percent.

[2020-11-03 16:43:17.801] [jointLog] [info] Mapping rate = 3.98194%

As before, the problem still seems to be the number of mappings and fragments discarded due to alignment score.

[2020-11-03 16:43:17.797] [jointLog] [info] Number of mappings discarded because of alignment score : 21,343,074
[2020-11-03 16:43:17.797] [jointLog] [info] Number of fragments entirely discarded because of alignment score : 2,829,440

One thing which has changed is that in quant.sf instead of 6k transcripts with > 0.5 NumReads, softclip has changed this to 14k transcripts with >0.5 NumReads. Identify this with a simple awk

awk -F"\t" '$5>0.01' quant.sf | wc -l

ADD REPLY • link 3.5 years ago by bdy8 ▴ 80

0

Entering edit mode

How long are your reads? I used k=19 for a published 2x25bp dataset. k=11 as you do seems unreasonably small.

ADD REPLY • link 3.5 years ago by ATpoint 82k

1

Entering edit mode

Hi there. They are around 40bp long.

I have been trying everything and have altered the K value from 9 up to 25 with little to no changes in mapping rate.

ADD REPLY • link 3.5 years ago by bdy8 ▴ 80

score 2 · Accepted Answer · 2020-11-03

2

Entering edit mode

3.5 years ago

bdy8 ▴ 80

Hi all

Thank you for the help, I'm an idiot.

A friend who has had a similar problem with Salmon in the past told me to double check the -l SR parameter. Lo and behold after going through the kit specifications i needed -l SF for the quantseq FWD (its even in the name).

Hopefully if anyone else encounters this problem this post will help them find the fix without spending a whole day trouble shooting.

Thank you for the help from everyone I really appreciate it.

ADD COMMENT • link 3.5 years ago by bdy8 ▴ 80

1

Entering edit mode

Happens to all of us. You can go ahead and accept this answer (green checkmark) to provide closure to this thread.

ADD REPLY • link 3.5 years ago by GenoMax 141k