Quantification of stranded RNA-seq data
1
0
Entering edit mode
10 weeks ago

Greeting.

I downloaded a dataset from my species of interest (link). I checked the publication but no information was given about the library preparation they used. To infer the standess of my sample I ran salmon on the data. Salmon then inferred the most likely library type to be ISR (inward, stranded reverse)

After trimming my data I aligned the samples using hisat2 and performed quantification using featureCounts using the following commands:

hisat2 --dta --rf  -x ../indexes/cro_index -1 fastq_trimmed/$SRA\_1_paired.trimmed.fastq -2 fastq_trimmed/$SRA\_2_paired.trimmed.fastq | samtools view -C -T ../cro_v2_asm.fasta - | samtools sort > cram_files_trimmed/$SRA.cram; featureCounts -p -a -s2 transcriptome_assembly.mRNA.gtf -t exon -g gene_id -o featureCounts/$SRA.tsv cram_files_trimmed/\$SRA.cram


However, only around 30-40~ of the aligned reads were assigned to a feature. Reruning featureCounts without the -s2 option increased the assignment to 60-70% which is more in line with other samples from the same species.

I'm not sure if I'm misinterpreting the -s option in featureCounts and the -s2 option only considers reads mapped in the reverse strand. Should featureCoutns be used with -s 0 in this case? Am I missing something?

featureCounts RNA-seq stranded Hisat2 • 544 views
0
Entering edit mode

You should ideally run salmon using the whole genome sequence as decoy (How does salmon deal with decoy? ). That is the recommendation from salmon developers. You are trying to compare two different methodologies so it is not surprising that there is no 1:1 correlation in read assignment. featureCounts ignores multi-mapped reads. salmon excels at assigning them to transcripts.

Comparison at DE result level is what you should look at.

0
Entering edit mode

Hi! I think you missed the point my question. I will update the post to make it more clear. I was not trying to compare salmon with featureCounts. I was not even trying to quantify using salmon. The only thing I wanted from salmon was to understand what type of library preparation method was used (i.e stranded or not stranded, fr or rf).

My objective is to quantify my samples using hisat2 + featureCounts

2
Entering edit mode

Way your post was written and the title quantification of stranded RNAseq led to my comment above.

In light of explanation take a look at infer_experiment.py from RSeQC and/or another tool GUESSmyLT. Examining the alignments in IGV is also helpful (Answer: Infering strand specificity of bam files using rseqc? )

0
Entering edit mode

I used the RSeQC script and I obtained the following results:

This is PairEnd Data

Fraction of reads failed to determine: 0.0065
Fraction of reads explained by "1++,1--,2+-,2-+": 0.5002
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4932


From this result, it appears my library is not stranded. I find this weird since aligning with hisat2 with the --rf option had a > 95% alignment ratio

1
Entering edit mode

--rf does not change HISAT2 alignment algorithm, but it changes whether the read will be properly paired or not. Read mapped in proper pair is a flag (0x2) in sam/bam alignment format that highlight read pairs that mapped on the same chromosome, in a reasonable distance from each other, and in proper orientation (given by --rf in this case). The number of properly mapped reads is probably lower than it should be in your alignment file because of this.

0
Entering edit mode
10 weeks ago

Hi

First of all, -s2 is the correct featureCounts option for reversely stranded libraries, meaning that the read#1 is reverse to gene annotation and the read#2 is forward (so, no, -s2 option does not only considers reads mapped in the reverse strand).

The results of your quantification suggest that the libraries are unstranded, so -s0 would be the correct option. However, you need to confirm this (following GenoMax' excellent suggestions).

It would also be useful to know for what reason the mapped pairs are unmapped when you change -s0 to -s2. Look at the summary statistics of featureCounts and see what categories increases the most (e.g. unassigned(no_feature)) .

0
Entering edit mode

Hi! After checking GenoMax' suggestion It appears my library is unstranded. I did check the featureCounts summary, and most of the reads that were not assigned were in the "Unassigned_NoFeatures" category

0
Entering edit mode

Depending on the quality of annotation for this genome it is possible that the annotation is not complete so you could potentially have real data that you don't have annotation for.