Question

Can't count features with featurecount

0

Entering edit mode

3.0 years ago

avelarbio46 ▴ 30

Hello everyone!

I'm trying to count features of my reads from paired-end RNAseq. The samples are not so good, because we extracted from FFPE tissue, and for some samples the RNA was extracted long ago and preserved in -80°C. Still, I got about 10-40M reads uniquely aligned (15-90%) to the human genome with STAR on gencode GTF primary assemble version and HG38. Because this is RNA derived from FFPE, we observed that some samples had high rRNA contamination, which mostly translates to the low uniquely aligned reads. This is the code I used in star (mapping to non trimmed reads, as this was the best aligment)

 STAR --genomeDir /home/references --readFilesIn FILE1 FILE2
 --outReadsUnmapped None --twopassMode Basic --readFilesCommand "gunzip -c" --outSAMstrandField intronMotif --outSAMunmapped Within --chimSegmentMin 12  --chimJunctionOverhangMin 8 --chimOutJunctionFormat 1  --alignSJDBoverhangMin 10 --alignMatesGapMax 100000 --alignIntronMax 100000 --alignSJstitchMismatchNmax 5 -1 5 5 --outSAMattrRGline ID:GRPundef --chimMultimapScoreRange 3 --chimScoreJunctionNonGTAG -4 --chimMultimapNmax 20 --chimNonchimScoreDropMin 10 --peOverlapNbasesMin 12 --peOverlapMMp 0.1 --alignInsertionFlush Right --alignSplicedMateMapLminOverLmate 0 --alignSplicedMateMapLmin 30 --outFileNamePrefix /home/Análises/Teste1/star/twopass/OUTPUT --runThreadN 16 --outFilterScoreMinOverLread 0.3 --outFilterMatchNminOverLread 0.3

But I got a very specific problem with featurecounts program, from the subread package. We used the Total RNA stranded kit from Illumina with ribossomal RNA depletion. As observing the manual, I got that in Read 1, sequencing reads map to the antisense strand, and Read 2 maps to the sense strand (https://support.illumina.com/content/dam/illumina-support/documents/documentation/chemistry_documentation/samplepreps_truseq/truseq-stranded-total-rna-workflow/truseq-stranded-total-rna-workflow-reference-1000000040499-00.pdf).

But when using featurecounts with both -s 1 and then -s 2 aligns very little reads (1-4%). Using unstranded options, I get better results (7% to 70%). I'm also using the same GTF as I used in star aligment (gencode version 37)

I'm not sure what is the problem, as the reads should contain some strand information.

I used:

> parallel --verbose --link -j 4  'featureCounts -F GTF -s (0, 1 or 2)
> -f -p -T 8 -a /home/references/gencode.v37.primary_assembly.annotation.gtf  -o
> {1/.}.txt {1} ' ::: /home/star/twopass/*.fqAligned.out.sam

What am I missing?

RNAseq featurecounts • 1.9k views

ADD COMMENT • link 3.0 years ago by avelarbio46 ▴ 30

2

Entering edit mode

Some issues:

do a qc on your alignments with the output from star and bamtools stats or similar. How many of your alignments are in fact properly paired?
why -f, that summarizes on exon level, do you want that?
- what happens with options -M -O?
- I'd double check if the chromosome names are consistent between alignment and GTF
- I'd always specify -g and -t

ADD REPLY • link 3.0 years ago by Michael 54k

0

Entering edit mode

I'm not sure about three things:

What does -M and -O do exactly that could benefit in my case?

Why should I specify -t if I don't necessarily need exons or other types of RNA specifically? Verifying everything should be the best approach to compare multiple samples, including low quality ones

What is the diference on specifying the genome?

Anyways, I will try all options!

ADD REPLY • link 3.0 years ago by avelarbio46 ▴ 30

0

Entering edit mode

-M and -O add multi-mapping reads and reads that are map to genes that overlap other genes. In case you get higher rates using them, you know that many reads have that property. Specifying -g and -t is just a pre-caution. Not specifying them leads to using the defaults, not to count "everything", and the defaults may or may not be in alignment with your G[T/F]F file. -t exon -g gene_id are the defaults for GTF I think, but if your annotation is in GFF file instead, you might need to use -t exon -g parent. Do you think you might have a lot of reads in introns? -t transcript -g gene_id might work in this case.

If you have unexpected results, there is no way around doing a thorough QC at every step, especially with FFPE samples, where the RNA might be more degraded. Did you get RIN values for the extracted RNA?

ADD REPLY • link 3.0 years ago by Michael 54k

0

Entering edit mode

After using these options, I got decreased aligment in general than with "-F GTF -s 0 -p -T 8 -a ". I have not tried with -M and -O on 1 or 2 strandeness. Using the same genome, with -M and -O on -s 0 gave me less aligned reads!

ADD REPLY • link 3.0 years ago by avelarbio46 ▴ 30

0

Entering edit mode

That is a bit weird, so you cannot get around a QC. I suspect that your mRNA is degraded due to storage as FFPE and therefore only few reads are paired and only the most stable RNA species like rRNA remain.

Please do the following:

Check if the Nanondrop RIN values are "ok".
Provide the STAR log output from one of the samples
QC the BAM files with samtools stats and flagstats, look specifically for the proportion of properly paired reads. I suspect this is very low.
After you did those things, combine all your QC files, including fastqc output, using multiqc and post it.
Optionally: provide one of the worst bam files for inspection

The final outcome of this might be that the mRNA was too degraded for a useful analysis. There was maybe one or the other paper where FFPE samples were used, but that doesn't mean it always works or even in any other case. In the end, one might have analyzed first and foremost RNA-stability and not RNA-expression.

ADD REPLY • link 3.0 years ago by Michael 54k

0

Entering edit mode

I suspect that for some samples this is indeed the case! I'm trying to run STAR with default options because I ran it with STAR-fusion optimized options (found on the manual of STAR-fusion). For example, one sample had 77M reads, but only 1.8% aligned with feature counts. We expected aprox. 30% samples to not work, but I'm trying to be sure what samples are good, with the best pipeline. Currently, we expect a minimun of 10M features (my mean is 30M, even with low alignments). Because we did a ribo-depletion, some samples do have high rate of rRNA!

The RIN of some samples were ok, but I don't expect RIN to be a very good measure in the case of FFPE according to some papers.

This is the BAM of the worst sample I've found yet: https://transfer.sh/T8BHE/randomsample_coordinatesorted.bam

ADD REPLY • link 3.0 years ago by avelarbio46 ▴ 30