Problem obtaining RNA-seq counts for whole gene (intron and exon), exon only, and intron only
2
0
Entering edit mode
15 days ago
Ian 6.1k

Hello,

I have been trying to run RNA-seq analyses for the purpose of chrRNA-seq. My first chrRNA-seq analysis used mapped reads from STAR (no annotation) counting reads into full gene coordinates using featureCounts [1]. The group then wanted an exon only analysis, so I used the same STAR reads and featureCounts using the GTF file using an exon wise analysis [2]. Lastly they wanted an intron only analysis. To do this I took the gene counts of the previous two analysis and used R to subtract the exon counts from the full gene counts. HOWEVER, when running the final intron only counts through DESeq2 I was informed that there are negative counts, see extract of the count files below [3]. The biggest mystery to me is why the full gene count for ENSG00000001631.17 has nearly 0 counts, but the exon has thousands.

I would welcome any advice to explain what i am doing wrong. Something to do with strandedness...? Any more reliable method? Sundry thoughts that may be important: it is a stranded analysis. Negative values appear in ~9000 genes of ~65K.

Thank you!

[1] Intron + Exon:

Extract gene coords:

library(GenomicFeatures)
txdb <- makeTxDbFromGFF("gencode.v41.annotation.gtf", format="gtf")
genes <- genes(txdb)
write.table(as.data.frame(genes)[,-4], file="gencode.v41.annotation_genes.txt", col.names=F, sep="\t", quote=F)

Cut out first four columns to create the SAF file for feature Counts:

cut -f1-5 gencode.v41.annotation_genes.txt > gencode.v41.annotation_genes.saf

Add required header:

GeneID Chr Start End Strand

Remove PAR_Y genes:

cat gencode.v41.annotation_genes.saf | sed '/PAR_Y/d' > gencode.v41.annotation_genes_minus_PAR_Y.saf

Run featureCounts on a sample

featureCounts -F SAF -a gencode.v41.annotation_genes_minus_PAR_Y.saf -s 2 -p -C -T 8 -o KWR1_Neg_1_hg38_STAR_277a_defaults_noanno_featureCounts_whole_gene.txt KWR1_Neg_1_hg38_STAR_277a_defaults_noannoAligned.out.bam

[2] Exon only ('-t exon' is default):

featureCounts -a gencode.v41.annotation.gtf -s 2 -p -C -T 8 -o KWR1_Neg_1_hg38_STAR_277a_defaults_noanno_featureCounts_exons.txt KWR1_Neg_1_hg38_STAR_277a_defaults_noannoAligned.out.bam

[3] Extracts of counts files

Intron + exon counts:

ENSG00000001631.17      2       16      13      6       4       
ENSG00000002016.18      3132    5265    5222    3298    2828    
ENSG00000002079.14      0       13      10      4       0       
ENSG00000002330.14      588     1041    1064    782     766

Exon only counts:

ENSG00000001631.17      1611    2579    2875    2045    1582  
ENSG00000002016.18      915     1377    1400    847     778  
ENSG00000002079.14      0       0       0       3       0  
ENSG00000002330.14      170     189     218     160     154  

Intron only counts:

ENSG00000001631.17      -1609   -2563   -2862   -2039   -1578  
ENSG00000002016.18      2217    3888    3822    2451    2050   
ENSG00000002079.14      0       13      10      1       0    
ENSG00000002330.14      418     852     846     622     612   
rna-seq featurecounts • 521 views
ADD COMMENT
1
Entering edit mode

Whenever I encounter something like this I move in R with GenomicAlignments and do a targeted analysis (in your case you could focus on the 4 genes in your list).I

My instinct says either the strandedness is causing an issue or that the exon/intron distinction is causing bad counting. A question that might be important is how are counts overlapping both exons and introns being counted (like for alternative transcript isoforms).

Also how are you getting the exon/intron counts with -t exon,intron ?

ADD REPLY
0
Entering edit mode

Thank you for your reply @benformatics! Like you I wonder whether all the different transcripts within the initial gene footprint could do odd things to counting. Both runs of featureCounts in [1] and [2] do not set '-t' so the default is 'exon'. Although in [1] a SAF file is used and so -t isn't used (I think...). I will also try an unstranded count to see the affect.

ADD REPLY
3
Entering edit mode
15 days ago
DGTool ▴ 260

There was a similar issue on biostars before (How to extract intron counts from total RNA Sequencing?) where it was mentioned one possibility to include intronic regions in the GTF file (using e.g. AGAT) and then use that for feature counting.

On the other hand I was wondering whether it might be possible that gene level counts also include reads mapped to TSS/TES + UTR regions, but that shouldn't lead to negative numbers when you subtract exon from full gene (rather it should give you higher count values).
(There are other options such as -O which is responsible for summarising exons reads to their meta-features which could also affect the numbers you get, but would need to re-check the manual for how -t, -g, and -O interact with each other)

Could try that for gene-level counts with -t gene -g "gene_id", and for exon-level -t exon -g "gene_id" or something along those lines?

ADD COMMENT
1
Entering edit mode

Thank you for your reply @dgtool. -O looks interesting, might help if a fragment is both exon and intron in different transcripts. I looked at AGAT, another good suggestion. I do wonder now why I didn't just use -t gene instead of my convoluted SAF file, but then I am a fCount newbie.

ADD REPLY
1
Entering edit mode

Thank you your suggestion on using -O worked. Makes sense in hind-sight. Sorry I couldn't select your answer as the "winner" as you posted as a comment.

ADD REPLY
0
Entering edit mode

It has been moved to a comment (looks like you are a moderator that seems to have come back after a long break). You can accept DGTool answer as well.

ADD REPLY
0
Entering edit mode

Thank you I did not realise I could do that.

ADD REPLY
1
Entering edit mode
15 days ago
Ian 6.1k

It appears that DGTool had the answer. I needed to use the featureCounts -O flag (allowMultiOverlap) to allow the counting of a features that overlaps a different feature, such as happens with genes with lots of transcripts. Also, my initial method of generating a SAF file for full length gene coordinates was not required, I just needed to use '-t gene', instead of '-t exon'.

ADD COMMENT

Login before adding your answer.

Traffic: 1114 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6