featureCounts : GFF3 gene identifier and low percentage of assigned alignments
5 months ago
liyong • 0

Hi,

I am very new to RNAseq analysis. I got Illumina paired-end RNA-Seq data. After QC, the data was aligned to the genome with a gff3 annotation file using STAR (Uniquely mapped reads 80%), then I use featureCounts (version 2.0.1) in conda env to count genes.

The parameters for running featureCounts are listed following:

featureCounts \
analysis/aligned_sequences/SRR1171897/Aligned.sortedByCoord.out.bam \
-a data/annotation/Cs_genes_v2_annot.gff3 \
-o analysis/final_counts/SRR1171897/featureCounts.txt \
-T 10 \
-p \
-F "GFF3" \
-g "Parent"


I am not sure about the -g "Parent" based on the gff3 annotation file, the first couple lines were showed below:

Chr1    AAFC_NRC    gene    1   6504    .   -   .   ID=Csa01g001000;Name=Csa01g001000;Note=methyl-CPG-binding domain 9

Chr1    AAFC_NRC    gene    1   6504    .   -   .   ID=Csa01g001000;Name=Csa01g001000

Chr1    AAFC_NRC    mRNA    1   6504    .   -   .   ID=Csa01g001000.1;Name=Csa01g001000.1;Parent=Csa01g001000;Note=methyl-CPG-binding domain 9

Chr1    AAFC_NRC    five_prime_UTR  6380    6504    .   -   .   ID=Csa01g001000.1.utr5p1;Parent=Csa01g001000.1

Chr1    AAFC_NRC    exon    5865    6504    .   -   .   ID=Csa01g001000.1.exon1;Parent=Csa01g001000.1


After finishing the featureCounts, I got the following results:

My question is which gene identifier should I use for the -g parameter when running featureCounts, and why I only got 51.1% successfully assigned alignments? Is my result correct and is there anything I could do to improve this?

Thank you very much.

In my experience, setting also the arguments -M (including multi-mapping reads) slightly increased the % of assigned alignments. You can also try to set -t gene (-t exon by default) and see if you notice any changes. However, getting a % of assigned alignments ~50-60% does not necessarily mean that the annotation has been unsuccessful, but it might be that many of your reads come from regions not annotated (like some non-coding regions).

I would also suggest you try different .gtf files, as it might also affect the results. Not sure from what source you got yours, but I usually stick to the ones from Ensembl.

Hello Marco,

Thank you for the suggestion. Sounds good, I will play around with the -M and -t parameter.

The .gff3 file come from our own lab, I will ask around to see if there is a different version.

Thanks, Liyong