Question

RNA-seq - Creating SAF from NCBI gff for Subread featureCounts - keep 'gene' or 'exon'

0

Entering edit mode

8 months ago

BioinfGuru ★ 2.1k

Hi all,

I'm setting up an RNA-seq pipeline with a sample dataset of 2 samples (1 x condition1, 1 x condition 2). Each sample contains 100,000 paired end reads (4 fastq files) which I have mapped to chromosome 1 of my reference genome. I set the pipeline up based on the demo in the RNA-seq by example, so I am using featureCounts for quantification. I'm getting vastly different results, depending on whether the source of my reference data is NCBI or Ensembl. I use NCBI annotation.gff for NCBI reference genome, and I use Ensembl annotation.gtf for Ensembl reference genome. I intend to run differential gene expression with Deseq2 or edgeR analysis after this step.

The problem:

The NCBI annotation.gff format does not meet the requirements for featureCounts so I created an SAF format which does meet the requirements with the following code:

   # Create SAF
   grep 'gene' ncbi.annotations.gff |         # extract all gene features (includes exons, cds etc)
    cut -d ';' -f1 |            # keep only first attribute (ID=)
    sed 's/ID=//g' |            # remove "ID=" from ID string
    grep NC_ |                  # keep only primary assembly (NC maps to chromosomes)
    #grep gene |                 # keep only "gene"
    #sed 's/gene-//g' |          # remove "gene-" from ID string
    grep exon |                 # keep only "exon"
    sed 's/exon-//g' |          # remove "exon-" from ID string
    awk 'BEGIN {FS="\t"; OFS="\t"; print "GeneID\tChr\tStart\tEnd\tStrand"}{print $9,$1,$4,$5,$7}' > ncbi.annotations.gff

Note: 2 lines are commented out under "# Create SAF", I am unsure whether to select gene or exon. I'm aware the focus should be on exons.

With the featureCount parameters "-F SAF -g GeneID -p --countReadPairs" I get the following results:

# When using NCBI data and " grep exon | sed 's/exon-//g' "
Assigned            1839    1728
Unassigned_Unmapped 91553   91289
Unassigned_MultiMapping 929 1023
Unassigned_NoFeatures   1157    1521
Unassigned_Ambiguity    4603    4814

# When using NCBI data and " grep gene | sed 's/gene-//g' "
Assigned            6626    7008
Unassigned_Unmapped 91553   91289
Unassigned_MultiMapping 929 1023
Unassigned_NoFeatures   441 494
Unassigned_Ambiguity    532 561

Finally, as a control I used Ensembl data which meets the file format requirement of featureCounts without any processing. With the featureCount parameters " -t exon -p --countReadPairs " I get the following results:

# When using Ensembl data with " -t exon "
Assigned            6045    6232
Unassigned_Unmapped 91553   91289
Unassigned_MultiMapping 929 1023
Unassigned_NoFeatures   1303    1585
Unassigned_Ambiguity    251 246

# When using Ensembl data with " -t gene "
Assigned    6740    7124
Unassigned_Unmapped 91553   91289
Unassigned_MultiMapping 929 1023
Unassigned_NoFeatures   441 472
Unassigned_Ambiguity    418 467

Note: All other metrics are zero (e.g. Unassigned_Read_Type 0 0)

I trust the results from the Ensembl data more as I had minimal input but I'd like to know for future reference what I am doing wrong with the NCBI data. The results of the NCBI data is alot closer to that of the Ensembl data when I use the "gene" option instead of "exon" option, which is counter intuitive to me.

2 questions:

What am I missing regarding using featureCounts with NCBI reference data, I would like to have confidence in my ability and not have to never use NCBI data if I have to use featureCounts.
If going forward I choose the Ensembl reference data, am I correct in sticking with the "exon" option.

Thanks in advance. Kenneth

featureCounts NCBI Ensembl GFF rnaseq • 1.1k views

ADD COMMENT • link updated 8 months ago by Gordon Smyth ★ 7.5k • written 8 months ago by BioinfGuru ★ 2.1k

1

Entering edit mode

8 months ago

Gordon Smyth ★ 7.5k

Does NCBI provide a GTF format file for your species? If it does, then

library(Rsubread)
SAF <- flattenGTF("annotation.gtf")

will usually work.

If you are working with human or mouse, there are pre-made SAF files at https://bioinf.wehi.edu.au/Rsubread/annot.

ADD COMMENT • link 8 months ago by Gordon Smyth ★ 7.5k

score 2 · Accepted Answer · 2024-03-02

After some time and research I came up with the following:

I created an SAF from the NCBI GFF. First, I selected all gff lines of type 'exon' (3rd colum), then isolated the "gene=" attribute of the 8th column. After writing the SAF, I set -g "GeneID" in the featureCounts command. GeneID is the name of the first SAF column.

    awk -F "\t" '$3 ~ /exon/' ncbi_annotations.gff |                            # extract all lines of type 'exon' (3rd column)
    grep gene= |                                                                # delete lines without "gene=" (<50 lines that allocate to mitochondrial DNA, removal allows uniform format to create SAF)
    awk '{FS="\t"; OFS="\t"; sub(/.*;gene=/,"",$9); sub(/;.*/,"",$9)}1' |       # reduce column 9 to only include the value of "gene=" 
    awk 'BEGIN {FS="\t"; OFS="\t"; print "GeneID\tChr\tStart\tEnd\tStrand"}{print $9,$1,$4,$5,$7}' > ncbi_annotations.saf

The results are very close to those when I run htseq-count, and to the results when an ensembl gtf annotation file is used. The main difference is the Unnassigned_Ambiguity range is +/- 200. Randomly searching individual genes in the results, show identical results (although I've only checked a few genes).