RNA-seq - Creating SAF from NCBI gff for Subread featureCounts - keep 'gene' or 'exon'
2
0
Entering edit mode
8 weeks ago
BioinfGuru ★ 1.7k

Hi all,

I'm setting up an RNA-seq pipeline with a sample dataset of 2 samples (1 x condition1, 1 x condition 2). Each sample contains 100,000 paired end reads (4 fastq files) which I have mapped to chromosome 1 of my reference genome. I set the pipeline up based on the demo in the RNA-seq by example, so I am using featureCounts for quantification. I'm getting vastly different results, depending on whether the source of my reference data is NCBI or Ensembl. I use NCBI annotation.gff for NCBI reference genome, and I use Ensembl annotation.gtf for Ensembl reference genome. I intend to run differential gene expression with Deseq2 or edgeR analysis after this step.

The problem:

The NCBI annotation.gff format does not meet the requirements for featureCounts so I created an SAF format which does meet the requirements with the following code:

   # Create SAF
   grep 'gene' ncbi.annotations.gff |         # extract all gene features (includes exons, cds etc)
    cut -d ';' -f1 |            # keep only first attribute (ID=)
    sed 's/ID=//g' |            # remove "ID=" from ID string
    grep NC_ |                  # keep only primary assembly (NC maps to chromosomes)
    #grep gene |                 # keep only "gene"
    #sed 's/gene-//g' |          # remove "gene-" from ID string
    grep exon |                 # keep only "exon"
    sed 's/exon-//g' |          # remove "exon-" from ID string
    awk 'BEGIN {FS="\t"; OFS="\t"; print "GeneID\tChr\tStart\tEnd\tStrand"}{print $9,$1,$4,$5,$7}' > ncbi.annotations.gff

Note: 2 lines are commented out under "# Create SAF", I am unsure whether to select gene or exon. I'm aware the focus should be on exons.

With the featureCount parameters "-F SAF -g GeneID -p --countReadPairs" I get the following results:

# When using NCBI data and " grep exon | sed 's/exon-//g' "
Assigned            1839    1728
Unassigned_Unmapped 91553   91289
Unassigned_MultiMapping 929 1023
Unassigned_NoFeatures   1157    1521
Unassigned_Ambiguity    4603    4814

# When using NCBI data and " grep gene | sed 's/gene-//g' "
Assigned            6626    7008
Unassigned_Unmapped 91553   91289
Unassigned_MultiMapping 929 1023
Unassigned_NoFeatures   441 494
Unassigned_Ambiguity    532 561

Finally, as a control I used Ensembl data which meets the file format requirement of featureCounts without any processing. With the featureCount parameters " -t exon -p --countReadPairs " I get the following results:

# When using Ensembl data with " -t exon "
Assigned            6045    6232
Unassigned_Unmapped 91553   91289
Unassigned_MultiMapping 929 1023
Unassigned_NoFeatures   1303    1585
Unassigned_Ambiguity    251 246

# When using Ensembl data with " -t gene "
Assigned    6740    7124
Unassigned_Unmapped 91553   91289
Unassigned_MultiMapping 929 1023
Unassigned_NoFeatures   441 472
Unassigned_Ambiguity    418 467

Note: All other metrics are zero (e.g. Unassigned_Read_Type 0 0)

I trust the results from the Ensembl data more as I had minimal input but I'd like to know for future reference what I am doing wrong with the NCBI data. The results of the NCBI data is alot closer to that of the Ensembl data when I use the "gene" option instead of "exon" option, which is counter intuitive to me.

2 questions:

  • What am I missing regarding using featureCounts with NCBI reference data, I would like to have confidence in my ability and not have to never use NCBI data if I have to use featureCounts.
  • If going forward I choose the Ensembl reference data, am I correct in sticking with the "exon" option.

Thanks in advance. Kenneth

featureCounts NCBI Ensembl GFF rnaseq • 616 views
ADD COMMENT
2
Entering edit mode
6 weeks ago
BioinfGuru ★ 1.7k

After some time and research I came up with the following:

I created an SAF from the NCBI GFF. First, I selected all gff lines of type 'exon' (3rd colum), then isolated the "gene=" attribute of the 8th column. After writing the SAF, I set -g "GeneID" in the featureCounts command. GeneID is the name of the first SAF column.

    awk -F "\t" '$3 ~ /exon/' ncbi_annotations.gff |                            # extract all lines of type 'exon' (3rd column)
    grep gene= |                                                                # delete lines without "gene=" (<50 lines that allocate to mitochondrial DNA, removal allows uniform format to create SAF)
    awk '{FS="\t"; OFS="\t"; sub(/.*;gene=/,"",$9); sub(/;.*/,"",$9)}1' |       # reduce column 9 to only include the value of "gene=" 
    awk 'BEGIN {FS="\t"; OFS="\t"; print "GeneID\tChr\tStart\tEnd\tStrand"}{print $9,$1,$4,$5,$7}' > ncbi_annotations.saf

The results are very close to those when I run htseq-count, and to the results when an ensembl gtf annotation file is used. The main difference is the Unnassigned_Ambiguity range is +/- 200. Randomly searching individual genes in the results, show identical results (although I've only checked a few genes).

ADD COMMENT
1
Entering edit mode
6 weeks ago
Gordon Smyth ★ 7.0k

Does NCBI provide a GTF format file for your species? If it does, then

library(Rsubread)
SAF <- flattenGTF("annotation.gtf")

will usually work.

If you are working with human or mouse, there are pre-made SAF files at https://bioinf.wehi.edu.au/Rsubread/annot.

ADD COMMENT

Login before adding your answer.

Traffic: 2660 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6