Hi all,
I'm setting up an RNA-seq pipeline with a sample dataset of 2 samples (1 x condition1, 1 x condition 2). Each sample contains 100,000 paired end reads (4 fastq files) which I have mapped to chromosome 1 of my reference genome. I set the pipeline up based on the demo in the RNA-seq by example, so I am using featureCounts for quantification. I'm getting vastly different results, depending on whether the source of my reference data is NCBI or Ensembl. I use NCBI annotation.gff for NCBI reference genome, and I use Ensembl annotation.gtf for Ensembl reference genome. I intend to run differential gene expression with Deseq2 or edgeR analysis after this step.
The problem:
The NCBI annotation.gff format does not meet the requirements for featureCounts so I created an SAF format which does meet the requirements with the following code:
# Create SAF
grep 'gene' ncbi.annotations.gff | # extract all gene features (includes exons, cds etc)
cut -d ';' -f1 | # keep only first attribute (ID=)
sed 's/ID=//g' | # remove "ID=" from ID string
grep NC_ | # keep only primary assembly (NC maps to chromosomes)
#grep gene | # keep only "gene"
#sed 's/gene-//g' | # remove "gene-" from ID string
grep exon | # keep only "exon"
sed 's/exon-//g' | # remove "exon-" from ID string
awk 'BEGIN {FS="\t"; OFS="\t"; print "GeneID\tChr\tStart\tEnd\tStrand"}{print $9,$1,$4,$5,$7}' > ncbi.annotations.gff
Note: 2 lines are commented out under "# Create SAF", I am unsure whether to select gene or exon. I'm aware the focus should be on exons.
With the featureCount parameters "-F SAF -g GeneID -p --countReadPairs" I get the following results:
# When using NCBI data and " grep exon | sed 's/exon-//g' "
Assigned 1839 1728
Unassigned_Unmapped 91553 91289
Unassigned_MultiMapping 929 1023
Unassigned_NoFeatures 1157 1521
Unassigned_Ambiguity 4603 4814
# When using NCBI data and " grep gene | sed 's/gene-//g' "
Assigned 6626 7008
Unassigned_Unmapped 91553 91289
Unassigned_MultiMapping 929 1023
Unassigned_NoFeatures 441 494
Unassigned_Ambiguity 532 561
Finally, as a control I used Ensembl data which meets the file format requirement of featureCounts without any processing. With the featureCount parameters " -t exon -p --countReadPairs " I get the following results:
# When using Ensembl data with " -t exon "
Assigned 6045 6232
Unassigned_Unmapped 91553 91289
Unassigned_MultiMapping 929 1023
Unassigned_NoFeatures 1303 1585
Unassigned_Ambiguity 251 246
# When using Ensembl data with " -t gene "
Assigned 6740 7124
Unassigned_Unmapped 91553 91289
Unassigned_MultiMapping 929 1023
Unassigned_NoFeatures 441 472
Unassigned_Ambiguity 418 467
Note: All other metrics are zero (e.g. Unassigned_Read_Type 0 0)
I trust the results from the Ensembl data more as I had minimal input but I'd like to know for future reference what I am doing wrong with the NCBI data. The results of the NCBI data is alot closer to that of the Ensembl data when I use the "gene" option instead of "exon" option, which is counter intuitive to me.
2 questions:
- What am I missing regarding using featureCounts with NCBI reference data, I would like to have confidence in my ability and not have to never use NCBI data if I have to use featureCounts.
- If going forward I choose the Ensembl reference data, am I correct in sticking with the "exon" option.
Thanks in advance. Kenneth