Question: Number of discovered genes in genome does not correspond with published annotation
0
gravatar for P16226
5 weeks ago by
P162260
P162260 wrote:

Greetings,

I have tried to run RNA-seq analysis for crucial carp samples on following genome: https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Carassius_auratus/100/

Author claims to discover 53,065 coding genes, but GFF file contain only fraction of presented number:

cat GCF_003368295.1_ASM336829v1_genomic.gff | grep -P "\tCDS\t" | awk -F'\t' '{print $1}' | sort | uniq | wc -l

3434

cat GCF_003368295.1_ASM336829v1_genomic.gff | grep -P "\texon\t" | awk -F'\t' '{print $1}' | sort | uniq | wc -l

4455

cat GCF_003368295.1_ASM336829v1_genomic.gff | grep -P "\tgene*\t" | awk -F'\t' '{print $1}' | sort | uniq | wc -l

4092

cat GCF_003368295.1_ASM336829v1_genomic.gff | grep -P "\tmRNA\t" | awk -F'\t' '{print $1}' | sort | uniq | wc -l

3419

Is there any secret? I apologize I am pretty new to any bioinformatics ...

Thank you for your time.

annotation rna-seq gff genome ncbi • 228 views
ADD COMMENTlink modified 5 weeks ago by kashiff00750 • written 5 weeks ago by P162260

claims to discover 53,065 coding genes,

Are those total number of transcripts by any chance?

ADD REPLYlink written 5 weeks ago by genomax64k

Full information of your GFF file

awk '{print$3}' | sort | uniq -c your.gff
ADD REPLYlink written 5 weeks ago by kashiff00750

What is this supposed to do? Please explain.

ADD REPLYlink written 5 weeks ago by vkkodali990

It extract 3 column of gff file (which has information like gene, intron, exon, miRNA etc) and then count the number of occurrence in whole gff file. In other words, you will get the total number of genes in your gff file.

ADD REPLYlink written 4 weeks ago by kashiff00750
1

It extract 3 column of gff file

No it doesn't. Please make sure your code works as intended before posting.

Even if it did, it will count the number of, for example, CDS lines in the GFF3 file which will be wildly off the actual number. That's because a given gene can have many alternately spliced transcripts with multiple coding exons, each of them represented in the GFF3 file as a single CDS row.

ADD REPLYlink written 4 weeks ago by vkkodali990

Yes, you are right. Since your question mainly concerned about number of "discovered genes" in annotation file, it will for sure give the total count of genes (which may have more that one CDS for sure). Also, I have not mentioned in my previous comment that it will provide the total number of CDS for each genes, or count one CDS for one gene. Although, it will only provide the total number entity in whole gff file or in whole genome. I always double check my code before posting it. Thanks

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by kashiff00750
2
gravatar for vkkodali
5 weeks ago by
vkkodali990
United States
vkkodali990 wrote:

You are counting the wrong column. Try this instead.

$ zgrep -v '^#' GCF_003368295.1_ASM336829v1_genomic.gff.gz \
    | awk 'BEGIN{FS="\t";OFS="\t"}($3=="CDS"){print $9}' \
    | grep -Po 'GeneID:\d+' \
    | sort -u \
    | wc -l
53614

Apparently, there are a bunch of CDS features without protein accessions. That is the reason for the discrepancy in the numbers (53065 vs 53614). You can get to that count by doing the following:

$ zgrep -v '^#' GCF_003368295.1_ASM336829v1_genomic.gff.gz \
    | awk 'BEGIN{FS="\t";OFS="\t"}($3=="CDS"){print $9}' \
    | grep 'protein_id=' \
    | grep -Po 'GeneID:\d+' \
    | sort -u \
    | wc -l
53065
ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by vkkodali990

Thank you very much! You are right!

ADD REPLYlink written 5 weeks ago by P162260
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1211 users visited in the last hour