Why are number of CDS smaller than corresponding genes [M. tb]
0
0
Entering edit mode
6 weeks ago
Alewa ▴ 100

[sta55@cbsukim tb_genes_fasta]$esearch -db nuccore -query 'Mycobacterium tuberculosis H37Rv[Organism] AND NC_000962.3[ACCN]' | efilter -feature gene | efetch -format gene_fasta | grep "^>" | wc -l 4008 [sta55@cbsukim tb_genes_fasta]$ esearch -db nuccore -query 'Mycobacterium tuberculosis H37Rv[Organism] AND NC_000962.3[ACCN]' | efilter -feature gene | efetch -format fasta_cds_aa | grep "^>" | wc -l
3906


Background

I'm extracting the nucleotide sequence of M. TB genes and their corresponding cds(protein) sequences. https://www.ncbi.nlm.nih.gov/nuccore/NC_000962#locus_448814763

NCBI entrez bash genes • 242 views
3
Entering edit mode

At a guess one explanation could be that some features annotated as genes would be RNAs etc which aren't coding for proteins, thus there are more genes than CDSs (i.e. more functionally annotated "things" than just proteins")

0
Entering edit mode

Joe - thanks for chiming in. but in my case there were less cds than the genes. or maybe I'm not doing the gene filtering right? :(

1
Entering edit mode

That's what I said, no? You have fewer annotated CDSs than genes. Remember what "CDS" actually means: coding sequences.

This is usually taken to mean they give rise to a functional protein, but the definition of a gene is broader these days and can include non-coding RNAs.

Hence number of CDS + number of non-CDS functional elements = number of "genes".

Or more simply: gene != CDS.

This is still a guess on my part as it could be due to any number of annotation artefacts etc, but I don't see an obvious problem here - the numbers you've retrieved make intuitive sense.