Question

Cds From Ucsc Not Always Modulus 3

1

Entering edit mode

11.0 years ago

Max ▴ 150

When retrieving all cds sequences from the human reference genome, in the overwhelming majority of cases, the cds is mod 3, starts with ATG, and ends in a stop codon.

However, approximately 10% are not divisible by 3 and have no stop codon (but do have a start codon in the first exon).

Are these likely to be errors?

To retrieve the sequences, I used the Table Browser, i.e.

Select -- refSeq genes for the track and ccdsInfo for the table
Under output, select "selected fields from primary…"
then click get output
You will go to another page that gives you the option to select additional tables
-- Select "ccdsGene"
-- click "Allow Selection from…"
 That will expand another list of fields...etc.

-- click "Check all" at the top click "Get output"

ucsc • 2.8k views

ADD COMMENT • link updated 7.2 years ago by Petr Ponomarenko ★ 2.8k • written 11.0 years ago by Max ▴ 150

0

Entering edit mode

You might give an example as one can come up with a number of likely explanations.

ADD REPLY • link 11.0 years ago by Devon Ryan 104k

0

Entering edit mode

For instance, this sequence (length=119), which I obtained by selecting CDS exons with 1 FASTA record per gene:

hg19_refGene_NM_001187 range=chr21:11097543-11098737 5'pad=0 3'pad=0 strand=- repeatMasking=none ATGGCGGCCGGAGCGGTTTTTCTGGCATTGTCTGCCCAGCTGCTCCAAGCCAGGCTGATGAAGGAGGAGTCCCCTGTGGTGAGCTGGAGGTTGGAGCCTGAAGATGGCACAGCTCTGTG

ADD REPLY • link 11.0 years ago by Max ▴ 150

0

Entering edit mode

Another example is NM_001077693 . The UCSC CDS exons give a length of 326. Moreover, when translated, there are several internal stop codons, and a frameshift difference that gives amino acid sequences inconsistent with the Genbank record for this gene.

ADD REPLY • link 11.0 years ago by Max ▴ 150

0

Entering edit mode

I'm not sure what's going on with that one. If you look that up in the nucleotide database on NCBI, its sequence diverges around 1/2 way through, which suggests that something is off between NCBI and UCSC.

ADD REPLY • link 11.0 years ago by Devon Ryan 104k

0

Entering edit mode

Look at the cdsStartStat field (hint, exon #3 is missing) in the refGene table for that entry.

ADD REPLY • link 11.0 years ago by Devon Ryan 104k

score 1 · Answer 1 · 2017-05-25

1

Entering edit mode

7.2 years ago

Petr Ponomarenko ★ 2.8k

It does not have to be mod 3 even in the real world let alone as a result of aligning RefSeq on the reference sequence. Some sequences are partial, some were incorrectly aligned, some are known to have "weird" stuff, like polymerase slippage at a certain position, etc.

ADD COMMENT • link 7.2 years ago by Petr Ponomarenko ★ 2.8k