Question: UCSC Table Browser: redundant CDS sequences in generated fasta
1
gravatar for lilla.davim
5.6 years ago by
lilla.davim110
France
lilla.davim110 wrote:

Hello,

I used UCSC Table Browser to generate a fasta where each entry is a CDS within a given region. The parameters I used are the following:

  • group: Genes and gene predictions
  • track: Ensembl genes
  • table: ensGene
  • region: position chr20:250000-1000000
  • output format: sequence
  • output file: myfile.fasta

Then:

  • sequence type for Ensembl Genes : genomic

And in "sequence retrieval region options": CDS Exons (only) and "One FASTA record per region (exon, intron, etc.)".

Now in the resulting fasta some identical sequences occur several times, with the same range but a different ID, for instance:

>hg19_ensGene_ENST00000217233_0 range=chr20:368655-368945 5'pad=0 3'pad=0 strand=+ repeatMasking=none
ATGCGAGCC......

> (...)

>hg19_ensGene_ENST00000449710_0 range=chr20:368655-368945 5'pad=0 3'pad=0 strand=+ repeatMasking=none
ATGCGAGCC......

Why does this occur and how can I obtain a fasta where each entry is a unique range?

Thanks.

 

 

ucsc cds table browser • 2.0k views
ADD COMMENTlink modified 5.6 years ago by Bert Overduin3.6k • written 5.6 years ago by lilla.davim110
1

That gene has multiple transcripts, which is what you're seeing. Download the whole genome and the annotation file and use R or bioperl/biopython.

ADD REPLYlink written 5.6 years ago by Devon Ryan94k
2
gravatar for Bert Overduin
5.6 years ago by
Bert Overduin3.6k
Edinburgh Genomics, The University of Edinburgh
Bert Overduin3.6k wrote:

This does occur because most genes have multiple alternative transcripts annotated, and the CDSs of these can (partially) overlap. Ensembl does annotate one transcript per gene as canonical (from their glossary: "For human, the canonical transcript for a gene is set according to the following hierarchy: 1. Longest CCDS translation with no stop codons. 2. If no (1), choose the longest Ensembl/Havana merged translation with no stop codons. 3. If no (2), choose the longest translation with no stop codons. 4. If no translation, choose the longest non-protein-coding transcript."), so you could consider to only take the CDSs from these. However, the only way to do this, as far as I am aware, is by using the Ensembl Perl API. I am happy to provide you with some code to accomplish this, but you would have to install the Ensembl API yourself (easiest way is by using the Ensembl virtual machine). If you decide to do this and have questions / run into problems with regard to the API installation, please contact the Ensembl Helpdesk at helpdesk@ensembl.org.

Also, I don't know what your ultimate goal is, but you probably should ask yourself if just taking one CDS per gene is the right thing to do for what you want to accomplish.

ADD COMMENTlink modified 4 months ago by RamRS25k • written 5.6 years ago by Bert Overduin3.6k

Addendum: Reading back again, I think I may have misunderstood your question. Do you only want to get rid of those CDSs that are exactly identical or also of overlapping ones? If the first, then you should just filter your output file for unique locations, if the second, then my reply above still holds.

ADD REPLYlink written 5.6 years ago by Bert Overduin3.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 764 users visited in the last hour