Question

Can you suggest tools or script to get canonical sequences from a cds fasta file?

0

Entering edit mode

10 weeks ago

Yuto • 0

I have a CDS fasta file from NCBI, and there are around 11k duplicated transcripts. I would like to only get the canonical sequences (would be the longest transcript, I suppose?), so there won't be duplicated coding sequences in my downstream analyses. Can you please suggest tools I can easily use?

Thanks!

genome cds fasta • 348 views

ADD COMMENT • link updated 10 weeks ago by Ram 43k • written 10 weeks ago by Yuto • 0

0

Entering edit mode

If you are working with human data then look into the MANE project: https://www.ncbi.nlm.nih.gov/refseq/MANE/#Select

ADD REPLY • link 10 weeks ago by GenoMax 141k

0

Entering edit mode

The longest transcript is not necessary canonical, as: encoding the most abundant protein/most conserved among species etc. The 11k seems to be to small for human/mammalian transcripts. Anyway, ENSEMBL GTF files should have canonical tag (no idea if this is valid for annotations of more exotic species) so you can get the canonical selecting these and extracting sequences from fasta using i.e. bedtools.

ADD REPLY • link 10 weeks ago by Darked89 4.6k

score 0 · Answer 1 · 2024-02-15

Which organism is it and for what purpose do you need the data? For human and mouse you can use the CCDS database. The meaning of "canonical" may also vary by organism. In the end it may be rather a subjective choice (for human it is "what's annotated as canonical in the database") and depends on your purpose. I would prefer experimentally validated or manually annotated over the longest sequence.