Can you suggest tools or script to get canonical sequences from a cds fasta file?
1
0
Entering edit mode
10 weeks ago
Yuto • 0

I have a CDS fasta file from NCBI, and there are around 11k duplicated transcripts. I would like to only get the canonical sequences (would be the longest transcript, I suppose?), so there won't be duplicated coding sequences in my downstream analyses. Can you please suggest tools I can easily use?

Thanks!

genome cds fasta • 348 views
ADD COMMENT
0
Entering edit mode

If you are working with human data then look into the MANE project: https://www.ncbi.nlm.nih.gov/refseq/MANE/#Select

ADD REPLY
0
Entering edit mode

The longest transcript is not necessary canonical, as: encoding the most abundant protein/most conserved among species etc. The 11k seems to be to small for human/mammalian transcripts. Anyway, ENSEMBL GTF files should have canonical tag (no idea if this is valid for annotations of more exotic species) so you can get the canonical selecting these and extracting sequences from fasta using i.e. bedtools.

ADD REPLY
0
Entering edit mode
10 weeks ago
Michael 54k

Which organism is it and for what purpose do you need the data? For human and mouse you can use the CCDS database. The meaning of "canonical" may also vary by organism. In the end it may be rather a subjective choice (for human it is "what's annotated as canonical in the database") and depends on your purpose. I would prefer experimentally validated or manually annotated over the longest sequence.

ADD COMMENT

Login before adding your answer.

Traffic: 1826 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6