From a genome wide predicted CDS annotation file how to remove multiple transcripts and select only one?
1
0
Entering edit mode
9.9 years ago
cjy8709 • 0

Hi Everyone

I'm planning to conduct a genome-wide ortholog search for a couple of species that has its genome sequenced and currently have predicted gene annotations (ie. have predicted CDS). My plan was to do a reciprocal blast hit type of analysis and going through each CDS of one species and compare it to the other species.There's no central database to find out ortholog information for my species of interest.

Anyways for the first step I was planning to edit the CDS file so that genes that have multiple predicted transcript my plan was to remove all but the longest sequence (to limit the redundancy in later BLAST searches). I'm familiar with perl scripting however I'm trying to force myself to do bioperl scripting (since it probably would help later downstream with BLAST) and I was wondering if people had some suggestions on how the script might work in my case?

Thank you!

bioperl sequence-edit scripting • 3.3k views
ADD COMMENT
1
Entering edit mode
9.9 years ago
Prakki Rama ★ 2.7k

You can extract all the predicted sequences listed in the annotation file, and then try using CD-HIT-EST, where you can collapse all the shorter sequences into longer sequences with a threshold identity cutoff. It will reduce the redundancy, retains longest and outputs fasta file. If you have too many try standalone.

Also, check if this biostars post would help you.

ADD COMMENT
0
Entering edit mode

Thanks for the reply. The biostar reference you linked had some bits and pieces that I think would help me in the future.

ADD REPLY

Login before adding your answer.

Traffic: 2177 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6