Question

From a genome wide predicted CDS annotation file how to remove multiple transcripts and select only one?

0

Entering edit mode

9.9 years ago

cjy8709 • 0

Hi Everyone

I'm planning to conduct a genome-wide ortholog search for a couple of species that has its genome sequenced and currently have predicted gene annotations (ie. have predicted CDS). My plan was to do a reciprocal blast hit type of analysis and going through each CDS of one species and compare it to the other species.There's no central database to find out ortholog information for my species of interest.

Anyways for the first step I was planning to edit the CDS file so that genes that have multiple predicted transcript my plan was to remove all but the longest sequence (to limit the redundancy in later BLAST searches). I'm familiar with perl scripting however I'm trying to force myself to do bioperl scripting (since it probably would help later downstream with BLAST) and I was wondering if people had some suggestions on how the script might work in my case?

Thank you!

bioperl sequence-edit scripting • 3.3k views

ADD COMMENT • link updated 2.5 years ago by Ram 43k • written 9.9 years ago by cjy8709 • 0

Ram · Answer 1 · 2014-06-13

1

Entering edit mode

9.9 years ago

Prakki Rama ★ 2.7k

You can extract all the predicted sequences listed in the annotation file, and then try using CD-HIT-EST, where you can collapse all the shorter sequences into longer sequences with a threshold identity cutoff. It will reduce the redundancy, retains longest and outputs fasta file. If you have too many try standalone.

Also, check if this biostars post would help you.

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

Thanks for the reply. The biostar reference you linked had some bits and pieces that I think would help me in the future.

ADD REPLY • link 9.9 years ago by cjy8709 • 0