TransDecoder: Filtering *_longest_orfs.pep Results
1
0
Entering edit mode
5.8 years ago
mduhon8 • 0

I've obtained TransDecoder output from 9 different Trinity transcriptomes from as many eukaryotic species. Within the _longest_orfs.pep output files, there are multiple predicted protein sequences from each isoform (see data subset below). My goal is to isolate the longest, most complete protein sequence from each isoform. Can someone here provide me with a python/perl script that will identify and write to a secondary file these target protein sequences and their corresponding IDs under the following conditions: (1) the protein sequence must be at least 50 amino acids in length, (2) the protein sequence must begin with a Methionine (M) residue, and (3) the protein sequence must end with a stop codon ().

>Chione_cancellata_TRINITY_DN43053_c4_g1_i2.p1 type:5prime_partial len:303 gc:universal Chione_cancellata_TRINITY_DN43053_c4_g1_i2:2-910(+)
GANAGSSAQETGANTESGKQGTGANAGSSAQETGANTESGKQGTGANAGSSGPKTGAAAVNEGQATSADAGSKPTGTNTQPSTGEAEVGGQADTETPLSAGPTGVEQGTAAETVESSPNAGEPAETLEGGHSSECQQFAYRNVGNKIVFDVEVDCQLSVDVTTAQASKTKGVGPDRILAEMQTSSGNKAAVGSCPELVTYTKPGGIIHIKTSQNCLIIIYPERKATGRKGAVPRNVLFTVDSGTQVETQAEGRQVKVAKKVTGKAEKVMGMKETVKVRQIKRTEKVTKPKKTGEGKAKANKI*
>Chione_cancellata_TRINITY_DN43053_c4_g1_i2.p2 type:complete len:132 gc:universal Chione_cancellata_TRINITY_DN43053_c4_g1_i2:813-418(-)
MPITFSALPVTFFATLTCLPSACVSTWVPLSTVNKTFRGTAPFLPVAFLSGYIIMRQFCDVLIWIIPPGFVYVTSSGHEPTAALFPLDVCISAKILSGPTPFVLDACAVVTSTDNWQSTSTSKTILLPTFR*
>Chione_cancellata_TRINITY_DN43053_c4_g1_i2.p3 type:complete len:117 gc:universal Chione_cancellata_TRINITY_DN43053_c4_g1_i2:985-635(-)
MCDSSIMQDLSLNNLIYVFFFKGSVLYFICFRFSLAGFLWFSHFFRSFNLSNFDCLLHAHNFLCFTCHFLCYLNLSTLSLRFYLGSTVYCEQNISWDSAFPACRFSFRIYYYETVL*

Thank you in advance!

next-gen sequence cDNA TransDecoder transcriptome • 2.5k views
ADD COMMENT
0
Entering edit mode
5.8 years ago
h.mon 35k

Use seqkit tab2fx to convert fasta to tab, then grep for "type: complete", then convert back from tab to fasta. Finally, if I am not mistaken, Trinity has a script to filter longest transcripts which might work for Transdecoder as well.

ADD COMMENT

Login before adding your answer.

Traffic: 2381 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6