Question

Longest transcript variant per gene

0

Entering edit mode

10 weeks ago

EnriqueSan_Mx ▴ 110

Hello everyone,

I am running orthofinder on several proteomes that I obtained after an annotation with BRAKER3 and I want to perform an analysis with orthofinder. A suggestion to make the analysis faster and more precise is to extract the longest transcript variant per gene. Orthofinder provides a script for this but it only applies to files downloaded from Ensembl. Does anyone know a tool that can help me with this?

Basically I have these files for each species:

braker.aa
braker.codingseq
braker.gff3

My protein file looks like this:

head -n 2000 braker.aa

>g176.t1
MTKLTKRLELQMESSRLGLLRSHSRARSSKLASSQSKADGPPAEPEDAAKDATKDAAKEE
KQEKSRSWFSPRRKDSVASTILRSSKSFRRLSAVSSAASSQAAISTPTTPSFSRGSYESD
AAFHHRKLSSVESGEAPLSEPAPTDPPSIPYPPHRPERPQGDLFEGTPLEKANQLEKANQ
FEKANQLEKANQPSQRPSRSRSILRPFSPFPGPSIRDLRRLSQHSEHKPQANNAADTATT
TASKANRLSEEAEQVHEPADVAMTPASPKPSTYSVLSAPAPIVPNRRASLRPSSKGSETG
LARKSSIASSFHFSIHRRPSHVTAEIESRKHARSSRWTLTENMAEMFKGQHIKTDKQAMT
PSQIEAIWNGQDNGGAAAAAAKQKQKKSKERRMKSASDTSASIPRSFGGPLTEQQTNMFS
EPFQWPDKMSSPVPMSRADIKMRALPFEITVPPPPSTILVSPEIIHGDVSPKSPTKIDRR
HMERQSVPQLVLPDTEDAAIEDDDDAASGSSIPIPPKNPARFVARAPTMPLLPPILEGFR
SPPGSNRSSNISSHRRSRSCNHASEVKDDMITFTSTPYTMANPSFRHGPIVLSDAGSRDS
VVTAEVVEEEPRDVDWTAFQTAILGGSNGDLEGLFPEEPIPADEEEGKMAEDVTTWFEGF
GFETHGELIASSEKSSEKKSDRSTQRDSAGSMTSAASTPSTVQTEAEAELQTPVTIPQHQ
NIFDTIKQLRDRCESTYSASIYTTDSAEGPWAAAGVDGEDGRKEIDELAPTSVASKQGME
HGLEAFLGFTIDDSY*


>g176.t2
MERQSVPQLVLPDTEDAAIEDDDDAASGSSIPIPPKNPARFVARAPTMPLLPPILEGFRS
PPGSNRSSNISSHRRSRSCNHASEVKDDMITFTSTPYTMANPSFRHGPIVLSDAGSRDSV
VTAEVVEEEPRDVDWTAFQTAILGGSNGDLEGLFPEEPIPADEEEGKMAEDVTTWFEGFG
FETHGELIASSEKSSEKKSDRSTQRDSAGSMTSAASTPSTVQTEAEAELQTPVTIPQHQN
IFDTIKQLRDRCESTYSASIYTTDSAEGPWAAAGVDGEDGRKEIDELAPTSVASKQGMEH
GLEAFLGFTIDDSY*

transcript longest variant orthofinder • 614 views

ADD COMMENT • link 10 weeks ago by EnriqueSan_Mx ▴ 110

2

Entering edit mode

How to extract the longest isoform from multi fasta file
How do you extract longest transcript from fasta file using length and header IDs?

ADD REPLY • link 10 weeks ago by GenoMax 144k

0

Entering edit mode

Thanks so much

ADD REPLY • link 10 weeks ago by EnriqueSan_Mx ▴ 110

score 2 · Answer 1 · 2024-05-11

Hi, why not write code to extract the longest variant?

I copied these two in a txt file called "input.txt", and made three dummy variants of a gene "d163.t1" through "d163.t3", where d163.t3 is the longest of the transcripts. Thus, my output file should contain only g176.t1 (which is the longest of the two you provided), and d163.t3 (Which is the dummy gene I made that is a longer transcript). Here is some code I quickly wrote to solve this:

def extract_longest_transcript(input_file, output_file):
   gene_transcripts = {}
   current_gene = None
   current_transcript = ''
   current_variant = ''
   with open(input_file, 'r') as f:
       for line in f:
           line = line.strip()
           if line.startswith('>'):
               if current_gene:
                   if current_gene in gene_transcripts:
                       existing_transcript, _, _ = gene_transcripts[current_gene]
                       if len(current_transcript) > len(existing_transcript):
                           gene_transcripts[current_gene] = (current_transcript, len(current_transcript), current_variant)
                   else:
                       gene_transcripts[current_gene] = (current_transcript, len(current_transcript), current_variant)
               current_gene = line[1:].split('.')[0]  # Extracting the gene identifier
               current_variant = line[1:].split('.')[1] # Extracting the variant
               current_transcript = ''
           else:
               current_transcript += line

       # Last transcript
       if current_gene:
           if current_gene in gene_transcripts:
               existing_transcript, _, _ = gene_transcripts[current_gene]
               if len(current_transcript) > len(existing_transcript):
                   gene_transcripts[current_gene] = (current_transcript, len(current_transcript), current_variant)
           else:
               gene_transcripts[current_gene] = (current_transcript, len(current_transcript), current_variant)

   # Write longest transcripts to file
   with open(output_file, 'w') as out:
       for gene, (transcript, _, variant) in gene_transcripts.items():
           out.write(f'>{gene}.{variant}\n{transcript}\n')

This results in an output file containing only the longest variants for each. Now of course you have different file types and this code may not translate exactly with the rest of the formatting, but this is just to maybe spark some ideas for you.

Please also see GenoMax's post, who has linked several answers to very similar questions before.