how to collecting all isoforms of a gene from Trinity.fasta file ?
1
1
Entering edit mode
7.9 years ago
Farbod ★ 3.4k

Dear Friends, Hi ( I'm not native in English so, be ready for some possible language flaws).

I have done DEG analysis on de novo assembly created via Trinity software.

Now I want to pull out my DEG sequences for annotation, and I have created two collection.fasta files containing the ID names of those DEGs (one for Male sample DEGs and one for female sample DEGs)

but the DEG result IDs is as: TRINITY_DN212724_c0_g1 and the structure of Trinity.fasta is as TRINITY_DN212724_c0_g1_i# and it is usual that each gene has several isoformes.

I have used a program called "faSomeRecords" but it seems that it could not collect all the isoform of a gene ID automatically.

I have add "_i* " at the end of all gene IDs in the DEG files, but no chance !

I need a tool to collect all isoforms of some selected genes (collection.fasta file) from Trinity.fasta file, please.

Thank you in advance

sequence sequence manipulation • 3.7k views
ADD COMMENT
1
Entering edit mode
7.9 years ago
st.ph.n ★ 2.7k

I wrote a script awhile back that takes the longest isoform per gene and writes it to a new fasta. I've adapted it to solve your problem. Rename your collections.fasta to collections.txt, if it is not in fasta format. This script will take the following inputs: your trinity fasta containing all transcripts, your gene ids file (one ids per line), and the name of your output fasta.

#!/usr/bin/env python

import sys
from collections import defaultdict
from Bio import SeqIO

fasta_file = sys.argv[1]
gene_ids_file = sys.argv[2]
output_file = sys.argv[3]

with open(fasta_file, "rU") as file:
        records = list(SeqIO.parse(file, "fasta"))

with open(gene_ids_file, 'r') as f:
        gene_ids = [line.strip() for line in f]

with open(output_file, 'w') as out:
        for i in gene_ids:
                for n in range(len(records)):
                        if i in records[n].id:
                                print >> out, '>' + records[n].id, '\n', records[n].seq

Save this to a file, and run with python script.py Trinity.fasta collections.txt outputfile

ADD COMMENT
1
Entering edit mode

Dear st.ph.n,

Thank you for sharing your valuable script!

I will check it.

ADD REPLY
1
Entering edit mode

Hope it works well for you. You should also change your question title to 'Question: how to collect all isoforms of a gene from Trinity.fasta file?'

ADD REPLY
1
Entering edit mode

Yes, I have corrected it.

ADD REPLY
1
Entering edit mode

It seems that it works, Thank you

ADD REPLY
1
Entering edit mode

Dear st.ph.n Hi,

Do you have any script that can check the Trinity.fasta file and give a list of those transcripts (IDs) that have only one isoform ?

I will appreciate if you share your valuable script with me.

Thanks!

ADD REPLY
0
Entering edit mode

Dear st.ph.n, Could you provide for me the original script you used that takes the longest isoform per gene and that writes it to a new fasta? It would be greatly appreciated.

ADD REPLY
0
Entering edit mode

Sorry to bring an older post back up, but I've been trying to use this script and it doesn't seem to be outputting anything for me. Any help would be appreciated!

ADD REPLY

Login before adding your answer.

Traffic: 3662 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6