Get all isoforms and proteins that correspond to list of Trinity gene IDs
0
0
Entering edit mode
15 months ago
cdsparks • 0

Hello,

After using TXImport to summarize my transcript abundance to the gene level, and then running DESeq2, I have a large list of genes that I would like to further investigate the identity and homology. I would like to take the list of gene IDs, and capture all of the isoform and potentially peptide, sequences from my original trinity.fasta and my transdecoder.pep file, respectively.

So my gene IDs look like:

geneIDs

For any of these I want to get the sequences of all available isoforms, which will have an "_i[1-99] appended to the ID such as:

isoformIDs

I found a previous answer that has a Python Script that appeared to do this, however it seems to be written for python 2. I cannot get it to work on python 3. I am awful at python and can't quite get it worked out. I am getting the error:

TypeError: unsupported operand type(s) for >>: 'builtin_function_or_method' and '_io.TextIOWrapper'. Did you mean "print(<message>, file=<output_stream>)"?

Can anyone edit this python script or write another one to do this?

Alternatively, I was wondering if some type of regex with grep could be used to accomplish this. I have used seqkit grep in the past to get a fasta of sequences based a pattern file of geneIDs, but in this case I would need to somehow indicate to grep that every string in my pattern file will have "_i[1-99]", or allow it to get any result that contains the geneID.

My previous seqkit grep command might look something like this:

seqkit grep --pattern-file DEgenes.txt Trinity.fa > DEgenes.fasta

That only works if I run the DE analysis at the transcript level and keep the isoform IDs, but in this case I want to continue with the gene level analysis. Can I add some regex to indicate that after each string in my pattern file there will be the "_i" and a number? I have played with adding "--id-regexp" but can't quiet work that on.

If it's not too much trouble I would also like to do the same when there is an additional protein ID, denoted by ".p[1-99]":

proteinID

Any help is very much appreciated!

fasta grep python isoforms • 671 views
ADD COMMENT
1
Entering edit mode

To fix the error as suggested in python3:

TypeError: unsupported operand type(s) for >>: 'builtin_function_or_method' and '_io.TextIOWrapper'. Did you mean "print(<message>, file=<output_stream>)"?

change last line of script file

from:

print >> out, '>' + records[n].id, '\n', records[n].seq

to:

print( '>' + records[n].id + '\n' + records[n].seq, file=out)
ADD REPLY
0
Entering edit mode

This worked for getting the sequences of all isoforms of the genes on my list. Thank you very much!

If anyone else winds up on this question I also had to change the line:

with open(fasta_file, "rU") as file:

to

with open(fasta_file, "r") as file:

So, just remove the "U" due to syntax differences from python 2 to 3.

Now to see if I can figure out the next step of simply grabbing each protein per isoform. I am still open to suggestions but I have some ideas.

Thanks again, mohammadhassanj

ADD REPLY

Login before adding your answer.

Traffic: 2636 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6