Get all isoforms and proteins that correspond to list of Trinity gene IDs
0
0
Entering edit mode
9 days ago
cdsparks • 0

Hello,

After using TXImport to summarize my transcript abundance to the gene level, and then running DESeq2, I have a large list of genes that I would like to further investigate the identity and homology. I would like to take the list of gene IDs, and capture all of the isoform and potentially peptide, sequences from my original trinity.fasta and my transdecoder.pep file, respectively.

So my gene IDs look like:

For any of these I want to get the sequences of all available isoforms, which will have an "_i[1-99] appended to the ID such as:

I found a previous answer that has a Python Script that appeared to do this, however it seems to be written for python 2. I cannot get it to work on python 3. I am awful at python and can't quite get it worked out. I am getting the error:

TypeError: unsupported operand type(s) for >>: 'builtin_function_or_method' and '_io.TextIOWrapper'. Did you mean "print(<message>, file=<output_stream>)"?

Can anyone edit this python script or write another one to do this?

Alternatively, I was wondering if some type of regex with grep could be used to accomplish this. I have used seqkit grep in the past to get a fasta of sequences based a pattern file of geneIDs, but in this case I would need to somehow indicate to grep that every string in my pattern file will have "_i[1-99]", or allow it to get any result that contains the geneID.

My previous seqkit grep command might look something like this:

seqkit grep --pattern-file DEgenes.txt Trinity.fa > DEgenes.fasta

That only works if I run the DE analysis at the transcript level and keep the isoform IDs, but in this case I want to continue with the gene level analysis. Can I add some regex to indicate that after each string in my pattern file there will be the "_i" and a number? I have played with adding "--id-regexp" but can't quiet work that on.

If it's not too much trouble I would also like to do the same when there is an additional protein ID, denoted by ".p[1-99]":

Any help is very much appreciated!

fasta grep python isoforms • 236 views
1
Entering edit mode

To fix the error as suggested in python3:

TypeError: unsupported operand type(s) for >>: 'builtin_function_or_method' and '_io.TextIOWrapper'. Did you mean "print(<message>, file=<output_stream>)"?

change last line of script file

from:

print >> out, '>' + records[n].id, '\n', records[n].seq


to:

print( '>' + records[n].id + '\n' + records[n].seq, file=out)

0
Entering edit mode

This worked for getting the sequences of all isoforms of the genes on my list. Thank you very much!

If anyone else winds up on this question I also had to change the line:

with open(fasta_file, "rU") as file:

to

with open(fasta_file, "r") as file:

So, just remove the "U" due to syntax differences from python 2 to 3.

Now to see if I can figure out the next step of simply grabbing each protein per isoform. I am still open to suggestions but I have some ideas.