After using TXImport to summarize my transcript abundance to the gene level, and then running DESeq2, I have a large list of genes that I would like to further investigate the identity and homology. I would like to take the list of gene IDs, and capture all of the isoform and potentially peptide, sequences from my original trinity.fasta and my transdecoder.pep file, respectively.
So my gene IDs look like:
For any of these I want to get the sequences of all available isoforms, which will have an "_i[1-99] appended to the ID such as:
I found a previous answer that has a Python Script that appeared to do this, however it seems to be written for python 2. I cannot get it to work on python 3. I am awful at python and can't quite get it worked out. I am getting the error:
TypeError: unsupported operand type(s) for >>: 'builtin_function_or_method' and '_io.TextIOWrapper'. Did you mean "print(<message>, file=<output_stream>)"?
Can anyone edit this python script or write another one to do this?
Alternatively, I was wondering if some type of regex with grep could be used to accomplish this. I have used seqkit grep in the past to get a fasta of sequences based a pattern file of geneIDs, but in this case I would need to somehow indicate to grep that every string in my pattern file will have "_i[1-99]", or allow it to get any result that contains the geneID.
My previous seqkit grep command might look something like this:
seqkit grep --pattern-file DEgenes.txt Trinity.fa > DEgenes.fasta
That only works if I run the DE analysis at the transcript level and keep the isoform IDs, but in this case I want to continue with the gene level analysis. Can I add some regex to indicate that after each string in my pattern file there will be the "_i" and a number? I have played with adding "--id-regexp" but can't quiet work that on.
If it's not too much trouble I would also like to do the same when there is an additional protein ID, denoted by ".p[1-99]":
Any help is very much appreciated!