Question

How to write out a ID and AA sequence from a SWISS PROT database file into a new file in a specific order using python?

0

Entering edit mode

2.4 years ago

M.O.L.S ▴ 100

I have a Swiss-Prot database file that contains several Swiss-Prot Files.

They are copied and pasted underneath each other.

Therefore there is one Swiss-Prot entry after another listed in the same file.

I want to write the ID into another file as the header. Immediately underneath, I want to write the amino acid sequence.

So far I can only read one single Swiss-Prot file and get as output 1ID and 1 amino acid sequence. In other words, I have managed to print out the ID header first and the amino acid sequence second .

How can this code work to read multiple Swiss-Prot file entries from one single file?

How do I do this sequentially for every ID and amino acid sequence from each Swiss-Prot entry listed in the file?

bright_cyan = "\033[0;96m"
bright_yellow = "\033[0;33m"
bright_green = "\033[0;32m"
reset = "\033[0m"
#--------------------------------------------------------------------
import sys
import re  
#--------------------------------------------------------------------
def read_data(SPROT_FILE):
    ''' This function is what is is aint it '''

    flag = ''

    try:
        DNAfile = open(SPROT_FILE , 'r')
    except IOError as error:
        print(bright_cyan + "double check and see if you entered the correct filename :> ", str(error))
        sys.exit(1)


    # create a FASTA file to copy the information to and write. 
    new_outfile = open("first.fsa", 'w')

    amino_acid_sequence  = ''

    for line in DNAfile:
        #print(line, end = '')

        if re.match(r'ID', line):
            ID = line[5:20]     

        # Stateful Parsing of the amino acid sequence.  
        if re.match(r'//', line):
            flag = False
        if flag:
            amino_acid_sequence += line
        if re.match(r'SQ', line):
            flag = True

        # Find the modified amino acid residue. 
        if re.match(r'FT   MOD_RES', line):
            FT = line
            position_switch = ','.join(re.findall(r'\d+',FT))
            header_line = '>'+ID.strip()+" phospho:"+position_switch
            print(header_line)
            #print('>'+ID.strip()+" phospho:"+position_switch, file = new_outfile)


    # Print each amino acid sequence outside of the loop.
    amino_acid_sequence = amino_acid_sequence.replace(' ', '')
    print(amino_acid_sequence)


    # Write the amino acid sequence to the file. 
    print(amino_acid_sequence, file = new_outfile)

    DNAfile.close()
    new_outfile.close() 


# Not sure about this part...
files = input(bright_yellow + 'Type possibly filenames :> ').split()
for filename in files:

    read_data(filename)

I hope the question is clear.

Would be great it if you could offer some help.

Thanks in advance

protein python • 4.2k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 2.4 years ago by M.O.L.S ▴ 100

1

Entering edit mode

Can you give an example of what you mean by swiss-prot file? I think you're describing a fasta file with amino acids. In that case use BioPython to parse the fasta file.

ADD REPLY • link 2.4 years ago by Mark ★ 1.5k

0

Entering edit mode

Yes. Here is an example of the file. It is a few thousand lines long so I won't put the whole thing.

ID   002L_FRG3G              Reviewed;         320 AA.
AC   Q6GZX3;
DT   28-JUN-2011, integrated into UniProtKB/Swiss-Prot.
DT   19-JUL-2004, sequence version 1.
DT   05-JUN-2019, entry version 38.
DE   RecName: Full=Uncharacterized protein 002L;
GN   ORFNames=FV3-002L;
OS   Frog virus 3 (isolate Goorha) (FV-3).
OC   Viruses; Iridoviridae; Alphairidovirinae; Ranavirus.
OX   NCBI_TaxID=654924;
OH   NCBI_TaxID=8295; Ambystoma (mole salamanders).
OH   NCBI_TaxID=30343; Dryophytes versicolor (chameleon treefrog).
OH   NCBI_TaxID=8404; Lithobates pipiens (Northern leopard frog) (Rana pipiens).
OH   NCBI_TaxID=8316; Notophthalmus viridescens (Eastern newt) (Triturus viridescens).
OH   NCBI_TaxID=45438; Rana sylvatica (Wood frog).
RN   [1]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RX   PubMed=15165820; DOI=10.1016/j.virol.2004.02.019;
RA   Tan W.G., Barkman T.J., Gregory Chinchar V., Essani K.;
RT   "Comparative genomic analyses of frog virus 3, type species of the
RT   genus Ranavirus (family Iridoviridae).";
RL   Virology 323:70-84(2004).
CC   -!- SUBCELLULAR LOCATION: Host membrane {ECO:0000305}; Single-pass
CC       membrane protein {ECO:0000305}.
DR   EMBL; AY548484; AAT09661.1; -; Genomic_DNA.
DR   RefSeq; YP_031580.1; NC_005946.1.
DR   GeneID; 2947774; -.
DR   KEGG; vg:2947774; -.
DR   Proteomes; UP000008770; Genome.
DR   GO; GO:0033644; C:host cell membrane; IEA:UniProtKB-SubCell.
DR   GO; GO:0016021; C:integral component of membrane; IEA:UniProtKB-KW.
DR   InterPro; IPR004251; Pox_virus_G9/A16.
DR   Pfam; PF03003; Pox_G9-A16; 1.
PE   4: Predicted;
KW   Complete proteome; Host membrane; Membrane; Reference proteome;
KW   Transmembrane; Transmembrane helix.
FT   CHAIN         1    320       Uncharacterized protein 002L.
FT                                /FTId=PRO_0000410509.
FT   TRANSMEM    301    318       Helical. {ECO:0000255}.
FT   COMPBIAS    263    295       Pro-rich.
SQ   SEQUENCE   320 AA;  34642 MW;  9E110808B6E328E0 CRC64;
     MSIIGATRLQ NDKSDTYSAG PCYAGGCSAF TPRGTCGKDW DLGEQTCASG FCTSQPLCAR
     IKKTQVCGLR YSSKGKDPLV SAEWDSRGAP YVRCTYDADL IDTQAQVDQF VSMFGESPSL
     AERYCMRGVK NTAGELVSRV SSDADPAGGW CRKWYSAHRG PDQDAALGSF CIKNPGAADC
     KCINRASDPV YQKVKTLHAY PDQCWYVPCA ADVGELKMGT QRDTPTNCPT QVCQIVFNML
     DDGSVTMDDV KNTINCDFSK YVPPPPPPKP TPPTPPTPPT PPTPPTPPTP PTPRPVHNRK
     VMFFVAGAVL VAILISTVRW
//
ID   012L_FRG3G              Reviewed;         297 AA.
AC   Q6GZW3;
DT   28-JUN-2011, integrated into UniProtKB/Swiss-Prot.
DT   19-JUL-2004, sequence version 1.
DT   05-JUN-2019, entry version 29.
DE   RecName: Full=Uncharacterized protein 012L;
GN   ORFNames=FV3-012L;
OS   Frog virus 3 (isolate Goorha) (FV-3).
OC   Viruses; Iridoviridae; Alphairidovirinae; Ranavirus.
OX   NCBI_TaxID=654924;
OH   NCBI_TaxID=8295; Ambystoma (mole salamanders).
OH   NCBI_TaxID=30343; Dryophytes versicolor (chameleon treefrog).
OH   NCBI_TaxID=8404; Lithobates pipiens (Northern leopard frog) (Rana pipiens).
OH   NCBI_TaxID=8316; Notophthalmus viridescens (Eastern newt) (Triturus viridescens).
OH   NCBI_TaxID=45438; Rana sylvatica (Wood frog).
RN   [1]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RX   PubMed=15165820; DOI=10.1016/j.virol.2004.02.019;
RA   Tan W.G., Barkman T.J., Gregory Chinchar V., Essani K.;
RT   "Comparative genomic analyses of frog virus 3, type species of the
RT   genus Ranavirus (family Iridoviridae).";
RL   Virology 323:70-84(2004).
DR   EMBL; AY548484; AAT09671.1; -; Genomic_DNA.
DR   RefSeq; YP_031590.1; NC_005946.1.
DR   GeneID; 2947784; -.
DR   KEGG; vg:2947784; -.
DR   Proteomes; UP000008770; Genome.
PE   4: Predicted;
KW   Complete proteome; Reference proteome.
FT   CHAIN         1    297       Uncharacterized protein 012L.
FT                                /FTId=PRO_0000410530.
SQ   SEQUENCE   297 AA;  32669 MW;  9B1D9247FF7E5D25 CRC64;
     MCAKLVEMAF GPVNADSPPL TAEEKESAVE KLVGSKPFPA LKKKYHDKVP AQDPKYCLFS
     FVEVLPSCDI KAAGAEEMCS CCIKRRRGQV FGVACVRGTA HTLAKAKQKA DKLVGDYDSV
     HVVQTCHVGR PFPLVSSGMA QETVAPSAME AAEAAMDAKS AEKRKERMRQ KLEMRKREQE
     IKARNRKLLE DPSCDPDAEE ETDLERYATL RVKTTCLLEN AKNASAQIKE YLASMRKSAE
     AVVAMEAADP TLVENYPGLI RDSRAKMGVS KQDTEAFLKM SSFDCLTAAS ELETMGF
//
ID   015R_FRG3G              Reviewed;         322 AA.
AC   Q6GZW0;
DT   28-JUN-2011, integrated into UniProtKB/Swiss-Prot.
DT   19-JUL-2004, sequence version 1.
DT   05-JUN-2019, entry version 40.
DE   RecName: Full=Uncharacterized protein 015R;
GN   ORFNames=FV3-015R;
OS   Frog virus 3 (isolate Goorha) (FV-3).
OC   Viruses; Iridoviridae; Alphairidovirinae; Ranavirus.
OX   NCBI_TaxID=654924;
OH   NCBI_TaxID=8295; Ambystoma (mole salamanders).
OH   NCBI_TaxID=30343; Dryophytes versicolor (chameleon treefrog).
OH   NCBI_TaxID=8404; Lithobates pipiens (Northern leopard frog) (Rana pipiens).
OH   NCBI_TaxID=8316; Notophthalmus viridescens (Eastern newt) (Triturus viridescens).
OH   NCBI_TaxID=45438; Rana sylvatica (Wood frog).
RN   [1]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RX   PubMed=15165820; DOI=10.1016/j.virol.2004.02.019;
RA   Tan W.G., Barkman T.J., Gregory Chinchar V., Essani K.;
RT   "Comparative genomic analyses of frog virus 3, type species of the
RT   genus Ranavirus (family Iridoviridae).";
RL   Virology 323:70-84(2004).
DR   EMBL; AY548484; AAT09674.1; -; Genomic_DNA.
DR   RefSeq; YP_031593.1; NC_005946.1.
DR   PRIDE; Q6GZW0; -.
DR   GeneID; 2947735; -.
DR   KEGG; vg:2947735; -.
DR   Proteomes; UP000008770; Genome.
DR   InterPro; IPR027417; P-loop_NTPase.
DR   SUPFAM; SSF52540; SSF52540; 1.
PE   4: Predicted;
KW   Complete proteome; Reference proteome.
FT   CHAIN         1    322       Uncharacterized protein 015R.
FT                                /FTId=PRO_0000410504.
SQ   SEQUENCE   322 AA;  36098 MW;  8E5F5B3DA9CDFF8A CRC64;
     MEQVPIKEMR LSDLRPNNKS IDTDLGGTKL VVIGKPGSGK STLIKALLDS KRHIIPCAVV
     ISGSEEANGF YKGVVPDLFI YHQFSPSIID RIHRRQVKAK AEMGSKKSWL LVVIDDCMDN
     AKMFNDKEVR ALFKNGRHWN VLVVIANQYV MDLTPDLRSS VDGVFLFREN NVTYRDKTYA
     NFASVVPKKL YPTVMETVCQ NYRCMFIDNT KATDNWHDSV FWYKAPYSKS AVAPFGARSY
     WKYACSKTGE EMPAVFDNVK ILGDLLLKEL PEAGEALVTY GGKDGPSDNE DGPSDDEDGP
     SDDEEGLSKD GVSEYYQSDL DD
//
ID   023R_IIV3               Reviewed;         106 AA.
AC   Q197D7;
DT   16-JUN-2009, integrated into UniProtKB/Swiss-Prot.
DT   11-JUL-2006, sequence version 1.
DT   18-SEP-2019, entry version 20.
DE   RecName: Full=Uncharacterized protein 023R;
GN   ORFNames=IIV3-023R;
OS   Invertebrate iridescent virus 3 (IIV-3) (Mosquito iridescent virus).
OC   Viruses; Iridoviridae; Betairidovirinae; Chloriridovirus.
OX   NCBI_TaxID=345201;
OH   NCBI_TaxID=7163; Aedes vexans (Inland floodwater mosquito) (Culex vexans).
OH   NCBI_TaxID=42431; Culex territans.
OH   NCBI_TaxID=332058; Culiseta annulata.
OH   NCBI_TaxID=310513; Ochlerotatus sollicitans (eastern saltmarsh mosquito).
OH   NCBI_TaxID=329105; Ochlerotatus taeniorhynchus (Black salt marsh mosquito) (Aedes taeniorhynchus).
OH   NCBI_TaxID=7183; Psorophora ferox.
RN   [1]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RX   PubMed=16912294; DOI=10.1128/jvi.00464-06;
RA   Delhon G., Tulman E.R., Afonso C.L., Lu Z., Becnel J.J., Moser B.A.,
RA   Kutish G.F., Rock D.L.;
RT   "Genome of invertebrate iridescent virus type 3 (mosquito iridescent
RT   virus).";
RL   J. Virol. 80:8439-8449(2006).
DR   EMBL; DQ643392; ABF82053.1; -; Genomic_DNA.
DR   RefSeq; YP_654595.1; NC_008187.1.
DR   GeneID; 4156230; -.
DR   KEGG; vg:4156230; -.
DR   OrthoDB; 16183at10239; -.
DR   Proteomes; UP000001358; Genome.
PE   4: Predicted;
KW   Complete proteome; Reference proteome.
FT   CHAIN         1    106       Uncharacterized protein 023R.
FT                                /FTId=PRO_0000377945.
SQ   SEQUENCE   106 AA;  12767 MW;  6620465F6FC52A18 CRC64;
     MGSYMLFDSL IKLVENRNPL NHEQKLWLID VINNTLNLEG KEKLYSLLIV HNKQQTKIYD
     PKEPFYDIEK IPVQLQLVWY EFTKMHLKSQ NEDRRRKMSL YAGRSP
//
ID   048L_FRG3G              Reviewed;          83 AA.
AC   Q6GZS8;
DT   28-JUN-2011, integrated into UniProtKB/Swiss-Prot.
DT   19-JUL-2004, sequence version 1.
DT   05-JUN-2019, entry version 27.
DE   RecName: Full=Uncharacterized protein 048L;
GN   ORFNames=FV3-048L;
OS   Frog virus 3 (isolate Goorha) (FV-3).
OC   Viruses; Iridoviridae; Alphairidovirinae; Ranavirus.
OX   NCBI_TaxID=654924;
OH   NCBI_TaxID=8295; Ambystoma (mole salamanders).
OH   NCBI_TaxID=30343; Dryophytes versicolor (chameleon treefrog).
OH   NCBI_TaxID=8404; Lithobates pipiens (Northern leopard frog) (Rana pipiens).
OH   NCBI_TaxID=8316; Notophthalmus viridescens (Eastern newt) (Triturus viridescens).
OH   NCBI_TaxID=45438; Rana sylvatica (Wood frog).
RN   [1]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RX   PubMed=15165820; DOI=10.1016/j.virol.2004.02.019;
RA   Tan W.G., Barkman T.J., Gregory Chinchar V., Essani K.;
RT   "Comparative genomic analyses of frog virus 3, type species of the
RT   genus Ranavirus (family Iridoviridae).";
RL   Virology 323:70-84(2004).
DR   EMBL; AY548484; AAT09707.1; -; Genomic_DNA.
DR   RefSeq; YP_031626.1; NC_005946.1.
DR   GeneID; 2947827; -.
DR   KEGG; vg:2947827; -.
DR   Proteomes; UP000008770; Genome.
PE   4: Predicted;
KW   Complete proteome; Reference proteome.
FT   CHAIN         1     83       Uncharacterized protein 048L.
FT                                /FTId=PRO_0000410516.
SQ   SEQUENCE   83 AA;  9566 MW;  52A13E9E325273F6 CRC64;
     MTAKTLDPSD YNVRDDSTTG MFTPVDRFVC DPESDRIIVR KIPPEWTIGN SMRFVHFTKE
     FTQTFDPSES PSNIVRHTNG KKK
//

Hopefully that is clearer now.

Best

ADD REPLY • link 2.4 years ago by M.O.L.S ▴ 100

score 3 · Answer 1 · 2021-12-09

3

Entering edit mode

2.4 years ago

Mensur Dlakic ★ 27k

If you download older HMMer (say, 2.3.2 version), there is a program called sreformat that will directly convert this format to fasta.

http://hmmer.org/download.html

ADD COMMENT • link 2.4 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

There is a problem installing this old HMMer program as it says:

ehmmcalibrate.c:25:10: fatal error: 'emboss.h' file not found
#include "emboss.h"

ADD REPLY • link 2.4 years ago by M.O.L.S ▴ 100

0

Entering edit mode

The 2.3.2 version of HMMer doesn't have the ehmmcalibrate.c file - I just checked. That would indicate that you are working with a different version.

Separately, HMMer is one of the best-behaved programs I have ever encountered in terms of compiling. I have tried all the major versions of it, and not once did I have to do anything more than a simple:

./configure ; make

I downloaded and recompiled 2.3.2 as I was writing this, and it took less than 30 seconds. sreformat is part of HMMer's squid library and it will be in the corresponding directory. If you have problems compiling HMMer, chances are that something standard is missing from your system.

ADD REPLY • link 2.4 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Yes, this error occurs after the make command.

All the way at the very end.

The system is MacOS Big Sur Version 11.3.1

I likely will not use this old program if it does not simply compile.

Also, I am looking for a more pythonic approach.

ADD REPLY • link 2.4 years ago by M.O.L.S ▴ 100

3

Entering edit mode

You say you aren't interested in this route in part because it isn't 'pythonic'. I'd argue not repeating making software that already exists is very Pythonic as it adheres strong 'DRY' principles.
In fact, most command line software can be run from within a Python script if you are looking to use Python as the backbone of a workflow; often os.system(<command_here>) is easier than subprocess. I demonstrate this near the end of the notebook I'm going to suggest checking out here.

Be that as it may, others may want to follow this route...

This is a case where having another system to do you work on can be handy. It's especially nice because it makes it so there is zero chance of messing up your system trying to install old software. Plus, it makes it more reproducible by eliminating the 'it works on my machine'-issue.

Put https://mybinder.org/v2/gh/jupyterlab/jupyterlab-demo/HEAD?urlpath=lab/tree/demo in your browser's address bar and hit return, or click that URL to launch a JupyterLab session.

We don't need a special environment here, and so it is the same as available for JupyterLab from Try Jupyter.

When the session comes up, open a new Jupyter notebook and paste in the following in a cell and run it with Shift-Enter:

!curl -OL https://gist.githubusercontent.com/fomightez/cb3a7f13a9b1ff74f55ac23835eb28a5/raw/56b098763f50be78f718642ad7e1a956a9df99e3/Guide_to_using_advice_posted_in_Biostars_answer_9500884.ipynb

That will get a notebook Guide_to_using_advice_posted_in_Biostars_answer_9500884.ipynb. When that shows up after a few seconds, in the file navigation panel on the left, double-click to open it and then execute the entire thing by selecting Run > Run All Cells from the toolbar menu. Alternatively, you can just step through running the cells with Shift-Enter to follow along. The 'Preparation' section installs the software and sets up to process your example data. You won't be able to run the lower cells until you install the software on the new session.

The session is ephemeral, and so if you use to do the conversion, make sure to grab anything useful.
All but the use of Python at the end of the linked Jupyter notebook could be done on the command line with the same commands without the exclamation signs or percent symbols. The notebook just makes it easier to share the commands and the result.

Direct link to static version of that notebook:
here in nbviewer which presently renders gists better than github on my system

ADD REPLY • link 16 months ago by Wayne ★ 1.9k

0

Entering edit mode

You say you aren't interested in this route in part because it isn't 'pythonic'. I'd argue not repeating making software that already exists is very Pythonic as it adheres strong 'DRY' principles.

Amen !

ADD REPLY • link 2.4 years ago by hugo.avila ▴ 490

1

Entering edit mode

If the error occurs at the very end, chances are that sreformat may have been compiled successfully before the compilation stopped. It should be in the squid subdirectory. If you don't have a squid subdirectory, you are not compiling the 2.3.2 version.

ADD REPLY • link 2.4 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

If older version of HMMer has (any) 32-bit code then it is not going to work on macOS 11.x.

ADD REPLY • link 2.4 years ago by GenoMax 141k

GenoMax · Answer 2 · 2021-12-09

0

Entering edit mode

2.4 years ago

hugo.avila ▴ 490

Hi, is this what you want to do ?

import os
import sys
import re


def main(input_fname: str, output_fname: str) -> None:
    with open(output_fname, 'w') as f_out:
        with open(input_fname, 'r') as f_in:
            for record in re.split('//', f_in.read())[:-1]:
                record_id = re.split('\s+', record[record.index('ID'):])[1]
                sequence = ''.join(record[record.index('SQ'):].split('\n')[1:]).replace(' ','')
                f_out.write(f'>{record_id}\n{sequence}\n')

if __name__ ==  '__main__':
    main(input_fname=sys.argv[1], output_fname=sys.argv[2])

ADD COMMENT • link updated 2.4 years ago by GenoMax 141k • written 2.4 years ago by hugo.avila ▴ 490

0

Entering edit mode

I would like all of the amino acid sequences to be directly underneath the ID line, so that the output file is:

ID LINE
AA sequence
ID LINE
AA sequence
ID LINE
AA sequence and so on ....

There is a Value Error of Substring not found in the last line that I am trying to iron out.

enter code here
 #!/usr/bin/env python3
 import os
 import sys
 import re. 

 # Get input file name
  if len(sys.argv) == 3:
         input_fname = sys.argv[1]
          output_fname = sys.argv[2]
   else:
         sys.stderr.write("Usage: pythonfile.py <input filename> <output filename> \n")
         sys.exit(1)

    def main(input_fname: str, output_fname: str) -> None:
     with open(output_fname, 'w') as f_out:
          with open(input_fname, 'r') as f_in:

                 # Can you write a comment here to explain this line below? 
               for record in re.split('//', f_in.read())[:-1]:

                         # What is the record[record.index ? 

                 record_id = re.split('\s+', record[record.index('ID'):])[1]

                          # Yes, except this only works for one ID and one sequence, as noted in my initial post. 
                 # error here >> sequence = ''.join(record[record.index('SQ'): >>> 
                          split the file from SQ to // so that each AA sequence is seperate i.e. until the next '//' <<<.
                            ].split('\n')[1:]).replace(' ','')
               f_out.write(f'>{record_id}\n{sequence}\n')


    if __name__ ==  '__main__':
         main(input_fname=sys.argv[1], output_fname=sys.argv[2])

ADD REPLY • link 2.4 years ago by M.O.L.S ▴ 100

0

Entering edit mode

import sys
import re


def main(input_fname: str, output_fname: str) -> None:
    with open(output_fname, 'w') as f_out:
        with open(input_fname, 'r') as f_in:
            # The input file is separated with '//' so if we split the file by these caracters
            # it is possible to get a list of records that can by looped: 
            # "record1//record2//record//" --split('//')--> ["record1", "record2", "record3", ""].
            # As you can see above the last item of the splitted string is a "" (empty string) 
            # so we need to ignore it
            # li like this:  ["record1", "record2", "record3", ""][:-1] ->  ["record1", "record2", "record3"].
            for record in re.split('//', f_in.read())[:-1]:
                    # I cant see way the first code only returned the first id, i did run it and it worked for the sample.
                    # I think that must be some format error in one of the sequences. This try block is kind of
                    # ugly but it wil go to the end of your file, write the output and print the unformated records (if any).
                try: 
                    # split a record into a list of lines and get only the one that starts with 'ID'.
                    # I think that maybe u only want the whole line so i did not pull only the id.
                    id_line = list(filter(lambda x: x.startswith('ID'), record.split('\n')))[0]
                    sequence = ''.join(record[record.index('SQ'):].split('\n')[1:]).replace(' ','')
                    f_out.write(f'>{id_line}\n{sequence}\n')
                except Exception as e:
                    print (f'Unformated record\n{record}')
                    pass

if __name__ ==  '__main__':
    main(input_fname=sys.argv[1], output_fname=sys.argv[2])

ADD REPLY • link 2.4 years ago by hugo.avila ▴ 490

0

Entering edit mode

This also doesn't parse quite well because there are multiple lines in the file where SQ is found, not just at the sequence lines. The exception doesn't catch it because the records are formatted. For example, in the FTId lines "SQ " is found.

ADD REPLY • link 16 months ago by M.O.L.S ▴ 100

2

Entering edit mode

To reliably find the SQ line, you need to look for lines starting with "SQ" and then followed by 3 spaces.

Also, I strongly recommend using the primary accession number (first identifier on the first line starting with AC), and not the ID line. The ID line contains an entry name/mnemonic which cannot be guaranteed to remain stable (https://www.uniprot.org/help/entry_name vs https://www.uniprot.org/help/accession_numbers).

If you have a list of accession numbers, you can post them to https://www.uniprot.org/id-mapping and download in FASTA format.

If not, you could download the Swissknife package from https://swissknife.sourceforge.net/ and run this script over your data:

use strict;

use IO::File;

use SWISS::Entry;

my $inputfile = @ARGV[0];
my $fh = new IO::File $inputfile or 
    die "Cannot open input file $inputfile: $!";


    $/ = "\n\/\/";
    while(<$fh>) {
        s/\r//g;
        (my $entry_txt = $_) =~ s/^\s+//;
        next unless $entry_txt;
        $entry_txt .= "\n";
        my $entry = SWISS::Entry->fromText( $entry_txt );
        print $entry->toFasta();
    }

ADD REPLY • link 16 months ago by Elisabeth Gasteiger ★ 2.4k

0

Entering edit mode

Hey, cool implementation !

ADD REPLY • link 16 months ago by hugo.avila ▴ 490

0

Entering edit mode

Hi, could you post a sample of the input file where the code fails ?

ADD REPLY • link 16 months ago by hugo.avila ▴ 490

0

Entering edit mode

I added a demo of both 'hugo.avila''s scripts along with the use of sreformat to the Jupyter notebook linked in my post above that was meant to install & demonstrate sreformat.

ADD REPLY • link 2.3 years ago by Wayne ★ 1.9k

2

Entering edit mode

Hey @Wayne , really nice job with the notebook ! I didn't know about binder, i'll be using it for now on, it seems very practical.

ADD REPLY • link 2.4 years ago by hugo.avila ▴ 490