Question: Creating Table From Fasta File In Python
1
gravatar for Peiska
7.5 years ago by
Peiska30
Finland
Peiska30 wrote:

I have a fasta file with more than 500 sequences, and from that file I need to build a table, so I am trying to write a script to do it instead of doing it by hand, with copy-paste. I am reading the file using Biopython : seq=SeqIO.parse(handle, "fasta")

From each sequence I want to know the specie the protein sequence belongs, the name of the protein and the Uniprot ID. when I use the SeqIO to parse a fasta file I noticed that there is not much info I can parse from it.

Here is an subset of my fasta file:

>gi|194757291|ref|XP_001960898.1| GF11270 [Drosophila ananassae] >gi|190622196|gb|EDV37720.1| GF11270 [Drosophila ananassae] MSAARTSQDCDCTAKCRLRQHGNTITAALTKRSISSQNLAAFVYKTCGNFANILDDLGRSAVHMSASTGRYEILEWLLNH GAYINGQDYESGSSPLHRALYYGSIDCAVLLLRYGASMELLDEDTCCPLQAICRKCDVDDFATDSQNDVLVWGSNKNYNL GVGSEQNTNAPQSVDFFRKSNIWIEQVALGAYHSLFLDKKGHLYAVGHGKGGRLGTGGENTLPAPKRVKVSSKLGSEDSI RCISVSRQHSLVLTHRSLVFACGLNSDCQLGVRDAPEHLAQFKEVVALRDKGASDLVRVIACDQHSIAYGSRCVYVWGAN QGQFGISANIASIVVPTLIKLPARTTIRFVEANNAATVIYSEEKMIYLYYAEKTRAIKTPNYEDLKSISVMGGHIKNSAK GSAAALKLLMLTETNVVYLWYENTQQFYRCNFLPIRLPQIKKILYKCNQVMVLSEDGCVYRGKCNQIALPASELQEKSRP NLDIWQNNDQNRTEISREHVIRIELQRVPNIDRAVDISCDEGFSSFAVLQESQGKYFRKPTLPRKEHSFKKLLHDTSDCD AVHDVVFHVDGEKYPAHKYIIYSRAPGLRELVRMYLDKDIYLNFENLTGKMFELVLKHIYTNYWPTEDDIDCIQQSLGPA NPQNRSRTCQMFLPHLEKFQLTELAKYVKSYVQDHQFPLPSARQRLPRLHRSDYPELYDVKIKCEDGQVLQAHKCMLVAR LEYFEMMFMHSWAERSSVTMEGVPAEYMEPVLDYLYSLEAEAFCKQAYLETFLYNMITICDQYFIESLQNLCELLILDKI SIRKCGEMLEFATMYNCKLLLKGCMDFICQNLARVLCYRSIEQCDGETLKCLNDHYRNMFSRVFDYRQITPFSEAIEDEL LLSFIDGLEVDLEYRMDAESKAKQAAKTKQKDLRKLNARHQYEQRAISSMMRSISISESNPAPEVATSPQESARSETNNW SRVIDKKEQKRKQAETALKVNKTLKQETSPEPEMVPIERTPVNEQTPPPLSPETEPSTPLNKSYNLDFSSLTPQSQKLSQ KQRKRLSSESKSWRGNSSALLESPTTPVPVPNAWGVTTTPSSSFNDSYTSPTTGSSSDPTSFANMMRSQAASSSATSKDQ SQNFSKILADERRQRESYERMRNKSLVHTQIEETAIAELREFYNVDNIDDEKITIARKSRPSDINFSTWIRQ
>gi|198456847|ref|XP_001360463.2| GA20796 [Drosophila pseudoobscura pseudoobscura] >gi|198135774|gb|EAL25038.2| GA20796 [Drosophila pseudoobscura pseudoobscura] MSTAKAQEYDCTAKCTCRQHGNSITAALTKRSIDNQNLGAFIFKTCGNFANIIDDLGRSAVHMSASVARYEILEWLLNHG AYINGLDYESGSSPLHRALYYGSIDCAVLLLRYGASLELLDEDTRCPLQAICRKCDEDFTTESQNDVLVWGSNKNYNLGI GNEQNTNAPQAVDFFRKSNIWIEQVALGAYHSLFCDKKGHLYAVGHGKGGRLGIGVENSLPAPKRVKVSSKLNDDSIMCI SVSRQHSLLLTRRSLVFACGINTDHQLGVRDAPENLTQFREVVALRDKGASDLLRVIACDQHSIAYSTKCVYVWGANQGQ FGISRTTDTIMAPTLIKLPARTSIRFVEANNAATVIYTEEKMITLFYGDKTRYIKTPNYEDLKSIAVIGGHLKSSTKGSA AALKLLMLTETNVVFLWYENTQQFYRCNFSPIRLPEIKKILYKCNQVLILSLDGCVYRGKCNQIALPAGILEEKSKPNMD IWHNNDQNRTEISREHVIRIELQRVPNIDRATDIFCDESFSSFAVLQESHMKYFRKPPLPRREHNFKKLYHDTCESDAVH DVVFHVDGERFAAHKFILYSRAPGLRELTRIYLDKDVYLNFENLTGKMFELILKYIYTSYWPTEDDIDCIQESLGPANPR ERSRACEMFIPHLEMFQLVDLARYLQSYVRDNQFPIPSTRQRFNRLHRSDYPELYDVRIVCEDSKVLEAHKCMLVSRLEY FEMMFTHSWAERTTVNMEGVPAEYMEPVLDYLYSLDTEAFCKQNYTETFLYNMVTFCDQYFIESLQNVCESLILDKISIR KCGEMLDFAAMYNCKLLHKGCMDFICHNLARVLCYRSIEQCDEATLKCLNDHYRKMFSNVFDYRQITPFSEAIEDELLLS FVVDCDIDLDYRMDPETKLKAAAKHKQKDLRRQDARHYYEQQAISSMMRSLSVSESASGPEATTGPQESTRSEGKNWSRV VDKKEQKRKLADTALKVNNTLKLEEPPRPELEVIERALMKEQTPPPTSPAEETSTPLSKSYNLDLSSLTPQSQKLSQKQR KRLSSESKSWRSPLVEQEPTTPVAVPNAWGLPPATPSSSSFTDSPATGSISDPTSFANMMRGQAAAATTPTEKGQSFSRI LADERRQRESFERMRNKSLAHTQIEETAIAELREFYNVDNTDDETITIERKSRPTDINFSTWLKH
>gi|355695434|gb|AES00009.1| inhibitor of Bruton agammaglobulinemia tyrosine kinase [Mustela putorius furo] KPGNKLKLNQKKCSFLCDVTMKSVDGKEFTCHKCVLCARLEYFHSMLSSSWIEASTCTALEMPIHSDILKVILDYLYTDE AVVIKESQNVDFVCSVLVVADQLLITRLKGMCEVALTEKLTLKNAAMILEFAAMYNAEQLKLSCLQFIGLNM

Is there any way I get the Protein name, Uniprot ID and organism from those sequences? For example I thought on parsing the genebank id from the seq.description and then search in genebank with that ID but I think it cannot be made, not all the sequences have genebank ID. Any suggestions how to do this? Any help would be really appreciated.

example of desired output:

name    organism    uniprot id  family
GF11270 Sophophora  B3MFN0  
GA20796 Sophophora  Q291S4
python • 3.3k views
ADD COMMENTlink modified 7.5 years ago • written 7.5 years ago by Peiska30

What you pasted there is not valid Fasta.

ADD REPLYlink written 7.5 years ago by Neilfws48k

Check out the different string split functions in the Python and other like startswith, endswith.

ADD REPLYlink written 7.5 years ago by Thaman3.3k
2
gravatar for Peiska
7.5 years ago by
Peiska30
Finland
Peiska30 wrote:

Solution found:

handle = open("seq.fasta")
orchid_dict = SeqIO.to_dict(SeqIO.parse(handle, "fasta"))
handle.close()
for key in orchid_dict:
    # print key
    #print orchid_dict[key].description
    if orchid_dict[key].description.find("|gb|") !=-1:
        gb_id=re.findall(r'gb\|([^\|]+)', orchid_dict[key].description)
        newstr = str(gb_id).replace("'", "")
        newstr = newstr.replace("[", "")
        newstr = newstr.replace("]", "")

        params = ( ('query',newstr),

                  ('format','tab'),
                  ('columns','protein names,organism,id,families,existence'  )          
                  )

        data = urllib.urlencode(params)

        request = urllib2.Request(url, data)
        contact = "" # Please set your email address here to help us debug in case of problems.
        request.add_header('User-Agent', 'Python contact')
        response = urllib2.urlopen(request)
        page = response.read(200000)
        f.writelines( page)
ADD COMMENTlink written 7.5 years ago by Peiska30

You should probably quote where you found this solution. From the orchid reference I guess on the Biopython tutorial?

ADD REPLYlink written 7.5 years ago by skymningen330

I used the code from the Biopython tutorial to read the fast file, and to access the Uniprot I found the information here: http://www.uniprot.org/faq/28#downloading

ADD REPLYlink written 7.5 years ago by Peiska30

thanks for following up with an answer!

ADD REPLYlink written 7.5 years ago by Istvan Albert ♦♦ 84k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 688 users visited in the last hour