Question

get organism name from fasta file header (PYTHON)

1

Entering edit mode

3.6 years ago

biohacker_tobe ▴ 80

For example, is it possible to get the organism name as one gets the sequence and gene id for example:

>sp|Q09305|AAR2_CAEEL Protein AAR2 homolog OS=Caenorhabditis elegans GN=F10B5.2 PE=3 SV=1
MGGALPPEIVDYMYRNGAFLLFLGFPQASEFGIDYKSWKTGEKFMGLKMIPPGVHFVYCS
IKSAPRIGFFHNFKAGEILVKKWNTESETFEDEEVPTDQISEKKRQLKNMDSSLAPYPYE
NYRSWYGLTDFITADTVERIHPILGRITSQAELVSLETEFMENAEKEHKDSHFRNRVDRE

>sp|Q18007|ACM1_CAEEL Probable muscarinic acetylcholine receptor gar-1 OS=Caenorhabditis elegans GN=gar-1 PE=2 SV=3
MPNYTVPPDPADTSWDSPYSIPVQIVVWIIIIVLSLETIIGNAMVVMAYRIERNISKQVS
NRYIVSLAISDLIIGIEGFPFFTVYVLNGDRWPLGWVACQTWLFLDYTLCLVSILTVLLI
TADRYLSVCHTAKYLKWQSPTKTQLLIVMSWLLPAIIFGIMIYGWQAMTGQSTSMSGAEC
SAPFLSNPYVNMGMYVAYYWTTLVAMLILYKGIHQAAKNLEKKAKAKERRHIALILSQRL

When taking out the following code I can get the sequence and protein id but not the organism name, how can this be done? :)

from Bio import SeqIO
import re
import pandas as pd

input_file = "Streptomyces_Uniprot.fasta" 
pattern = "\|(.*?)\|"
substring = re.search(pattern, s).group(1)
sequence_list = []
id_list = []

fasta_sequences = SeqIO.parse(open(input_file),'fasta')
for fasta in fasta_sequences:
    fasta_id, sequence, description = fasta.id, str(fasta.seq), fasta.description
    fasta_id = re.search(pattern, fasta_id).group(1)
    print (fasta_id)

How would I pull when OS="ORGANISM_NAME" for example from the description?

genome sequence fasta • 3.4k views

ADD COMMENT • link updated 3.6 years ago by Joe 21k • written 3.6 years ago by biohacker_tobe ▴ 80

0

Entering edit mode

Hello biohacker_tobe!

It appears that your post has been cross-posted to another site: https://stackoverflow.com/questions/63922667/

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY • link 3.6 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

3.6 years ago

Pierre Lindenbaum 161k

assuming all the fields are in the same order:

awk -F= '/^>/ {X=$2;gsub(/[^ ]*$/,"",X);print X}' in.fasta

ADD COMMENT • link 3.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

With linux I am aware on how to do it, but is it possible directly with python?

ADD REPLY • link 3.6 years ago by biohacker_tobe ▴ 80

0

Entering edit mode

Since there is no formal specification for fasta headers (beyond > and some identifier that follows), you can't do it using Biopython. You will likely have to do this by pattern search after you grab the header.

ADD REPLY • link 3.6 years ago by GenoMax 141k

0

Entering edit mode

Fair enough, thanks for the pointer :)

ADD REPLY • link 3.6 years ago by biohacker_tobe ▴ 80

score 2 · Accepted Answer · 2020-09-16

If the OS and GN identifiers are reliable, you could try this:

from Bio import SeqIO
import sys, re

new_recs = []
for rec in SeqIO.parse(sys.argv[1], 'fasta'):
    rec.description = re.search(r'OS=(.*?) GN=', rec.description).group(1)
    new_recs.append(rec)


for new_rec in new_recs:
    print(">" + new_rec.description + "\n" + new_rec.seq)

Output:

>Caenorhabditis elegans
MGGALPPEIVDYMYRNGAFLLFLGFPQASEFGIDYKSWKTGEKFMGLKMIPPGVHFVYCSIKSAPRIGFFHNFKAGEILVKKWNTESETFEDEEVPTDQISEKKRQLKNMDSSLAPYPYENYRSWYGLTDFITADTVERIHPILGRITSQAELVSLETEFMENAEKEHKDSHFRNRVDRE
>Caenorhabditis elegans
MPNYTVPPDPADTSWDSPYSIPVQIVVWIIIIVLSLETIIGNAMVVMAYRIERNISKQVSNRYIVSLAISDLIIGIEGFPFFTVYVLNGDRWPLGWVACQTWLFLDYTLCLVSILTVLLITADRYLSVCHTAKYLKWQSPTKTQLLIVMSWLLPAIIFGIMIYGWQAMTGQSTSMSGAECSAPFLSNPYVNMGMYVAYYWTTLVAMLILYKGIHQAAKNLEKKAKAKERRHIALILSQRL