get organism name from fasta file header (PYTHON)
2
1
Entering edit mode
3.6 years ago

For example, is it possible to get the organism name as one gets the sequence and gene id for example:

>sp|Q09305|AAR2_CAEEL Protein AAR2 homolog OS=Caenorhabditis elegans GN=F10B5.2 PE=3 SV=1
MGGALPPEIVDYMYRNGAFLLFLGFPQASEFGIDYKSWKTGEKFMGLKMIPPGVHFVYCS
IKSAPRIGFFHNFKAGEILVKKWNTESETFEDEEVPTDQISEKKRQLKNMDSSLAPYPYE
NYRSWYGLTDFITADTVERIHPILGRITSQAELVSLETEFMENAEKEHKDSHFRNRVDRE

>sp|Q18007|ACM1_CAEEL Probable muscarinic acetylcholine receptor gar-1 OS=Caenorhabditis elegans GN=gar-1 PE=2 SV=3
MPNYTVPPDPADTSWDSPYSIPVQIVVWIIIIVLSLETIIGNAMVVMAYRIERNISKQVS
NRYIVSLAISDLIIGIEGFPFFTVYVLNGDRWPLGWVACQTWLFLDYTLCLVSILTVLLI
TADRYLSVCHTAKYLKWQSPTKTQLLIVMSWLLPAIIFGIMIYGWQAMTGQSTSMSGAEC
SAPFLSNPYVNMGMYVAYYWTTLVAMLILYKGIHQAAKNLEKKAKAKERRHIALILSQRL

When taking out the following code I can get the sequence and protein id but not the organism name, how can this be done? :)

from Bio import SeqIO
import re
import pandas as pd

input_file = "Streptomyces_Uniprot.fasta" 
pattern = "\|(.*?)\|"
substring = re.search(pattern, s).group(1)
sequence_list = []
id_list = []

fasta_sequences = SeqIO.parse(open(input_file),'fasta')
for fasta in fasta_sequences:
    fasta_id, sequence, description = fasta.id, str(fasta.seq), fasta.description
    fasta_id = re.search(pattern, fasta_id).group(1)
    print (fasta_id)

How would I pull when OS="ORGANISM_NAME" for example from the description?

genome sequence fasta • 3.4k views
ADD COMMENT
0
Entering edit mode

Hello biohacker_tobe!

It appears that your post has been cross-posted to another site: https://stackoverflow.com/questions/63922667/

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY
2
Entering edit mode
3.6 years ago
Joe 21k

If the OS and GN identifiers are reliable, you could try this:

from Bio import SeqIO
import sys, re

new_recs = []
for rec in SeqIO.parse(sys.argv[1], 'fasta'):
    rec.description = re.search(r'OS=(.*?) GN=', rec.description).group(1)
    new_recs.append(rec)


for new_rec in new_recs:
    print(">" + new_rec.description + "\n" + new_rec.seq)

Output:

>Caenorhabditis elegans
MGGALPPEIVDYMYRNGAFLLFLGFPQASEFGIDYKSWKTGEKFMGLKMIPPGVHFVYCSIKSAPRIGFFHNFKAGEILVKKWNTESETFEDEEVPTDQISEKKRQLKNMDSSLAPYPYENYRSWYGLTDFITADTVERIHPILGRITSQAELVSLETEFMENAEKEHKDSHFRNRVDRE
>Caenorhabditis elegans
MPNYTVPPDPADTSWDSPYSIPVQIVVWIIIIVLSLETIIGNAMVVMAYRIERNISKQVSNRYIVSLAISDLIIGIEGFPFFTVYVLNGDRWPLGWVACQTWLFLDYTLCLVSILTVLLITADRYLSVCHTAKYLKWQSPTKTQLLIVMSWLLPAIIFGIMIYGWQAMTGQSTSMSGAECSAPFLSNPYVNMGMYVAYYWTTLVAMLILYKGIHQAAKNLEKKAKAKERRHIALILSQRL
ADD COMMENT
0
Entering edit mode

Thanks, this is exactly what I was looking for :D

ADD REPLY
1
Entering edit mode
3.6 years ago

assuming all the fields are in the same order:

awk -F= '/^>/ {X=$2;gsub(/[^ ]*$/,"",X);print X}' in.fasta
ADD COMMENT
0
Entering edit mode

With linux I am aware on how to do it, but is it possible directly with python?

ADD REPLY
0
Entering edit mode

Since there is no formal specification for fasta headers (beyond > and some identifier that follows), you can't do it using Biopython. You will likely have to do this by pattern search after you grab the header.

ADD REPLY
0
Entering edit mode

Fair enough, thanks for the pointer :)

ADD REPLY

Login before adding your answer.

Traffic: 2527 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6