Question: How to parse a .fasta file in python ?
0
gravatar for 2001linana
7 weeks ago by
2001linana20
2001linana20 wrote:

I have a .fasta file which formats like this:

>NC_045512.2 |Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCT
GTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACT
CACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATC
TTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTT...
>MW326508.1 |Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/TX-DSHS-1443/2020 ORF1ab polyprotein (ORF1ab), ORF1a polyprotein (ORF1ab), surface glycoprotein (S), ORF3a protein (ORF3a), envelope protein (E), membrane glycoprotein (M), ORF6 protein (ORF6), ORF7a protein (ORF7a), ORF7b (ORF7b), ORF8 protein (ORF8), nucleocapsid phosphoprotein (N), and ORF10 protein (ORF10) genes, complete cds
CTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACT
CGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAG
GACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCG...

While I was searching through the internet, I encountered this link: Correct Way To Parse A Fasta File In Python

Looks like Biopyton can be used to solve it. Yet, I highly doubt it, since in the .fasta file, there are no sequences field / annotation etc to indicate which part of the input file is the sequence part and which part of the input file is the sequence id part. I guess I will write the code starting from the scratch without using Biopython modules. Or is there any suggestions for this task?

sequence python fasta • 216 views
ADD COMMENTlink modified 7 weeks ago by trausch1.6k • written 7 weeks ago by 2001linana20

A fasta file is a text file containing sequence information, being sequence ids in lines starting with the '>' character and the sequence itself right after the id. In your example, you have 2 sequences NC_045512.2 and MW326508.1 which use the id line to provide annotation right after the '|' character. Although you haven't mentioned what exactly you'd like to do with those sequences, that example should be perfectly parseable by any fasta parser.

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by Jorge Amigo12k
3
gravatar for Joe
7 weeks ago by
Joe18k
United Kingdom
Joe18k wrote:

biopython most assuredly is the 'right' way to parse a file like this for simple applications.

Once you parse a file in using SeqIO.parse(...), you can access the ID with the object.id attribute. This is the string following the > until the first whitespace. Alternatively, you can use the object.description attribute to access the full header string after the >.

If you need to do any more complicated parsing of the headers, you need to do this yourself by applying string manipulation operations to the object.description.

ADD COMMENTlink modified 7 weeks ago • written 7 weeks ago by Joe18k
1
gravatar for colindaven
7 weeks ago by
colindaven2.6k
Hannover Medical School
colindaven2.6k wrote:

Try biopython, from memory they use delimiters such as space to differentiate the ID from the name or annotation fields. It might be helpful to use sed etc in Linux to modify the fasta headers to get the ID and Name to "stick together", depending on what you want to do.

You are right though, fasta headers are notoriously unstructured.

ADD COMMENTlink written 7 weeks ago by colindaven2.6k
0
gravatar for Matt Shirley
7 weeks ago by
Matt Shirley9.5k
Cambridge, MA
Matt Shirley9.5k wrote:

I'll link to my answer from almost 7 years ago. Use biopython (pure python iterative parsing), pyfaidx (pure python file offset-based parsing), or pyfastx (C/python file offset-based parsing). I can vouch for the first two methods, and haven't used pyfastx though it looks like a good implementation especially if you need to index FASTQ files as well.

ADD COMMENTlink written 7 weeks ago by Matt Shirley9.5k
0
gravatar for trausch
7 weeks ago by
trausch1.6k
Germany
trausch1.6k wrote:

readfq supports reading FASTA and FASTQ in various programming languages (incl python).

ADD COMMENTlink written 7 weeks ago by trausch1.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 992 users visited in the last hour
_