Question

How to covert fasta file into pandas dataframe with parallel running python script?

0

Entering edit mode

3.3 years ago

sunyeping ▴ 110

I wish to use python to read in a fasta sequence file and convert it into a panda dataframe. I use the following scripts:

from Bio import SeqIO
import pandas as pd

def fasta2df(infile):
    records = SeqIO.parse(infile, 'fasta')
    seqList = []
    for record in records:
        desp = record.description
        # print(desp)
        seq = list(record.seq._data.upper())
        seqList.append([desp] + seq)
        seq_df = pd.DataFrame(seqList)
        print(seq_df.shape)
        seq_df.columns=['strainName']+list(range(1, seq_df.shape[1]))
    return seq_df


if __name__ == "__main__":
    path = 'path/to/the/fasta/file'
    input = path + 'GISAIDspikeprot0119.selection.fasta'
    df = fasta2df(input)
The 'GISAIDspikeprot0119.selection.fasta' file can be found at https://drive.google.com/file/d/1DYwhzUDH0LNgZXFuY2ud0CWkWLL9SBid/view?usp=sharing

The script can be run at my linux workstation only with one cpu core, but is it possible to run it with more cores (multiple processes) so that it can be run much faster? What would be the codes for that?

with many thanks!

alignment • 3.7k views

ADD COMMENT • link 3.3 years ago by sunyeping ▴ 110

0

Entering edit mode

I think that parallel computation may not be from the code but the configuration on the work station.

ADD REPLY • link 3.3 years ago by davidenoma ▴ 50