Question

Extracting subsequence from FASTA file using python

0

Entering edit mode

6.1 years ago

shawnt1234 • 0

Hi I would like to extract subsequences from a large fasta file and make a new fasta file with the extracted seqences using python preferably.

I have a csv file with the following format:

id, start, stop, header
id1, 3, 10, Contig0
id2, 12, 25, Contig1
id3, 19, 40, Contig2

the input fasta file has the following format:

>Contig0
(Contig0 sequence)
>Contig1
(Contig1 sequence)
>Contig2
(Contig2 sequence)

I would like an fasta file output that has the following format:

>id1
(Contig0 sequence from bp 3-10)
>id2
(Contig1 sequence from bp 12-25)
>id3
(Contig2 sequence from bp 19-40)

If anyone has any suggestions or a script that can do this, any help would be greatly appreciated.

fasta sequence python • 3.0k views

ADD COMMENT • link updated 6.1 years ago by Bastien Hervé 5.3k • written 6.1 years ago by shawnt1234 • 0

score 2 · Answer 1 · 2018-03-21

2

Entering edit mode

6.1 years ago

Bastien Hervé 5.3k

It's possible in Biopython

1) Create a dataframe with your csv file (make your id column as index)

2) Iterate over your fasta file using SeqIO

3) For the record you get from your iteration, find the corresponding row in your dataframe (something like : df.loc[[record.id]])

4) Once you have the good row, modify the header record with the row infos

5) Substring and replace the sequence record (record.sequence)

6) Write the record in a new file

7) Step3

I let you try this by your own, if you want some help comment below :)

ADD COMMENT • link 6.1 years ago by Bastien Hervé 5.3k

0

Entering edit mode

Thanks for the help! I wrote a script and it was not very efficient so it ran very slow, so I did some more research and found bedtools getfasta and that worked for me.

ADD REPLY • link 6.1 years ago by shawnt1234 • 0

score 1 · Answer 2 · 2018-03-21

1

Entering edit mode

6.1 years ago

GenoMax 141k

pyfaidx by Matt Shirley.

ADD COMMENT • link 6.1 years ago by GenoMax 141k