Question

Extracting substring from fasta file

0

Entering edit mode

6.9 years ago

Burgenix • 0

I have an excel sheet and I extract some values out of it with openpyxl. I want to use those two values, lets say start and end, as borders to extract a substring from a fasta file.

For example, if the value of start is 34 and the value of end is 4000(as read from two cells in excel - FILE A), I want to print the string of characters(letters)(from FILE B) into another file.(FILE C)

Any ideas?

python • 3.4k views

ADD COMMENT • link updated 6.9 years ago by st.ph.n ★ 2.7k • written 6.9 years ago by Burgenix • 0

2

Entering edit mode

and please, don't use excel.

ADD REPLY • link 6.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

did you search this site for a similar question ?

ADD REPLY • link 6.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Are you asking someone to do it for you in python?

ADD REPLY • link 6.9 years ago by st.ph.n ★ 2.7k

score 2 · Answer 1 · 2017-06-06

You'll need samtools for this, not excel.

For example.

samtools faidx file.fasta name_of_seq:34-4000 > another_file.fasta

name_of_seq is the name of your sequence in the fasta file.

Try to figure out yourself first how to get your coordinates from file A, if it does not work show us what you have tried and some-one will help you further.

score 2 · Answer 2 · 2017-06-06

Suppose your excel file looks like this: (ideally just remove columns other than ids and start/end values - and save as text tab-delimited (my_coords.txt). As Pierre said, don't use excel)

id1    34    4000
id2    45    3156
id3    33    3764

And your fasta looks like this (if you have a multi-line fasta, linearize it):

>id1
sequence
>id2
sequence
>id3

#!/usr/bin/env python

with open('my_coords.txt', 'r') as f1:
    pos = {}
    for line in f1:
        pos[line.strip().split('\t')[0]] = (int(line.strip().split('\t')[1]), int(line.strip().split('\t')[2]))

with open('my_fasta.fasta', 'r') as f2:
    seqs = {}
    for line in f2:
        if line.startswith('>'):
            seqs[line.strip().split('>')[1]] = next(f).strip()

with open('my_fasta_trimmed.fasta', 'w') as out:
    for i in seqs:
        out.write('>' + i, '\n', seqs[pos[i][0]:pos[i][1]])

Condensed, write directly to output:

#!/usr/bin/env python

with open('my_coords.txt', 'r') as f1:
    pos = {}
    for line in f1:
        pos[line.strip().split('\t')[0]] = (int(line.strip().split('\t')[1]), int(line.strip().split('\t')[2]))

with open('my_fasta.fasta', 'r') as f2:
    with open('my_fasta_trimmed.fasta', 'w') as out:
        for line in f2:
            if line.startswith('>'):
                out.write(line.strip(), '\n', next(f).strip()[pos[line.strip().split('>')[1]][0]:pos[line.stripI().split('>')[1]][1])