Question: Extract pairs of sequences from a fasta file based on a CSV list of headers
0
gravatar for ajsakla
12 days ago by
ajsakla0
ajsakla0 wrote:

I have a fasta file of primer sequences and I have a csv file with the sequence ids of primer sequence pairs that I need to parse out into their own fasta files. What I need to do is extract each pair of sequences from the fasta and put them into their own fasta files.

So my files look something like this:

My fasta file of primer sequences:

>01
ACGTACGT
>02
ACGTACGT
>03
ACGTACGT
>04
ACGTACGT

My CSV list of primer pairs:

column1,column2
01,03
02,04

Based on this example data set, what I need to end up with is 2 fasta files of the pairs of fasta sequences i.e., File1.fa will contain sequences 01 and 03 and File2.fa will contain sequences 02 and 04.

Thanks in advance!

extract fasta • 151 views
ADD COMMENTlink modified 12 days ago by st.ph.n1.6k • written 12 days ago by ajsakla0
1
gravatar for st.ph.n
12 days ago by
st.ph.n1.6k
Philadelphia, PA
st.ph.n1.6k wrote:
#!/usr/bin/env python
with open('pairs.txt', 'r') as f:
    # Get pairs, ',' separated
    pairs = []
    for line in f:
        pairs.append((line.strip().split(',')[0], line.strip().split(',')[1]))

with open('file.fasta', 'r') as fa:
    #open fasta to dictionary 
    primer = {}
    for line in fa:
        if line.startswith(">"):
            primer[line.strip().split('>')[1]] = next(fa).strip()

#For each pair, open new file, and print pair in fasta format
for p in pairs:
    with open('file' + str(pairs.index(p)) + '.fasta', 'w') as out:
        out.write('>' + p[0] + '\n' + primer[p[0]])
        out.write('>' + p[1] + '\n' + primer[p[1]])
ADD COMMENTlink modified 12 days ago • written 12 days ago by st.ph.n1.6k

Thanks. I'm having some trouble getting this to run.

  1. I've copied and pasted this script into a text file as extract.sh
  2. I've chmod u+x extract.sh
  3. I've replaced the 'pairs.txt' and 'file.fasta' with the appropriate file names
  4. I've run ./extract.sh within the folder containing the two files.

But I get this error:

Traceback (most recent call last): File "./parse.sh", line 11, in <module> for line in f: ValueError: I/O operation on closed file

ADD REPLYlink written 12 days ago by ajsakla0

This is python code, not bash. mv extract.sh extract.py, run as python extract.py

ADD REPLYlink modified 12 days ago • written 12 days ago by st.ph.n1.6k

I still get the same error:

Traceback (most recent call last): File "parse.py", line 11, in <module> for line in f: ValueError: I/O operation on closed file

ADD REPLYlink written 12 days ago by ajsakla0

I edited the error.

ADD REPLYlink written 12 days ago by st.ph.n1.6k

So now I get this error:

Traceback (most recent call last): File "extract.py", line 18, in <module> with open('file' + str(pairs.index(p)) + '.fasta') as out: IOError: [Errno 2] No such file or directory: 'file0.fasta'

It looks like it's looking for the file or directory file0.fasta so when I make an empty file0.fasta I get this error:

Traceback (most recent call last): File "extract.py", line 19, in <module> out.write('>' + p[0] + '\n' + primer[p[0]]) KeyError: 'ccRepeat-46449'

ADD REPLYlink written 12 days ago by ajsakla0

Should be fixed now.

ADD REPLYlink written 12 days ago by st.ph.n1.6k

I'm getting a similar error as before. Key error is the id of the first sequence in the first pair:

Traceback (most recent call last): File "extract.py", line 18, in <module> file.write('>' + p[0] + '\n' + primer[p[0]]) KeyError: 'ccRepeat-46449'

If it helps, here's a sampling of my primer pairs list.

ccRepeat-46449,ge00012
ccRepeat-52142,ge00379
ccRepeat-48210,ge00387
cc00375,ge00431
ccRepeat-46825,ge01057
ccRepeat-49062,ge01469
cc15291,ge01941
ccRepeat-46651,ge03344
cc22391,ge03673
ccRepeat-81763,ge03838

It's just occurred to me that python may not be reading this file as csv formatted. Could that be an issue?

ADD REPLYlink written 12 days ago by ajsakla0

this line is separating the ids by ',' into a tuple:

pairs.append((line.strip().split(',')[0], line.strip().split(',')[1]))

The error you're getting is that the keys in the dictionary (headers from your fasta), are not matching the primers. Double check that the headers match the primers, and if there is anything else in the header, this needs parsed out. The script was based on your OP example, so there may be further things to consider where the script needs to be modified. If you're unable to troubleshoot on your own, it's best to find another method, ideally by your own testing and come back with errors. Biostars is not a coding service.

ADD REPLYlink written 12 days ago by st.ph.n1.6k

I realize now I should provide exact examples of the data I'm working with. I was able to get your script to work with my data which saved me many hours of parsing thousands of pairs manually. Thank you for your time.

ADD REPLYlink written 11 days ago by ajsakla0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1194 users visited in the last hour