Question: Extract a subset of fasta file based on header in python
0
gravatar for xupbuy
9 months ago by
xupbuy10
xupbuy10 wrote:

I have a fasta file like this:

>Contig1
aattata
>Contig2
ttgtgtgt
>Contig3
cgcgtgtgt
>Contig4
tgtgttgtttttt
>Contig5
cccgcgctg
>Contig6
tgtgtgtgtgtgt
>Contig7
aacgctg
>Contig8
acgtgtgc

I have a 2nd txt file:

1
5
7

I want extract a subset from the fasta file if the header contains EXACTLY the number in the txt file, like this:

    >Contig1
    aattata
    >Contig5
    cccgcgctg
    >Contig7
    aacgctg

I don't want >Contig10 even it contains the number 1.

How to do this in python? I tried scripts other people provided but they both give me empty output file. I don't know what I did wrong: script1:

"""                                                                                 
%prog some.fasta wanted-list.txt                                                    
"""                                                                                 
from Bio import SeqIO                                                               
import sys                                                                          

wanted = [line.strip() for line in open(sys.argv[2])]
print(wanted)
seqiter = SeqIO.parse(open(sys.argv[1]), 'fasta')      
sys.stdout = open('output.fasta', 'w')                               
SeqIO.write((seq for seq in seqiter if seq.id in wanted), sys.stdout, "fasta")
sys.stdout.close()

Script2:

from Bio import SeqIO
import sys                                                                          

fasta_file = sys.argv[1] # Input fasta file
wanted_file = sys.argv[2] # Input interesting sequence IDs, one per line
result_file = "result_file.fasta" # Output fasta file

wanted = set()
with open(wanted_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            wanted.add(line)
            print(line)

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
with open(result_file, "w") as f:
    for seq in fasta_sequences:
        if seq.id in wanted:
            SeqIO.write([seq], f, "fasta")
python fasta • 578 views
ADD COMMENTlink modified 9 months ago by Siya Diya0 • written 9 months ago by xupbuy10
1

Why are you reinventing the wheel? There already exist a bunch of tools that do this!

ADD REPLYlink written 9 months ago by RamRS22k
1

You can try this:

  • rename 1,2,3 to Contig1, contig2, contig3 (using awk)
  • samtools index file.fasta,
  • xargs samtools faidx fasta.file < contigs.txt >> output.fasta
ADD REPLYlink modified 9 months ago • written 9 months ago by Alice280
0
gravatar for WouterDeCoster
9 months ago by
Belgium
WouterDeCoster40k wrote:

Print your seq.id, you'll see it will never be in wanted since it looks like... Contig1

ADD COMMENTlink written 9 months ago by WouterDeCoster40k

Indeed! Thank you finally! Worked!

ADD REPLYlink written 9 months ago by xupbuy10

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Upvote|Bookmark|Accept

ADD REPLYlink written 9 months ago by genomax69k
0
gravatar for Siya Diya
9 months ago by
Siya Diya0
Thrissur
Siya Diya0 wrote:

Try these updated scripts

Script 1:

"""
%prog some.fasta wanted-list.txt
"""
from Bio import SeqIO
import sys, re

wanted = [line.strip() for line in open(sys.argv[2])]
print(wanted)
seqiter = SeqIO.parse(open(sys.argv[1]), 'fasta')
sys.stdout = open('output.fasta', 'w')
SeqIO.write((seq for seq in seqiter if any(item == re.findall('\d+',seq.id)[-1] for item in wanted)), sys.stdout, "fasta")
sys.stdout.close()

Script 2:

from Bio import SeqIO
import sys, re

fasta_file = sys.argv[1] # Input fasta file
wanted_file = sys.argv[2] # Input interesting sequence IDs, one per line
result_file = "result_file.fasta" # Output fasta file

wanted = set()
with open(wanted_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            wanted.add(line)
            print(line)

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
with open(result_file, "w") as f:
    for seq in fasta_sequences:
        if any(item == re.findall('\d+',seq.id)[-1] for item in wanted):
            SeqIO.write([seq], f, "fasta")
ADD COMMENTlink written 9 months ago by Siya Diya0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1456 users visited in the last hour