Question

Filtering a fasta file based on blast results file

0

Entering edit mode

18 months ago

Bruna • 0

hi! I'm fairly new at all this and have been trying to filter a fasta file based on blast results file and I'm having a lot of trouble doing so, as I don't really understand how to code this. i have file 1 in this format:

>SampleName11.16891;size=8308
GATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGC
GCCCCTTGGTATTCCGGGGGGCATGCCTGTTCGAGCGTCATTACAACCCTCAAGCTCTGCTTGGAATTGGGCACCGTCCT
CACTGCGGACGCGCCTCAAAGACCTCGGCGGTGGCTGTTCAGCCCTCAAGCGTAGTAGAATACACCTCGCTTTGGAGCGG
TTGGCGTCGCCCGCCGGACGAACCTTCTGAACTTTTCTCAAGGTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTA
AGCATATCAATAAGCGGAGGAAAAGAAACCAACAGGGATTGCCTTAGTAACGGCGAGTGAAGCGGCAACAG

>SampleName10.739;size=7844
GATGAAGAACGTAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGC
GCCCCTTGGTATTCCGGGGGGCATGCCTGTTCGAGCGTCATTACAACCCTCAAGCTCTGCTTGGAATTGGGCACCGTCCT
CACTGCGGACGCGCCTCAAAGACCTCGGCGGTGGCTGTTCAGCCCTCAAGCGTAGTAGAATACACCTCGCTTTGGAGCGG
TTGGCGTCGCCCGCCGGACGAACCTTCTGAACTTTTCTCAAGGTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTA
AGCATATCAATAAGCGGAGGAAAAGAAACCAACAGGGATTGCCTTAGTAACGGCGAGTGAAGCGGCAACAG
...

And then file 2 in this:

SampleName11.4501;size=7446 Fungi_sp|MT237019|SH3568601.08FU|reps_singleton|k__Fungi;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__Fungi_sp    80.7
SampleName11.1591;size=7622 Lasiodiplodia_citricola|GU945354|SH2131013.08FU|refs|k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Botryosphaeriales;f__Botryosphaeriaceae;g__Lasiodiplodia;s__Lasiodiplodia_citricola   100.0
SampleName11.1591;size=7622 Fungi_sp|MT237019|SH3568601.08FU|reps_singleton|k__Fungi;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__Fungi_sp    81.0
SampleName10.611;size=6979  Lasiodiplodia_citricola|GU945354|SH2131013.08FU|refs|

And I would like to filter file 1 to only have the samples (sequence and name) that appear on file 2, do any of you could give me some lights on how to do that? Thank you

fasta biopython • 1.0k views

ADD COMMENT • link updated 18 months ago by acvill ▴ 340 • written 18 months ago by Bruna • 0

1

Entering edit mode

18 months ago

iraun 6.2k

Hi! filterbyname.sh script from BBMap toolkit could do the trick.

filterbyname.sh -Xmx10g in=input.fa out=output.fa names=ids.txt include=t substring=t

Before running the command you should probably clean a bit your file 2, and generate a file (in the previous command called ids.txt) containing just the sequence ids:

cut -d';' -f1 file2.txt > ids.txt

ADD COMMENT • link 18 months ago by iraun 6.2k

0

Entering edit mode

thank you! I'll start by cleaning file two so I only have the ids

ADD REPLY • link 18 months ago by Bruna • 0

0

Entering edit mode

18 months ago

acvill ▴ 340

Here's a solution using awk and grep:

while read hit; do 
  awk 'BEGIN {RS=">";FS="\n";OFS=""} NR>1 {print ">"$1; $1=""; print}' test.fasta | \
  grep -A1 ">${hit}$"
done < <(awk -F'\t' '{print $1}' blast.txt | sort | uniq)

Assuming blast.txt is tab-delimited, this first extracts the unique sequence names from the first field of the blast results. Then, the hits are processed in a while read loop, extracting deflines and sequences from the unwrapped fasta file test.fasta. This will be inefficient for larger fasta files, since the fasta file is unwrapped again with every iteration of the loop, though I like this solution because you don't have to preprocess the inputs.

ADD COMMENT • link 18 months ago by acvill ▴ 340

score 4 · Accepted Answer · 2022-10-19

4

Entering edit mode

18 months ago

Andrzej Zielezinski 11k

You can first open file 2 and read sequence ids into a set. Then, open file 1 (with FASTA sequences) and loop over the records to check if the sequence id is present in the set. If it is present, then write the record to the output file.

from Bio import SeqIO

IN_FILE1 = 'file1.fasta'
IN_FILE2 = 'file2.txt'
OUT_FILE = 'out.fasta'

seq_ids = set()
with open(IN_FILE2) as fh:
    for line in fh:
        seq_id = line.split()[0]
        seq_ids.add(seq_id)

with open(OUT_FILE, 'w') as oh:
    for seq_record in SeqIO.parse(IN_FILE1, 'fasta')
        if seq_record.id in seq_ids:
            oh.write(seq_record.format('fasta'))

ADD COMMENT • link 18 months ago by Andrzej Zielezinski 11k

0

Entering edit mode

thank you very much, I'm gonna try that!

ADD REPLY • link 18 months ago by Bruna • 0

0

Entering edit mode

Update: That worked very well. Thank you so much :)

ADD REPLY • link 18 months ago by Bruna • 0