Question: Extract fasta sequences from a large file using a list of names
3
gravatar for fhsantanna
5.5 years ago by
fhsantanna490
Brazil
fhsantanna490 wrote:

I have a large fasta file of 16S sequences and I want to retrieve sequences using a list of organism names.

Do you know a script capable of doing it?

EDIT:

Headers look like that:

>S000000859 Bacillus sp. USC14; AF346495

sequence

>S000001027 Paenibacillus borealis; KN25; AJ011325

sequence

And I have a list like the following:

Paenibacilus borealis

Paenibacillus sp. 1-18

Paenibacillus sp. 1-49
Paenibacillus sp. A9
Paenibacillus sp. Aloe-11

 

I want to retrieve those sequences that match with names present in the list.

 

 

extract 16s fasta • 11k views
ADD COMMENTlink modified 5.5 years ago by h.mon31k • written 5.5 years ago by fhsantanna490

Can you show us some example headers from your FASTA file? Right now I'm thinking to put all the organism names in a text file and simply use grep -f

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by Dan D7.1k

I believe I was not clear. My 16S fasta file has sequences of hundreds of species. I have to retrieve just dozens of them using a list of organisms of interest.

ADD REPLYlink written 5.5 years ago by fhsantanna490
1

If the fasta sequence is in one line it's": 

grep -A1 'organism name'

If it's not it would be easier to format it into one line and then do grep.

ADD REPLYlink written 5.5 years ago by marina.v.yurieva520

A good way to start. Very simple. Thanks. The problem is when the sequences have different sizes.

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by fhsantanna490
5
gravatar for tlawrence3
5.5 years ago by
tlawrence370
United States
tlawrence370 wrote:

You can do this easily by installing the FAST: Fast Analysis of Sequences Toolbox (publication)(github).

You can install it by using this command (only use sudo if you need to):

(sudo) perl -MCPAN -e 'install FAST'

Once it is installed here is a small bash script to do what you need:

#!/bin/bash

while read line ;do
    cat original_fasta.fa | fasgrep -di "$line"  >> reduced_fasta_file.fa
done < species.txt

 

 

ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by tlawrence370

Asking just for clarification. In above bash script which one is the file containing the fasta sequence and which one is the file containing list of organism names?

ADD REPLYlink written 5.5 years ago by venu6.7k

Updated the example to make it more clear.

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by tlawrence370

Very good. I was able to extract the sequences running the program one by one.

But when I try the bash script, the following error message appears: "fasgrep: expects one regex argument when input is on STDIN."

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by fhsantanna490
try ./faSomeRecords very simple one as shown
ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by venu6.7k

tlawrence, do you know why I had trouble with your bash script (error message appears: "fasgrep: expects one regex argument when input is on STDIN.")?

It seems that fasgrep is not recognizing the $line variable.

ADD REPLYlink written 5.5 years ago by fhsantanna490

It is something in the bash script. I don't know bash scripting well enough to figure it out.

This works on the command line:

fasgrep -di "Paenibacilus borealis" fasta_file.fa

However, it doesn't work in the script. Maybe someone here can help.

ADD REPLYlink written 5.5 years ago by tlawrence370

I updated the script. It should work now.
 

ADD REPLYlink written 5.5 years ago by tlawrence370

It did not work for me.

ADD REPLYlink written 5.5 years ago by fhsantanna490

What didn't work? Did you get an empty fasta file? The wrong sequences? An error? Does doing the command manually give the expected results?

ADD REPLYlink written 5.5 years ago by tlawrence370

As stated before, fasgrep works fine manually (I accept your answer, as you can see).

However, the bash script did not work (I had to write one without using a loop, but writing a line for every organism of interest). Your bash script did not filtered the organisms of interest, all of the original sequences were included in the output and more...

ADD REPLYlink written 5.5 years ago by fhsantanna490

I guess my bash scripting skills need some improvement. Since I am one of the authors of the toolkit I want to make sure it is working for as many people as possible.

ADD REPLYlink written 5.5 years ago by tlawrence370

Nevermind. Great toolkit. Thanks again.

Just one thing, maybe I found a bug.

Considering a fasta file like the following (note the absence of a underscore between "S000001027" and "Paenibacillus_borealis"):

    >S000001027 Paenibacillus_borealis;_KN25;_AJ011325
    AAAAAAAAAATTTTTTT

When I type:

    fasgrep -di "Paenibacillus_borealis" input.fas

Fasgrep is able to find the sequence of interest.

However, if the fasta sequence has a underscore between "S000001027" and "Paenibacillus_borealis" (>S000001027_Paenibacillus_borealis;_KN25;_AJ011325), fasgrep cannot find the sequence.

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by fhsantanna490
1

It is not a bug. fastA format is loosely defined as:

>identifier<space>description<newline>
<sequence data>

So when you don't have a space in the header line all you have is an identifier which fasgrep searches by default. So instead of:

fasgrep -di

which searches the description field you would use:

fasgrep -i
ADD REPLYlink written 5.4 years ago by tlawrence370

+1 for properly handling the fasta defline info field. 

ADD REPLYlink written 5.4 years ago by Matt Shirley9.4k
4
gravatar for Juke34
5.5 years ago by
Juke344.8k
Sweden
Juke344.8k wrote:

BASH

IFS=$'\n'; for i in $(cat headersFILE);do  line=$(grep -nr "$i" multiFastaFILE); if [[ ! -z $line ]];then for j in $line;do lineNumber=$(echo $j | cut -d':' -f1); sed -n "$lineNumber p" multiFastaFILE; awk -v nb=$lineNumber 'NR > nb {if ($0 ~ ">") exit; else print $0 }' multiFastaFILE; done;fi;done

Where headersFILE is the file containing one header per line ; multiFastaFILE is obviously your multifasta file.

Every time a fasta sequence header of your multiFastaFILE contains a name described in your headersFILE, the header and its sequence will be displayed.

ADD COMMENTlink modified 4.8 years ago • written 5.5 years ago by Juke344.8k

It worked very well! Thanks!

ADD REPLYlink written 5.5 years ago by fhsantanna490
4
gravatar for h.mon
5.5 years ago by
h.mon31k
Brazil
h.mon31k wrote:

The following script should work:   

#!/usr/bin/env perl
my $list_file = $ARGV[0];
my $fasta_in = $ARGV[1];
my $fasta_out = $ARGV[2];
open(LIST_FILE, "<", $list_file) or die "could not open '$list_file' : $! \n";
open(FASTA_IN, "<", $fasta_in) or die "could not open '$fasta_in' : $! \n";
open(FASTA_OUT, ">", $fasta_out) or die "could not open $fasta_out : $! \n";
my @headers = ();
while(<LIST_FILE>) {
    chomp;
    next if ( /^\s*$/ );
    push(@headers, $_.";");
}
my $pat = join '|', map quotemeta, @headers;
$/ = ">";
while(<FASTA_IN>) {
    chomp;
    if ( /$pat/ ) { print FASTA_OUT ">$_"; }
}
close(LIST_FILE);
close(FASTA_IN);
close(FASTA_OUT);

Call it with:

./getFastas.pl list.txt sequences.fasta sequences.out

 

ADD COMMENTlink modified 5.4 years ago • written 5.5 years ago by h.mon31k

It did not work. Blank output.

ADD REPLYlink written 5.5 years ago by fhsantanna490

I tested with the following mock files:

list.txt

Paenibacillus borealis
Paenibacillus sp. 1-18
Paenibacillus sp. 1-49
Paenibacillus sp. A9
Paenibacillus sp. Aloe-11

sequences.fasta:

>S000000859 Bacillus sp. USC14; AF346495
ATGCATGCATGCATGCATGCATGC
>S000001027 Paenibacillus borealis; KN25; AJ011325
GCGCGCATATATGCGCGCATATAT
>S000001026 Paenibacillus borealisensis; KN26; AJ011326
GCGCGCATATATGCGCGCATATAT
>Simaginary Streptomyces borealis; KN25; AJ011325
AAAAAAAAAAAAAAAAAAAAAAAA
>S000000000 Paenibacillus sp. 1-49; nvhjdasnviuas
GGGGGGGGGGGGGGGGGGGGGGGG
>S000001027_Paenibacillus borealis; KN25; AJ011325
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

worked all right for me, retrieving three sequences ( "S000001027 Paenibacillus borealis; KN25; AJ011325", "S000001027_Paenibacillus borealis; KN25; AJ011325" and "S000000000 Paenibacillus sp. 1-49; nvhjdasnviuas" ).

ADD REPLYlink written 5.4 years ago by h.mon31k

For the mock sequences.fasta, it works fine. However, it does not work if the sequences.fasta has the following sequence:

>AB042938.1.1462 Paenibacillus borealis

GACGAACGCUGGCGGCGUGCCUAAUACAUGCAAGUCGAGCGGACUUUGCCUUCGGGUAAAGUUAGCGGCGGACGGGUGAG

UAACACGUGGGUAACCUGCCCAUAAGACUGGGAUAACAUUCGGAAACGAAUGCUAAUACCGGAUACGCGAAUCGGUCGCA

UGAUCGAAUCGGGAAAGGCGGAGCAAUCUGCCACUUAUGGAUGGACCCGCGGCGCAUUAGCUAGUUGGUGGGGUAACGGC

UCACCAAGGCGACGAUGCGUAGCCGACCUGAGAGGGUGAUCGGCCACACUGGGACUGAGACACGGCCCAGACUCCUACGG

GAGGCAGCAGUAGGGAAUCUUCCGCAAUGGACGAAAGUCUGACGGAGCAACGCCGCGUGAGUGAUGAAGGUUUUCGGAUC

GUAAAGCUCUGUUGCCAGGGAAGAACGCUUGCGAGAGUAACUGCUCGCAAGGUGACGGUACCUGAGAAGAAAGCCCCGGC

UAACUACGUGCCAGCAGCCGCGGUAAUACGUAGGGGGCAAGCGUUGUCCCGGAAUUAUUGGGCGUAAAGCGCGCGCAGGC

GGCCUUGUAAGUCUGUCGUUUAAACUCGGAGCUCAACUUCGAGUCGCGAUGGAAACUGCAAAGCUUGAGUGCAGAAGAGG

AAAGUGGAAUUCCACGUGUAGCGGUGAAAUGCGUAGAGAUGUGGAGGAACACCAGUGGCGAAGGCGACUUUCUGGGCUGU

AACUGACGCUGAGGCGCGAAAGCGUGGGGAGCAAACAGGAUUAGAUACCCUGGUAGUCCACGCCGUAAACGAUGAAUGCU

AGGUGUUAGGGGUUUCGAUACCCUUGGUGCCGAAGUUAACACAUUAAGCAUUCCGCCUGGGGAGUACGGUCGCAAGACUG

AAACUCAAAGGAAUUGACGGGGACCCGCACAAGCAGUGGAGUAUGUGGUUUAAUUCGAAGCAACGCGAAGAACCUUACCA

GGUCUUGACAUCCCUCUGACCGUCCUAGAGAUAGGGCUUUCCUUCGGGACAGAGGAGACAGGUGGUGCAUGGUUGUCGUC

AGCUCGUGUCGUGAGAUGUUGGGUUAAGUCCCGCAACGAGCGCAACCCUUGAUCUUAGUUGCCAGCACUUUGGGUGGGCA

CUCUAGGAUGACUGCCGGUGACAAACCGGAGGAAGGUGGGGAUGACGUCAAAUCAUCAUGCCCCUUAUGACCUGGGCUAC

ACACGUACUACAAUGGCCGAUACAACGGGAAGCGAAACCGCGAGGUGGAGCCAAUCCUAUCAAAGUCGGUCUCAGUUCGG

AUUGCAGGCUGCAACUCGCCUGCAUGAAGUCGGAAUUGCUAGUAAUCGCGGAUCAGCAUGCCGCGGUGAAUACGUUCCCG

GGUCUUGUACACACCGCCCGUCACACCACGAGAGUUUACAACACCCGAAGCCGGUGGGGUAACCGCAAGGAGCCAGCCGU

CGAAGGUGGGGUAGAUGAUUGG

 

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by fhsantanna490
1

No it doesn't, because it lacks the semicolon ( ; ) at the end of the species / strain name in the fasta header. I've used the semicolon because all the fasta headers you provided had a semicolon, and it was a good way to, e.g., select "Paenibacillus borealis" (which is included in your list) and avoid "Paenibacillus borealisensis" (which I made up and is not included in your list). That is, it was just a way of avoiding longer names which could contain shorter ones, but were not in the list.

Anyways, if you change:

push(@headers, $_.";");

to:

push(@headers, $_);

it should work again.

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by h.mon31k
3
gravatar for shelly.deforte
5.5 years ago by
United States
shelly.deforte190 wrote:

You could use Biopython's fasta parser. I didn't test this, but this would be the basic idea: 

from Bio import SeqIO
handle = open("example.fasta", "rU")
organism_name = "Paenibacilus borealis"
for record in SeqIO.parse(handle, "fasta"):
    if organism_name in record.id:
        # Do whatever you want with these
        print record.id
        print record.seq
handle.close()

There's more info here: http://biopython.org/wiki/SeqIO

ADD COMMENTlink written 5.5 years ago by shelly.deforte190

I have a similar problem, I have many species name in one fasta file and I want them to copy those files which contain the desired species sequence.

The file is like this

>CanaanFir115301
PQPIVPPAASTSSEATTSTDVPQPIVHPAASTSSEATTSPVAFPARDNVAGTSTTSTDVPQPIVPPAASTSSEATTSPVA
FPARDNVAGTSSSSSVSKASPLIHDIVAAARAVVTDADLDDLIGNIFSLLEHIKAKKTETKSS
>FraserFir31480
MGKTTLADAVFARVYVEGCKYSMVRLFDDVTSTPSTSHIVKLQKYILEDLKMGTPEETTHE
>CanaanFir58880
SSTDVLPESNEIAPVDHFSVTSTAAPNATDHGPQPIVPPDASSSSTDVLPESNEIAPVDHFSVTSTAAPNATDDALKHPQ
HGPQPILPPDASSSSTDVLPDSNEPQPIVHPDLSSSSPLAFPAHDEVYQVSEAASSSNAATDNHKVADDQLQGTPSSSTV
SNAPHLINNILTAANALDT
>BalsamFir4479
MEKKSCRCRLFSCFQCCFSAQRPDPSVTSTAATDDALHHRPQPIVPSAASTSSEPTTSTDVPQPIVHPAASASSKFSCFQ
CCFSAQRPDPSVTSTAATDDALHHRPQPIVPSAASTSSEPTTSADVPQPIVHPAASTSSEATT

I want to copy the files which contain the sequence (Balsam, Fraser and Canaan) species. The rest will be discarded.

How can i do that?

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by adeel.maliks2010
2
gravatar for venu
5.5 years ago by
venu6.7k
Germany
venu6.7k wrote:

As your list is containing 'Paenibaci' in common, I have followed this approach

grep '^>' input.fasta > headers.txt

sed -n '/Paenibaci/p' headers.txt > filter_headers.txt

$ ./faSomeRecords input.faa filter_headers.txt output.faa

A: perl code to extract sequences from multi-line fasta works on all test files but

ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by venu6.7k

Not worked. Output file is empty.

ADD REPLYlink written 5.5 years ago by fhsantanna490

Can you provide some more organism names from the organisms file? You have only provided one name, post 5-10 line from organism file

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by venu6.7k

Edited.

 

ADD REPLYlink written 5.5 years ago by fhsantanna490

try the edited one..it might work

ADD REPLYlink written 5.5 years ago by venu6.7k

I think you have not understood. I already have a "header" file (organism name), which was not derived from a fasta file (so, grep and sed commands are unnecessary). I want to collect those sequences that contain specif organism names ("headers") from a FASTA file.

ADD REPLYlink written 5.5 years ago by fhsantanna490
2
gravatar for Matt Shirley
5.5 years ago by
Matt Shirley9.4k
Cambridge, MA
Matt Shirley9.4k wrote:
pip install --user pyfaidx 
xargs faidx 16s.fasta < organisms.txt
ADD COMMENTlink written 5.5 years ago by Matt Shirley9.4k

I wasn't familiar with that package and it looks really nice. Thanks!

ADD REPLYlink written 5.5 years ago by shelly.deforte190

Firstly, the pip install did not work. I had to install using setup.py.

When I try to run the script, I get the following error:

Traceback (most recent call last):
  File "/usr/local/bin/faidx", line 9, in <module>
    load_entry_point('pyfaidx==0.4.0', 'console_scripts', 'faidx')()
  File "build/bdist.linux-x86_64/egg/pyfaidx/cli.py", line 149, in main
  File "build/bdist.linux-x86_64/egg/pyfaidx/cli.py", line 40, in write_sequence
  File "build/bdist.linux-x86_64/egg/pyfaidx/cli.py", line 50, in fetch_sequence
KeyError: 'Paenibacillus_alginolyticus_DSM_5050'

My command was:  xargs faidx rdp_download_2882seqs.fasta < organisms.txt (the list of organism names)

ADD REPLYlink written 5.5 years ago by fhsantanna490

Regarding the pip install issue, you probably don't have pip installed, but it seems that you figured that out. The KeyError is telling you that the organism name "Paenibacillus_alginolyticus_DSM_5050" does not exist as a FASTA header line in your file. I should probably catch this and warn the user instead of having the script bail. 

ADD REPLYlink written 5.5 years ago by Matt Shirley9.4k

Yes, I had pip installed, and it installed pyfaidx. However faidx was not executable it in terminal, only in python.

In my list of species, each line has a organism name. "Paenibacillus_alginolyticus_DSM_5050" was the first one. So, if the first name is not located in my fasta file, faidx does not try to locate other names?

ADD REPLYlink written 5.5 years ago by fhsantanna490

That is correct, but I just submitted a change that fixes that behavior. The current github master branch will just warn the user and move to the next query without exiting with an error code. 

ADD REPLYlink written 5.5 years ago by Matt Shirley9.4k

I've just pushed a new version of pyfaidx (v0.4.1) that will produce a warning for each sequence that is not found in your file, but will continue without exiting. You can redirect stderr to a log file to figure out which organisms are missing:

xargs faidx rdp_download_2882seqs.fasta < organisms.txt 2> missing.log
ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by Matt Shirley9.4k

Not worked for me. Here is the log:

...

warning: Paenibacillus not found in file
warning: sp. not found in file
warning: 1-49 not found in file

...

 

Of course the strain Paenibacillus sp. 1-49 is in rdp_download_2882seqs.fasta and organisms.txt files.

ADD REPLYlink written 5.5 years ago by fhsantanna490

Ah, yes. There can be no spaces in the sequence names. "xargs" passes each line in organisms.txt as the arguments for "faidx". If there are spaces, they are interpreted as separate sequence names. Generally it is advisable to not include spaces in the FASTA sequence names, as most programs will truncate anything after the first space.

ADD REPLYlink written 5.5 years ago by Matt Shirley9.4k

Still not working. Here is the log:

warning: Paenibacillus_sp._1-49 not found in file

In my rdp_download_2882seqs.fasta I have the following sequence: 

>ASRY01000111.1.1112_Paenibacillus_sp._1-49
GAGUAACACGUAGGCAACCUGCCCAUGAGACUGGGAUAACUACCGGAAACGGUAGCUAAUACCGGAUACAUCCUUUCCCU
GCAUGGGGAGGGGAGGAAAGACGGAGCAAUCUGUCACUGAUGGAUGGGCCUGCGGCGCAUUAGCUAGUUGGUGGGGUAAU

ADD REPLYlink written 5.5 years ago by fhsantanna490

The issue here is that you are expecting to find an inexact match. You can do this using:

xargs faidx rdp_download_2882seqs.fasta --regex  < organisms.txt 2> missing.log

The --regex option specifies that the sequence names are regular expressions patterns to be matched.

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by Matt Shirley9.4k

Traceback (most recent call last):
  File "/usr/local/bin/faidx", line 9, in <module>
    load_entry_point('pyfaidx==0.4.1', 'console_scripts', 'faidx')()
  File "build/bdist.linux-x86_64/egg/pyfaidx/cli.py", line 153, in main
  File "build/bdist.linux-x86_64/egg/pyfaidx/cli.py", line 16, in write_sequence
  File "build/bdist.linux-x86_64/egg/pyfaidx/__init__.py", line 577, in __init__
  File "build/bdist.linux-x86_64/egg/pyfaidx/__init__.py", line 214, in __init__
IOError: [Errno 2] No such file or directory: 'Paenibacillus_alginolyticus_DSM_5050'

ADD REPLYlink written 5.5 years ago by fhsantanna490

See the updated command above. Note that order of arguments matters. 

ADD REPLYlink written 5.5 years ago by Matt Shirley9.4k

faidx: error: unrecognized arguments: Paenibacillus_sp._1-49

I also have tried with simpler data, but no output has been generated (I also have utilized the -o option).

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by fhsantanna490

For future online interactions, it is usually preferred to tell people what you did (what commands you tried), the error, and show some of the data you're working with. At this point I see no reason why you should not have a solution that works.

 

ADD REPLYlink written 5.5 years ago by Matt Shirley9.4k

Firstly, thanks. I had two solutions, as you can see. I just tried to give you a feedback, since you expend some time helping me.

I have tested your script with the data I provided above (the only difference is the presence of underscores instead of spaces). The command is exactly what you send me (the file names are the same): 

   

xargs faidx rdp_download_2882seqs.fasta --regex  < organisms.txt 2> missing.log > filtered.out

Here is the error message:

    faidx: error: unrecognized arguments: Paenibacillus_sp._1-18 Paenibacillus_sp._1-49 Paenibacillus_sp._A9 Paenibacillus_sp._Aloe-11

Certainly your script recognized "Paenibacilus_borealis), but the output was empty.

I tested the same data with fasgrep, and I had a positive result.

 

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by fhsantanna490
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1423 users visited in the last hour