Extract fasta sequences from a large file using a list of names
6
3
Entering edit mode
7.2 years ago
fhsantanna ▴ 600

I have a large fasta file of 16S sequences and I want to retrieve sequences using a list of organism names.

Do you know a script capable of doing it?

EDIT:

>S000000859 Bacillus sp. USC14; AF346495

sequence

>S000001027 Paenibacillus borealis; KN25; AJ011325

sequence

And I have a list like the following:

Paenibacilus borealis

Paenibacillus sp. 1-18

Paenibacillus sp. 1-49
Paenibacillus sp. A9
Paenibacillus sp. Aloe-11

I want to retrieve those sequences that match with names present in the list.

16s fasta extract • 14k views
0
Entering edit mode

Can you show us some example headers from your FASTA file? Right now I'm thinking to put all the organism names in a text file and simply use grep -f

0
Entering edit mode

I believe I was not clear. My 16S fasta file has sequences of hundreds of species. I have to retrieve just dozens of them using a list of organisms of interest.

1
Entering edit mode

If the fasta sequence is in one line it's":

grep -A1 'organism name'

If it's not it would be easier to format it into one line and then do grep.

0
Entering edit mode

A good way to start. Very simple. Thanks. The problem is when the sequences have different sizes.

5
Entering edit mode
7.2 years ago
tlawrence3 ▴ 70

You can do this easily by installing the FAST: Fast Analysis of Sequences Toolbox (publication)(github).

You can install it by using this command (only use sudo if you need to):

(sudo) perl -MCPAN -e 'install FAST'

Once it is installed here is a small bash script to do what you need:

#!/bin/bash

cat original_fasta.fa | fasgrep -di "$line" >> reduced_fasta_file.fa done < species.txt ADD COMMENT 0 Entering edit mode Asking just for clarification. In above bash script which one is the file containing the fasta sequence and which one is the file containing list of organism names? ADD REPLY 0 Entering edit mode Updated the example to make it more clear. ADD REPLY 0 Entering edit mode Very good. I was able to extract the sequences running the program one by one. But when I try the bash script, the following error message appears: "fasgrep: expects one regex argument when input is on STDIN." ADD REPLY 0 Entering edit mode try ./faSomeRecords very simple one as shown ADD REPLY 0 Entering edit mode tlawrence, do you know why I had trouble with your bash script (error message appears: "fasgrep: expects one regex argument when input is on STDIN.")? It seems that fasgrep is not recognizing the$line variable.

0
Entering edit mode

It is something in the bash script. I don't know bash scripting well enough to figure it out.

This works on the command line:

fasgrep -di "Paenibacilus borealis" fasta_file.fa

However, it doesn't work in the script. Maybe someone here can help.

0
Entering edit mode

I updated the script. It should work now.

0
Entering edit mode

It did not work for me.

0
Entering edit mode

What didn't work? Did you get an empty fasta file? The wrong sequences? An error? Does doing the command manually give the expected results?

0
Entering edit mode

As stated before, fasgrep works fine manually (I accept your answer, as you can see).

However, the bash script did not work (I had to write one without using a loop, but writing a line for every organism of interest). Your bash script did not filtered the organisms of interest, all of the original sequences were included in the output and more...

0
Entering edit mode

I guess my bash scripting skills need some improvement. Since I am one of the authors of the toolkit I want to make sure it is working for as many people as possible.

0
Entering edit mode

Nevermind. Great toolkit. Thanks again.

Just one thing, maybe I found a bug.

Considering a fasta file like the following (note the absence of a underscore between "S000001027" and "Paenibacillus_borealis"):

>S000001027 Paenibacillus_borealis;_KN25;_AJ011325
AAAAAAAAAATTTTTTT

When I type:

fasgrep -di "Paenibacillus_borealis" input.fas

Fasgrep is able to find the sequence of interest.

However, if the fasta sequence has a underscore between "S000001027" and "Paenibacillus_borealis" (>S000001027_Paenibacillus_borealis;_KN25;_AJ011325), fasgrep cannot find the sequence.

1
Entering edit mode

It is not a bug. fastA format is loosely defined as:

>identifier<space>description<newline>
<sequence data>


So when you don't have a space in the header line all you have is an identifier which fasgrep searches by default. So instead of:

fasgrep -di

which searches the description field you would use:

fasgrep -i
0
Entering edit mode

+1 for properly handling the fasta defline info field.

4
Entering edit mode
7.2 years ago
Juke34 ★ 7.1k

BASH

IFS=$'\n'; for i in$(cat headersFILE);do  line=$(grep -nr "$i" multiFastaFILE); if [[ ! -z $line ]];then for j in$line;do lineNumber=$(echo$j | cut -d':' -f1); sed -n "$lineNumber p" multiFastaFILE; awk -v nb=$lineNumber 'NR > nb {if ($0 ~ ">") exit; else print$0 }' multiFastaFILE; done;fi;done

Where headersFILE is the file containing one header per line ; multiFastaFILE is obviously your multifasta file.

0
Entering edit mode

It worked very well! Thanks!

4
Entering edit mode
7.2 years ago
h.mon 34k

The following script should work:

#!/usr/bin/env perl
my $list_file =$ARGV[0];
my $fasta_in =$ARGV[1];
my $fasta_out =$ARGV[2];
open(LIST_FILE, "<", $list_file) or die "could not open '$list_file' : $! \n"; open(FASTA_IN, "<",$fasta_in) or die "could not open '$fasta_in' :$! \n";
open(FASTA_OUT, ">", $fasta_out) or die "could not open$fasta_out : $! \n"; my @headers = (); while(<LIST_FILE>) { chomp; next if ( /^\s*$/ );
push(@headers, $_.";"); } my$pat = join '|', map quotemeta, @headers;
$/ = ">"; while(<FASTA_IN>) { chomp; if ( /$pat/ ) { print FASTA_OUT ">$_"; } } close(LIST_FILE); close(FASTA_IN); close(FASTA_OUT); Call it with: ./getFastas.pl list.txt sequences.fasta sequences.out ADD COMMENT 0 Entering edit mode It did not work. Blank output. ADD REPLY 0 Entering edit mode I tested with the following mock files: list.txt Paenibacillus borealis Paenibacillus sp. 1-18 Paenibacillus sp. 1-49 Paenibacillus sp. A9 Paenibacillus sp. Aloe-11 sequences.fasta: >S000000859 Bacillus sp. USC14; AF346495 ATGCATGCATGCATGCATGCATGC >S000001027 Paenibacillus borealis; KN25; AJ011325 GCGCGCATATATGCGCGCATATAT >S000001026 Paenibacillus borealisensis; KN26; AJ011326 GCGCGCATATATGCGCGCATATAT >Simaginary Streptomyces borealis; KN25; AJ011325 AAAAAAAAAAAAAAAAAAAAAAAA >S000000000 Paenibacillus sp. 1-49; nvhjdasnviuas GGGGGGGGGGGGGGGGGGGGGGGG >S000001027_Paenibacillus borealis; KN25; AJ011325 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA worked all right for me, retrieving three sequences ( "S000001027 Paenibacillus borealis; KN25; AJ011325", "S000001027_Paenibacillus borealis; KN25; AJ011325" and "S000000000 Paenibacillus sp. 1-49; nvhjdasnviuas" ). ADD REPLY 0 Entering edit mode For the mock sequences.fasta, it works fine. However, it does not work if the sequences.fasta has the following sequence: >AB042938.1.1462 Paenibacillus borealis GACGAACGCUGGCGGCGUGCCUAAUACAUGCAAGUCGAGCGGACUUUGCCUUCGGGUAAAGUUAGCGGCGGACGGGUGAG UAACACGUGGGUAACCUGCCCAUAAGACUGGGAUAACAUUCGGAAACGAAUGCUAAUACCGGAUACGCGAAUCGGUCGCA UGAUCGAAUCGGGAAAGGCGGAGCAAUCUGCCACUUAUGGAUGGACCCGCGGCGCAUUAGCUAGUUGGUGGGGUAACGGC UCACCAAGGCGACGAUGCGUAGCCGACCUGAGAGGGUGAUCGGCCACACUGGGACUGAGACACGGCCCAGACUCCUACGG GAGGCAGCAGUAGGGAAUCUUCCGCAAUGGACGAAAGUCUGACGGAGCAACGCCGCGUGAGUGAUGAAGGUUUUCGGAUC GUAAAGCUCUGUUGCCAGGGAAGAACGCUUGCGAGAGUAACUGCUCGCAAGGUGACGGUACCUGAGAAGAAAGCCCCGGC UAACUACGUGCCAGCAGCCGCGGUAAUACGUAGGGGGCAAGCGUUGUCCCGGAAUUAUUGGGCGUAAAGCGCGCGCAGGC GGCCUUGUAAGUCUGUCGUUUAAACUCGGAGCUCAACUUCGAGUCGCGAUGGAAACUGCAAAGCUUGAGUGCAGAAGAGG AAAGUGGAAUUCCACGUGUAGCGGUGAAAUGCGUAGAGAUGUGGAGGAACACCAGUGGCGAAGGCGACUUUCUGGGCUGU AACUGACGCUGAGGCGCGAAAGCGUGGGGAGCAAACAGGAUUAGAUACCCUGGUAGUCCACGCCGUAAACGAUGAAUGCU AGGUGUUAGGGGUUUCGAUACCCUUGGUGCCGAAGUUAACACAUUAAGCAUUCCGCCUGGGGAGUACGGUCGCAAGACUG AAACUCAAAGGAAUUGACGGGGACCCGCACAAGCAGUGGAGUAUGUGGUUUAAUUCGAAGCAACGCGAAGAACCUUACCA GGUCUUGACAUCCCUCUGACCGUCCUAGAGAUAGGGCUUUCCUUCGGGACAGAGGAGACAGGUGGUGCAUGGUUGUCGUC AGCUCGUGUCGUGAGAUGUUGGGUUAAGUCCCGCAACGAGCGCAACCCUUGAUCUUAGUUGCCAGCACUUUGGGUGGGCA CUCUAGGAUGACUGCCGGUGACAAACCGGAGGAAGGUGGGGAUGACGUCAAAUCAUCAUGCCCCUUAUGACCUGGGCUAC ACACGUACUACAAUGGCCGAUACAACGGGAAGCGAAACCGCGAGGUGGAGCCAAUCCUAUCAAAGUCGGUCUCAGUUCGG AUUGCAGGCUGCAACUCGCCUGCAUGAAGUCGGAAUUGCUAGUAAUCGCGGAUCAGCAUGCCGCGGUGAAUACGUUCCCG GGUCUUGUACACACCGCCCGUCACACCACGAGAGUUUACAACACCCGAAGCCGGUGGGGUAACCGCAAGGAGCCAGCCGU CGAAGGUGGGGUAGAUGAUUGG ADD REPLY 1 Entering edit mode No it doesn't, because it lacks the semicolon ( ; ) at the end of the species / strain name in the fasta header. I've used the semicolon because all the fasta headers you provided had a semicolon, and it was a good way to, e.g., select "Paenibacillus borealis" (which is included in your list) and avoid "Paenibacillus borealisensis" (which I made up and is not included in your list). That is, it was just a way of avoiding longer names which could contain shorter ones, but were not in the list. Anyways, if you change: push(@headers,$_.";");

to:

push(@headers, $_); it should work again. ADD REPLY 3 Entering edit mode 7.2 years ago You could use Biopython's fasta parser. I didn't test this, but this would be the basic idea: from Bio import SeqIO handle = open("example.fasta", "rU") organism_name = "Paenibacilus borealis" for record in SeqIO.parse(handle, "fasta"): if organism_name in record.id: # Do whatever you want with these print record.id print record.seq handle.close() There's more info here: http://biopython.org/wiki/SeqIO ADD COMMENT 0 Entering edit mode I have a similar problem, I have many species name in one fasta file and I want them to copy those files which contain the desired species sequence. The file is like this >CanaanFir115301 PQPIVPPAASTSSEATTSTDVPQPIVHPAASTSSEATTSPVAFPARDNVAGTSTTSTDVPQPIVPPAASTSSEATTSPVA FPARDNVAGTSSSSSVSKASPLIHDIVAAARAVVTDADLDDLIGNIFSLLEHIKAKKTETKSS >FraserFir31480 MGKTTLADAVFARVYVEGCKYSMVRLFDDVTSTPSTSHIVKLQKYILEDLKMGTPEETTHE >CanaanFir58880 SSTDVLPESNEIAPVDHFSVTSTAAPNATDHGPQPIVPPDASSSSTDVLPESNEIAPVDHFSVTSTAAPNATDDALKHPQ HGPQPILPPDASSSSTDVLPDSNEPQPIVHPDLSSSSPLAFPAHDEVYQVSEAASSSNAATDNHKVADDQLQGTPSSSTV SNAPHLINNILTAANALDT >BalsamFir4479 MEKKSCRCRLFSCFQCCFSAQRPDPSVTSTAATDDALHHRPQPIVPSAASTSSEPTTSTDVPQPIVHPAASASSKFSCFQ CCFSAQRPDPSVTSTAATDDALHHRPQPIVPSAASTSSEPTTSADVPQPIVHPAASTSSEATT  I want to copy the files which contain the sequence (Balsam, Fraser and Canaan) species. The rest will be discarded. How can i do that? ADD REPLY 2 Entering edit mode 7.2 years ago venu 7.0k As your list is containing 'Paenibaci' in common, I have followed this approach grep '^>' input.fasta > headers.txt sed -n '/Paenibaci/p' headers.txt > filter_headers.txt$ ./faSomeRecords input.faa filter_headers.txt output.faa

A: perl code to extract sequences from multi-line fasta works on all test files but

0
Entering edit mode

Not worked. Output file is empty.

0
Entering edit mode

Can you provide some more organism names from the organisms file? You have only provided one name, post 5-10 line from organism file

0
Entering edit mode

Edited.

0
Entering edit mode

try the edited one..it might work

0
Entering edit mode

I think you have not understood. I already have a "header" file (organism name), which was not derived from a fasta file (so, grep and sed commands are unnecessary). I want to collect those sequences that contain specif organism names ("headers") from a FASTA file.

2
Entering edit mode
7.2 years ago
pip install --user pyfaidx
xargs faidx 16s.fasta < organisms.txt
0
Entering edit mode

I wasn't familiar with that package and it looks really nice. Thanks!

0
Entering edit mode

Firstly, the pip install did not work. I had to install using setup.py.

When I try to run the script, I get the following error:

Traceback (most recent call last):
File "/usr/local/bin/faidx", line 9, in <module>
File "build/bdist.linux-x86_64/egg/pyfaidx/cli.py", line 149, in main
File "build/bdist.linux-x86_64/egg/pyfaidx/cli.py", line 40, in write_sequence
File "build/bdist.linux-x86_64/egg/pyfaidx/cli.py", line 50, in fetch_sequence
KeyError: 'Paenibacillus_alginolyticus_DSM_5050'

My command was:  xargs faidx rdp_download_2882seqs.fasta < organisms.txt (the list of organism names)

0
Entering edit mode

Regarding the pip install issue, you probably don't have pip installed, but it seems that you figured that out. The KeyError is telling you that the organism name "Paenibacillus_alginolyticus_DSM_5050" does not exist as a FASTA header line in your file. I should probably catch this and warn the user instead of having the script bail.

0
Entering edit mode

Yes, I had pip installed, and it installed pyfaidx. However faidx was not executable it in terminal, only in python.

In my list of species, each line has a organism name. "Paenibacillus_alginolyticus_DSM_5050" was the first one. So, if the first name is not located in my fasta file, faidx does not try to locate other names?

0
Entering edit mode

That is correct, but I just submitted a change that fixes that behavior. The current github master branch will just warn the user and move to the next query without exiting with an error code.

0
Entering edit mode

I've just pushed a new version of pyfaidx (v0.4.1) that will produce a warning for each sequence that is not found in your file, but will continue without exiting. You can redirect stderr to a log file to figure out which organisms are missing:

xargs faidx rdp_download_2882seqs.fasta < organisms.txt 2> missing.log
0
Entering edit mode

Not worked for me. Here is the log:

...

...

Of course the strain Paenibacillus sp. 1-49 is in rdp_download_2882seqs.fasta and organisms.txt files.

0
Entering edit mode

Ah, yes. There can be no spaces in the sequence names. "xargs" passes each line in organisms.txt as the arguments for "faidx". If there are spaces, they are interpreted as separate sequence names. Generally it is advisable to not include spaces in the FASTA sequence names, as most programs will truncate anything after the first space.

0
Entering edit mode

Still not working. Here is the log:

>ASRY01000111.1.1112_Paenibacillus_sp._1-49
GAGUAACACGUAGGCAACCUGCCCAUGAGACUGGGAUAACUACCGGAAACGGUAGCUAAUACCGGAUACAUCCUUUCCCU
GCAUGGGGAGGGGAGGAAAGACGGAGCAAUCUGUCACUGAUGGAUGGGCCUGCGGCGCAUUAGCUAGUUGGUGGGGUAAU

0
Entering edit mode

The issue here is that you are expecting to find an inexact match. You can do this using:

xargs faidx rdp_download_2882seqs.fasta --regex  < organisms.txt 2> missing.log

The --regex option specifies that the sequence names are regular expressions patterns to be matched.

0
Entering edit mode

Traceback (most recent call last):
File "/usr/local/bin/faidx", line 9, in <module>
File "build/bdist.linux-x86_64/egg/pyfaidx/cli.py", line 153, in main
File "build/bdist.linux-x86_64/egg/pyfaidx/cli.py", line 16, in write_sequence
File "build/bdist.linux-x86_64/egg/pyfaidx/__init__.py", line 577, in __init__
File "build/bdist.linux-x86_64/egg/pyfaidx/__init__.py", line 214, in __init__
IOError: [Errno 2] No such file or directory: 'Paenibacillus_alginolyticus_DSM_5050'

0
Entering edit mode

See the updated command above. Note that order of arguments matters.

0
Entering edit mode

faidx: error: unrecognized arguments: Paenibacillus_sp._1-49

I also have tried with simpler data, but no output has been generated (I also have utilized the -o option).

0
Entering edit mode

For future online interactions, it is usually preferred to tell people what you did (what commands you tried), the error, and show some of the data you're working with. At this point I see no reason why you should not have a solution that works.

0
Entering edit mode

Firstly, thanks. I had two solutions, as you can see. I just tried to give you a feedback, since you expend some time helping me.

I have tested your script with the data I provided above (the only difference is the presence of underscores instead of spaces). The command is exactly what you send me (the file names are the same):

xargs faidx rdp_download_2882seqs.fasta --regex  < organisms.txt 2> missing.log > filtered.out

Here is the error message:

faidx: error: unrecognized arguments: Paenibacillus_sp._1-18 Paenibacillus_sp._1-49 Paenibacillus_sp._A9 Paenibacillus_sp._Aloe-11

Certainly your script recognized "Paenibacilus_borealis), but the output was empty.

I tested the same data with fasgrep, and I had a positive result.