I have a group of fasta sequences numbered sequentially 1,2, etc. and would like to extract a subset by their numbers/identifiers. Using bp_index and bp_fetch in bioperl enables me to extract a single sequences, but I am not familiar enough with perl to automate this for a large group. Any suggestions, ideally in python but also in perl? Thanks.
Here is a quick solution in Python.
from Bio import SeqIO fasta_file = "fasta_file.fasta" # Input fasta file wanted_file = "wanted_file.txt" # Input interesting sequence IDs, one per line result_file = "result_file.fasta" # Output fasta file wanted = set() with open(wanted_file) as f: for line in f: line = line.strip() if line != "": wanted.add(line) fasta_sequences = SeqIO.parse(open(fasta_file),'fasta') with open(result_file, "w") as f: for seq in fasta_sequences: if seq.id in wanted: SeqIO.write([seq], f, "fasta")
FWIW, you can do this from the commandline if you have pyfasta installed with:
pyfasta extract --header --fasta my.fa header1 header2 header3 subset.fa
or if the headers you want are listed in a file:
pyfasta extract --header --fasta my.fa --file header-list.txt > subset.fa
Here is a solution with GNU pcregrep, this is what I generally use to extract an entry (title+sequence) from a fasta file (in this example
gi|30261391 in NCBI nr.fasta, you can add more patterns with
pcregrep --buffer-size=16M -M -e 'gi[[:punct:]]116108791[[:punct:]].*\s([[:alpha:]]+\s*)*' -e 'gi[[:punct:]]30261391[[:punct:]].*\s([[:alpha:]]+\s*)*' nr.fasta
--buffer-size to increase the default buffer-size which was not sufficient for NCBInr
-M for multiline matching
-e for adding a pattern
Depending of your platform, use single/double quotes. This has been tested with the Windows GNU cygwin version 8.21 2011-12-12 of PCREGREP but should work with the pcregrep binary of every Linux distribution. The main limitation is that it has to parse the entire file to return control. I tried to add the '--match-limit' parameter to the number of patterns but gave me some weird errors.
If you have a list of sequence IDs that you want to extract, you can use the UCSC utilities faOneRecord (to extract just a single record), or faSomeRecords (multiple sequences):
$ faSomeRecords inputfile.fasta listfile outputfile.fasta
Also, using the option '-exclude' will output sequences not present in 'listfile'
If besides an identifier you had sequence length, you could create a simple BED file and use "bedtools getfasta".
An example would be:
- Create a bed file from your whole fasta file.
- Grep the bed lines of your identifers and pipe it to
bedtools getfasta -fi whole_fasta -bed - -fo myfasta