Question: perl code to extract sequences from multi-line fasta works on all test files but not on research files
gravatar for kdiaz17
4.8 years ago by
United States
kdiaz170 wrote:

I have headers from a BLAST output file that I would like to create a subset database from for use in HMMer, so I pulled some code from a class I had to create a subset fasta file from the main fasta by giving a perl script a list of headers and pulling sequences with matching headers from that fasta and creating a subset fasta with the matching sequences.  It works on the class test files, but not on my own research files.  I've tried different scripts with the same end and all have the same effect.  Even if I just use one sequence that I know matches as the input fasta, the output is always empty once I run any of the scripts.  I'm not sure what's wrong- my files appear to be formatted just like the test files (multi-line wrapping, > included in the headers file, not tab delimited, etc), and i can't tell if it's my input fasta or the headers file or what.  It's not the different extensions either- it worked on other files.  Anyone know what's wrong?  It has to be something in my files themselves.

For reference my own files: my database file; the sequences I want to extract from my database file

a couple of the scripts I've been using: and

the test files: test headers; test fasta 1; test fasta 2



ADD COMMENTlink modified 4.8 years ago by Brian Bushnell17k • written 4.8 years ago by kdiaz170

Some of your links are dead. Empty or wrong files might be one of the problems.

ADD REPLYlink written 4.8 years ago by Alex Reynolds29k

Just checked them, they all work (just one wasn't, and it was a Dropbox parsing error) and permissions are there.  These are also only copies of what I'm working off of,  I only uploaded them to Dropbox for examples.  They actual files are stored on my computer.

ADD REPLYlink written 4.8 years ago by kdiaz170
gravatar for venu
4.8 years ago by
venu6.3k wrote:

Perl is a very nice language. But when there are simple tools available for this kind of tasks, why to go for perl.

Try using faSomeRecords

Here you can find it.

$ chmod +x faSomeRecords

$ ./faSomeRecords input.faa ids.txt output.faa

Working properly with your files.

ADD COMMENTlink modified 4.8 years ago • written 4.8 years ago by venu6.3k

I downloaded it and ran it, but all output files are empty (even my test files this time) even with downloading slightly different versions of it.  Which is a shame, because it looks like it would clear this up right away.  Thank you though.

ADD REPLYlink written 4.8 years ago by kdiaz170

Don't download different versions. Download the same. Remove '>' symbol from ids file. Then it will work. 

ADD REPLYlink written 4.8 years ago by venu6.3k

It worked, thank you VERY much sir!  Can't believe I didn't try removing the ">" before. But regardless, now I can actually get some serious work done this weekend. Thank you again for introducing me to this valuable tool, I appreciate it very much!

ADD REPLYlink written 4.8 years ago by kdiaz170
gravatar for Brian Bushnell
4.8 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

There is a program for this purpose called FilterByName, in the BBTools package. Usage: in=sequences.fasta names=names.txt include=t out=filtered.fasta

It's very flexible - the sequences can be in fasta, fastq, scarf, sam, bam, and compressed in various formats. The names can be just about anything:

A name or list of literal names, comma-delimited, such as names=chr1,chr7,chr18

A file or list of files containing names (one name per line) such as names=names.txt

A list of sequence files, such as names=blacklist.fasta,junk.fastq,unmapped.sam

...or any combination of the above.

The include flag sets the mode. If include=t, then only sequences with matching names will be retained. If include=f, then only sequences with nonmatching names will be retained (so anything with a name matching your name list will be discarded).

ADD COMMENTlink modified 4 months ago by RamRS25k • written 4.8 years ago by Brian Bushnell17k

I did know about this tool. It looks really nice. I tried to run it and it runs, but I do not understand where is the output. This is what I obtain:

Input is being processed as unpaired
Time:               0.228 seconds.
Reads Processed:    62933     276.42k reads/sec
Bases Processed:    18378339     80.72m bases/sec
Reads Processed:    29
Bases Processed:    14320
ADD REPLYlink modified 4 months ago by RamRS25k • written 4.1 years ago by dago2.6k

Ah, sorry! I forgot to add the out= flag. I've fixed the command above.

ADD REPLYlink modified 4 months ago by RamRS25k • written 4.1 years ago by Brian Bushnell17k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2068 users visited in the last hour