Question

perl code to extract sequences from multi-line fasta works on all test files but not on research files

0

Entering edit mode

9.0 years ago

kdiaz17 • 0

I have headers from a BLAST output file that I would like to create a subset database from for use in HMMer, so I pulled some code from a class I had to create a subset fasta file from the main fasta by giving a perl script a list of headers and pulling sequences with matching headers from that fasta and creating a subset fasta with the matching sequences. It works on the class test files, but not on my own research files. I've tried different scripts with the same end and all have the same effect. Even if I just use one sequence that I know matches as the input fasta, the output is always empty once I run any of the scripts. I'm not sure what's wrong- my files appear to be formatted just like the test files (multi-line wrapping, > included in the headers file, not tab delimited, etc), and I can't tell if it's my input fasta or the headers file or what. It's not the different extensions either- it worked on other files. Anyone know what's wrong? It has to be something in my files themselves.

For reference my own files:

My database file: https://www.dropbox.com/s/1dehw638mm6soyo/assembly71_201206_P1.fa?dl=0
The sequences I want to extract from my database file: https://www.dropbox.com/s/k3tcrxp69xwdx7s/test_2_filtered_GIs.txt?dl=0

A couple of the scripts I've been using:

The test files:

Test headers: https://www.dropbox.com/s/9z0twid9hjuqd5m/Ceratopteris1_headers.txt?dl=0
Test fasta 1: https://www.dropbox.com/s/ns4hpcitlhjnewt/Ceratopteris1.fasta?dl=0
Test fasta 2: https://www.dropbox.com/s/y2hz3ai23t43yp3/Ceratopteris2.fasta?dl=0

perl fasta • 5.1k views

ADD COMMENT • link updated 22 months ago by Ram 43k • written 9.0 years ago by kdiaz17 • 0

0

Entering edit mode

Some of your links are dead. Empty or wrong files might be one of the problems.

ADD REPLY • link 9.0 years ago by Alex Reynolds 35k

0

Entering edit mode

Just checked them, they all work (just one wasn't, and it was a Dropbox parsing error) and permissions are there. These are also only copies of what I'm working off of, I only uploaded them to Dropbox for examples. They actual files are stored on my computer.

ADD REPLY • link updated 22 months ago by Ram 43k • written 9.0 years ago by kdiaz17 • 0

2

Entering edit mode

9.0 years ago

Brian Bushnell 20k

There is a program for this purpose called FilterByName, in the BBTools package. Usage:

filterbyname.sh in=sequences.fasta names=names.txt include=t out=filtered.fasta

It's very flexible - the sequences can be in fasta, fastq, scarf, sam, bam, and compressed in various formats. The names can be just about anything:

A name or list of literal names, comma-delimited, such as names=chr1,chr7,chr18

A file or list of files containing names (one name per line) such as names=names.txt

A list of sequence files, such as names=blacklist.fasta,junk.fastq,unmapped.sam

...or any combination of the above.

The include flag sets the mode. If include=t, then only sequences with matching names will be retained. If include=f, then only sequences with nonmatching names will be retained (so anything with a name matching your name list will be discarded).

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 9.0 years ago by Brian Bushnell 20k

0

Entering edit mode

I did know about this tool. It looks really nice. I tried to run it and it runs, but I do not understand where is the output. This is what I obtain:

Input is being processed as unpaired
Time:               0.228 seconds.
Reads Processed:    62933     276.42k reads/sec
Bases Processed:    18378339     80.72m bases/sec
Reads Processed:    29
Bases Processed:    14320

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 8.3 years ago by dago ★ 2.8k

0

Entering edit mode

Ah, sorry! I forgot to add the out= flag. I've fixed the command above.

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 8.3 years ago by Brian Bushnell 20k

Ram · Accepted Answer · 2015-04-03

6

Entering edit mode

9.0 years ago

venu 7.1k

Perl is a very nice language. But when there are simple tools available for this kind of tasks, why to go for perl.

Try using faSomeRecords

Here you can find it.

$ chmod +x faSomeRecords

$ ./faSomeRecords input.faa ids.txt output.faa

Working properly with your files.

ADD COMMENT • link 9.0 years ago by venu 7.1k

0

Entering edit mode

I downloaded it and ran it, but all output files are empty (even my test files this time) even with downloading slightly different versions of it. Which is a shame, because it looks like it would clear this up right away. Thank you though.

ADD REPLY • link updated 22 months ago by Ram 43k • written 9.0 years ago by kdiaz17 • 0

3

Entering edit mode

Don't download different versions. Download the same. Remove '>' symbol from ids file. Then it will work.

ADD REPLY • link updated 22 months ago by Ram 43k • written 9.0 years ago by venu 7.1k

0

Entering edit mode

It worked, thank you VERY much sir! Can't believe I didn't try removing the ">" before. But regardless, now I can actually get some serious work done this weekend. Thank you again for introducing me to this valuable tool, I appreciate it very much!

ADD REPLY • link updated 22 months ago by Ram 43k • written 9.0 years ago by kdiaz17 • 0