Sorry if similar questions have been answered already - I had a look and can't see anything that suits my problem exactly.
I'm trying to use Samtools to extract Fasta sequences from a database file using a list. (Sorry if my terminology is dreadful, I'm new to computational biology).
So I have my database file as a multi-FASTA file, and my list purely as a list of gene headers in a text file, e.g.
ND3005_accessory_2122_length_3432
etc. These do exactly match the headers in the database, with the exception of not having a >
at the start of each sequence, but I've tried with these as well.
The only output that I get is a file with the headers in, e.g.
>ND3005_accessory_2122_length_3432
and an example of the terminal output is the following:
not found in FASTA file, returning empty sequence0_length_2460
Here are the commands I've been using:
$ samtools faidx Gene_db.fasta
$ xargs samtools faidx Gene_db.fasta < GH2_list.txt > GH2_output.fasta
Any thoughts would be appreciated. Thanks!
Hello spidcock01 ,
I'm not familiar with
xargs
. I would useparallel
like this:And please use the formatting bar (especially the
code
option) to present your post better. I've done it for you this time.Thank you!
fin swimmer
Unfortunately this still produces a file with just the headers. This means that the problem is probably with my list file, right?
Thank you for advice regarding formatting!
If you get just the headers, I would suspect that something with your fasta file is wrong.
What's the output of:
Also you should check if the line ending terminators are correct:
If the output is
Gene_db.fasta: ASCII text, with CRLF line terminators
you should installdos2unix
and rundos2unix Gene_db.fasta
.fin swimmer
Did you try my solution bellow?
Thanks for your help! You were right in that it was the line ending terminators.
Thank you again!