Question: How to extract a sequence below a fasta header (">") from a fasta file ?
1
gravatar for v.berriosfarias
6 weeks ago by
v.berriosfarias10 wrote:

Hello I have a lot of sequences in a FASTA file, and I want to extarct a specific sequence knowing the header ID. for example the header of a sequence is:

NODE_19_length_5758_cluster_19_candidate_1

I know that with grep I can extract the header, but i want the below sequences to appear on stdout.

How can I do this on bash?

bash fasta • 308 views
ADD COMMENTlink modified 6 weeks ago by ATpoint46k • written 6 weeks ago by v.berriosfarias10
$ seqkit grep -w 0 -ip NODE_19_length_5758_cluster_19_candidate_1 input.fa

for printing sequence only:

$ seqkit grep -ip NODE_19_length_5758_cluster_19_candidate_1 test.fa  | seqkit seq -s -w 0
ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by cpad011215k
9
gravatar for ATpoint
6 weeks ago by
ATpoint46k
ATpoint46k wrote:

The proposed solutions are probably all fine but have the limitation that they first have to iteratively find the correct sequence which can take time if the file is (very) large. The by far simplest solution would be to use samtools faidx which in a first step indexes your fasta file and then can use this index to retrieve any sequence in basically no time. The index generation takes like 10 seconds for the entire mouse genome and is therefore not really a limitation, and it only has to be done once per fasta file:

samtools faidx your.fasta NODE_19_length_5758_cluster_19_candidate_1

For many sequences manually:

samtools faidx your.fasta seq1 seq2 seq(...N)

or if you want it automated

(upstream cmd creating "\n"-sep list of seq.names) | xargs samtools faidx your.fasta

Please be sure to browse existing threads, this has been asked many times before.

How To Extract A Sequence From A Big (6Gb) Multifasta File ?

How do I extract Fasta Sequences based on a list of IDs?

ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by ATpoint46k

this was very useful , thanks so much!

ADD REPLYlink written 6 weeks ago by v.berriosfarias10

v.berriosfarias : Since you received multiple answers you should test/accept all of them. It is fine to accept multiple answers if they all work.

ADD REPLYlink written 6 weeks ago by GenoMax96k
3
gravatar for i.sudbery
6 weeks ago by
i.sudbery11k
Sheffield, UK
i.sudbery11k wrote:

I'm not sure you can in pure bash, but you can with awk (I think).

If you make the record separator \n> rather than \n, and the field separator \n, I think you should be able to do:

awk -v RS="\n>" -v FS="\n" '$1=="NODE_19_length_5758_cluster_19_candidate_1" {print ">"$0}' my_fasta.fasta
ADD COMMENTlink written 6 weeks ago by i.sudbery11k

yes , this worked for me thanks so much :)

ADD REPLYlink written 6 weeks ago by v.berriosfarias10

can you explain the $0 and the $1 on this situation ? I'll be so grateful if you explain this

ADD REPLYlink written 6 weeks ago by v.berriosfarias10
1

In awk $X references "record X", this usually means "column X" when the record separator is something like space, tab or comma. So $1 is record/column 1. As we are using \n as a field separator, in this case it means "line 1" of the record.

$0 means the whole record.

ADD REPLYlink written 6 weeks ago by i.sudbery11k

Very interesting. I had never thought about using awk's record and separators like this to split files having grouped lines.

ADD REPLYlink written 6 weeks ago by Jorge Amigo12k

Just leaving a note about a problem with this:

Originally I set RS="\n>" this doesn't for the very first entry, because it starts with >, not \n. This means the record name of the first record will include >, while all the others exclude it.

You could set RS=">", but then if you have the symbol > anywhere else other than the start of a new record (e.g. if the title of a record has a mutaiton in it written as A>T) you'll get the records split in the wrong place.

As it is, I would just go with @ATPoints answer, which I had completely forgotten about when I wrote this.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by i.sudbery11k

another awk solution for flattend fasta file:

$ awk -v OFS="\n" '/^>/ {getline seq} {if ($0==">NODE_19_length_5758_cluster_19_candidate_1") print $0,seq}' test.fa

for printing sequence only:

$ awk -v OFS="\n" '/^>/ {getline seq} {if ($0==">NODE_19_length_5758_cluster_19_candidate_1") print seq}' test.fa
ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by cpad011215k
1
gravatar for jenn.drummond
6 weeks ago by
United States
jenn.drummond80 wrote:

The awk solution is more general, but for simplicity, if I have single-line rather than hard-wrapped FASTAs, I prefer to do this with grep -A and tail.

grep -wA 1 '>NODE_19_length_5758_cluster_19_candidate_1' example.fasta | tail -n 1

The -A 1 tells grep to return both the matched line and the line immediately after it, and then tail takes the last (second) line of that result. (The w is just for full-word matching in case you have similar sequence names that are subsets of each other.)

If you're sure that the entire sequence is on the next line, I find grep is easier to use in a parameterized loop than the awk version. I always get tangled up with the quoting. In fact, with the grep approach, you can even use a file to hold your list of desired headers. If you want both the headers and their sequences, just drop the second grep.

grep -wf list-of-headers.txt -A 1 example.fasta | grep -v '^>'

If your sequences are spread across multiple lines, then use one of the awk solutions, or "unwrap" your FASTA first with a tool like fastx_toolkit.

ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by jenn.drummond80
0
gravatar for Jorge Amigo
6 weeks ago by
Jorge Amigo12k
Santiago de Compostela, Spain
Jorge Amigo12k wrote:

I would definitely go for @ATpoint's samtools solution in general, as it should work always and it should be really fast.

If programming alternatives like @i.sudbery's awk solution in general or @jenn.drummond's grep if your fasta file is linearized would be considered, here is a perl alternative just for having where to choose:

perl -ne 'if (/^>NODE_19_length_5758_cluster_19_candidate_1/) { $o = 1; print; next } if ($o) { last if /^>/; print }' my_fasta.fasta

It should also work always, it allows you to use a sequence pattern rather than the entire sequence name (this could be convenient in certain cases), and it woud be faster than the first of samtools faidx. If any other sequences would be queried afterwards, having your fasta file already indexed would definitely be faster.

ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by Jorge Amigo12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 993 users visited in the last hour
_