Question

Extract several parts from fasta header

0

Entering edit mode

5.1 years ago

rah ▴ 20

I'm looking for a way to create an text file containing some information about sequence reads, extracted from a .fasta file. Either by using grep, sed or awk.

Basically i have several fasta sequences which i have trimmed, so i an example of a header for a trimmed fasta file with a sequence where i have the original as well as the trimmed length

>ca51a0fa-e6e5-4fd7-bd00-91cba70ca87e runid=f51153f9c3ec50d37d212f8f83dc387ac416f3c9 read=3826 ch=60 start_time=2018-11-21T16:47:21Z barcode=barcode01 trim=0-1060

So the information i want from this header is the:

read name ca51a0fa-e6e5-4fd7-bd00-91cba70ca87e

original read length; 3826

trimmed length: 0-1600

So far i've done this part

grep -o -E "^>\w+|.read=\w+|.trim=\w+" test.fasta

Which yields the output

>ca51a0fa
read=3826
trim=0

What im looking for, would either be this

>ca51a0fa
read=3826
trim=0-1060

Or this

>ca51a0fa-e6e5-4fd7-bd00-91cba70ca87e
read=3826
trim=0-1060

And I can't really get it to work, would any of you have a suggestion. Thanks

fasta grep sed unix bash • 1.7k views

ADD COMMENT • link 5.1 years ago by rah ▴ 20

1

Entering edit mode

Why not use awk, delimit on space and then print the fields you need?

ADD REPLY • link 5.1 years ago by GenoMax 141k

0

Entering edit mode

Because i didn't think of that, all of the examples i could find handling fasta headers was using grep, so i thought i might as well stay with using grep. well that worked perfectly, thanks

ADD REPLY • link 5.1 years ago by rah ▴ 20

0

Entering edit mode

$ grep -o -E "^>\w+|.read=\w+|.trim=\w+\W\w+" test.txt
>ca51a0fa
 read=3826
 trim=0-1060


$ grep -Eio ">(\w+\W){5}|read=\w+|trim=\w+\W\w+" test.txt
>ca51a0fa-e6e5-4fd7-bd00-91cba70ca87e 
read=3826
trim=0-1060

ADD REPLY • link 5.1 years ago by cpad0112 21k

0

Entering edit mode

Thanks for your suggestions for both options.

ADD REPLY • link 5.1 years ago by rah ▴ 20

0

Entering edit mode

SEDA (https://www.sing-group.org/seda/) has an operation to process FASTA headers and do this type of things. It is called 'Rename header' (https://www.sing-group.org/seda/manual/operations.html#rename-header) and may be useful to you. You do not even need to install SEDA, you can use the Docker image with the latest version available at Docker Hub (https://hub.docker.com/r/pegi3s/seda/). Regards!