Extract several parts from fasta header
2.7 years ago
rah ▴ 20

I'm looking for a way to create an text file containing some information about sequence reads, extracted from a .fasta file. Either by using grep, sed or awk.

Basically i have several fasta sequences which i have trimmed, so i an example of a header for a trimmed fasta file with a sequence where i have the original as well as the trimmed length

>ca51a0fa-e6e5-4fd7-bd00-91cba70ca87e runid=f51153f9c3ec50d37d212f8f83dc387ac416f3c9 read=3826 ch=60 start_time=2018-11-21T16:47:21Z barcode=barcode01 trim=0-1060


So the information i want from this header is the:

trimmed length: 0-1600

So far i've done this part

grep -o -E "^>\w+|.read=\w+|.trim=\w+" test.fasta


Which yields the output

>ca51a0fa
trim=0


What im looking for, would either be this

>ca51a0fa
trim=0-1060


Or this

>ca51a0fa-e6e5-4fd7-bd00-91cba70ca87e
trim=0-1060


And I can't really get it to work, would any of you have a suggestion. Thanks

Why not use awk, delimit on space and then print the fields you need?

Because i didn't think of that, all of the examples i could find handling fasta headers was using grep, so i thought i might as well stay with using grep. well that worked perfectly, thanks

$grep -o -E "^>\w+|.read=\w+|.trim=\w+\W\w+" test.txt >ca51a0fa read=3826 trim=0-1060$ grep -Eio ">(\w+\W){5}|read=\w+|trim=\w+\W\w+" test.txt
>ca51a0fa-e6e5-4fd7-bd00-91cba70ca87e
trim=0-1060

Thanks for your suggestions for both options.

SEDA (https://www.sing-group.org/seda/) has an operation to process FASTA headers and do this type of things. It is called 'Rename header' (https://www.sing-group.org/seda/manual/operations.html#rename-header) and may be useful to you. You do not even need to install SEDA, you can use the Docker image with the latest version available at Docker Hub (https://hub.docker.com/r/pegi3s/seda/). Regards!

It looks really useful. Thanks!