Question

How to match an exact string in a fasta header whilst excluding matches followed by hyphen?

0

Entering edit mode

2.5 years ago

alexander.byrne ▴ 10

Hi,

So I have a fasta file that contains a list of influenza proteins (PB2, PB1, PB1-F2, PA-X, HA, NP, NA, M1, M2, NS1 and NS2). I'm trying to use grep to pull out the headers containing individual proteins e.g. grep "PB2" fastafile

This works fine for most of the proteins, but with PB1 and PA, (grep "PB1" fastafile or grep "PA" fastafile) it doesn't just return the headers containing PB1 or PA but also the headers containing PB1-F2 and PA-X.

I've tried playing around with regexs (e.g. "PB1$") but that doesn't appear to solve the issue either.

Does anyone have an idea of how to solve this?

grep hyphen header fasta • 1.3k views

ADD COMMENT • link updated 2.5 years ago by cpad0112 21k • written 2.5 years ago by alexander.byrne ▴ 10

0

Entering edit mode

Please post example input headers and expected output headers. In the absence of any data to process, I would suggest trying grep -w "PA" fastafile. But grep may not be sufficient for multi-line fasta.

ADD REPLY • link 2.5 years ago by cpad0112 21k

0

Entering edit mode

Hi, thanks for getting back to me.

The headers look something like this:

A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PB1

A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PB1-F2

A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PA

A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PA-X

I want to be able to pull the PB1/PA headers separately to the PB1-F2/PA-X headers. If I try:

grep -w "PB1" fastafile

It returns both the PB1 and PB1-F2 header and the same for "PA". Any ideas?

ADD REPLY • link 2.5 years ago by alexander.byrne ▴ 10

score 1 · Answer 1 · 2021-11-09

$ cat test.txt          
A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PB1
A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PB1-F2
A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PA
A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PA-X

$ grep -w "PA" test.txt
A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PA
A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PA-X

$ grep -w "PA$" test.txt
A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PA

$ grep -w "PB1" test.txt 
A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PB1
A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PB1-F2

$ grep -w "PB1$" test.txt
A/bird/England/00000001/2021_|HxNx|_dd/mm/yyyy|PB1

For this job, I would suggest to use seqkit like below:

$ seqkit -w 0 grep -irp "\|PA$" test.fa

If you want separate fasta files as per each entry (PA, PA-X, PB1), look at the following example code:

$ tree .               
.
└── test.fa

0 directories, 1 file

$ seqkit -w 0 split -i --id-regexp ".*\|(.*)$" -2 test.fa -O out --quiet
[INFO] create FASTA index for test.fa

$ tree .
.
├── out
│   ├── test.id_PA.fasta
│   ├── test.id_PA-X.fasta
│   ├── test.id_PB1-F2.fasta
│   └── test.id_PB1.fasta
├── test.fa
└── test.fa.seqkit.fai

1 directory, 6 files

$ cd out 

$ rename -n 's/test\.id_//g' *.fasta                                    
'test.id_PA.fasta' would be renamed to 'PA.fasta'
'test.id_PA-X.fasta' would be renamed to 'PA-X.fasta'
'test.id_PB1-F2.fasta' would be renamed to 'PB1-F2.fasta'
'test.id_PB1.fasta' would be renamed to 'PB1.fasta'

You can also use awk for this. But you need to flatten your fasta file for a simpler awk code.