Question

Would you please give me some advices about how to change header?

0

Entering edit mode

2.5 years ago

Riku ▴ 80

Hi, all.

I would like to edit headers from fasta.

I have fasta with random header as following(headers are separated by space);

>3R5.1a wormpep=CE24758 gene=WBGene00007065 locus=pot-3 status=Confirmed uniprot=G5EFG7 insdc=CAA21777.2 product="POT1PC domain-containing protein"
>2RSSE.1a wormpep=CE32785 gene=WBGene00007064 locus=rga-9 status=Confirmed uniprot=A4F337 insdc=CCD61138.1 product="Rho-GAP domain-containing protein"
>2L52.1a wormpep=CE32090 gene=WBGene00007063 status=Confirmed uniprot=A4F336 insdc=CCD61130.1

I would like to edit it like a following;

>3R5.1a gene=WBGene00007065 locus=pot-3    
>2RSSE.1a gene=WBGene00007064 locus=rga-9
>2L52.1a gene=WBGene00007064 -

In fact, this file is so long that command operations are required as you know. But I don't know how to edit random header. Could you please give me a help?

I really appreciate your help in advance! Thank you.

bash Linux BLAST fasta • 1.5k views

ADD COMMENT • link updated 2.5 years ago by cpad0112 21k • written 2.5 years ago by Riku ▴ 80

1

Entering edit mode

One lazy way:

$ cut -d " " -f1,3,4 your.fa > new.fa

If locus field is not present in every header then whatever is next will be picked up e.g. status in example above in line 3.

ADD REPLY • link 2.5 years ago by GenoMax 141k

1

Entering edit mode

2.5 years ago

Istvan Albert 100k

This is one of those problems that are hard to solve in a reliable manner since the delimiters are inconsistent. Spaces separate both the fields and internal content that in turn are protected with quotes. You would need to write a parser regardless, then depending on the file structure the implementation may be quite difficult.

As long as the desired key=value pairs are listed before other fields with quoted whitespaces, the python code could look like this:

import sys

targets = { 'gene', 'locus'}

for line in open('mydata.fa'):

    line = line.strip()

    if line.startswith(">"):

        pieces = line.split(" ")

        parts = [pieces[0]] + [p for p in pieces if p.split("=")[0] in targets]

        line = " ".join(parts)

        if "locus" not in line:
            line += " -"

    print(line)

run it as:

python fixme.py

ADD COMMENT • link 2.5 years ago by Istvan Albert 100k

0

Entering edit mode

I don't understand python, but it was good to know there's way to do this. I also try to this.

I appreciated your quick answer!

ADD REPLY • link 2.5 years ago by Riku ▴ 80

1

Entering edit mode

2.5 years ago

cpad0112 21k

with seqkit:

$ seqkit replace  -p '(.*) worm.* (gene=\w+) (locus=\w+\W\w+)*.*' -r '${1} ${2} ${3}' test.fa     

>3R5.1a gene=WBGene00007065 locus=pot-3
atgc
>2RSSE.1a gene=WBGene00007064 locus=rga-9
atc
>2L52.1a gene=WBGene00007063 
gtc

with sed:

$ sed -r '/^>/ s/(>[A-Za-z0-9.]+).*worm.*\s(gene=\w+)\s(locus=\w+\W[0-9]|-)*.*/\1 \2 \3-/g' test.fa | sed -r 's/([0-9])-$/\1/g'

>3R5.1a gene=WBGene00007065 locus=pot-3
atgc
>2RSSE.1a gene=WBGene00007064 locus=rga-9
atc
>2L52.1a gene=WBGene00007063 -
gtc

with awk (Assuming that locus is alway present after wormrep and gene):

$ awk '/^>/ {print $1,$3, ($4 ~ /locus/) ? $4: "-"};!/^>/'  test.fa
$ awk '/^>/ {print $1,$3, ($4 ~ /locus/) ? $4: "-";next}1'  file.fa

>3R5.1a gene=WBGene00007065 locus=pot-3
atgc
>2RSSE.1a gene=WBGene00007064 locus=rga-9
atc
>2L52.1a gene=WBGene00007063 -
gtc

ADD COMMENT • link 2.5 years ago by cpad0112 21k

0

Entering edit mode

I moved it to an answer, probably the simplest approach that also generalizes well.

ADD REPLY • link 2.5 years ago by Istvan Albert 100k

score 1 · Accepted Answer · 2021-10-21

1

Entering edit mode

2.5 years ago

jared.andrews07 ★ 16k

awk -F 'status' '{print $1}' file.fa > new_file.fa

ADD COMMENT • link 2.5 years ago by jared.andrews07 ★ 16k

2

Entering edit mode

This will include wormpep field which is not wanted in example by OP. That can be eliminated by sed.

awk -F 'status' '{print $1}' file.fa | sed 's/wormpep=CE.....//' > new.fa

ADD REPLY • link 2.5 years ago by GenoMax 141k

0

Entering edit mode

Oh, good call, didn't even notice that.

ADD REPLY • link 2.5 years ago by jared.andrews07 ★ 16k

0

Entering edit mode

I see. It helped a lot. I was able to solve this problem with the help of GenoMax and Jared advices.

$ awk -F 'status' '{print $1}' file.fa | cut -d " " -f 1,3,4 > new.fa

But the method by sed is simpler. I will use it.

I'm really appreciated both of you for your quick answer. Thank you very much!

ADD REPLY • link 2.5 years ago by Riku ▴ 80