Question

Edit Headers of Fasta file

0

Entering edit mode

2.9 years ago

NewtoBioinfo • 0

Hi, I have a fasta file for a genome assembly with 646 sequences. The first 7 are pseudochromosomes and the rest are unassigned scaffolds. And the headers look something like this:

>GWHABKY00000001    Chromosome 1    Complete=F  Circular=F  OriSeqID=chr1   Len=553355525
>GWHABKY00000002    Chromosome 2    Complete=F  Circular=F  OriSeqID=chr2   Len=740519526
>GWHABKY00000003    Chromosome 3    Complete=F  Circular=F  OriSeqID=chr3   Len=676969686
>GWHABKY00000004    Chromosome 4    Complete=F  Circular=F  OriSeqID=chr4   Len=612967577
>GWHABKY00000005    Chromosome 5    Complete=F  Circular=F  OriSeqID=chr5   Len=625473173
>GWHABKY00000006    Chromosome 6    Complete=F  Circular=F  OriSeqID=chr6   Len=584270320
>GWHABKY00000007    Chromosome 7    Complete=F  Circular=F  OriSeqID=chr7   Len=744096988
>GWHABKY00000008    OriSeqID=scaffold1  Len=1816015
>GWHABKY00000009    OriSeqID=scaffold10 Len=942477
>GWHABKY00000010    OriSeqID=scaffold100    Len=268586
>GWHABKY00000011    OriSeqID=scaffold101    Len=265196
>GWHABKY00000012    OriSeqID=scaffold102    Len=259718
>GWHABKY00000013    OriSeqID=scaffold103    Len=258511
>GWHABKY00000014    OriSeqID=scaffold104    Len=258489
>GWHABKY00000015    OriSeqID=scaffold105    Len=257418
>GWHABKY00000016    OriSeqID=scaffold106    Len=256425
...

I want to edit the headers so that the first 7 just say:

>Chr1E
>Chr2E
...

And for the rest I just want the scaffold ID

>scaffold1
>scaffold10
...

What's the best way to do this using sed/awk?

header fasta • 1.4k views

ADD COMMENT • link updated 2.9 years ago by Ram 43k • written 2.9 years ago by NewtoBioinfo • 0

0

Entering edit mode

This question has been addressed multiple times on the forum. Please use the search bar. Sample search: https://www.biostars.org/post/search/?query=fasta+header

ADD REPLY • link 2.9 years ago by Ram 43k

0

Entering edit mode

Assuming that sequences are in single line:

$ awk -v OFS="\n" -F "OriSeqID=| Len" '/^>/ {getline seq} ; {($2 ~ "chr"? ($2=$2"E"):$2);print ">"$2,seq}' test.fa

or using perl:

$ perl -lne 'if (/>.+OriSeqID\=(\w+)\s/) { print ">\u$1" } else { print }' test.fa

ADD REPLY • link 2.9 years ago by cpad0112 21k

score 2 · Accepted Answer · 2021-06-17

2

Entering edit mode

2.9 years ago

JC 13k

$ perl -lne 'if (/>.+Chromosome (\d)/) { print ">Chr$1E" } elsif (/>.+(scaffold\d+)/) { print ">$1" } else { print }' < in 
>Chr1E
>Chr2E
>Chr3E
>Chr4E
>Chr5E
>Chr6E
>Chr7E
>scaffold1
>scaffold10
>scaffold100
>scaffold101
>scaffold102
>scaffold103
>scaffold104
>scaffold105
>scaffold106

ADD COMMENT • link 2.9 years ago by JC 13k

0

Entering edit mode

Thanks. Can I have the output in a new file?

ADD REPLY • link 2.9 years ago by NewtoBioinfo • 0

1

Entering edit mode

Google "linux redirect output". Please put some effort before asking to be helped.

ADD REPLY • link 2.9 years ago by Ram 43k

0

Entering edit mode

In response to your first helpful response, I looked at all the prior posts about fasta headers before I posted mine. The reason I didn't find a solution through any of them was because my file didn't have a standard format of headers throughout. The first seven were different to the rest so I couldn't workout a common sed/awk type command. In any case looking at the solution posted I would have never been able to come up with that with what little knowledge I have of command line. Sometimes people who are new to bioinformatics don't know what to search for either. With regards to your second helpful response, I wasn't expecting a solution with perl so wasn't sure how to output to a new file when the input was being piped in at the end but I did make the effort simultaneosuly and got a solution. So thanks for your help.

ADD REPLY • link 2.9 years ago by NewtoBioinfo • 0

0

Entering edit mode

I apologize - I missed your special case.

A good strategy is to treat each such category separately and join/merge/concatenate the resulting content later. awk or bioawk would be really useful here.

ADD REPLY • link 2.9 years ago by Ram 43k

0

Entering edit mode

It worked!

$cat in.fa | perl -lne 'if (/>.+Chromosome (\d)/) { print ">Chr$1E" } elsif (/>.+(scaffold\d+)/) { print ">$1" } else { print }' > out.fa

ADD REPLY • link 2.9 years ago by NewtoBioinfo • 0