Edit Headers of Fasta file
1
0
Entering edit mode
2.9 years ago

Hi, I have a fasta file for a genome assembly with 646 sequences. The first 7 are pseudochromosomes and the rest are unassigned scaffolds. And the headers look something like this:

>GWHABKY00000001    Chromosome 1    Complete=F  Circular=F  OriSeqID=chr1   Len=553355525
>GWHABKY00000002    Chromosome 2    Complete=F  Circular=F  OriSeqID=chr2   Len=740519526
>GWHABKY00000003    Chromosome 3    Complete=F  Circular=F  OriSeqID=chr3   Len=676969686
>GWHABKY00000004    Chromosome 4    Complete=F  Circular=F  OriSeqID=chr4   Len=612967577
>GWHABKY00000005    Chromosome 5    Complete=F  Circular=F  OriSeqID=chr5   Len=625473173
>GWHABKY00000006    Chromosome 6    Complete=F  Circular=F  OriSeqID=chr6   Len=584270320
>GWHABKY00000007    Chromosome 7    Complete=F  Circular=F  OriSeqID=chr7   Len=744096988
>GWHABKY00000008    OriSeqID=scaffold1  Len=1816015
>GWHABKY00000009    OriSeqID=scaffold10 Len=942477
>GWHABKY00000010    OriSeqID=scaffold100    Len=268586
>GWHABKY00000011    OriSeqID=scaffold101    Len=265196
>GWHABKY00000012    OriSeqID=scaffold102    Len=259718
>GWHABKY00000013    OriSeqID=scaffold103    Len=258511
>GWHABKY00000014    OriSeqID=scaffold104    Len=258489
>GWHABKY00000015    OriSeqID=scaffold105    Len=257418
>GWHABKY00000016    OriSeqID=scaffold106    Len=256425
...

I want to edit the headers so that the first 7 just say:

>Chr1E
>Chr2E
...

And for the rest I just want the scaffold ID

>scaffold1
>scaffold10
...

What's the best way to do this using sed/awk?

header fasta • 1.4k views
ADD COMMENT
0
Entering edit mode

This question has been addressed multiple times on the forum. Please use the search bar. Sample search: https://www.biostars.org/post/search/?query=fasta+header

ADD REPLY
0
Entering edit mode

Assuming that sequences are in single line:

$ awk -v OFS="\n" -F "OriSeqID=| Len" '/^>/ {getline seq} ; {($2 ~ "chr"? ($2=$2"E"):$2);print ">"$2,seq}' test.fa

or using perl:

$ perl -lne 'if (/>.+OriSeqID\=(\w+)\s/) { print ">\u$1" } else { print }' test.fa
ADD REPLY
2
Entering edit mode
2.9 years ago
JC 13k
$ perl -lne 'if (/>.+Chromosome (\d)/) { print ">Chr$1E" } elsif (/>.+(scaffold\d+)/) { print ">$1" } else { print }' < in 
>Chr1E
>Chr2E
>Chr3E
>Chr4E
>Chr5E
>Chr6E
>Chr7E
>scaffold1
>scaffold10
>scaffold100
>scaffold101
>scaffold102
>scaffold103
>scaffold104
>scaffold105
>scaffold106
ADD COMMENT
0
Entering edit mode

Thanks. Can I have the output in a new file?

ADD REPLY
1
Entering edit mode

Google "linux redirect output". Please put some effort before asking to be helped.

ADD REPLY
0
Entering edit mode

In response to your first helpful response, I looked at all the prior posts about fasta headers before I posted mine. The reason I didn't find a solution through any of them was because my file didn't have a standard format of headers throughout. The first seven were different to the rest so I couldn't workout a common sed/awk type command. In any case looking at the solution posted I would have never been able to come up with that with what little knowledge I have of command line. Sometimes people who are new to bioinformatics don't know what to search for either. With regards to your second helpful response, I wasn't expecting a solution with perl so wasn't sure how to output to a new file when the input was being piped in at the end but I did make the effort simultaneosuly and got a solution. So thanks for your help.

ADD REPLY
0
Entering edit mode

I apologize - I missed your special case.

A good strategy is to treat each such category separately and join/merge/concatenate the resulting content later. awk or bioawk would be really useful here.

ADD REPLY
0
Entering edit mode

It worked!

$cat in.fa | perl -lne 'if (/>.+Chromosome (\d)/) { print ">Chr$1E" } elsif (/>.+(scaffold\d+)/) { print ">$1" } else { print }' > out.fa
ADD REPLY

Login before adding your answer.

Traffic: 1988 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6