Question: Remove text flanking .. on fasta-headers
1
gravatar for genomes_and_MGEs
18 months ago by
genomes_and_MGEs10 wrote:

Hi guys,

I have a multi-fasta like this

>Citrobacter_freundii_D8_6645..17576
gtgatcgtcaagaaggttaagaacccgcagaaggcagca
>Enterobacter_hormaechei_35012_3830..23574
atggacgatagagaaagaggcttagcatttttatttgcaatt

And I would like to eliminate the numbers flanking .., to have an output like this

>Citrobacter_freundii_D8
gtgatcgtcaagaaggttaagaacccgcagaaggcagca
>Enterobacter_hormaechei_35012
atggacgatagagaaagaggcttagcatttttatttgcaatt

Since the number are variable, I guess just creating a command to remove x characters from the end of the fasta-header won't be enough. Thanks!

sequence genome • 436 views
ADD COMMENTlink modified 18 months ago by Joe18k • written 18 months ago by genomes_and_MGEs10
2
gravatar for ATpoint
18 months ago by
ATpoint38k
Germany
ATpoint38k wrote:

If the example is representative, then you basically intend to keep the first three elements that are separated by _. If so, do:

awk ' $1 ~ /^>/ { split($0,a,"_"); print a[1]"_"a[2]"_"a[3];next} {print}'

Command splits every line that starts with > at the _ and then simply prints the first three separated by _ again. Obviously that only works if all fasta headers look like the ones you showed.

ADD COMMENTlink modified 18 months ago • written 18 months ago by ATpoint38k
2
gravatar for cpad0112
18 months ago by
cpad011214k
India
cpad011214k wrote:
$ sed '/>/ s/_[0-9]\+\.\..*$//g' test.fa
>Citrobacter_freundii_D8
gtgatcgtcaagaaggttaagaacccgcagaaggcagca
>Enterobacter_hormaechei_35012
atggacgatagagaaagaggcttagcatttttatttgcaatt
ADD COMMENTlink written 18 months ago by cpad011214k

Thanks, saved my day!

ADD REPLYlink written 18 months ago by genomes_and_MGEs10

You can accept more than one answer, if they all work. Just so you know.

ADD REPLYlink written 18 months ago by genomax89k
0
gravatar for Joe
18 months ago by
Joe18k
United Kingdom
Joe18k wrote:

A bash only solution*, for good measure (because I can't help myself):

$ while read l; do echo "${l%_*}"; done < seqs.fasta

*Assumes there are no other underscores elsewhere beyond the D8 string etc.

ADD COMMENTlink modified 18 months ago • written 18 months ago by Joe18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1343 users visited in the last hour