Question: Remove text flanking .. on fasta-headers
1
gravatar for genomes_and_MGEs
12 months ago by
genomes_and_MGEs0 wrote:

Hi guys,

I have a multi-fasta like this

>Citrobacter_freundii_D8_6645..17576
gtgatcgtcaagaaggttaagaacccgcagaaggcagca
>Enterobacter_hormaechei_35012_3830..23574
atggacgatagagaaagaggcttagcatttttatttgcaatt

And I would like to eliminate the numbers flanking .., to have an output like this

>Citrobacter_freundii_D8
gtgatcgtcaagaaggttaagaacccgcagaaggcagca
>Enterobacter_hormaechei_35012
atggacgatagagaaagaggcttagcatttttatttgcaatt

Since the number are variable, I guess just creating a command to remove x characters from the end of the fasta-header won't be enough. Thanks!

sequence genome • 317 views
ADD COMMENTlink modified 12 months ago by Joe16k • written 12 months ago by genomes_and_MGEs0
2
gravatar for ATpoint
12 months ago by
ATpoint31k
Germany
ATpoint31k wrote:

If the example is representative, then you basically intend to keep the first three elements that are separated by _. If so, do:

awk ' $1 ~ /^>/ { split($0,a,"_"); print a[1]"_"a[2]"_"a[3];next} {print}'

Command splits every line that starts with > at the _ and then simply prints the first three separated by _ again. Obviously that only works if all fasta headers look like the ones you showed.

ADD COMMENTlink modified 12 months ago • written 12 months ago by ATpoint31k
2
gravatar for cpad0112
12 months ago by
cpad011212k
India
cpad011212k wrote:
$ sed '/>/ s/_[0-9]\+\.\..*$//g' test.fa
>Citrobacter_freundii_D8
gtgatcgtcaagaaggttaagaacccgcagaaggcagca
>Enterobacter_hormaechei_35012
atggacgatagagaaagaggcttagcatttttatttgcaatt
ADD COMMENTlink written 12 months ago by cpad011212k

Thanks, saved my day!

ADD REPLYlink written 12 months ago by genomes_and_MGEs0

You can accept more than one answer, if they all work. Just so you know.

ADD REPLYlink written 12 months ago by genomax80k
0
gravatar for Joe
12 months ago by
Joe16k
United Kingdom
Joe16k wrote:

A bash only solution*, for good measure (because I can't help myself):

$ while read l; do echo "${l%_*}"; done < seqs.fasta

*Assumes there are no other underscores elsewhere beyond the D8 string etc.

ADD COMMENTlink modified 12 months ago • written 12 months ago by Joe16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 948 users visited in the last hour