Question: need code for sorting fasta header
0
gravatar for divyaranib.10
13 months ago by
divyaranib.100 wrote:

Hello All,

I would like to sort the fasta header line (annotation). Below is the example of how my data is and it is in .txt

>AHF21055.1 ribosomal protein S4 (mitochondrion) [Helianthus annuus]

>AAM96597.1 ATP synthase F0 subunit 6 (mitochondrion) [Chaetosphaeridium globosum]
>AAM96598.1 ATP synthase F0 subunit 8 (mitochondrion) [Chaetosphaeridium globosum]
>AAM96599.1 ATP synthase F0 subunit 9 (mitochondrion) [Chaetosphaeridium globosum]

I would like to get the data as below: just the accession number and protein name preferably in table format and remove everything after the protein name.

example:

>AHF21055.1     ribosomal protein S4 

>AAM96597.1     ATP synthase F0 subunit 6

>AAM96598.1     ATP synthase F0 subunit 8 

>AAM96599.1     ATP synthase F0 subunit 9

Thank you in advance!!

sequencing sequence forum • 292 views
ADD COMMENTlink modified 13 months ago by mike-zx150 • written 13 months ago by divyaranib.100
1

Assuming the (mitochondrion) is always there, this is what I can think on the of my head cut -f1 -d'(' header.txt | sort. There will be an empty space at the end and can be removed by sed 's/ *$//'.

ADD REPLYlink modified 13 months ago • written 13 months ago by Eric Lim1.6k

thank you Eric Lim for your reply!

ADD REPLYlink written 13 months ago by divyaranib.100

What have you tried?

PS: Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

ADD REPLYlink modified 13 months ago • written 13 months ago by RamRS25k

Sure Ram will do that from next time. Thanks a lot! I am kinda new to this forum

ADD REPLYlink written 13 months ago by divyaranib.100

Do all of your entries follow that format? Will there be some where the string (mitochondrion) is not there?

ADD REPLYlink written 13 months ago by Joe15k
1
gravatar for mike-zx
13 months ago by
mike-zx150
mike-zx150 wrote:

This gives me the exact output you want as long as (mitochondrion) is present in all lines:

cat old_fasta_headers | sed '/^[[:space:]]*$/d' | cut -d\( -f1 | sed 's/\(\.[[:digit:]]*\) /\1\t/g ; s/$/\n/g' \
> new_fasta_headers

Hope this helps.

ADD COMMENTlink modified 13 months ago • written 13 months ago by mike-zx150

thanks a lot mike!! it solved my problem!

ADD REPLYlink written 13 months ago by divyaranib.100
1
gravatar for RamRS
13 months ago by
RamRS25k
Houston, TX
RamRS25k wrote:

From my experience, FASTA headers consist of two parts - the ID and the description. You can use a tool like bioawk to extract just the identifier and then sort the output, or you can use any combination of command line utilities, such as grep -o or cut or sed, much like Eric Lim's comment.

ADD COMMENTlink written 13 months ago by RamRS25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1193 users visited in the last hour