need code for sorting fasta header
2
0
Entering edit mode
5.5 years ago

Hello All,

I would like to sort the fasta header line (annotation). Below is the example of how my data is and it is in .txt

>AHF21055.1 ribosomal protein S4 (mitochondrion) [Helianthus annuus]
>AAM96597.1 ATP synthase F0 subunit 6 (mitochondrion) [Chaetosphaeridium globosum]
>AAM96598.1 ATP synthase F0 subunit 8 (mitochondrion) [Chaetosphaeridium globosum]
>AAM96599.1 ATP synthase F0 subunit 9 (mitochondrion) [Chaetosphaeridium globosum]

I would like to get the data as below: just the accession number and protein name preferably in table format and remove everything after the protein name.

example:

>AHF21055.1     ribosomal protein S4
>AAM96597.1     ATP synthase F0 subunit 6
>AAM96598.1     ATP synthase F0 subunit 8
>AAM96599.1     ATP synthase F0 subunit 9

Thank you in advance!

fasta sequence • 1.1k views
ADD COMMENT
1
Entering edit mode

Assuming the (mitochondrion) is always there, this is what I can think on the of my head cut -f1 -d'(' header.txt | sort. There will be an empty space at the end and can be removed by sed 's/ *$//'.

ADD REPLY
0
Entering edit mode

thank you Eric Lim for your reply!

ADD REPLY
0
Entering edit mode

What have you tried?

PS: Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

ADD REPLY
0
Entering edit mode

Sure Ram will do that from next time. Thanks a lot! I am kinda new to this forum

ADD REPLY
0
Entering edit mode

Do all of your entries follow that format? Will there be some where the string (mitochondrion) is not there?

ADD REPLY
1
Entering edit mode
5.5 years ago
n,n ▴ 360

This gives me the exact output you want as long as (mitochondrion) is present in all lines:

cat old_fasta_headers | sed '/^[[:space:]]*$/d' | cut -d\( -f1 | sed 's/\(\.[[:digit:]]*\) /\1\t/g ; s/$/\n/g' \
> new_fasta_headers

Hope this helps.

ADD COMMENT
0
Entering edit mode

thanks a lot mike!! it solved my problem!

ADD REPLY
1
Entering edit mode
5.5 years ago
Ram 43k

From my experience, FASTA headers consist of two parts - the ID and the description. You can use a tool like bioawk to extract just the identifier and then sort the output, or you can use any combination of command line utilities, such as grep -o or cut or sed, much like Eric Lim's comment.

ADD COMMENT

Login before adding your answer.

Traffic: 2720 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6