Question: Fasta header unique sequence
0
gravatar for Mehmet
3.4 years ago by
Mehmet460
Japan
Mehmet460 wrote:

Hi , 

I have a fasta file, which has some same headers like below. They have different sequence but same header. How can I merge them or what should I do? I want to run orthoMCL but it requires unique headers.

>c12358_g1_i9

>c12358_g1_i9

sequence genome • 1.3k views
ADD COMMENTlink modified 3.4 years ago by biocyberman770 • written 3.4 years ago by Mehmet460

It seems that your upstream tool spit out different fragments of the same sequence. Merge them with same padding 'N' may work, but the quicker and better method is to make the headers unique.

ADD REPLYlink written 3.4 years ago by biocyberman770
0
gravatar for biocyberman
3.4 years ago by
biocyberman770
Denmark
biocyberman770 wrote:

I don't know about orthoMCL, but if you just want to change the header and make them unique, do the following (in linux, or install GnuWin32 from here for Windows to get gawk command: http://getgnuwin32.sourceforge.net/)

gawk '{if ($0 ~/^>/) {h[$1]++; $1=$1 "_" h[$1]} print}' myfasta.fa >updatedIDs_myfasta.fa

# myfasta.fa is your fasta file.
ADD COMMENTlink written 3.4 years ago by biocyberman770

hi I used your command, but it didnt change the same header. Do you have any other solution?

ADD REPLYlink written 3.4 years ago by Mehmet460

That's weird, my gawk-fu can't be failing :-) Could you post an excerpt of the fasta file with sequences trimmed to about 10 bases?

ADD REPLYlink written 3.4 years ago by biocyberman770

>c10047_g1_i1|m.4145 c10047_g1_i1|g.4145  ORF c10047_g1_i1|g.4145 c10047_g1_i1|m.4145 type:complete len:387 (-) c10047_g1_i1:511-1671(-)

>c10047_g2_i1|m.4146 c10047_g2_i1|g.4146  ORF c10047_g2_i1|g.4146 c10047_g2_i1|m.4146 type:5prime_partial len:589 (+) c10047_g2_i1:2-1768(+)

These are headers of my fasta file. The same headers I want to merge or remove for my next work. The headers have different sequence.

ADD REPLYlink written 3.4 years ago by Mehmet460
1

Oh, this is different from what you gave in the question. In your fasta file, the tools that generated it form unique headers like this: one header: c10047_g1_i1|m.4145 ; another header: c10047_g2_i1|m.4146 but orthoMCL propably only consider header before the pipe '|' signs. Therefore you can make this:

gawk 'BEGIN{FS=" "}{if ($0 ~/^>/){gsub("\\|", "pp", $1)} print}' myfasta.fa >updatedIDs_myfasta.fa

Change "pp" to anything you like, but keep it distinguishable.

ADD REPLYlink written 3.4 years ago by biocyberman770

Thank you so much, you saved my day :) 

ADD REPLYlink written 3.4 years ago by Mehmet460
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 775 users visited in the last hour