Question

Modifying Fasta file header

0

Entering edit mode

7.1 years ago

Blaise • 0

Dear All, please, I would like to modify my Fasta file header:

>gi|51039021|ref|NC_006130.1| Streptococcus pyogenes 71-724 plasmid pDN571, complete sequence

to

> Streptococcus pyogenes 71-724 plasmid pDN571

Please could somebody help?

Many thanks

sequence • 5.5k views

ADD COMMENT • link 7.1 years ago by Blaise • 0

1

Entering edit mode

This isn't a 'forum' question. 2. You haven't told us how, in what manner, or to what you would like it modified. 3. This is probably the single most asked question on the forum so you can use the search function to find plenty of existing solutions.

ADD REPLY • link 7.1 years ago by Joe 21k

0

Entering edit mode

Do you want a general purpose solution for multiple files or do you just want this exact fasta modified?

ADD REPLY • link 7.1 years ago by Joe 21k

0

Entering edit mode

@genomax and @sej, for running a BLAT, Please, I need:

>gi|51039021|ref|NC_006130.1| Streptococcus pyogenes 71-724 plasmid pDN571, complete sequence

to

> pDN571

Thank you

ADD REPLY • link updated 7.1 years ago by GenoMax 141k • written 7.1 years ago by Blaise • 0

0

Entering edit mode

Question is do ALL of your fasta headers follow that exact format in terms of where the spaces are etc. That is why we need more than one record.

BTW: This request already does not match what you had originally asked.

Try this in mean time: awk -F " " '{ if ($0 ~ /^>/) { print ">"$6;} else { print $0}}' input.fa | sed -e 's/,//' > output.fa

ADD REPLY • link 7.1 years ago by GenoMax 141k

0

Entering edit mode

@genomax2, Sorry, I was not very clear in my request. It is in fact a multifasta file for conducting a BLAT.

ADD REPLY • link 7.1 years ago by Blaise • 0

0

Entering edit mode

No need to be sorry but we need additional information to ensure that solutions provided so far will work.

ADD REPLY • link 7.1 years ago by GenoMax 141k

0

Entering edit mode

I have to transform:

>gi|51039021|ref|NC_006130.1| Streptococcus pyogenes 71-724 plasmid pDN571, complete sequence
>gi|62945224|ref|NC_006976.1| Mannheimia haemolytica 3259 plasmid pCCK3259, complete sequence
>gi|63219713|ref|NC_006994.1| Pasteurella multocida 381 plasmid pCCK381, complete sequence

and so on....

to:

 > pDN571
 > pCCK3259
 > pCCK381

and so on....

ADD REPLY • link updated 7.1 years ago by GenoMax 141k • written 7.1 years ago by Blaise • 0

0

Entering edit mode

Can you try the awk solution I posted above? It should work. Assumption here is the actual sequence part is left as is.

ADD REPLY • link 7.1 years ago by GenoMax 141k

0

Entering edit mode

Simple regex for lowercase p followed by 1 or more uppercase/digits should work I think.

ADD REPLY • link 7.1 years ago by Joe 21k

score 1 · Answer 1 · 2017-03-15

1

Entering edit mode

7.1 years ago

Sej Modha 5.3k

Simple sed solution

sed -e 's/gi.*| //g' input.fa > output.fa

ADD COMMENT • link 7.1 years ago by Sej Modha 5.3k

1

Entering edit mode

This keeps the , complete sequence part.

If that is common for all sequences then it could also be removed by doing sed -e 's/gi.*| //g' -e 's/,.*//' input.fa > output.fa

ADD REPLY • link 7.1 years ago by GenoMax 141k

0

Entering edit mode

Thank you @Sej , I still have problem since it is a multifasta file. Please how to also remove the bacteria name and just keep

71-724 plasmid pDN571

Thanks

ADD REPLY • link 7.1 years ago by Blaise • 0

0

Entering edit mode

Please post headers of 2+ records in your original post otherwise we are going to have a back and forth as we get additional bits of information from you.

ADD REPLY • link 7.1 years ago by GenoMax 141k

0

Entering edit mode

hello guys, i want to change headers in a fasta file from:

 >rna18149 gene=LOC103954069 Dbxref=GeneID:103954069,Genbank:XM_009365876.1 Name=XM_009365876.1 gbkey=mRNA product=uncharacterized LOC103954069 transcript_id=XM_009365876.1 gene_biotype=protein_coding
>rna18150 gene=LOC103953996 Dbxref=GeneID:103953996,Genbank:XM_009365794.1 Name=XM_009365794.1 gbkey=mRNA product=non-structural maintenance of chromosomes element 4 homolog A transcript_id=XM_009365794.1 gene_biotype=protein_coding
>rna18151 gene=LOC103953997 Dbxref=GeneID:103953997,Genbank:XM_009365795.1 Name=XM_009365795.1 gbkey=mRNA product=enoyl-[acyl-carrier-protein] reductase [NADH]%2C chloroplastic-like transcript_id=XM_009365795.1 gene_biotype=protein_coding
>rna18152 gene=LOC103954070 Dbxref=GeneID:103954070,Genbank:XM_009365877.1 Name=XM_009365877.1 gbkey=mRNA product=protein NRT1/ PTR FAMILY 7.1-like transcript_id=XM_009365877.1 gene_biotype=protein_coding

to:

>LOC103954069
>LOC103953996
>LOC103953997
>LOC103954070

whould you please give me some tips with sed or awk, thank you!

ADD REPLY • link 3.8 years ago by Kurban ▴ 230

0

Entering edit mode

Assuming that your fasta file is called test.fa, you could try

cut -f1-2 -d ' ' test.fa |sed 's/^.*=/>/g'

ADD REPLY • link 3.8 years ago by Sej Modha 5.3k

0

Entering edit mode

Thank you for the quick reply, that one worked like a charm.

ADD REPLY • link 3.8 years ago by Kurban ▴ 230