Question: Modifying Fasta file header
0
gravatar for Blaise
3.4 years ago by
Blaise0
Blaise0 wrote:

Dear All, please, I would like to modify my Fasta file header:

>gi|51039021|ref|NC_006130.1| Streptococcus pyogenes 71-724 plasmid pDN571, complete sequence

to

> Streptococcus pyogenes 71-724 plasmid pDN571

Please could somebody help?

Many thanks

sequence • 1.7k views
ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by Blaise0
1
  1. This isn't a 'forum' question. 2. You haven't told us how, in what manner, or to what you would like it modified. 3. This is probably the single most asked question on the forum so you can use the search function to find plenty of existing solutions.
ADD REPLYlink written 3.4 years ago by Joe17k

Do you want a general purpose solution for multiple files or do you just want this exact fasta modified?

ADD REPLYlink written 3.4 years ago by Joe17k

@genomax and @sej, for running a BLAT, Please, I need:

>gi|51039021|ref|NC_006130.1| Streptococcus pyogenes 71-724 plasmid pDN571, complete sequence

to

> pDN571

Thank you

ADD REPLYlink modified 3.4 years ago by genomax87k • written 3.4 years ago by Blaise0

Question is do ALL of your fasta headers follow that exact format in terms of where the spaces are etc. That is why we need more than one record.

BTW: This request already does not match what you had originally asked.

Try this in mean time: awk -F " " '{ if ($0 ~ /^>/) { print ">"$6;} else { print $0}}' input.fa | sed -e 's/,//' > output.fa

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by genomax87k

@genomax2, Sorry, I was not very clear in my request. It is in fact a multifasta file for conducting a BLAT.

ADD REPLYlink written 3.4 years ago by Blaise0

No need to be sorry but we need additional information to ensure that solutions provided so far will work.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by genomax87k

I have to transform:

>gi|51039021|ref|NC_006130.1| Streptococcus pyogenes 71-724 plasmid pDN571, complete sequence
>gi|62945224|ref|NC_006976.1| Mannheimia haemolytica 3259 plasmid pCCK3259, complete sequence
>gi|63219713|ref|NC_006994.1| Pasteurella multocida 381 plasmid pCCK381, complete sequence

and so on....

to:

 > pDN571
 > pCCK3259
 > pCCK381

and so on....

ADD REPLYlink modified 3.4 years ago by genomax87k • written 3.4 years ago by Blaise0

Can you try the awk solution I posted above? It should work. Assumption here is the actual sequence part is left as is.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by genomax87k

Simple regex for lowercase p followed by 1 or more uppercase/digits should work I think.

ADD REPLYlink written 3.4 years ago by Joe17k
1
gravatar for Sej Modha
3.4 years ago by
Sej Modha4.7k
Glasgow, UK
Sej Modha4.7k wrote:

Simple sed solution

sed -e 's/gi.*| //g' input.fa > output.fa
ADD COMMENTlink written 3.4 years ago by Sej Modha4.7k
1

This keeps the , complete sequence part.

If that is common for all sequences then it could also be removed by doing sed -e 's/gi.*| //g' -e 's/,.*//' input.fa > output.fa

ADD REPLYlink written 3.4 years ago by genomax87k

Thank you @Sej , I still have problem since it is a multifasta file. Please how to also remove the bacteria name and just keep

71-724 plasmid pDN571

Thanks

ADD REPLYlink written 3.4 years ago by Blaise0

Please post headers of 2+ records in your original post otherwise we are going to have a back and forth as we get additional bits of information from you.

ADD REPLYlink written 3.4 years ago by genomax87k

hello guys, i want to change headers in a fasta file from:

 >rna18149 gene=LOC103954069 Dbxref=GeneID:103954069,Genbank:XM_009365876.1 Name=XM_009365876.1 gbkey=mRNA product=uncharacterized LOC103954069 transcript_id=XM_009365876.1 gene_biotype=protein_coding
>rna18150 gene=LOC103953996 Dbxref=GeneID:103953996,Genbank:XM_009365794.1 Name=XM_009365794.1 gbkey=mRNA product=non-structural maintenance of chromosomes element 4 homolog A transcript_id=XM_009365794.1 gene_biotype=protein_coding
>rna18151 gene=LOC103953997 Dbxref=GeneID:103953997,Genbank:XM_009365795.1 Name=XM_009365795.1 gbkey=mRNA product=enoyl-[acyl-carrier-protein] reductase [NADH]%2C chloroplastic-like transcript_id=XM_009365795.1 gene_biotype=protein_coding
>rna18152 gene=LOC103954070 Dbxref=GeneID:103954070,Genbank:XM_009365877.1 Name=XM_009365877.1 gbkey=mRNA product=protein NRT1/ PTR FAMILY 7.1-like transcript_id=XM_009365877.1 gene_biotype=protein_coding

to:

>LOC103954069
>LOC103953996
>LOC103953997
>LOC103954070

whould you please give me some tips with sed or awk, thank you!

ADD REPLYlink modified 27 days ago • written 27 days ago by Kurban190

Assuming that your fasta file is called test.fa, you could try

cut -f1-2 -d ' ' test.fa |sed 's/^.*=/>/g'

ADD REPLYlink modified 27 days ago • written 27 days ago by Sej Modha4.7k

Thank you for the quick reply, that one worked like a charm.

ADD REPLYlink written 27 days ago by Kurban190
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1594 users visited in the last hour