Question: How to edit fasta header with underscores
0
gravatar for imda
7 months ago by
imda10
imda10 wrote:

Hi everyone! I want to remove one part of my fasta headers, could somebody help me??? please

I have this:

>Capsicum_annuum_cvCM334_CA01g24260 PREDICTED_ transmembrane 9 superfamily member 4-like [Solanum tuberosum]
MIAWISFPAVPLNFFGKEVVFSFATAVDNPLHVDLATQNKTRPSCAKVKMKINLLGEFPKRINVGMRMKTREVKEKGVNISYDYVPKYCKTFKLQDYNEKECFILHPKLYPKD

and I just want this part

 >CA01g24260
MIAWISFPAVPLNFFGKEVVFSFATAVDNPLHVDLATQNKTRPSCAKVKMKINLLGEFPKRINVGMRMKTREVKEKGVNISYDYVPKYCKTFKLQDYNEKECFILHPKLYPKD

or

 >CA01g24260 PREDICTED_ transmembrane 9 superfamily member 4-like [Solanum tuberosum]
MIAWISFPAVPLNFFGKEVVFSFATAVDNPLHVDLATQNKTRPSCAKVKMKINLLGEFPKRINVGMRMKTREVKEKGVNISYDYVPKYCKTFKLQDYNEKECFILHPKLYPKD

In my same fasta file, I have other sequences which are not in the same format as the sequence above:

>Capsicum_annuum_glabriusculum_Capang01g001768
VNKMENLQFVVIGKFTUEWTDIEELRNIIPQQCNINGDCQIGVFRNKHIF
IRUTQESDFINIIISKGAFYLHCKDVLLFDANTLTLIYDPLFKETLETTK
TFAWISFPNLLPTYYVKECLFSLAATVGKPVQLDLATINRTRPSCARIKV
LVDLKADFSKSVRMDIENEESGKCRTIVKRIKFDHIPKYCHECNMQVHAK
NQCRNL

But in general, I just want the last part:

>Capang01g001768
VNKMENLQFVVIGKFTUEWTDIEELRNIIPQQCNINGDCQIGVFRNKHIF
IRUTQESDFINIIISKGAFYLHCKDVLLFDANTLTLIYDPLFKETLETTK
TFAWISFPNLLPTYYVKECLFSLAATVGKPVQLDLATINRTRPSCARIKV
LVDLKADFSKSVRMDIENEESGKCRTIVKRIKFDHIPKYCHECNMQVHAK
NQCRNL

Because the program that I am using add the name of the species to the ID.

fasta • 288 views
ADD COMMENTlink modified 7 months ago • written 7 months ago by imda10
1

I appreciate your answers, your scripts worked well for some kinds of sequences but not for all. The problem is that the headers of my sequences are not uniform. I have thirteen kinds of sequences (from different species = different headers). I want to extract the headers to get the CDS from another fasta file to carry out selection analysis. Therefore, I need that the headers can match with the headers of my CDS fasta file. For some reason, a previous analysis adds the name of the species to the original sequences headers.

These are the thirteen kinds of different sequences that I have and I am pointing out the header that I need:

>Capsicum_annuum_cvCM334_CA01g24260 PREDICTED_ transmembrane 9 superfamily member 4-like [Solanum tuberosum]
MIAWISFPAVPLNFFGKEVVFSFATAVDNPLHVDLATQNKTRPSCAKVKMKINLLGEFPKRINVGMRMKTR

Required

>cvCM334_CA01g24260
--------------------------------------

input

>Capsicum_annuum_glabriusculum_Capang04g001871
SLSSSVEPIPIKKPCFNNGMSRVIWTEKEVERMKTTENLQYVVIGKFMDG
QILMNYESKFDTNVKRECQIGVLKNRHILMRFNSEEDFINITLKPSYYIL
SKDGYSYMMRTIIYDTKFNVKEVTTLAMAWISFLDLQPTFFVKESIFSIA
LDIEKP

Required

>Capang04g001871
---------------------------------------

input

>Datura_stramonium_Teo1_Datura_stramonium_Teo12749-RA gene=Datura_stramonium_Teo12749 name=Datura_stramonium_Teo12749 seq_id=opera_scaffold_353_pilon_pilon type=cds
MSPPPPETSTEDGNTQFPPLPTTQTQKTHQTQPPIDYGKLFTNSTTQTKPQIDPIPMKPV

Required

>Datura_stramonium_Teo12749-RA 
--------------------------------------------

input

>Datura_stramonium_Tic23_Datura31638-RA gene=Datura_stramonium30201 name=Datura_stramonium30201 seq_id=opera_scaffold_7375_pilon type=cds
MTRINVIENIQHAIVRKFSHDWPSLEELRALIPKQYDYSRNKHVLNRFKLMEDFSNIMSK
SSYHMHPLIYDAKFRTNEETTQAMEQ

Required

>Datura31638-RA
-------------------------------------------------

input

>Nicotiana_attenuata_NIATv7_g62846.t1   unknown
MEVGQSSFNPKPLPQIASNPNPIQNYAKLLQPQAFNAPMHVNSINLKPVELLHGEPMVRW
KKSEVKKSIIQQGFHLAVLGKFSYGKPVIQELRKAIPIQCELKGSCLVGLIEDSHVLIKL
SFMEDYIHLLSKPAFYLKAQGEF

Required

>NIATv7_g62846.t1 
------------------------------------------

input

>Nicotiana_sylvestris_mRNA_25148_cds mRNA_25148 gene_14162|id=AT2G01050.1_evalue=7e-07_annot='zinc ion binding';id=Solyc01g021700.1.1_evalue=4e-12_annot='Unknown Protein'
MNQIERLEFAVVGKFTYDWSDLEELRKIIPQQCGVKGGCQIGLFRSKHILIRLSLQEDFVNLVSKGAFYIT

Required

>mRNA_25148 gene_14162|id=AT2G01050.1_evalue=7e-07_annot='zinc ion binding';id=Solyc01g021700.1.1_evalue=4e-12_annot='Unknown Protein'
------------------------------------------------

input

>Nicotiana_tabacum_Nitab4.5_0003269g0070.1
MATMASGQLPANTRTPPQPPLNITQPCTTTINVPKTMDYANAVKPTTSTSTMQDRAAVVD
PIPPRQAQFFQGQPTCGIKADCNIGYLRDR

Required

>Nitab4.5_0003269g0070.1
---------------------------------------------------

input

>Nicotiana_tomentosiformis_mRNA_3163_cds mRNA_3163 gene_1805|id=AT5G32613.1_evalue=2e-04_annot='Zinc knuckle _CCHC-type_ family protein';id=Solyc03g071760.1.1_evalue=3e-30_annot$
MATNASPQPLVAGELIQNNVNPNPNPTLQTPYAATLKQQPTIQNLPISKLKPVEFVHGEPTLK

Required

>mRNA_3163_cds mRNA_3163 gene_1805|id=AT5G32613.1_evalue=2e-04_annot='Zinc
------------------------------------------------

input

>Petunia_inflata_Peinf101Ctg13805532g00002.1 Unknown protein
MKYDVWFDPLEETSIVVTWISFPGILPEFFVQETAIRKPLQFDIAPKSKTRPGGAKVKVEMDLLVNHPHH

Required

>Peinf101Ctg13805532g00002.1
-------------------------------------------------------

input

>Solanum_lycopersicum_Solyc02g030550.1.1 LOW QUALITY_MLP-like protein 423 _AHRD V3.3 --* AT1G24020.2_
MAKIDSPQPQAEKERPEKPSHATIPNPSTCIQK

Required

>Solyc02g030550.1.1
------------------------------------------------------------

input

>Solanum_pennellii_Sopen10g018820.1 hypothetical protein
MRNQSGEVMEKWIKIRYDYVPKDCKTCMIQGHNKEQCYVIHQELYPKEKTGHKEGQTQEHR

Required

>Sopen10g018820.1
------------------------------------------------------

input

>Solanum_pimpinellifolium_Sopim01g017000.0.1
MPMYCKNYNLQGHKESECFILHPELRMEEEKVDVSEEPRGNSPIDKDKNIGNDEMNTLIK
ILKFTERDNDVLP

Required

>Sopim01g017000.0.1
-----------------------------------------

input

>Solanum_tuberosum_Sotub01g015640.1.1 - [64]
MAVTTACGSSPPEDFPPLPNRSKPGATPIPSSPQTNQYANLLKPRSLLPQITKVLPKPVNIVHE

Required

>Sotub01g015640.1.1
ADD REPLYlink modified 7 months ago by cpad011212k • written 7 months ago by imda10

Thank for providing the detailed examples. This is not a trivial task, because there is no clean pattern for the names you like to keep.

For some reason, a previous analysis adds the name of the species to the original sequences headers.

So you have a file with "original sequence headers"? How does they look like there?

ADD REPLYlink written 7 months ago by finswimmer12k

Hi! I have a .fasta file with proteins from every species. They look like this:

>CA01g00010 Detected protein of unknown function
FRRNLELVRADRPNAFSN...
>CA01g00020 PREDICTED: protein ECERIFERUM 3-like [Solanum tuberosum]
MLTSSTERFQKIQKGAPAEYQKYLV...

However, the program that I used to detect orthologues can give me also all the proteins sequences that belong to each ortogroup or gene family. Therefore, I want to carry out some analysis using Hyphy program, but this program required CDS sequences to work. So I also have all the CDS for each species. I need to use the headers from all the sequences that belong to each gene family (from Orthofinder) in order to obtain the CDS.

ADD REPLYlink modified 7 months ago by finswimmer12k • written 7 months ago by imda10

you would need seqkit to linearize your fasta file.

input:

$ cat test.fa                                                        
>Capsicum_annuum_cvCM334_CA01g24260 PREDICTED_ transmembrane 9 superfamily member 4-like [Solanum tuberosum]
MIAWISFPAVPLNFFGKEVVFSFATAVDNPLHVDLATQNKTRPSCAKVKMKINLLGEFPKRINVGMRMKTREVKEKGVNISYDYVPKYCKTFKLQDYNEKECFILHPKLYPKD
>Capsicum_annuum_glabriusculum_Capang01g001768
VNKMENLQFVVIGKFTUEWTDIEELRNIIPQQCNINGDCQIGVFRNKHIF
IRUTQESDFINIIISKGAFYLHCKDVLLFDANTLTLIYDPLFKETLETTK
TFAWISFPNLLPTYYVKECLFSLAATVGKPVQLDLATINRTRPSCARIKV
LVDLKADFSKSVRMDIENEESGKCRTIVKRIKFDHIPKYCHECNMQVHAK
NQCRNL

output:

$ seqkit seq -w0 test.fa | sed '/^>/ s/^>\w\+_/>/1'  |sed 's/\s.*//g'

>CA01g24260
MIAWISFPAVPLNFFGKEVVFSFATAVDNPLHVDLATQNKTRPSCAKVKMKINLLGEFPKRINVGMRMKTREVKEKGVNISYDYVPKYCKTFKLQDYNEKECFILHPKLYPKD
>Capang01g001768
VNKMENLQFVVIGKFTUEWTDIEELRNIIPQQCNINGDCQIGVFRNKHIFIRUTQESDFINIIISKGAFYLHCKDVLLFDANTLTLIYDPLFKETLETTKTFAWISFPNLLPTYYVKECLFSLAATVGKPVQLDLATINRTRPSCARIKVLVDLKADFSKSVRMDIENEESGKCRTIVKRIKFDHIPKYCHECNMQVHAKNQCRNL
ADD REPLYlink modified 7 months ago • written 7 months ago by cpad011212k

Dear cpad0112, Could you help me with the questions that I pointed out below. Thank you.

ADD REPLYlink written 7 months ago by imda10
2
gravatar for finswimmer
7 months ago by
finswimmer12k
Germany
finswimmer12k wrote:

Try this:

$ awk -v FS=" " '/^>/ {n=split($1, id, "_"); $0=">"id[n]}1' input.fasta > ouput.fasta
ADD COMMENTlink written 7 months ago by finswimmer12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 802 users visited in the last hour