How to edit fasta header with underscores
1
0
Entering edit mode
5.1 years ago
imda ▴ 10

Hi everyone! I want to remove one part of my fasta headers, could somebody help me??? please

I have this:

>Capsicum_annuum_cvCM334_CA01g24260 PREDICTED_ transmembrane 9 superfamily member 4-like [Solanum tuberosum]
MIAWISFPAVPLNFFGKEVVFSFATAVDNPLHVDLATQNKTRPSCAKVKMKINLLGEFPKRINVGMRMKTREVKEKGVNISYDYVPKYCKTFKLQDYNEKECFILHPKLYPKD

and I just want this part

 >CA01g24260
MIAWISFPAVPLNFFGKEVVFSFATAVDNPLHVDLATQNKTRPSCAKVKMKINLLGEFPKRINVGMRMKTREVKEKGVNISYDYVPKYCKTFKLQDYNEKECFILHPKLYPKD

or

 >CA01g24260 PREDICTED_ transmembrane 9 superfamily member 4-like [Solanum tuberosum]
MIAWISFPAVPLNFFGKEVVFSFATAVDNPLHVDLATQNKTRPSCAKVKMKINLLGEFPKRINVGMRMKTREVKEKGVNISYDYVPKYCKTFKLQDYNEKECFILHPKLYPKD

In my same fasta file, I have other sequences which are not in the same format as the sequence above:

>Capsicum_annuum_glabriusculum_Capang01g001768
VNKMENLQFVVIGKFTUEWTDIEELRNIIPQQCNINGDCQIGVFRNKHIF
IRUTQESDFINIIISKGAFYLHCKDVLLFDANTLTLIYDPLFKETLETTK
TFAWISFPNLLPTYYVKECLFSLAATVGKPVQLDLATINRTRPSCARIKV
LVDLKADFSKSVRMDIENEESGKCRTIVKRIKFDHIPKYCHECNMQVHAK
NQCRNL

But in general, I just want the last part:

>Capang01g001768
VNKMENLQFVVIGKFTUEWTDIEELRNIIPQQCNINGDCQIGVFRNKHIF
IRUTQESDFINIIISKGAFYLHCKDVLLFDANTLTLIYDPLFKETLETTK
TFAWISFPNLLPTYYVKECLFSLAATVGKPVQLDLATINRTRPSCARIKV
LVDLKADFSKSVRMDIENEESGKCRTIVKRIKFDHIPKYCHECNMQVHAK
NQCRNL

Because the program that I am using add the name of the species to the ID.

fasta • 1.9k views
ADD COMMENT
1
Entering edit mode

I appreciate your answers, your scripts worked well for some kinds of sequences but not for all. The problem is that the headers of my sequences are not uniform. I have thirteen kinds of sequences (from different species = different headers). I want to extract the headers to get the CDS from another fasta file to carry out selection analysis. Therefore, I need that the headers can match with the headers of my CDS fasta file. For some reason, a previous analysis adds the name of the species to the original sequences headers.

These are the thirteen kinds of different sequences that I have and I am pointing out the header that I need:

>Capsicum_annuum_cvCM334_CA01g24260 PREDICTED_ transmembrane 9 superfamily member 4-like [Solanum tuberosum]
MIAWISFPAVPLNFFGKEVVFSFATAVDNPLHVDLATQNKTRPSCAKVKMKINLLGEFPKRINVGMRMKTR

Required

>cvCM334_CA01g24260
--------------------------------------

input

>Capsicum_annuum_glabriusculum_Capang04g001871
SLSSSVEPIPIKKPCFNNGMSRVIWTEKEVERMKTTENLQYVVIGKFMDG
QILMNYESKFDTNVKRECQIGVLKNRHILMRFNSEEDFINITLKPSYYIL
SKDGYSYMMRTIIYDTKFNVKEVTTLAMAWISFLDLQPTFFVKESIFSIA
LDIEKP

Required

>Capang04g001871
---------------------------------------

input

>Datura_stramonium_Teo1_Datura_stramonium_Teo12749-RA gene=Datura_stramonium_Teo12749 name=Datura_stramonium_Teo12749 seq_id=opera_scaffold_353_pilon_pilon type=cds
MSPPPPETSTEDGNTQFPPLPTTQTQKTHQTQPPIDYGKLFTNSTTQTKPQIDPIPMKPV

Required

>Datura_stramonium_Teo12749-RA 
--------------------------------------------

input

>Datura_stramonium_Tic23_Datura31638-RA gene=Datura_stramonium30201 name=Datura_stramonium30201 seq_id=opera_scaffold_7375_pilon type=cds
MTRINVIENIQHAIVRKFSHDWPSLEELRALIPKQYDYSRNKHVLNRFKLMEDFSNIMSK
SSYHMHPLIYDAKFRTNEETTQAMEQ

Required

>Datura31638-RA
-------------------------------------------------

input

>Nicotiana_attenuata_NIATv7_g62846.t1   unknown
MEVGQSSFNPKPLPQIASNPNPIQNYAKLLQPQAFNAPMHVNSINLKPVELLHGEPMVRW
KKSEVKKSIIQQGFHLAVLGKFSYGKPVIQELRKAIPIQCELKGSCLVGLIEDSHVLIKL
SFMEDYIHLLSKPAFYLKAQGEF

Required

>NIATv7_g62846.t1 
------------------------------------------

input

>Nicotiana_sylvestris_mRNA_25148_cds mRNA_25148 gene_14162|id=AT2G01050.1_evalue=7e-07_annot='zinc ion binding';id=Solyc01g021700.1.1_evalue=4e-12_annot='Unknown Protein'
MNQIERLEFAVVGKFTYDWSDLEELRKIIPQQCGVKGGCQIGLFRSKHILIRLSLQEDFVNLVSKGAFYIT

Required

>mRNA_25148 gene_14162|id=AT2G01050.1_evalue=7e-07_annot='zinc ion binding';id=Solyc01g021700.1.1_evalue=4e-12_annot='Unknown Protein'
------------------------------------------------

input

>Nicotiana_tabacum_Nitab4.5_0003269g0070.1
MATMASGQLPANTRTPPQPPLNITQPCTTTINVPKTMDYANAVKPTTSTSTMQDRAAVVD
PIPPRQAQFFQGQPTCGIKADCNIGYLRDR

Required

>Nitab4.5_0003269g0070.1
---------------------------------------------------

input

>Nicotiana_tomentosiformis_mRNA_3163_cds mRNA_3163 gene_1805|id=AT5G32613.1_evalue=2e-04_annot='Zinc knuckle _CCHC-type_ family protein';id=Solyc03g071760.1.1_evalue=3e-30_annot$
MATNASPQPLVAGELIQNNVNPNPNPTLQTPYAATLKQQPTIQNLPISKLKPVEFVHGEPTLK

Required

>mRNA_3163_cds mRNA_3163 gene_1805|id=AT5G32613.1_evalue=2e-04_annot='Zinc
------------------------------------------------

input

>Petunia_inflata_Peinf101Ctg13805532g00002.1 Unknown protein
MKYDVWFDPLEETSIVVTWISFPGILPEFFVQETAIRKPLQFDIAPKSKTRPGGAKVKVEMDLLVNHPHH

Required

>Peinf101Ctg13805532g00002.1
-------------------------------------------------------

input

>Solanum_lycopersicum_Solyc02g030550.1.1 LOW QUALITY_MLP-like protein 423 _AHRD V3.3 --* AT1G24020.2_
MAKIDSPQPQAEKERPEKPSHATIPNPSTCIQK

Required

>Solyc02g030550.1.1
------------------------------------------------------------

input

>Solanum_pennellii_Sopen10g018820.1 hypothetical protein
MRNQSGEVMEKWIKIRYDYVPKDCKTCMIQGHNKEQCYVIHQELYPKEKTGHKEGQTQEHR

Required

>Sopen10g018820.1
------------------------------------------------------

input

>Solanum_pimpinellifolium_Sopim01g017000.0.1
MPMYCKNYNLQGHKESECFILHPELRMEEEKVDVSEEPRGNSPIDKDKNIGNDEMNTLIK
ILKFTERDNDVLP

Required

>Sopim01g017000.0.1
-----------------------------------------

input

>Solanum_tuberosum_Sotub01g015640.1.1 - [64]
MAVTTACGSSPPEDFPPLPNRSKPGATPIPSSPQTNQYANLLKPRSLLPQITKVLPKPVNIVHE

Required

>Sotub01g015640.1.1
ADD REPLY
0
Entering edit mode

Thank for providing the detailed examples. This is not a trivial task, because there is no clean pattern for the names you like to keep.

For some reason, a previous analysis adds the name of the species to the original sequences headers.

So you have a file with "original sequence headers"? How does they look like there?

ADD REPLY
0
Entering edit mode

Hi! I have a .fasta file with proteins from every species. They look like this:

>CA01g00010 Detected protein of unknown function
FRRNLELVRADRPNAFSN...
>CA01g00020 PREDICTED: protein ECERIFERUM 3-like [Solanum tuberosum]
MLTSSTERFQKIQKGAPAEYQKYLV...

However, the program that I used to detect orthologues can give me also all the proteins sequences that belong to each ortogroup or gene family. Therefore, I want to carry out some analysis using Hyphy program, but this program required CDS sequences to work. So I also have all the CDS for each species. I need to use the headers from all the sequences that belong to each gene family (from Orthofinder) in order to obtain the CDS.

ADD REPLY
0
Entering edit mode

you would need seqkit to linearize your fasta file.

input:

$ cat test.fa                                                        
>Capsicum_annuum_cvCM334_CA01g24260 PREDICTED_ transmembrane 9 superfamily member 4-like [Solanum tuberosum]
MIAWISFPAVPLNFFGKEVVFSFATAVDNPLHVDLATQNKTRPSCAKVKMKINLLGEFPKRINVGMRMKTREVKEKGVNISYDYVPKYCKTFKLQDYNEKECFILHPKLYPKD
>Capsicum_annuum_glabriusculum_Capang01g001768
VNKMENLQFVVIGKFTUEWTDIEELRNIIPQQCNINGDCQIGVFRNKHIF
IRUTQESDFINIIISKGAFYLHCKDVLLFDANTLTLIYDPLFKETLETTK
TFAWISFPNLLPTYYVKECLFSLAATVGKPVQLDLATINRTRPSCARIKV
LVDLKADFSKSVRMDIENEESGKCRTIVKRIKFDHIPKYCHECNMQVHAK
NQCRNL

output:

$ seqkit seq -w0 test.fa | sed '/^>/ s/^>\w\+_/>/1'  |sed 's/\s.*//g'

>CA01g24260
MIAWISFPAVPLNFFGKEVVFSFATAVDNPLHVDLATQNKTRPSCAKVKMKINLLGEFPKRINVGMRMKTREVKEKGVNISYDYVPKYCKTFKLQDYNEKECFILHPKLYPKD
>Capang01g001768
VNKMENLQFVVIGKFTUEWTDIEELRNIIPQQCNINGDCQIGVFRNKHIFIRUTQESDFINIIISKGAFYLHCKDVLLFDANTLTLIYDPLFKETLETTKTFAWISFPNLLPTYYVKECLFSLAATVGKPVQLDLATINRTRPSCARIKVLVDLKADFSKSVRMDIENEESGKCRTIVKRIKFDHIPKYCHECNMQVHAKNQCRNL
ADD REPLY
0
Entering edit mode

Dear cpad0112, Could you help me with the questions that I pointed out below. Thank you.

ADD REPLY
2
Entering edit mode
5.1 years ago

Try this:

$ awk -v FS=" " '/^>/ {n=split($1, id, "_"); $0=">"id[n]}1' input.fasta > ouput.fasta
ADD COMMENT

Login before adding your answer.

Traffic: 2498 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6