format uniprot fasta headers
        4 
    
    
    
        
        
        
        
            
                
                
                    
                        
                    
                
                    
                        Hi, 
I have a multi-fasta file with a header in the following format:
>sp|Q9Y5Q8|TF3C5_HUMAN General transcription factor 3C polypeptide 5 OS=Homo sapiens GN=GTF3C5 PE=1 SV=2
 
I would like to format to extract the Uniprot ID or the Protein Name (ACC) to get the following:
>Q9Y5Q8
 
or 
>TF3C5_HUMAN
 
I think sed can do it but I don't know the exact combination of regexp
Thanks
                    
                 
                 
                
                
                    
                    
    
        
        
            sequence
         
        
    
        
        
            fasta-header
         
        
    
    
        • 6.3k views
    
 
                
                 
                
                
    
    • 
link 
    
    
    
    
    
    
        
    
        updated 3.6 years ago by
        
            GenoMax 
         
        
    
         
    
    154k
        •
    
        written 8.5 years ago by
        
            jfertaj 
         
        
    
        ▴
    
    110
     
 
 
             
            
            
         
     
 
     
    
        
            
                
 
    
    
    
    
        
        
        
        
            
                
                
                    
                        
                    
                
                    
                        awk '{if ($0 ~ /^>/)  {split($0,a,"|"); print ">"a[2]} else { print;}}' your_file > new_file
 
If you want the TF* names then
awk '{if ($0 ~ /^>/)  {split($0,a,"|"); split(a[3],b," "); print ">"b[1]} else { print;}}' your_file > new_file
 
                    
                 
                 
                
                
                 
                
                
 
             
            
            
         
     
 
         
        
            
                
 
    
    
    
    
        
        
        
        
            
                
                
                    
                        
                    
                
                    
                        awk -F '|' '/^>/ {printf(">%s\n",$2);next;} {print;}' input.fasta
 
                    
                 
                 
                
                
                 
                
                
 
             
            
            
         
     
 
         
        
            
                
 
    
    
    
    
        
        
        
        
            
                
                
                    
                        
                    
                
                    
                        save the script as script.py and run as
python script.py file.fasta and you will get this
>Q9Y5Q8
LASJDQSMLASKDNAL
#!/usr/bin/env python
#-*- coding: UTF-8 -*-
from __future__ import division
import sys
##########################################################################################
syntax = '''
------------------------------------------------------------------------------------
Usage: python script_.py file.fasta 
------------------------------------------------------------------------------------
'''
##########################################################################################
if len(sys.argv) != 2:
    print syntax
    sys.exit()
##########################################################################################
dict = {}
seq = ""
prefix = sys.argv[1].split('.')[0]
outfile = open(prefix + '_' + 'extracted.fasta','w')
fasta_seqs = open(sys.argv[1], 'r')
for line in fasta_seqs:
    line = line.rstrip('\n')
    if line.startswith('>'):
        if seq:            
            dict[name] = seq
            seq = ""
        name = line.split('|')[1]                        
    else:
        seq = seq + line 
dict[name] = line
for key, value in dict.iteritems():
    outfile.write('>' + key + '\n' + str(value) + '\n')
 
Feel free to modify it as you need 
                    
                 
                 
                
                
                 
                
                
 
             
            
            
         
     
 
         
        
            
                
 
    
    
    
    
        
        
        
        
            
                
                
                    
                        
                    
                
                    
                        If you only want the unique identifiers and not the sequences: 
awk -F '|' '/^>/ {printf(">%s\n",$2);}' proteome.fasta | cut -c 2- > identifiers.txt
 
Example input: 
>sp|O67453|Y1476_AQUAE Uncharacterized protein aq_1476 OS=Aquifex aeolicus (strain VF5) OX=224324 GN=aq_1476 PE=4 SV=1
MLKSLTMENVKVVTGEIEKLRERIEKVKETLDLIPKEIEELERELERVRQEIAKKEDEL
AVAREIRHKEHEFTEVKQKIAYHRKYLERADSPREYERLLQERQKLIERAYKLSEEIYE
RRKYEALREEEEKLHQKEDEIEEKIHKLKKEYRALLNELKGLIEELNRKAREIIEKYGL
>tr|A0A384D5E1|A0A384D5E1_URSMA Prokineticin-1 OS=Ursus maritimus OX=29073 GN=PROK1 PE=3 SV=1
MRGAMRVSIMFLLVTVSDCAVITGACERDVQCGAGTCCAISLWLRGLRMCTPLGREGEEC
HPGSHKVPFFRRRQHHTCPCLPSLLCSRCLDGRYRCSTDLKNINF
 
Example output: 
O67453 
A0A384D5E1
                    
                 
                 
                
                
                 
                
                
 
             
            
            
         
     
 
         
        
 
    
    
        
            
                 Login  before adding your answer.
         
    
    
         
        
            
        
     
    
    Traffic: 6224 users visited in the last hour
         
    
    
        
    
    
 
What you need is
cut -d '|'sed -e 's/^>.\|//' -e 's/ .//' file
thanks but this approach gives me p|Q9Y5Q8|TF3C5_HUMANeneral transcription factor 3C polypeptide 5 OS=Homo sapiens GN=GTF3C5 PE=1 SV=2
sorry, my bad forgot the wild card