Question

format uniprot fasta headers

1

Entering edit mode

8.2 years ago

jfertaj ▴ 110

Hi,

I have a multi-fasta file with a header in the following format:

>sp|Q9Y5Q8|TF3C5_HUMAN General transcription factor 3C polypeptide 5 OS=Homo sapiens GN=GTF3C5 PE=1 SV=2

I would like to format to extract the Uniprot ID or the Protein Name (ACC) to get the following:

>Q9Y5Q8

or

>TF3C5_HUMAN

I think sed can do it but I don't know the exact combination of regexp

Thanks

sequence fasta-header • 5.5k views

ADD COMMENT • link updated 3.3 years ago by GenoMax 152k • written 8.2 years ago by jfertaj ▴ 110

0

Entering edit mode

What you need is cut -d '|'

ADD REPLY • link 8.2 years ago by WouterDeCoster 48k

0

Entering edit mode

sed -e 's/^>.\|//' -e 's/ .//' file

ADD REPLY • link 8.2 years ago by Rohit ★ 1.5k

0

Entering edit mode

thanks but this approach gives me p|Q9Y5Q8|TF3C5_HUMANeneral transcription factor 3C polypeptide 5 OS=Homo sapiens GN=GTF3C5 PE=1 SV=2

ADD REPLY • link 8.2 years ago by jfertaj ▴ 110

0

Entering edit mode

sorry, my bad forgot the wild card

sed -e 's/^>.*\|/>/' -e 's/ .*//' file

ADD REPLY • link 8.2 years ago by Rohit ★ 1.5k

2

Entering edit mode

8.2 years ago

Pierre Lindenbaum 166k

awk -F '|' '/^>/ {printf(">%s\n",$2);next;} {print;}' input.fasta

ADD COMMENT • link 8.2 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

thanks @Pierre, even more concise!! +1, could you please explain that does the first part of the awk command just right after the field separator command?

ADD REPLY • link 8.2 years ago by jfertaj ▴ 110

0

Entering edit mode

If the line begins with > do next thing.

ADD REPLY • link 8.2 years ago by GenoMax 152k

0

Entering edit mode

8.2 years ago

Buffo ★ 2.4k

save the script as script.py and run as

python script.py file.fasta and you will get this
>Q9Y5Q8
LASJDQSMLASKDNAL

#!/usr/bin/env python
#-*- coding: UTF-8 -*-

from __future__ import division
import sys


##########################################################################################
syntax = '''
------------------------------------------------------------------------------------
Usage: python script_.py file.fasta 
------------------------------------------------------------------------------------
'''
##########################################################################################

if len(sys.argv) != 2:
    print syntax
    sys.exit()

##########################################################################################

dict = {}
seq = ""
prefix = sys.argv[1].split('.')[0]
outfile = open(prefix + '_' + 'extracted.fasta','w')
fasta_seqs = open(sys.argv[1], 'r')

for line in fasta_seqs:

    line = line.rstrip('\n')

    if line.startswith('>'):
        if seq:            
            dict[name] = seq
            seq = ""
        name = line.split('|')[1]                        

    else:
        seq = seq + line 

dict[name] = line

for key, value in dict.iteritems():
    outfile.write('>' + key + '\n' + str(value) + '\n')

Feel free to modify it as you need

ADD COMMENT • link 8.2 years ago by Buffo ★ 2.4k

1

Entering edit mode

In this case, I would make it little easier for User using BioPython module:

from Bio import SeqIO
for seq_record in SeqIO.parse('sample.fasta', 'fasta'):
  header = seq_record.id
  UniprotID ='>'+str(header.split('|')[1])
  ProteinName='>'+str((header.split('|')[-1]).split(' ')[0])
  seqs = str(seq_record.seq)
  print UniprotID
  print seqs

ADD REPLY • link 8.2 years ago by Pallab Bhowmick ▴ 20

0

Entering edit mode

Yes I know, but personally I don´t like to use biopython, and even less to use print for fasta files, I think exist a function called write _fasta or something like that on seqIO module doesn`t it?

ADD REPLY • link 8.2 years ago by Buffo ★ 2.4k

0

Entering edit mode

Yes SeqIO.write() exists, or you can use print with the format() function for proper output.

ADD REPLY • link 8.2 years ago by WouterDeCoster 48k

0

Entering edit mode

3.3 years ago

katieostrouchov ▴ 30

If you only want the unique identifiers and not the sequences:

awk -F '|' '/^>/ {printf(">%s\n",$2);}' proteome.fasta | cut -c 2- > identifiers.txt

Example input:

>sp|O67453|Y1476_AQUAE Uncharacterized protein aq_1476 OS=Aquifex aeolicus (strain VF5) OX=224324 GN=aq_1476 PE=4 SV=1
MLKSLTMENVKVVTGEIEKLRERIEKVKETLDLIPKEIEELERELERVRQEIAKKEDEL
AVAREIRHKEHEFTEVKQKIAYHRKYLERADSPREYERLLQERQKLIERAYKLSEEIYE
RRKYEALREEEEKLHQKEDEIEEKIHKLKKEYRALLNELKGLIEELNRKAREIIEKYGL
>tr|A0A384D5E1|A0A384D5E1_URSMA Prokineticin-1 OS=Ursus maritimus OX=29073 GN=PROK1 PE=3 SV=1
MRGAMRVSIMFLLVTVSDCAVITGACERDVQCGAGTCCAISLWLRGLRMCTPLGREGEEC
HPGSHKVPFFRRRQHHTCPCLPSLLCSRCLDGRYRCSTDLKNINF

Example output:

O67453
A0A384D5E1

ADD COMMENT • link updated 3.3 years ago by GenoMax 152k • written 3.3 years ago by katieostrouchov ▴ 30

score 4 · Accepted Answer · 2017-04-24

4

Entering edit mode

8.2 years ago

GenoMax 152k

awk '{if ($0 ~ /^>/)  {split($0,a,"|"); print ">"a[2]} else { print;}}' your_file > new_file

If you want the TF* names then

awk '{if ($0 ~ /^>/)  {split($0,a,"|"); split(a[3],b," "); print ">"b[1]} else { print;}}' your_file > new_file

ADD COMMENT • link 8.2 years ago by GenoMax 152k

0

Entering edit mode

thanks a lot @genomax2, if you write your comment as an answer I will give accept it as an answer

ADD REPLY • link 8.2 years ago by jfertaj ▴ 110