Question: format uniprot fasta headers
1
gravatar for jfertaj
2.8 years ago by
jfertaj90
United Kingdom
jfertaj90 wrote:

Hi,

I have a multi-fasta file with a header in the following format:

>sp|Q9Y5Q8|TF3C5_HUMAN General transcription factor 3C polypeptide 5 OS=Homo sapiens GN=GTF3C5 PE=1 SV=2

I would like to format to extract the Uniprot ID or the Protein Name (ACC) to get the following:

>Q9Y5Q8

or

>TF3C5_HUMAN

I think sed can do it but I don't know the exact combination of regexp

Thanks

sequence fasta-header • 1.4k views
ADD COMMENTlink modified 2.8 years ago by Pierre Lindenbaum125k • written 2.8 years ago by jfertaj90

What you need is cut -d '|'

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by WouterDeCoster42k

sed -e 's/^>.\|//' -e 's/ .//' file

ADD REPLYlink written 2.8 years ago by Rohit1.4k

thanks but this approach gives me p|Q9Y5Q8|TF3C5_HUMANeneral transcription factor 3C polypeptide 5 OS=Homo sapiens GN=GTF3C5 PE=1 SV=2

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by jfertaj90

sorry, my bad forgot the wild card

sed -e 's/^>.*\|/>/' -e 's/ .*//' file
ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Rohit1.4k
1
gravatar for genomax
2.8 years ago by
genomax76k
United States
genomax76k wrote:
awk '{if ($0 ~ /^>/)  {split($0,a,"|"); print ">"a[2]} else { print;}}' your_file > new_file

If you want the TF* names then

awk '{if ($0 ~ /^>/)  {split($0,a,"|"); split(a[3],b," "); print ">"b[1]} else { print;}}' your_file > new_file
ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by genomax76k

thanks a lot @genomax2, if you write your comment as an answer I will give accept it as an answer

ADD REPLYlink written 2.8 years ago by jfertaj90
1
gravatar for Pierre Lindenbaum
2.8 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum125k wrote:
awk -F '|' '/^>/ {printf(">%s\n",$2);next;} {print;}' input.fasta
ADD COMMENTlink written 2.8 years ago by Pierre Lindenbaum125k

thanks @Pierre, even more concise!! +1, could you please explain that does the first part of the awk command just right after the field separator command?

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by jfertaj90

If the line begins with > do next thing.

ADD REPLYlink written 2.8 years ago by genomax76k
0
gravatar for Buffo
2.8 years ago by
Buffo1.8k
Buffo1.8k wrote:

save the script as script.py and run as

python script.py file.fasta and you will get this
>Q9Y5Q8
LASJDQSMLASKDNAL

#!/usr/bin/env python
#-*- coding: UTF-8 -*-

from __future__ import division
import sys


##########################################################################################
syntax = '''
------------------------------------------------------------------------------------
Usage: python script_.py file.fasta 
------------------------------------------------------------------------------------
'''
##########################################################################################

if len(sys.argv) != 2:
    print syntax
    sys.exit()

##########################################################################################

dict = {}
seq = ""
prefix = sys.argv[1].split('.')[0]
outfile = open(prefix + '_' + 'extracted.fasta','w')
fasta_seqs = open(sys.argv[1], 'r')

for line in fasta_seqs:

    line = line.rstrip('\n')

    if line.startswith('>'):
        if seq:            
            dict[name] = seq
            seq = ""
        name = line.split('|')[1]                        

    else:
        seq = seq + line 

dict[name] = line

for key, value in dict.iteritems():
    outfile.write('>' + key + '\n' + str(value) + '\n')

Feel free to modify it as you need

ADD COMMENTlink written 2.8 years ago by Buffo1.8k
1

In this case, I would make it little easier for User using BioPython module:

from Bio import SeqIO
for seq_record in SeqIO.parse('sample.fasta', 'fasta'):
  header = seq_record.id
  UniprotID ='>'+str(header.split('|')[1])
  ProteinName='>'+str((header.split('|')[-1]).split(' ')[0])
  seqs = str(seq_record.seq)
  print UniprotID
  print seqs
ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Pallab Bhowmick20

Yes I know, but personally I don´t like to use biopython, and even less to use print for fasta files, I think exist a function called write _fasta or something like that on seqIO module doesn`t it?

ADD REPLYlink written 2.8 years ago by Buffo1.8k

Yes SeqIO.write() exists, or you can use print with the format() function for proper output.

ADD REPLYlink written 2.8 years ago by WouterDeCoster42k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 782 users visited in the last hour