Question: Extracting Gene Ids From Protein Ids
1
gravatar for Assa Yeroslaviz
6.8 years ago by
Assa Yeroslaviz1.1k
Munich
Assa Yeroslaviz1.1k wrote:

hi,

I have a tab-delimited table of of protein ids that looks like that:

45    FBpp0070037    
46    FBpp0070039;FBpp0070040    
47    FBpp0070041;FBpp0070042;FBpp0070043    
48    FBpp0070044;FBpp0110571    
...

For each of these protein Ids I would like to extract the gene id (Fbgn....) in a third column. the output table should looks like that:

45    FBpp0070037                          FBgn001234  
46    FBpp0070039;FBpp0070040              FBgn00094432;FBgn002345   
47    FBpp0070041;FBpp0070042;FBpp0070043  FBgn0001936;FBgn000102;FBgn004527   
48    FBpp0070044;FBpp0110571              FBgn0097234;FBgn00183   
...

I was thinking using biomaRt, but I could find a way of automating it for the complete protein ids in the line

I would appreciate your Ideas.

Thanks A.

R biomart conversion • 1.7k views
ADD COMMENTlink modified 6.8 years ago by Rm7.8k • written 6.8 years ago by Assa Yeroslaviz1.1k
5
gravatar for Rm
6.8 years ago by
Rm7.8k
Danville, PA
Rm7.8k wrote:

Flybase: FBgn <=> FBtr <=> FBpp IDs (fbgn_fbtr_fbpp_*.tsv)

input_file.txt:

FBpp0070037    
FBpp0070039;FBpp0070040    
FBpp0070041;FBpp0070042;FBpp0070043    
FBpp0070044;FBpp0110571

cat input_file.txt | while read LINE; do echo -en "$LINE\t" >> out_fbpp2fbgn.txt ; fbpp="$(echo $LINE | cut -d";" -f1)" ; grep "$fbpp" fbgn_fbtr_fbpp_fb_2011_10.tsv |  awk ' BEGIN {OFS = FS = "\t"}{print$1}' >>out_fbpp2fbgn.txt; done

out_fbpp2fbgn.txt:

FBpp0070037     FBgn0010215
FBpp0070039;FBpp0070040 FBgn0052230
FBpp0070041;FBpp0070042;FBpp0070043     FBgn0000258
FBpp0070044;FBpp0110571 FBgn0053217

ftp://ftp.flybase.net/releases/current/precomputed_files/genes/fbgn_fbtr_fbpp_fb_2011_10.tsv.gz

ADD COMMENTlink modified 6.8 years ago • written 6.8 years ago by Rm7.8k

the file is good, but not exactly what I was looking for. thanks.

ADD REPLYlink written 6.8 years ago by Assa Yeroslaviz1.1k

can you be more specific

ADD REPLYlink written 6.8 years ago by Rm7.8k

my problem is not to get the data from biomaRt, but to get it and keep the structure of the table. If I'll run the column as one ID per line, I will have it than difficult to bring the IDs back to their right protein ID.

ADD REPLYlink written 6.8 years ago by Assa Yeroslaviz1.1k

see the edited answer: to include your request

ADD REPLYlink written 6.8 years ago by Rm7.8k
1
gravatar for scapella
6.8 years ago by
scapella380
Barcelona, Spain
scapella380 wrote:

Hi,

I'd suggest (see below) to use a python script to do the parsing. The code works accordingly to you have said and it removes duplicates entries IDs in the same line.

Hope this can help you.

from string import strip

for line in open(inFile, "rU"):
  fields = map(strip, line.split())

  ids = map(strip, fields[1].split(";"))
  genes = [singleID for singleID in ids if singleID.startswith("Fbgn")]
  others = set(ids) - set(genes)

  print ("%s\t%s\t%s") % (fields[0], ";".join(sorted(others)), ";".join(sorted(genes)))
ADD COMMENTlink written 6.8 years ago by scapella380
1
gravatar for Damian Kao
6.8 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

You can download the index file that RM suggested. It has three columns where first column is the gene ID and third column is the protein ID. Then use a python script to use the index file to transform your protein ID list. So something like this:

import sys

indexFile = open(sys.argv[1],'r')
index = {}
for line in indexFile:
    if line[0] != "#" and line != "":
        data = line.strip().split('\t')
        if len(data) > 2:
            index[data[2]] = data[0]

pidFile = open(sys.argv[2],'r')

for line in pidFile:
    data = line.strip().split(';')
    output = ''
    for item in data:
        output += index[data[0]] + ';'
    print line.strip() + "\t" + output[:-1]

Save as myScript.py. Use by: python myScript.py indexFile proteinIDFile

ADD COMMENTlink written 6.8 years ago by Damian Kao15k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 865 users visited in the last hour