Question: scripting problem to use ProtComp output to identify secretome proteins
0
gravatar for rob234king
6.2 years ago by
rob234king600
UK/Harpenden/Rothamsted Research
rob234king600 wrote:

To identify secretome proteins, one of the end stages of the pipeline requires a scripting solution.

 

I have a file with protein sequence names (one per line) of the same order as the protein sequence analysis tables in another file, shown below. What I need to do is if the sequence name from the file2 has a number above 0 in either the LocDB or PotLocDB column excluding the "Extracellular" row to delete that sequence ID from file1 and if possible produce a file like file3. I then have a secretome protein id file list.

 

file3

14024    M            3.0          En           2.0         

 

File1:

14024

13025

 

File2:

Seq name: 14024, Length=261
Nuclear 0.0     0.0     0.00    0.00    0.26
Plasma membrane 0.0     0.0     0.76    0.39    3.70
Extracellular   0.0     0.0     0.32    0.13    1.06
Cytoplasmic     0.0     0.0     0.03    0.10    0.61
Mitochondrial   3.0     0.0     1.08    0.43    2.90
Endoplasm. retic.       2.0     0.0     0.22    0.37    0.43
Peroxisomal     0.0     0.0     0.06    0.10    0.22
Lysosomal       0.0     0.0     0.08    0.11    0.00
Golgi   0.0     0.0     0.18    1.12    0.57
Vacuolar        0.0     0.0     0.28    0.39    0.25

Seq name: 13025, Length=247
Nuclear 0.0     0.0     1.63    0.00    0.77
Plasma membrane 0.0     0.0     0.06    0.48    0.93
Extracellular   0.0     0.0     0.19    0.35    0.87
Cytoplasmic     0.0     0.0     0.26    0.04    0.18
Mitochondrial   0.0     0.0     1.16    1.34    6.67
Endoplasm. retic.       0.0     0.0     0.09    0.27    0.00
Peroxisomal     0.0     0.0     0.01    0.04    0.07
Lysosomal       0.0     0.0     0.00    0.13    0.03
Golgi   0.0     0.0     0.04    0.21    0.48
Vacuolar        0.0     0.0     0.00    0.21    0.00

 

 

This is the beginning of some bad code but not finished yet which why trying on here for some people who can script better than me

#!/usr/bin/perl -w

use strict;

my $idsfile = "RRESID.txt";
my $seqfile = "ProtComp_RRES_ALL_correct2.txt";
my %ids  = ();



open FILE2, $idsfile;
while(<FILE2>) {
open FILE, $seqfile;
while(<FILE>) {

for (my $i=0; $i <= 10; $i++) {

  chomp;
my @col = split /\t/;
if($col[1]>0){
if(i=0){

}elsif(i=1){

}elsif(i=2){

}elsif(i=3){

}elsif(i=4){

}elsif(i=5){

}elsif(i=6){

}elsif(i=7){

}elsif(i=8){

}elsif(i=9){

}


if($col[2]>0){
if(i=0){

}elsif(i=1){

}elsif(i=2){

}elsif(i=3){

}elsif(i=4){

}elsif(i=5){

}elsif(i=6){

}elsif(i=7){

}elsif(i=8){

}elsif(i=9){

}

}
$counter1=$counter1+1;
}

}
close FILE;
$counter1=$counter1+3;
}
close FILE2;
unix python perl • 2.3k views
ADD COMMENTlink modified 6.2 years ago by russhh5.5k • written 6.2 years ago by rob234king600
1

could you indicate what you have tried, please

ADD REPLYlink written 6.2 years ago by russhh5.5k

updated what I have so far but I'm just beginning and would take me a week as I'm slow at this :( I'll update what I have Friday and hopefully made more progress with it.I should say im posting to see if there is a simple script, command line code that I am not aware. I could only think of two loops using perl.

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by rob234king600
2
gravatar for russhh
6.2 years ago by
russhh5.5k
UK, U. Glasgow
russhh5.5k wrote:

I'm not giving you the code, however, in perl (though I probably wouldn't do it this way anymore):

You don't really need the file1 mentioned in your original post (I noticed it has the handle FILE2 in your code, which is a bit confusing..). I'd probably reconsider using all those ifelses, you could split each dataline into the cellular space id and an array of numbers. Sum the numbers you're interested in and compare to 0.

set up two variables $current_seq and $skip_flag. Determine what the current sequences name is from the first line of a given block in file2 and store it; if you hit a line that says 'I ought to disregard the current_seq id because it's present in the wrong part of the cell' set skip_flag to 1 - you can use this indicator to tell you that you don't need to consider anymore of the lines for the current_seq (so you don't print the same id multiple times) and that you don't need to ... Print the current_seq id when you get to a blank line, (and also reset current_seq and skip_flag).

Make sure you don't lose the final entry of your file

ADD COMMENTlink written 6.2 years ago by russhh5.5k

I went for a bit of a hack for the moment but I am looking to follow your suggestions to implement better code. Thanks

ADD REPLYlink written 6.2 years ago by rob234king600
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1310 users visited in the last hour