How can I convert the blast results into a matrix?
5
2
Entering edit mode
8.2 years ago
kandoigaurav ▴ 150

I've a table of blast results with ~5hits/query protein:

Protein     Class
ProtA       1
ProtA       1
ProtA       1
ProtA       0
ProtA       1
ProtB       1
ProtB       1
ProtB       0
ProtB       0
ProtB       1


I would like to convert this into a feature vector matrix like this:

Protein     Class1     Class2     Class3     Class4     Class5
ProtA       1          1          1          0          1
ProtB       1          1          0          0          1


Can someone suggest me an efficient way to do this, since I've ~2300k hits in the file.

perl linux python blast • 4.2k views
0
Entering edit mode

What is "Class"?

0
Entering edit mode

I've converted the hits to 1's and 0's to represent whether or not they are enzyme/non-enzymes. For each query protein, I've ~5hits and therefore, I can use these 5hits as 5 training features.

5
Entering edit mode
8.2 years ago
komal.rathi ★ 4.0k

For any number of hits per protein:

> tmp
protein class
ProtA     1
ProtA     1
ProtA     1
ProtA     0
ProtA     1
ProtB     1
ProtB     1
ProtB     0
ProtB     0
ProtC     1
ProtD     1
ProtD     0

library(reshape2)
library(plyr)
library(doMC)
registerDoMC(8) # assign 8 cores for ddply to use

# define function myfunc
myfunc <- function(x)
{
dim(x)
x$name = paste("Class",1:nrow(x)) dcast(x,protein~name,value.var = "class") } # call myfunc in ddply. run ddply in parallel mode to use >1 cores res = ddply(.data = tmp,.variables = "protein",.fun = function(x) myfunc(x),.parallel=T) # output > res protein Class 1 Class 2 Class 3 Class 4 Class 5 ProtA 1 1 1 0 1 ProtB 1 1 0 0 NA ProtC 1 NA NA NA NA ProtD 1 0 NA NA NA  ADD COMMENT 0 Entering edit mode Thanks a lot for that. All I need to do is feed in my table file to R as dat$name and it'll take care of the rest?

1
Entering edit mode

No you need to read your table in dat and then run the above commands. I will update my answer.

0
Entering edit mode

Hey, it gives me the following error:

Error in '$<-.data.frame'(*tmp*, "name", value = c("Class1", "Class2", : replacement has 5 rows, data has 2278916  ADD REPLY 0 Entering edit mode The R code is incorrect. This line: dat$name = paste("Class",1:5,sep="")


tries to create a column with the header "name" containing 5 rows with values "Class1", "Class2"..."Class 5". The error occurs because you have more than 5 existing rows.

1
Entering edit mode

This code worked using the example that he has provided. I thought each protein has 5 hits. If each protein has 5 hits, then this code should work. I misunderstood the question because of the example.

0
Entering edit mode

Hi Komal,

As I checked through the blast results, there are few queries with more or less than 5 hits as well. I don't know how that happened, since I've used the max_target_seqs option while using BLAST.

1
Entering edit mode

I am trying to rectify it.

1
Entering edit mode

I have updated my answer. Please try that and let me know if its ok.

0
Entering edit mode

Thanks for the update. It now gives me the following error:

Error in eval(expr, envir, enclos) : could not find functional "myfunc"

0
Entering edit mode
Did you define the function before calling it in ddply?
0
Entering edit mode

Here is what I did:

> tmp = read.table('all-vs-all.txt',header = T)
> myfunc <- function(x)
+ {
+   dim(x)
+   x$name = paste("Class",1:nrow(x)) + dcast(x,Protein~name,value.var="Class") + } > > res = ddply(.data = tmp,.variables = "Protein",.fun = function(x) myfunc(x)) Error: could not find function "dcast"  ADD REPLY 1 Entering edit mode dcast is part of the reshape2 package. Install that package first. ADD REPLY 0 Entering edit mode Yes, I forgot to load the reshape2 package. I tried the script on the sample data which you've used and it seemed to work fine. It is running on my data. SInce the data is pretty large, it's too memory intensive as well. I'll let you know of the output, when it completes executing. Thanks a lot for your continuous help. Really appreciate that :) ADD REPLY 1 Entering edit mode I will update my answer with a more reasonable solution I missed that your data has >2300K hits. ADD REPLY 0 Entering edit mode That would be very helpful of you. Thanks! ADD REPLY 1 Entering edit mode I have updated my answer, you can use ddply in a parallel mode. I have added the parallel argument in the ddply function call. If you are running this on a cluster, you can give a reasonable number of cores. In the example, ddply uses 8 processors. You can change the number of cores in the registerDoMC() call. ADD REPLY 0 Entering edit mode I'm on a windows system, and doMC works only in Unix based systems. Is there any substitute for windows system? ADD REPLY 0 Entering edit mode Sorry, I am not really familiar with Windows based systems. You can search for an alternative, or use a single core. ADD REPLY 0 Entering edit mode That's okay. I left it running on my system. Will check back in the morning when I go lab. Thanks again. ADD REPLY 0 Entering edit mode Sorry, I seemed to have missed to load reshape2 package. It's running as of now. I'll let you know the output when it finishes. ADD REPLY 0 Entering edit mode Apologies, you are correct; the 1:5 is recycled and should fill all rows if nrow() is a multiple of 5. I guess the error comes from the dcast(). ADD REPLY 2 Entering edit mode 8.2 years ago Siva ★ 1.8k Using awk grouping awk -F"\t" '{if(a[$1])a[$1]=a[$1]"\t"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS="\t" YOUR_FILE outputs ProtB 1 1 0 0 1 ProtA 1 1 1 0 1 I excluded the first/header line in the input. You can sort the rows if you want in any particular order. ADD COMMENT 0 Entering edit mode Hi Siva It seem to do the work, but with some errors. It misses out few entries every now or then, and thus doesn't combine all the hits from the table to the matrix. For example: There are 5 entries for sp|Q9UKW4|VAV3_HUMAN in the table, but in the output, there is only 1. ADD REPLY 0 Entering edit mode I checked the awk script with some examples I created. I noticed that whenever the class value of the first entry (of the 5 entries) is zero, it skips that entry. So, I modified the script to treat the class values as strings and it seems to work. Can you check the modified script? awk -F"\t" '{if(a[$1])a[$1]=a[$1]"\t"$2; else a[$1]=toupper($2);}END{for (i in a)print i, a[i];}' OFS="\t" YOUR_FILE ADD REPLY 0 Entering edit mode Thanks Siva. It seems to work fine now. ADD REPLY 2 Entering edit mode 8.2 years ago Prakki Rama ★ 2.6k Lets do it in Perl; Just change the filename in the first line, you should get the required output. open FH,"makeintomatrix.txt"; my$dummy=<FH>; ##skip header
while(<FH>)
{
chomp($_); # remove new line #print "$_";
@array=split;
#print "$array[0]:$array[1]\n";
push @{ $HoA{$array[0]} }, "$array[1]"; ##hashes ofarray } print "Protein\tClass1\tClass2\tClass3\tClass4\tClass5\n"; foreach$k (sort keys %HoA) ##print the hash, with all the values
{
print "$k\t"; if("$#{$HoA{$k}}"==4)
{
foreach $tmp (@{$HoA{$k} }) { print "\t$tmp";
}
print "\n";
}
else
{
$fill=4-"$#{$HoA{$k}}";
#print "should be filled $fill times\n"; foreach (@{$HoA{$k} }) { print "\t$_";
}
for($i=1;$i<=$fill;$i++)
{
print "\t0";
}
print "\n";

}
}

close(FH);


_____Data_______

Protein    Class
ProtA    1
ProtA    1
ProtA    1
ProtA    0
ProtA    1
ProtB    1
ProtB    1
ProtC    1
ProtC    1
ProtC    1
ProtD    0
ProtD    1
ProtD    1
ProtD    1


_____Output_______

perl makematrix.pl
Protein    Class1    Class2    Class3    Class4    Class5
ProtA        1    1    1    0    1
ProtB        1    1    0    0    0
ProtC        1    1    1    0    0
ProtD        0    1    1    1    0

0
Entering edit mode

Thanks Prakki. It does seem to have worked.

1
Entering edit mode
8.2 years ago
Chris S. ▴ 320

Using dcast.data.table should only take a couple of seconds in R.

library(data.table)
library(reshape2)
n <-460000
tmp <- data.frame(Protein=rep(paste("Prot", sprintf("%06d", 1:n), sep=""), each=5), Hit=sample(c(0,1),n*5, TRUE))

dt <- data.table(tmp)
dt[ , Class := 1:.N , by = Protein ]

Protein Hit Class
1: Prot000001   0     1
2: Prot000001   0     2
3: Prot000001   1     3
4: Prot000001   0     4
5: Prot000001   1     5
6: Prot000002   1     1

nrow(dt)
[1] 2300000

system.time( x <- dcast(dt, Protein~Class, value.var="Hit") )

user  system elapsed
3.173   0.025   3.200

Protein 1 2 3 4 5
1 Prot000001 0 0 1 0 1
2 Prot000002 1 0 1 1 0
3 Prot000003 1 0 0 1 0
4 Prot000004 1 1 0 0 1
5 Prot000005 0 0 1 1 1
6 Prot000006 0 0 0 0 0

# And to summarize strings using a data.frame

system.time( y <- table(apply(x[, 2:6], 1, paste, collapse="")) )
user  system elapsed
10.863   0.018  10.882

#or convert back to data.table if ~10 seconds is too long ...

x <- data.table(x)
x[, all:=do.call(paste0,.SD), .SDcols=-1]
table(x\$all)

00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100
14254 14392 14428 14219 14259 14178 14430 14422 14435 14465 14385 14403 14350
01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 11000 11001
14397 14300 14333 14265 14515 14197 14638 14166 14336 14501 14361 14395 14462
11010 11011 11100 11101 11110 11111
14251 14530 14326 14464 14475 14468
0
Entering edit mode

+1 for using data.table. I am not very comfortable with it but should start using it more often now.

0
Entering edit mode
8.2 years ago

If you have saved your result as a XML file, I wrote a XSLT stylesheet transforming the XML to a HTML matrix:

xsltproc --novalid \
blast2matrix.xsl \
blastn.xml > blast.html

0
Entering edit mode

Thank you, but I've saved my data in a tabular form.