Question

How can I convert the blast results into a matrix?

2

Entering edit mode

9.4 years ago

kandoigaurav ▴ 150

I've a table of blast results with ~5hits/query protein:

Protein     Class
ProtA       1
ProtA       1
ProtA       1
ProtA       0
ProtA       1
ProtB       1
ProtB       1
ProtB       0
ProtB       0
ProtB       1

I would like to convert this into a feature vector matrix like this:

Protein     Class1     Class2     Class3     Class4     Class5
ProtA       1          1          1          0          1
ProtB       1          1          0          0          1

Can someone suggest me an efficient way to do this, since I've ~2300k hits in the file.

perl linux python blast • 5.7k views

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by kandoigaurav ▴ 150

0

Entering edit mode

What is "Class"?

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I've converted the hits to 1's and 0's to represent whether or not they are enzyme/non-enzymes. For each query protein, I've ~5hits and therefore, I can use these 5hits as 5 training features.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by kandoigaurav ▴ 150

1

Entering edit mode

9.4 years ago

Chris S. ▴ 320

Using dcast.data.table should only take a couple of seconds in R.

library(data.table)
library(reshape2)
n <-460000
tmp <- data.frame(Protein=rep(paste("Prot", sprintf("%06d", 1:n), sep=""), each=5), Hit=sample(c(0,1),n*5, TRUE))

dt <- data.table(tmp)
# add counter
dt[ , Class := 1:.N , by = Protein ]

head(dt)

      Protein Hit Class
1: Prot000001   0     1
2: Prot000001   0     2
3: Prot000001   1     3
4: Prot000001   0     4
5: Prot000001   1     5
6: Prot000002   1     1

 nrow(dt)
[1] 2300000

system.time( x <- dcast(dt, Protein~Class, value.var="Hit") )

   user  system elapsed
  3.173   0.025   3.200

head(x)
     Protein 1 2 3 4 5
1 Prot000001 0 0 1 0 1
2 Prot000002 1 0 1 1 0
3 Prot000003 1 0 0 1 0
4 Prot000004 1 1 0 0 1
5 Prot000005 0 0 1 1 1
6 Prot000006 0 0 0 0 0

# And to summarize strings using a data.frame

system.time( y <- table(apply(x[, 2:6], 1, paste, collapse="")) )
   user  system elapsed
10.863   0.018  10.882

#or convert back to data.table if ~10 seconds is too long ...

x <- data.table(x)
x[, all:=do.call(paste0,.SD), .SDcols=-1]
table(x$all)

00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100
14254 14392 14428 14219 14259 14178 14430 14422 14435 14465 14385 14403 14350
01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 11000 11001
14397 14300 14333 14265 14515 14197 14638 14166 14336 14501 14361 14395 14462
11010 11011 11100 11101 11110 11111
14251 14530 14326 14464 14475 14468

ADD COMMENT • link 9.4 years ago by Chris S. ▴ 320

0

Entering edit mode

+1 for using data.table. I am not very comfortable with it but should start using it more often now.

ADD REPLY • link 9.4 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

9.4 years ago

Pierre Lindenbaum 161k

If you have saved your result as a XML file, I wrote a XSLT stylesheet transforming the XML to a HTML matrix:

xsltproc --novalid \
   blast2matrix.xsl \
    blastn.xml > blast.html

ADD COMMENT • link 9.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thank you, but I've saved my data in a tabular form.

ADD REPLY • link 9.4 years ago by kandoigaurav ▴ 150

Ram · Accepted Answer · 2014-11-24

5

Entering edit mode

9.4 years ago

komal.rathi ★ 4.1k

For any number of hits per protein:

> tmp
protein class
ProtA     1
ProtA     1
ProtA     1
ProtA     0
ProtA     1
ProtB     1
ProtB     1
ProtB     0
ProtB     0
ProtC     1
ProtD     1
ProtD     0

# install & load packages
library(reshape2)
library(plyr)
library(doMC)
registerDoMC(8) # assign 8 cores for ddply to use

# define function myfunc
myfunc <- function(x)
{
  dim(x)
  x$name = paste("Class",1:nrow(x))
  dcast(x,protein~name,value.var = "class")
}

# call myfunc in ddply. run ddply in parallel mode to use >1 cores
res = ddply(.data = tmp,.variables = "protein",.fun = function(x) myfunc(x),.parallel=T)

# output
> res
protein Class 1 Class 2 Class 3 Class 4 Class 5
ProtA       1       1       1       0       1
ProtB       1       1       0       0      NA
ProtC       1      NA      NA      NA      NA
ProtD       1       0      NA      NA      NA

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

Thanks a lot for that. All I need to do is feed in my table file to R as dat$name and it'll take care of the rest?

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by kandoigaurav ▴ 150

1

Entering edit mode

No you need to read your table in dat and then run the above commands. I will update my answer.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

Hey, it gives me the following error:

Error in '$<-.data.frame'(`*tmp*`, "name", value = c("Class1", "Class2", :
replacement has 5 rows, data has 2278916

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by kandoigaurav ▴ 150

0

Entering edit mode

The R code is incorrect. This line:

dat$name = paste("Class",1:5,sep="")

tries to create a column with the header "name" containing 5 rows with values "Class1", "Class2"..."Class 5". The error occurs because you have more than 5 existing rows.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by Neilfws 49k

1

Entering edit mode

This code worked using the example that he has provided. I thought each protein has 5 hits. If each protein has 5 hits, then this code should work. I misunderstood the question because of the example.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

Hi Komal,

As I checked through the blast results, there are few queries with more or less than 5 hits as well. I don't know how that happened, since I've used the max_target_seqs option while using BLAST.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by kandoigaurav ▴ 150

1

Entering edit mode

I am trying to rectify it.

ADD REPLY • link 9.4 years ago by komal.rathi ★ 4.1k

1

Entering edit mode

I have updated my answer. Please try that and let me know if its ok.

ADD REPLY • link 9.4 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

Thanks for the update. It now gives me the following error:

Error in eval(expr, envir, enclos) : could not find functional "myfunc"

ADD REPLY • link 9.4 years ago by kandoigaurav ▴ 150

0

Entering edit mode

Did you define the function before calling it in ddply?

ADD REPLY • link 9.4 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

Here is what I did:

> tmp = read.table('all-vs-all.txt',header = T)
> myfunc <- function(x)
+ {
+   dim(x)
+   x$name = paste("Class",1:nrow(x))
+   dcast(x,Protein~name,value.var="Class")
+ }
>
> res = ddply(.data = tmp,.variables = "Protein",.fun = function(x) myfunc(x))

Error: could not find function "dcast"

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by kandoigaurav ▴ 150

1

Entering edit mode

dcast is part of the reshape2 package. Install that package first.

ADD REPLY • link 9.4 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

Yes, I forgot to load the reshape2 package. I tried the script on the sample data which you've used and it seemed to work fine.

It is running on my data. SInce the data is pretty large, it's too memory intensive as well. I'll let you know of the output, when it completes executing.

Thanks a lot for your continuous help. Really appreciate that :)

ADD REPLY • link 9.4 years ago by kandoigaurav ▴ 150

1

Entering edit mode

I will update my answer with a more reasonable solution I missed that your data has >2300K hits.

ADD REPLY • link 9.4 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

That would be very helpful of you. Thanks!

ADD REPLY • link 9.4 years ago by kandoigaurav ▴ 150

1

Entering edit mode

I have updated my answer, you can use ddply in a parallel mode. I have added the parallel argument in the ddply function call. If you are running this on a cluster, you can give a reasonable number of cores. In the example, ddply uses 8 processors. You can change the number of cores in the registerDoMC() call.

ADD REPLY • link 9.4 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

I'm on a windows system, and doMC works only in Unix based systems. Is there any substitute for windows system?

ADD REPLY • link 9.4 years ago by kandoigaurav ▴ 150

0

Entering edit mode

Sorry, I am not really familiar with Windows based systems. You can search for an alternative, or use a single core.

ADD REPLY • link 9.4 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

That's okay. I left it running on my system. Will check back in the morning when I go lab.

Thanks again.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by kandoigaurav ▴ 150

0

Entering edit mode

Sorry, I seemed to have missed to load reshape2 package. It's running as of now. I'll let you know the output when it finishes.

ADD REPLY • link 9.4 years ago by kandoigaurav ▴ 150

0

Entering edit mode

Apologies, you are correct; the 1:5 is recycled and should fill all rows if nrow() is a multiple of 5. I guess the error comes from the dcast().

ADD REPLY • link 9.4 years ago by Neilfws 49k

Ram · Accepted Answer · 2014-11-24

2

Entering edit mode

9.4 years ago

Siva ★ 1.9k

Using awk grouping

awk -F"\t" '{if(a[$1])a[$1]=a[$1]"\t"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS="\t" YOUR_FILE

outputs

ProtB     1    1    0    0    1
ProtA     1    1    1    0    1

I excluded the first/header line in the input. You can sort the rows if you want in any particular order.

Modified from: http://www.theunixschool.com/2012/06/awk-10-examples-to-group-data-in-csv-or.html

ADD COMMENT • link 9.4 years ago by Siva ★ 1.9k

0

Entering edit mode

Hi Siva

It seem to do the work, but with some errors. It misses out few entries every now or then, and thus doesn't combine all the hits from the table to the matrix.

For example:

There are 5 entries for sp|Q9UKW4|VAV3_HUMAN in the table, but in the output, there is only 1.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by kandoigaurav ▴ 150

0

Entering edit mode

I checked the awk script with some examples I created. I noticed that whenever the class value of the first entry (of the 5 entries) is zero, it skips that entry. So, I modified the script to treat the class values as strings and it seems to work. Can you check the modified script?

awk -F"\t" '{if(a[$1])a[$1]=a[$1]"\t"$2; else a[$1]=toupper($2);}END{for (i in a)print i, a[i];}' OFS="\t" YOUR_FILE

ADD REPLY • link 9.4 years ago by Siva ★ 1.9k

0

Entering edit mode

Thanks Siva. It seems to work fine now.

ADD REPLY • link 9.4 years ago by kandoigaurav ▴ 150

Ram · Accepted Answer · 2014-11-25

Lets do it in Perl; Just change the filename in the first line, you should get the required output.

open FH,"makeintomatrix.txt";
my $dummy=<FH>; ##skip header
while(<FH>)
{
    chomp($_); # remove new line
    #print "$_";
        @array=split;
    #print "$array[0]:$array[1]\n";    
    push @{ $HoA{$array[0]} }, "$array[1]"; ##hashes ofarray

}
print "Protein\tClass1\tClass2\tClass3\tClass4\tClass5\n";
foreach $k (sort keys %HoA) ##print the hash, with all the values
    {
    print "$k\t";

    if("$#{$HoA{$k}}"==4)
    {
        foreach $tmp (@{ $HoA{$k} })
        {
        print "\t$tmp";
        }
        print "\n";
    }
    else
    {
        $fill=4-"$#{$HoA{$k}}";
        #print "should be filled $fill times\n";
        foreach (@{ $HoA{$k} })
        {
        print "\t$_";
        }
        for($i=1;$i<=$fill;$i++)
        {
        print "\t0";
        }
        print "\n";

    }
    }

close(FH);

_____Data_______

Protein    Class
ProtA    1
ProtA    1
ProtA    1
ProtA    0
ProtA    1
ProtB    1
ProtB    1
ProtC    1
ProtC    1
ProtC    1
ProtD    0
ProtD    1
ProtD    1
ProtD    1

_____Output_______

perl makematrix.pl
Protein    Class1    Class2    Class3    Class4    Class5
ProtA        1    1    1    0    1
ProtB        1    1    0    0    0
ProtC        1    1    1    0    0
ProtD        0    1    1    1    0