Question

Need help with R-script for data sorting of tBLASTN result

0

Entering edit mode

8.9 years ago

mjoyraj ▴ 80

I got a Table-A from tBLASTN a small segment of it is given below. I want to filter the data. I want to ask if in each row, column-2 (subject ID), 9(s.start) and 10(s.end) have same values, i.e., if the rows are redundant than keep only the row having lowest e-value. Can anybody help me with a R-script for this..?

query id            subject id       % identity    alignment length   mismatches   gap opens   q. start   q. end   s. start   s. end   evalue     bit score
Chr1_FK1            ADDD02134481.1  89.77       88                9           0          11        98      1         264     7.00E-23  92.4
Chr2_FK1            ADDD02134481.1  75          88                22          0          11        98      1         264     3.00E-20  85.5
Chr2_FK3            ADDD02134481.1  76.14       88                21          0          11        98      1         264     6.00E-21  87.4
ENSGALG00000028120  ADDD02134481.1  76.14       88                21          0          11        98      1         264     5.00E-21  87.4
Chr2_FK1            ADDD02198275.1  78.41       88                19          0          11        98      1         264     3.00E-21  87.4
Chr2_FK3            ADDD02198275.1  79.55       88                18          0          11        98      1         264     5.00E-22  89.7
ENSGALG00000028120  ADDD02198275.1  78.41       88                19          0          11        98      1         264     4.00E-22  89.7
ChrUn2_FK2          ADDD02198275.1  78.41       88                19          0          11        98      1         264     2.00E-21  87.8
ChrUn2_FK3          ADDD02198275.1  78.41       88                19          0          11        98      1         264     3.00E-21  87.4
ChrUn2_FK4          ADDD02198275.1  79.55       88                18          0          11        98      1         264     5.00E-22  89.7
ENSGALG00000027303  ADDD02271118.1  89.69       97                10          0          4         100     1         291     3.00E-41  139
Chr27_FK34          ADDD02271118.1  88.66       97                11          0          4         100     1         291     5.00E-40  136
Chr27_FK35          ADDD02271118.1  88.66       97                11          0          4         100     1         291     1.00E-40  137
Chr27_FK36          ADDD02271118.1  88.66       97                11          0          4         100     1         291     1.00E-40  137

R • 1.6k views

ADD COMMENT • link updated 15 months ago by Ram 43k • written 8.9 years ago by mjoyraj ▴ 80

Ram · Answer 1 · 2015-06-06

2

Entering edit mode

8.9 years ago

Ram 43k

I'd do this with a combination of doBy's summaryBy() and then a merge across data frames.

install.packages('doBy')
library('doBy')

min.dataset<-summaryBy(evalue ~ subject.id,s.start,s.end,data=tblastn.results,FUN=min)
results<-merge(min.dataset,tblastn.results,by=c("subject.id","s.start","s.end")

Explanation:

Step 1 (summaryBy) groups the data by the 3 factors and picks the minimum e-value. For singleton groups, this operation makes no difference.

Step 2 (merge) joins the original dataset with this minimal set to filter and pick relevant rows.

ADD COMMENT • link 15 months ago by Ram 43k

0

Entering edit mode

I used the following script, it shows the following error

tblastn.results <- read.table(file.choose(), header=TRUE, sep=",")
install.packages("doBy")
library('doBy')
min.dataset <- summaryBy(evalue ~ subject.id,s.start,s.end,data=tblastn.results,FUN=min)


Error in .get_variables(formula, data, id, debug.info) :
  object 's.start' not found

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by mjoyraj ▴ 80

0

Entering edit mode

Please check the headers once you import the data and tweak the query, substituting.the header names in my query with the ones actually found in the data frame.

ADD REPLY • link 15 months ago by Ram 43k