Need help with R-script for data sorting of tBLASTN result
1
0
Entering edit mode
8.9 years ago
mjoyraj ▴ 80

I got a Table-A from tBLASTN a small segment of it is given below. I want to filter the data. I want to ask if in each row, column-2 (subject ID), 9(s.start) and 10(s.end) have same values, i.e., if the rows are redundant than keep only the row having lowest e-value. Can anybody help me with a R-script for this..?

query id            subject id       % identity    alignment length   mismatches   gap opens   q. start   q. end   s. start   s. end   evalue     bit score
Chr1_FK1            ADDD02134481.1  89.77       88                9           0          11        98      1         264     7.00E-23  92.4
Chr2_FK1            ADDD02134481.1  75          88                22          0          11        98      1         264     3.00E-20  85.5
Chr2_FK3            ADDD02134481.1  76.14       88                21          0          11        98      1         264     6.00E-21  87.4
ENSGALG00000028120  ADDD02134481.1  76.14       88                21          0          11        98      1         264     5.00E-21  87.4
Chr2_FK1            ADDD02198275.1  78.41       88                19          0          11        98      1         264     3.00E-21  87.4
Chr2_FK3            ADDD02198275.1  79.55       88                18          0          11        98      1         264     5.00E-22  89.7
ENSGALG00000028120  ADDD02198275.1  78.41       88                19          0          11        98      1         264     4.00E-22  89.7
ChrUn2_FK2          ADDD02198275.1  78.41       88                19          0          11        98      1         264     2.00E-21  87.8
ChrUn2_FK3          ADDD02198275.1  78.41       88                19          0          11        98      1         264     3.00E-21  87.4
ChrUn2_FK4          ADDD02198275.1  79.55       88                18          0          11        98      1         264     5.00E-22  89.7
ENSGALG00000027303  ADDD02271118.1  89.69       97                10          0          4         100     1         291     3.00E-41  139
Chr27_FK34          ADDD02271118.1  88.66       97                11          0          4         100     1         291     5.00E-40  136
Chr27_FK35          ADDD02271118.1  88.66       97                11          0          4         100     1         291     1.00E-40  137
Chr27_FK36          ADDD02271118.1  88.66       97                11          0          4         100     1         291     1.00E-40  137
R • 1.6k views
ADD COMMENT
2
Entering edit mode
8.9 years ago
Ram 43k

I'd do this with a combination of doBy's summaryBy() and then a merge across data frames.

install.packages('doBy')
library('doBy')

min.dataset<-summaryBy(evalue ~ subject.id,s.start,s.end,data=tblastn.results,FUN=min)
results<-merge(min.dataset,tblastn.results,by=c("subject.id","s.start","s.end")

Explanation:

Step 1 (summaryBy) groups the data by the 3 factors and picks the minimum e-value. For singleton groups, this operation makes no difference.

Step 2 (merge) joins the original dataset with this minimal set to filter and pick relevant rows.

ADD COMMENT
0
Entering edit mode

I used the following script, it shows the following error

tblastn.results <- read.table(file.choose(), header=TRUE, sep=",")
install.packages("doBy")
library('doBy')
min.dataset <- summaryBy(evalue ~ subject.id,s.start,s.end,data=tblastn.results,FUN=min)


Error in .get_variables(formula, data, id, debug.info) :
  object 's.start' not found
ADD REPLY
0
Entering edit mode

Please check the headers once you import the data and tweak the query, substituting.the header names in my query with the ones actually found in the data frame.

ADD REPLY

Login before adding your answer.

Traffic: 2962 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6