Trouble filtering Blastn output csv's by pident, qcov, and stitle using Rstudio
1
3
Entering edit mode
8 months ago
pawhitesell ▴ 30

Hi,

I am trying to filter multiple blastn result files in csv format, such that each csv file will only include 3 columns: pident, qcoverage, and stitle. I want to keep only rows with pident and qcoverage above 90, and I want to remove any duplicate stitle rows.

To start with I used this:

data <- lapply(out, "[", 3:5)

to reduce my data down to the required 3 columns:

data        

list[3]                                              list of length 3

[[1]]       list[2652 x3] (S3: data.frame)  A data.frame with 2652 rows and 3 columns

[[2]]       list[2646 x 3] (S3: data.frame)    A data.frame with 2646 rows and 3 columns

[[3]]        list[1460 x 3] (S3:data.frame) A data.frame with 1460 rows and 3 columns

The data in each file now looks like this:

gb|AE006468.2|+|1707351-1707789|ARO:3002571|AAC(6')-Iaa [Salmonella enterica subsp. enterica serovar Typhimurium str. LT2] 96.522  46
gb|AY769962|+|2434-5611|ARO:3000781|adeJ [Acinetobacter baumannii] 87.273 22
gb|CP014358.1|-|2161325-2162750|ARO:3001327|mdtK [Salmonella enterica subsp. enterica serovar Typhimurium] 98.387 100

However, my next step, to exclude all rows with pident column below 90, I am running into trouble. I have tried this:

mydata1 <- lapply[data, function (x) x[data$pident > 90]]

but it does not seem to do this. Could anyone suggest how I could accomplish this better? I would also like to remove rows with duplicate stitles, for which I am planning on using the "distinct()" function of dplyr as I have seen in another post, something like this:

distinct(dat, stitle, .keep_all = TRUE)

but if this looks foolish, please let me know.

Thanks!

(P.S. This is reposted on behalf of someone else, who might also respond.)

Blastn CSV R • 256 views
ADD COMMENT
0
Entering edit mode

Instead of deleting and reposting, next time consider editing the original post.

Link to original deleted post

ADD REPLY
0
Entering edit mode
8 months ago
zx8754 10k

Try:

mydata1 <- lapply(data, function (x) x[x$pident > 90, ])
ADD COMMENT
0
Entering edit mode
mydata1 <- lapply(data, function (x) x[(x$qcovs > 90),])

The code you sent didnt work. So i modified it a little bit and it works. Thank you so much. Saved me a lot of time. I was planning on removing duplicate rows from the column after filtering. Your code gave me an idea of how to do that too.

data2 <- lapply(mydata1, function (x) x[!duplicated(x$stitle),])

That worked too. Thank you.

ADD REPLY
0
Entering edit mode

Yeah, there was a typo, I missed the comma after 90, fixed now.

ADD REPLY

Login before adding your answer.

Traffic: 2335 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6