I am trying to filter multiple blastn result files in csv format, such that each csv file will only include 3 columns: pident, qcoverage, and stitle. I want to keep only rows with pident and qcoverage above 90, and I want to remove any duplicate stitle rows.
To start with I used this:
data <- lapply(out, "[", 3:5)
to reduce my data down to the required 3 columns:
data list list of length 3 [] list[2652 x3] (S3: data.frame) A data.frame with 2652 rows and 3 columns [] list[2646 x 3] (S3: data.frame) A data.frame with 2646 rows and 3 columns [] list[1460 x 3] (S3:data.frame) A data.frame with 1460 rows and 3 columns
The data in each file now looks like this:
gb|AE006468.2|+|1707351-1707789|ARO:3002571|AAC(6')-Iaa [Salmonella enterica subsp. enterica serovar Typhimurium str. LT2] 96.522 46 gb|AY769962|+|2434-5611|ARO:3000781|adeJ [Acinetobacter baumannii] 87.273 22 gb|CP014358.1|-|2161325-2162750|ARO:3001327|mdtK [Salmonella enterica subsp. enterica serovar Typhimurium] 98.387 100
However, my next step, to exclude all rows with pident column below 90, I am running into trouble. I have tried this:
mydata1 <- lapply[data, function (x) x[data$pident > 90]]
but it does not seem to do this. Could anyone suggest how I could accomplish this better? I would also like to remove rows with duplicate stitles, for which I am planning on using the "distinct()" function of dplyr as I have seen in another post, something like this:
distinct(dat, stitle, .keep_all = TRUE)
but if this looks foolish, please let me know.
(P.S. This is reposted on behalf of someone else, who might also respond.)