Question

Closed:Using R studio to process multiple blastn outputs(csv files) and filter based on qcoverage and precentident.

1

Entering edit mode

3.6 years ago

pramach1 ▴ 40

library(purrr)
library(tidyverse)

fnames <- list.files() (all csv files are loaded onto a list here). 
myfiles = lapply(fnames, read.delim)
strings <- str_split_fixed(myfiles$col1, " ", 5) (all the files are separating the column1 here based on tab space)
colnames <- c("qseqid", "sseqid", "stitle", "pident", "qcovs") (all 5 columns are now named)
out <- lapply(myfiles, setNames, colnames)

This is how the list looks now

data

list[3]                                              list of length 3
[[1]]       list[2652 x3] (S3: data.frame)  A data.frame with 2652 rows and 3 columns
[[2]]       list[2646 x 3] (S3: data.frame)    A data.frame with 2646 rows and 3 columns
[[3]]        list[1460 x 3] (S3:data.frame) A data.frame with 1460 rows and 3 columns

data <- lapply(out, "[", 3:5)

(I want to retain only columns 3 to 5, and all the csv files now have only 3 columns. stitle, pidet and qcov in the list)

gb|AE006468.2|+|1707351-1707789|ARO:3002571|AAC(6')-Iaa [Salmonella enterica subsp. enterica serovar Typhimurium str. LT2] 96.522  46
gb|AY769962|+|2434-5611|ARO:3000781|adeJ [Acinetobacter baumannii] 87.273 22
gb|CP014358.1|-|2161325-2162750|ARO:3001327|mdtK [Salmonella enterica subsp. enterica serovar Typhimurium] 98.387 100

each csv file in that list looks like this.

mydata1 <- lapply[data, function (x) x[data$pident > 90]]

this is not working, where i want to filter them based on the percent ident anything above 90%.

After this I want to filter all the csv files in the list to qcoverage above 90. Then i want to remove duplicated rows in all the csv file based on the column stitle. One thing at a time. How to filter the csv files in the list based on qcoverage.

R blast • 152 views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 3.6 years ago by pramach1 ▴ 40