Question

Separating a column in list of files using purr in R

0

Entering edit mode

2.2 years ago

pramach1 ▴ 40

I have 108 files of BLAST output. I am treating them as list files and filtering them out based on %identity and Qcoverage.

fnames <- list.files()

data4 = lapply(files, function(x) {  res <- read.table(x, header=TRUE, sep="\t", quote = "", fill = FALSE) res$sample <- x   res                           })

colnames <- c("qseqid", "sseqid", "stitle", "pident", "qcovs", "Sample")

out <- lapply(data, setNames, colnames)
 data <- lapply(out, "[", 3:6)

data1 <- lapply(data, function (x) x[(x$qcovs > 90),])
data2 <- lapply(data1, function (x) x[(x$pident > 90),])

After this, I want to split the stitle column based on the paranthesis and this "|". How do I do that in the list of files.

Here is the example of the stitle column.

gb|AM260957.1|+|4186-5086|ARO:3003071|mphF [uncultured bacterium] 
gb|NC_008618.1|-|1667063-1670624|ARO:3004480|Bifidobacterium adolescentis rpoB conferring resistance to rifampicin [Bifidobacterium adolescentis] 
gb|AP006618.1|+|4835199-4838688|ARO:3000501|Nocardia rifampin resistant beta-subunit of RNA polymerase (rpoB2) [Nocardia farcinica IFM 10152] 
gb|AY043299.1|-|3984-5175|ARO:3000167|tet(C) [Aeromonas salmonicida] 
gb|AB571865.1|-|144312-145536|ARO:3003745|mefC [Photobacterium damselae subsp. damselae] 
gb|AE004091.2|+|2810008-2813197|ARO:3000804|MexF [Pseudomonas aeruginosa PAO1] 
gb|AB219524.1|+|1176-4338|ARO:3003699|mexQ [Pseudomonas aeruginosa]

I want the column split based on "|" and tab. Thank you for the help.

purr output BLAST R • 1.0k views

ADD COMMENT • link updated 2.2 years ago by rpolicastro 13k • written 2.2 years ago by pramach1 ▴ 40

score 0 · Answer 1 · 2022-02-15

0

Entering edit mode

2.2 years ago

rpolicastro 13k

Example data.

df <- structure(list(V1 = c("gb|AM260957.1|+|4186-5086|ARO:3003071|mphF [uncultured bacterium] ", 
"gb|NC_008618.1|-|1667063-1670624|ARO:3004480|Bifidobacterium adolescentis rpoB conferring resistance to rifampicin [Bifidobacterium adolescentis] ", 
"gb|AP006618.1|+|4835199-4838688|ARO:3000501|Nocardia rifampin resistant beta-subunit of RNA polymerase (rpoB2) [Nocardia farcinica IFM 10152] ", 
"gb|AY043299.1|-|3984-5175|ARO:3000167|tet(C) [Aeromonas salmonicida] ", 
"gb|AB571865.1|-|144312-145536|ARO:3003745|mefC [Photobacterium damselae subsp. damselae] ", 
"gb|AE004091.2|+|2810008-2813197|ARO:3000804|MexF [Pseudomonas aeruginosa PAO1] ", 
"gb|AB219524.1|+|1176-4338|ARO:3003699|mexQ [Pseudomonas aeruginosa] "
)), class = "data.frame", row.names = c(NA, -7L))

Tidyverse answer. Since you have a list of files just convert this to functional form in lapply or purrr::map.

library("tidyr")

separate(df, 1, into=LETTERS[1:7], sep="\\s(?=\\[)|\\|")

  A     B           C     D               E           F               G         
  <chr> <chr>       <chr> <chr>           <chr>       <chr>           <chr>     
1 gb    AM260957.1  +     4186-5086       ARO:3003071 mphF            "[uncultu…
2 gb    NC_008618.1 -     1667063-1670624 ARO:3004480 Bifidobacteriu… "[Bifidob…
3 gb    AP006618.1  +     4835199-4838688 ARO:3000501 Nocardia rifam… "[Nocardi…
4 gb    AY043299.1  -     3984-5175       ARO:3000167 tet(C)          "[Aeromon…
5 gb    AB571865.1  -     144312-145536   ARO:3003745 mefC            "[Photoba…
6 gb    AE004091.2  +     2810008-2813197 ARO:3000804 MexF            "[Pseudom…
7 gb    AB219524.1  +     1176-4338       ARO:3003699 mexQ            "[Pseudom…

ADD COMMENT • link 2.2 years ago by rpolicastro 13k

0

Entering edit mode

I have different number of rows but the same number of columns in 108 files. The number if rows range from 4000 to 12000 rows. If I have to use the above code, that means I have the same number of rows and exact same information on all the 108 files. I don't have that. so..how would I separate/split the column1 (stitle) on all 108 files? Thank you. I apologize for not being clear previously.

ADD REPLY • link 2.2 years ago by pramach1 ▴ 40

0

Entering edit mode

I think your confusion might be coming from into=LETTERS[1:7] since there also happens to be 7 rows. You're splitting the stitle column into 7 separate columns, so that argument was just telling the function to name the 7 new columns A-G. This function works for any number of rows.

ADD REPLY • link 2.2 years ago by rpolicastro 13k

0

Entering edit mode

I think I am doing something wrong. The first I did was

df <- structure(list(V1 = c("gb|AM260957.1|+|4186-5086|ARO:3003071|mphF [uncultured bacterium] ", 
                        "gb|NC_008618.1|-|1667063-1670624|ARO:3004480|Bifidobacterium adolescentis rpoB conferring resistance to rifampicin [Bifidobacterium adolescentis] ", 
                        "gb|AP006618.1|+|4835199-4838688|ARO:3000501|Nocardia rifampin resistant beta-subunit of RNA polymerase (rpoB2) [Nocardia farcinica IFM 10152] ", 
                        "gb|AY043299.1|-|3984-5175|ARO:3000167|tet(C) [Aeromonas salmonicida] ", 
                        "gb|AB571865.1|-|144312-145536|ARO:3003745|mefC [Photobacterium damselae subsp. damselae] ", 
                        "gb|AE004091.2|+|2810008-2813197|ARO:3000804|MexF [Pseudomonas aeruginosa PAO1] ", 
                        "gb|AB219524.1|+|1176-4338|ARO:3003699|mexQ [Pseudomonas aeruginosa])), class = "data.frame", row.names = c(NA, -7L))

purrr::map

data3 <- separate(df, 1, into=LETTERS[1:7], sep="\\s(?=\\[)|\\|")

I ended up with a single data frame of this split into 7 columns.It is not separating the stitle column in the list of all108 files. But creating a single data frame only with this column split into 7 columns.

The list of files is shown here

ADD REPLY • link 2.2 years ago by pramach1 ▴ 40

0

Entering edit mode

If you want to get into data analysis in R I would suggest reading R for Data Science by Hadley Whickam. It's going to be difficult to write R code without investing the time into learning it.

With that being said the code I provided was an example, and was not meant to be copy and pasted directly into your code. In your code it should look something like this.

data3 <- lapply(data2, \(x) separate(x, stitle, into=LETTERS[1:7], sep="\\s(?=\\[)|\\|"))

ADD REPLY • link 2.2 years ago by rpolicastro 13k