Question

Using Rstudio to bulk split CSV tabs into columns.

0

Entering edit mode

3.6 years ago

pawhitesell ▴ 30

Hi,

I have output from blastn. The output format is 6. -outfmt "6 qseqid sseqid stitle pident qcovs" and I have saved it as csv file.

So I have multiple csv blast output files in a directory. I want to work on cleaning up the csv files all at once using R studio. Each csv file has the same number of columns (1 column) and a different number of rows.

To read all the csv files at once, i used:

fnames <- list.files()
csv <- lapply(fnames, read.csv)

This creates a list of data frames. Now I want to split the one column into multiple ones based on tab spaces. I tried to use:

strings <- str_split_fixed(csv$col1, " ", 5)

enter image description here

However, this is not working, as it creates no data at all:

enter image description here

Is there another way with which I could split the column on all the csv's all at once?

P.S. I am posting on behalf of someone, so they may respond here to any questions for clarification.

Thanks!

R csv blastn • 2.6k views

ADD COMMENT • link updated 3.6 years ago by pramach1 ▴ 40 • written 3.6 years ago by pawhitesell ▴ 30

0

Entering edit mode

Can you post some of your data using dput(head(csv[[1]]))?

ADD REPLY • link 3.6 years ago by rpolicastro 13k

0

Entering edit mode

Thank you. Since I have differing number of rows this is not working. The error i get is

Error in data.frame(..., check.names = FALSE) : 
  arguments imply differing number of rows: 2652, 0

ADD REPLY • link updated 3.6 years ago by Ram 43k • written 3.6 years ago by pramach1 ▴ 40

0

Entering edit mode

Add some example data to your post, and an example of the desired output.

ADD REPLY • link 3.6 years ago by rpolicastro 13k

0

Entering edit mode

"1932:@M05996:27:000000000-CFP53:1:1101:19611:2120  gb|AF144880|+|3541-3979|ARO:3002569|AAC(6')-Iy  gb|AF144880|+|3541-3979|ARO:3002569|AAC(6')-Iy [Salmonella enterica subsp. enterica serovar Enteritidis]    98.261  46"
"1932:@M05996:27:000000000-CFP53:1:1101:19611:2120  gb|AE006468.2|+|1707351-1707789|ARO:3002571|AAC(6')-Iaa gb|AE006468.2|+|1707351-1707789|ARO:3002571|AAC(6')-Iaa [Salmonella enterica subsp. enterica serovar Typhimurium str. LT2]  96.522  46"
"1932:@M05996:27:000000000-CFP53:1:1101:14997:2171  gb|AY769962|+|2434-5611|ARO:3000781|adeJ    gb|AY769962|+|2434-5611|ARO:3000781|adeJ [Acinetobacter baumannii]  87.273  22"
"1928:@M05996:27:000000000-CFP53:1:1101:15032:4757  gb|CP014358.1|-|2161325-2162750|ARO:3001327|mdtK    gb|CP014358.1|-|2161325-2162750|ARO:3001327|mdtK [Salmonella enterica subsp. enterica serovar Typhimurium]  98.387  100"

This is how the single column is on multiple csv file.

I would like to separate the columns based on tab space and make it into to 5 columns.

gb|CP014358.1|-|2161325-2162750|ARO:3001327|mdtK    gb|CP014358.1|-|2161325-2162750|ARO:3001327|mdtK [Salmonella enterica subsp. enterica serovar Typhimurium]  98.387  100

ADD REPLY • link updated 3.6 years ago by zx8754 11k • written 3.6 years ago by pramach1 ▴ 40

0

Entering edit mode

I don't see a tab separator, but I do potentially see a pipe separator |.

When you refer to this as the single column in your files, are you saying that this is the only column in the data.frame after importing it, or are you rather saying that this is the one column out of many that you have that you want to split?

It's always better to post an example of your data using dput(head(csv[[1]])) to avoid confusion like this.

ADD REPLY • link 3.6 years ago by rpolicastro 13k

0

Entering edit mode

Even if I have to separate using a pipe separator, how would i do that in a list of files?

I have tried

strings <- str_split_fixed(csv$col1, "|", 8)

it doesnt work.

I am unable to attach an image or run dput(head(csv[[1]])). I apologize. My computer security settings is not letting me do it.

ADD REPLY • link 3.6 years ago by pramach1 ▴ 40

0

Entering edit mode

copy and paste the output from it into a comment. After, select all the code and then press the button with 0's and 1's just above the post to format it as code.

ADD REPLY • link 3.6 years ago by rpolicastro 13k

0

Entering edit mode

Thank you. Actually it worked.

strings <- str_split_fixed(csv$col1, " ", 5)

Even though the strings said "no data available", the actual csv list has separated into 5 columns. I apologize for not noticing this earlier.

ADD REPLY • link 3.6 years ago by pramach1 ▴ 40

0

Entering edit mode

Provide example data as plain text. Your links to images do not work.

ADD REPLY • link 3.6 years ago by zx8754 11k

Ram · Answer 1 · 2020-10-01

0

Entering edit mode

3.6 years ago

zx8754 11k

Row bind list of csvs into one dataframe then we can work on the column.

csv <- do.call(rbind, lapply(fnames, read.csv))

Or if you wish to keep them as list:

csv <- lapply(fnames, function(i) {
  d <- read.csv(i)
  x <- str_split_fixed(d$col1, " ", 5)
  #return
  cbind(d, x)
})

ADD COMMENT • link 3.6 years ago by zx8754 11k

0

Entering edit mode

Thank you. Since I have differing number of rows this is not working. The error i get is

Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 2652, 0

ADD REPLY • link updated 3.6 years ago by Ram 43k • written 3.6 years ago by pramach1 ▴ 40

0

Entering edit mode

If I use

csv <- do.call(rbind, lapply(fnames, read.csv))

This is the error i get.

Error in match.names(clabs, names(xi)) : 
  names do not match previous names

ADD REPLY • link updated 3.6 years ago by Ram 43k • written 3.6 years ago by pramach1 ▴ 40