Question: Loop R script with text modifying through different CNV files
0
gravatar for brunobsouzaa
15 months ago by
brunobsouzaa370
Brazil
brunobsouzaa370 wrote:

Hi guys,

I'm using R to modify cnvkit output for my analysis. The thing is that I have multiple region files (*.cnr) and it's time-consuming (and not necessary) to run the script in each file. So, does anyone know what can I do? Basically what I need is to loop through "file", "file_1", "file_2", "file_n". So, the output always needs to be the exact filename!

The script:

# Clear workspace

rm(list=(ls()))

# Load File and Reference
file <- read.csv("file.cnr", header=T, sep="\t")

ref <- read.csv("ref.cnn", header=T, sep="\t")

# Merge both files by "start" position
merged <- merge(file, ref, by="start", suffixes=c(".file", ".ref"))

# Round "log2" column
merged$log2file <- round(merged$log2.file, digits=1)

# re-calculate "cn" based on log2 correction
merged$cn <- round(2*(2^(merged$log2.file)))

# Subset file with all "cn" values that are not 2
alt.cn <- subset(merged, merged$cn !=2)

# Create new data with columns of interest
alt.cns <- as.data.framealt.cn[, c(1:8,13)])

# Re-order columns for better view
alt.cns <- alt.cns[c(2,1,3,4,6,5,8,7,9)]

# Calculate ratio between coverages
alt.cns$depth.ratio <- round(alt.cns$depth.file / alt.cns$depth.ref, digits=2)

alt.cns$depth.ratio.1 <- round(alt.cns$depth.file / alt.cns$depth.ref, digits=2)

## Function to call for DUP or DEL.  
alt.cns$SV_type <- ifelse(alt.cns$cn < 2, "DEL", "DUP")

# Convert "alt.cns" to .bed file

full <- alt.cns[c(1,2,3,12,5,4,6,7,8,9,10,11)]

names(full)[1] <- "#Chrom"

names(full)[2] <- "Start"

names(full)[3] <- "End"

names(full)[4] <- "SV_type"

names(full)[6] <- "gene"

names(full)[7] <- "log2"

# Save "alt.cns" as .bed file
write.table(full, file="/path/to/output/file.bed", row.names=F, col.names=T, sep="\t")

# Filter "alt.cns" file

filtered <- subset(alt.cns, alt.cns$depth.ratio > 0.6 & alt.cns$depth.ratio.1 > 1.4 & alt.cns$weight > 0.3)

filtered <- filtered[c(1,2,3,12,5,4,6,7,8,9,10)]

names(filtered)[1] <- "#Chrom"

names(filtered)[2] <- "Start"

names(filtered)[3] <- "End"

names(filtered)[4] <- "SV_type"

names(filtered)[6] <- "gene"

names(filtered)[7] <- "log2"

#Save file
write.table(filtered, file="/path/to/output/file.filtered.bed", row.names=F, col.names=T, sep="\t")

Thanks!

cnv R • 424 views
ADD COMMENTlink modified 15 months ago by zx87549.7k • written 15 months ago by brunobsouzaa370
1

What have you tried? Please consider doing some for loop tutorials first and see how far you can come (there are many available, search google). If you get stuck somewhere show us what you have tried and we will try to find a solution for your obstacle. I think that just asking us to rewrite your current script into a loop is not the way to learn bioinformatics.

ADD REPLYlink modified 15 months ago • written 15 months ago by Benn8.0k

You're absolutely right... I've tried some for loops, the last one:

files <- Sys.glob("path/to/*.cnr") For ( i in files ) { my_script_here

} But didn't worked! I've also tried to split the variables inside my "files" but it also didn't work. Actually, this is only a small part of my pipeline, all other parts (in bash) are inside a for loop and works like a charm! To be more specific, I've just posted the script so one can see what is going on... I think it's easier to see and maybe the script can help someone else!

ADD REPLYlink written 15 months ago by brunobsouzaa370

Does something like this work?

files <- Sys.glob("path/to/*.cnr") 

for ( i in 1:length(files) ) { 
  file <- read.csv(files[i], header=T, sep="\t")
  ...
}
ADD REPLYlink written 15 months ago by Benn8.0k

Tried this approach, it breaks on the "merge" step. For some reason, it doesn't recognize my $start column from files! "Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column"

ADD REPLYlink modified 15 months ago • written 15 months ago by brunobsouzaa370

see what colnames you have:

colnames(file)

NB. Maybe file is not the smartest name since it is a function to in R.

ADD REPLYlink modified 15 months ago • written 15 months ago by Benn8.0k

colnames are ok for "sample" and "s_ref" variables

> colnames(sample)
[1] "chromosome" "start"      "end"        "gene"       "log2"      
[6] "cn"         "depth"      "weight"    
> colnames(s_ref)
[1] "chromosome" "start"      "end"        "gene"       "log2"      
[6] "depth"      "gc"         "rmask"      "spread"

Note: if I don't try to apply "for", the script works well for each sample!

ADD REPLYlink modified 15 months ago • written 15 months ago by brunobsouzaa370

All right, this is the first time you use sample. In your initial script nowhere sample. It is impossible to help you this way, I can't just guess what's going, if you know what I mean...

ADD REPLYlink written 15 months ago by Benn8.0k

Sorry for that, just changed "file" by "sample" since it's an R function... But thanks for your help, I'm trying to figure out what is going on with the loop... If I find the solution, I'll post it here!

ADD REPLYlink written 15 months ago by brunobsouzaa370

N.B. sample is a function too. But that is probably not the problem.

The error you get now, is that the columns are not unique, so are there not unique values in one of these columns?

ADD REPLYlink written 15 months ago by Benn8.0k

Those columns contains start positions from my CNVs, so yes, they are unique values. I'm just matching the two files by those values... For single samples, this works well, the problem is inside the loop!

ADD REPLYlink written 15 months ago by brunobsouzaa370

Okay, if you say so. Good luck!

It might be useful just to test it with unique function. Coordinates on different chromosomes can be the same...

ADD REPLYlink modified 15 months ago • written 15 months ago by Benn8.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1654 users visited in the last hour