Question: Loop R script with text modifying through different CNV files
0
gravatar for brunobsouzaa
3 months ago by
brunobsouzaa50
Brazil
brunobsouzaa50 wrote:

Hi guys,

I'm using R to modify cnvkit output for my analysis. The thing is that I have multiple region files (*.cnr) and it's time-consuming (and not necessary) to run the script in each file. So, does anyone know what can I do? Basically what I need is to loop through "file", "file_1", "file_2", "file_n". So, the output always needs to be the exact filename!

The script:

# Clear workspace

rm(list=(ls()))

# Load File and Reference
file <- read.csv("file.cnr", header=T, sep="\t")

ref <- read.csv("ref.cnn", header=T, sep="\t")

# Merge both files by "start" position
merged <- merge(file, ref, by="start", suffixes=c(".file", ".ref"))

# Round "log2" column
merged$log2file <- round(merged$log2.file, digits=1)

# re-calculate "cn" based on log2 correction
merged$cn <- round(2*(2^(merged$log2.file)))

# Subset file with all "cn" values that are not 2
alt.cn <- subset(merged, merged$cn !=2)

# Create new data with columns of interest
alt.cns <- as.data.framealt.cn[, c(1:8,13)])

# Re-order columns for better view
alt.cns <- alt.cns[c(2,1,3,4,6,5,8,7,9)]

# Calculate ratio between coverages
alt.cns$depth.ratio <- round(alt.cns$depth.file / alt.cns$depth.ref, digits=2)

alt.cns$depth.ratio.1 <- round(alt.cns$depth.file / alt.cns$depth.ref, digits=2)

## Function to call for DUP or DEL.  
alt.cns$SV_type <- ifelse(alt.cns$cn < 2, "DEL", "DUP")

# Convert "alt.cns" to .bed file

full <- alt.cns[c(1,2,3,12,5,4,6,7,8,9,10,11)]

names(full)[1] <- "#Chrom"

names(full)[2] <- "Start"

names(full)[3] <- "End"

names(full)[4] <- "SV_type"

names(full)[6] <- "gene"

names(full)[7] <- "log2"

# Save "alt.cns" as .bed file
write.table(full, file="/path/to/output/file.bed", row.names=F, col.names=T, sep="\t")

# Filter "alt.cns" file

filtered <- subset(alt.cns, alt.cns$depth.ratio > 0.6 & alt.cns$depth.ratio.1 > 1.4 & alt.cns$weight > 0.3)

filtered <- filtered[c(1,2,3,12,5,4,6,7,8,9,10)]

names(filtered)[1] <- "#Chrom"

names(filtered)[2] <- "Start"

names(filtered)[3] <- "End"

names(filtered)[4] <- "SV_type"

names(filtered)[6] <- "gene"

names(filtered)[7] <- "log2"

#Save file
write.table(filtered, file="/path/to/output/file.filtered.bed", row.names=F, col.names=T, sep="\t")

Thanks!

cnv R • 248 views
ADD COMMENTlink modified 3 months ago by zx87548.4k • written 3 months ago by brunobsouzaa50
1

What have you tried? Please consider doing some for loop tutorials first and see how far you can come (there are many available, search google). If you get stuck somewhere show us what you have tried and we will try to find a solution for your obstacle. I think that just asking us to rewrite your current script into a loop is not the way to learn bioinformatics.

ADD REPLYlink modified 3 months ago • written 3 months ago by Benn7.9k

You're absolutely right... I've tried some for loops, the last one:

files <- Sys.glob("path/to/*.cnr") For ( i in files ) { my_script_here

} But didn't worked! I've also tried to split the variables inside my "files" but it also didn't work. Actually, this is only a small part of my pipeline, all other parts (in bash) are inside a for loop and works like a charm! To be more specific, I've just posted the script so one can see what is going on... I think it's easier to see and maybe the script can help someone else!

ADD REPLYlink written 3 months ago by brunobsouzaa50

Does something like this work?

files <- Sys.glob("path/to/*.cnr") 

for ( i in 1:length(files) ) { 
  file <- read.csv(files[i], header=T, sep="\t")
  ...
}
ADD REPLYlink written 3 months ago by Benn7.9k

Tried this approach, it breaks on the "merge" step. For some reason, it doesn't recognize my $start column from files! "Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column"

ADD REPLYlink modified 3 months ago • written 3 months ago by brunobsouzaa50

see what colnames you have:

colnames(file)

NB. Maybe file is not the smartest name since it is a function to in R.

ADD REPLYlink modified 3 months ago • written 3 months ago by Benn7.9k

colnames are ok for "sample" and "s_ref" variables

> colnames(sample)
[1] "chromosome" "start"      "end"        "gene"       "log2"      
[6] "cn"         "depth"      "weight"    
> colnames(s_ref)
[1] "chromosome" "start"      "end"        "gene"       "log2"      
[6] "depth"      "gc"         "rmask"      "spread"

Note: if I don't try to apply "for", the script works well for each sample!

ADD REPLYlink modified 3 months ago • written 3 months ago by brunobsouzaa50

All right, this is the first time you use sample. In your initial script nowhere sample. It is impossible to help you this way, I can't just guess what's going, if you know what I mean...

ADD REPLYlink written 3 months ago by Benn7.9k

Sorry for that, just changed "file" by "sample" since it's an R function... But thanks for your help, I'm trying to figure out what is going on with the loop... If I find the solution, I'll post it here!

ADD REPLYlink written 3 months ago by brunobsouzaa50

N.B. sample is a function too. But that is probably not the problem.

The error you get now, is that the columns are not unique, so are there not unique values in one of these columns?

ADD REPLYlink written 3 months ago by Benn7.9k

Those columns contains start positions from my CNVs, so yes, they are unique values. I'm just matching the two files by those values... For single samples, this works well, the problem is inside the loop!

ADD REPLYlink written 3 months ago by brunobsouzaa50

Okay, if you say so. Good luck!

It might be useful just to test it with unique function. Coordinates on different chromosomes can be the same...

ADD REPLYlink modified 3 months ago • written 3 months ago by Benn7.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1705 users visited in the last hour