Loop R script with text modifying through different CNV files
0
0
Entering edit mode
4.7 years ago
brunobsouzaa ▴ 830

Hi guys,

I'm using R to modify cnvkit output for my analysis. The thing is that I have multiple region files (*.cnr) and it's time-consuming (and not necessary) to run the script in each file. So, does anyone know what can I do? Basically what I need is to loop through "file", "file_1", "file_2", "file_n". So, the output always needs to be the exact filename!

The script:

# Clear workspace

rm(list=(ls()))

# Load File and Reference
file <- read.csv("file.cnr", header=T, sep="\t")

ref <- read.csv("ref.cnn", header=T, sep="\t")

# Merge both files by "start" position
merged <- merge(file, ref, by="start", suffixes=c(".file", ".ref"))

# Round "log2" column
merged$log2file <- round(merged$log2.file, digits=1)

# re-calculate "cn" based on log2 correction
merged$cn <- round(2*(2^(merged$log2.file)))

# Subset file with all "cn" values that are not 2
alt.cn <- subset(merged, merged$cn !=2)

# Create new data with columns of interest
alt.cns <- as.data.framealt.cn[, c(1:8,13)])

# Re-order columns for better view
alt.cns <- alt.cns[c(2,1,3,4,6,5,8,7,9)]

# Calculate ratio between coverages
alt.cns$depth.ratio <- round(alt.cns$depth.file / alt.cns$depth.ref, digits=2)

alt.cns$depth.ratio.1 <- round(alt.cns$depth.file / alt.cns$depth.ref, digits=2)

## Function to call for DUP or DEL.  
alt.cns$SV_type <- ifelse(alt.cns$cn < 2, "DEL", "DUP")

# Convert "alt.cns" to .bed file

full <- alt.cns[c(1,2,3,12,5,4,6,7,8,9,10,11)]

names(full)[1] <- "#Chrom"

names(full)[2] <- "Start"

names(full)[3] <- "End"

names(full)[4] <- "SV_type"

names(full)[6] <- "gene"

names(full)[7] <- "log2"

# Save "alt.cns" as .bed file
write.table(full, file="/path/to/output/file.bed", row.names=F, col.names=T, sep="\t")

# Filter "alt.cns" file

filtered <- subset(alt.cns, alt.cns$depth.ratio > 0.6 & alt.cns$depth.ratio.1 > 1.4 & alt.cns$weight > 0.3)

filtered <- filtered[c(1,2,3,12,5,4,6,7,8,9,10)]

names(filtered)[1] <- "#Chrom"

names(filtered)[2] <- "Start"

names(filtered)[3] <- "End"

names(filtered)[4] <- "SV_type"

names(filtered)[6] <- "gene"

names(filtered)[7] <- "log2"

#Save file
write.table(filtered, file="/path/to/output/file.filtered.bed", row.names=F, col.names=T, sep="\t")

Thanks!

R CNV • 1.8k views
ADD COMMENT
1
Entering edit mode

What have you tried? Please consider doing some for loop tutorials first and see how far you can come (there are many available, search google). If you get stuck somewhere show us what you have tried and we will try to find a solution for your obstacle. I think that just asking us to rewrite your current script into a loop is not the way to learn bioinformatics.

ADD REPLY
0
Entering edit mode

You're absolutely right... I've tried some for loops, the last one:

files <- Sys.glob("path/to/*.cnr") For ( i in files ) { my_script_here

} But didn't worked! I've also tried to split the variables inside my "files" but it also didn't work. Actually, this is only a small part of my pipeline, all other parts (in bash) are inside a for loop and works like a charm! To be more specific, I've just posted the script so one can see what is going on... I think it's easier to see and maybe the script can help someone else!

ADD REPLY
0
Entering edit mode

Does something like this work?

files <- Sys.glob("path/to/*.cnr") 

for ( i in 1:length(files) ) { 
  file <- read.csv(files[i], header=T, sep="\t")
  ...
}
ADD REPLY
0
Entering edit mode

Tried this approach, it breaks on the "merge" step. For some reason, it doesn't recognize my $start column from files! "Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column"

ADD REPLY
0
Entering edit mode

see what colnames you have:

colnames(file)

NB. Maybe file is not the smartest name since it is a function to in R.

ADD REPLY
0
Entering edit mode

colnames are ok for "sample" and "s_ref" variables

> colnames(sample)
[1] "chromosome" "start"      "end"        "gene"       "log2"      
[6] "cn"         "depth"      "weight"    
> colnames(s_ref)
[1] "chromosome" "start"      "end"        "gene"       "log2"      
[6] "depth"      "gc"         "rmask"      "spread"

Note: if I don't try to apply "for", the script works well for each sample!

ADD REPLY
0
Entering edit mode

All right, this is the first time you use sample. In your initial script nowhere sample. It is impossible to help you this way, I can't just guess what's going, if you know what I mean...

ADD REPLY
0
Entering edit mode

Sorry for that, just changed "file" by "sample" since it's an R function... But thanks for your help, I'm trying to figure out what is going on with the loop... If I find the solution, I'll post it here!

ADD REPLY
0
Entering edit mode

N.B. sample is a function too. But that is probably not the problem.

The error you get now, is that the columns are not unique, so are there not unique values in one of these columns?

ADD REPLY
0
Entering edit mode

Those columns contains start positions from my CNVs, so yes, they are unique values. I'm just matching the two files by those values... For single samples, this works well, the problem is inside the loop!

ADD REPLY
0
Entering edit mode

Okay, if you say so. Good luck!

It might be useful just to test it with unique function. Coordinates on different chromosomes can be the same...

ADD REPLY

Login before adding your answer.

Traffic: 2693 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6