Loop for interval range based on chromosome sizes
1
0
Entering edit mode
2.3 years ago
selplat21 ▴ 20
  1. I currently have a series of vcf files for 500 individuals.
  2. I have a separate vcf file for each chromosome in each individual.
  3. The software I am using requires 2 Mb intervals to run and takes the argument -range [int_min] [int_max]

Is there a loop I could do that would generate a separate job for each 2Mb range based on chromosome sizes. This would mean a job for each 2Mb interval across each chromosome for each individual.

chromosomes genomics • 725 views
ADD COMMENT
0
Entering edit mode
2.3 years ago

a nextfow workflow would look like this (not tested)

params.reference=""
params.bed=""
params.vcfs=""

Channel.fromPath(params.vcfs).
   splitCsv(header:false,sep:'\t',strip:true).
   set{each_vcf}

process makeWindows {
output:
    path("windows.bed")  into windows_bed
script:
"""
bedtools makewindows -g "${params.reference}.fai" -w 2000000 > windows.bed
"""
}

windows_bed.
   splitCsv(header:false,sep:'\t',strip:true).
   map{T->[T[0],1+(T[1] as Integer),T[2]]}. /* convert bed +0 to interval +1 */
   set{each_interval}


process processVcf {
input:
     tuple vcf,chrom,start,end from each_vcf.combine(windows_bed)
output:
    path("ouput.txt")  into vcf_out
script:
"""
yourtool -chromosome "${chrom}"   -range ${start}-${end} ${vcf} > ouput.txt
"""
}

process zipIt {
input:
  val L from vcf_out.collect()
output:
  path("final.zip")
script:
"""
cat << EOF > tmp.list
${L.join("\n")}
EOF

zip -@ -9 final.zip < tmp.list
"""
}
ADD COMMENT
0
Entering edit mode

Hi Pierre, I will test this out and familiarize myself with the syntax and get back to you. Very much appreciated.

ADD REPLY
0
Entering edit mode

Okay the following is what I was able to trim it down to, but I have some followup questions which are commented out:

params.reference="Sept22Assembly.fasta" 

I assume we keep quotes here?

params.vcfs="vcf.list" 

This would be a list of my gzipped vcf files (the input for my software of interest. I'm a little confused on how to structure this list. Should it be a list of vcf.gz files per sample, or all vcf.gz files together. If it is per sample, I will need to run this whole script as a loop and keep params.vcf as a variable.

Channel.fromPath(params.vcfs).
   splitCsv(header:false,sep:'\t',strip:true).
   set{each_vcf}

process makeWindows {
output:
    path("windows.bed")  into windows_bed
script:
bedtools makewindows -g "${params.reference}.fai" -w 2000000 > windows.bed
}

windows_bed.
   splitCsv(header:false,sep:'\t',strip:true).
   map{T->[T[0],1+(T[1] as Integer),T[2]]}. /* convert bed +0 to interval +1 */
   set{each_interval}


process processVcf {
input:
     tuple vcf,chrom,start,end from each_vcf.combine(windows_bed)
output:
    path("ouput.txt")  into vcf_out
script:
./loimpute -i ${vcf} -ne 80000 -range ${start}-${end} -h refpanel.vcf.gz -o ??
}

This is essentially the structure of the command. Notice it does not include a chromosome argument. But range will apply only to the chromosome being analyzed I think. In terms of the output, I'm not sure if this should be different for each vcf.gz file.

Lastly, I'm unsure about how this will parallelize. The software can only handle 2Mb at a time because of limits on memory.

ADD REPLY

Login before adding your answer.

Traffic: 2958 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6