I assume we keep quotes here?

Question

Loop for interval range based on chromosome sizes

0

Entering edit mode

2.3 years ago

selplat21 ▴ 20

I currently have a series of vcf files for 500 individuals.
I have a separate vcf file for each chromosome in each individual.
The software I am using requires 2 Mb intervals to run and takes the argument -range [int_min] [int_max]

Is there a loop I could do that would generate a separate job for each 2Mb range based on chromosome sizes. This would mean a job for each 2Mb interval across each chromosome for each individual.

chromosomes genomics • 725 views

ADD COMMENT • link updated 14 months ago by Ram 43k • written 2.3 years ago by selplat21 ▴ 20

score 0 · Answer 1 · 2022-01-13

0

Entering edit mode

2.3 years ago

Pierre Lindenbaum 161k

a nextfow workflow would look like this (not tested)

params.reference=""
params.bed=""
params.vcfs=""

Channel.fromPath(params.vcfs).
   splitCsv(header:false,sep:'\t',strip:true).
   set{each_vcf}

process makeWindows {
output:
    path("windows.bed")  into windows_bed
script:
"""
bedtools makewindows -g "${params.reference}.fai" -w 2000000 > windows.bed
"""
}

windows_bed.
   splitCsv(header:false,sep:'\t',strip:true).
   map{T->[T[0],1+(T[1] as Integer),T[2]]}. /* convert bed +0 to interval +1 */
   set{each_interval}


process processVcf {
input:
     tuple vcf,chrom,start,end from each_vcf.combine(windows_bed)
output:
    path("ouput.txt")  into vcf_out
script:
"""
yourtool -chromosome "${chrom}"   -range ${start}-${end} ${vcf} > ouput.txt
"""
}

process zipIt {
input:
  val L from vcf_out.collect()
output:
  path("final.zip")
script:
"""
cat << EOF > tmp.list
${L.join("\n")}
EOF

zip -@ -9 final.zip < tmp.list
"""
}

ADD COMMENT • link 2.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hi Pierre, I will test this out and familiarize myself with the syntax and get back to you. Very much appreciated.

ADD REPLY • link 2.3 years ago by selplat21 ▴ 20

0

Entering edit mode

Okay the following is what I was able to trim it down to, but I have some followup questions which are commented out:

params.reference="Sept22Assembly.fasta"

I assume we keep quotes here?

params.vcfs="vcf.list"

This would be a list of my gzipped vcf files (the input for my software of interest. I'm a little confused on how to structure this list. Should it be a list of vcf.gz files per sample, or all vcf.gz files together. If it is per sample, I will need to run this whole script as a loop and keep params.vcf as a variable.

Channel.fromPath(params.vcfs).
   splitCsv(header:false,sep:'\t',strip:true).
   set{each_vcf}

process makeWindows {
output:
    path("windows.bed")  into windows_bed
script:
bedtools makewindows -g "${params.reference}.fai" -w 2000000 > windows.bed
}

windows_bed.
   splitCsv(header:false,sep:'\t',strip:true).
   map{T->[T[0],1+(T[1] as Integer),T[2]]}. /* convert bed +0 to interval +1 */
   set{each_interval}


process processVcf {
input:
     tuple vcf,chrom,start,end from each_vcf.combine(windows_bed)
output:
    path("ouput.txt")  into vcf_out
script:
./loimpute -i ${vcf} -ne 80000 -range ${start}-${end} -h refpanel.vcf.gz -o ??
}

This is essentially the structure of the command. Notice it does not include a chromosome argument. But range will apply only to the chromosome being analyzed I think. In terms of the output, I'm not sure if this should be different for each vcf.gz file.

Lastly, I'm unsure about how this will parallelize. The software can only handle 2Mb at a time because of limits on memory.

ADD REPLY • link 2.3 years ago by selplat21 ▴ 20