Loop for interval range based on chromosome sizes
1
0
Entering edit mode
8 months ago
selplat21 ▴ 20
1. I currently have a series of vcf files for 500 individuals.
2. I have a separate vcf file for each chromosome in each individual.
3. The software I am using requires 2 Mb intervals to run and takes the argument -range [int_min] [int_max]

Is there a loop I could do that would generate a separate job for each 2Mb range based on chromosome sizes. This would mean a job for each 2Mb interval across each chromosome for each individual.

chromosomes bioinformatics genomics loops • 374 views
0
Entering edit mode
8 months ago

a nextfow workflow would look like this (not tested)

params.reference=""
params.bed=""
params.vcfs=""

Channel.fromPath(params.vcfs).
set{each_vcf}

process makeWindows {
output:
path("windows.bed")  into windows_bed
script:
"""
bedtools makewindows -g "${params.reference}.fai" -w 2000000 > windows.bed """ } windows_bed. splitCsv(header:false,sep:'\t',strip:true). map{T->[T[0],1+(T[1] as Integer),T[2]]}. /* convert bed +0 to interval +1 */ set{each_interval} process processVcf { input: tuple vcf,chrom,start,end from each_vcf.combine(windows_bed) output: path("ouput.txt") into vcf_out script: """ yourtool -chromosome "${chrom}"   -range ${start}-${end} ${vcf} > ouput.txt """ } process zipIt { input: val L from vcf_out.collect() output: path("final.zip") script: """ cat << EOF > tmp.list${L.join("\n")}
EOF

zip -@ -9 final.zip < tmp.list
"""
}

0
Entering edit mode

Hi Pierre, I will test this out and familiarize myself with the syntax and get back to you. Very much appreciated.

0
Entering edit mode

Okay the following is what I was able to trim it down to, but I have some followup questions which are commented out:

params.reference="Sept22Assembly.fasta"


# I assume we keep quotes here?

params.vcfs="vcf.list"


# This would be a list of my gzipped vcf files (the input for my software of interest. I'm a little confused on how to structure this list. Should it be a list of vcf.gz files per sample, or all vcf.gz files together. If it is per sample, I will need to run this whole script as a loop and keep params.vcf as a variable.

Channel.fromPath(params.vcfs).
set{each_vcf}

process makeWindows {
output:
path("windows.bed")  into windows_bed
script:
bedtools makewindows -g "${params.reference}.fai" -w 2000000 > windows.bed } windows_bed. splitCsv(header:false,sep:'\t',strip:true). map{T->[T[0],1+(T[1] as Integer),T[2]]}. /* convert bed +0 to interval +1 */ set{each_interval} process processVcf { input: tuple vcf,chrom,start,end from each_vcf.combine(windows_bed) output: path("ouput.txt") into vcf_out script: ./loimpute -i${vcf} -ne 80000 -range ${start}-${end} -h refpanel.vcf.gz -o ??
}