Question

How to slice a CRAM file into the 50kb regions padded with 1kb?

0

Entering edit mode

5 months ago

Sd • 0

Hello,

I am working on whole genome sequencing CRAM files and I want to perform GATK best practice. Before that, I want to slice each CRAM into smaller chunks, 50kb regions with 1kb padding, and avoid losing the reads. I want this to paralelize my analysis and increase the speed. After that I created gVCF files I am going to merge them.

How can I do slicing?

GATK BED CRAM WGS • 1.4k views

ADD COMMENT • link updated 5 months ago by ATpoint 82k • written 5 months ago by Sd • 0

0

Entering edit mode

What have you tried? I ask because you have bed as a tag and are thus aware of bed-format based tools.

ADD REPLY • link 5 months ago by Ram 43k

0

Entering edit mode

I have tried to create a bed file for the hg38 reference genome with 50kb region length and 1kb padding. I used the following script to creat the bed file but I am not sure if the output is correct or not.

region_size <- 50000  # 50kb
padding <- 1000  # 1kb

# Initialize an empty list to store BED entries
bed_entries_list <- list()

# Generate BED entries for each chromosome
for (i in 1:nrow(chromosomes)) {
  chromosome <- chromosomes[i, 1]
  size <- as.numeric(chromosomes[i, 2])
  num_regions <- ceiling(size / region_size)

  # Generate BED entries for each region with padding
  for (j in 1:num_regions) {
    start <- (j - 1) * region_size + 1  # 1-based start position
    end <- min(j * region_size + padding, size)  # 1-based end position with padding, capped at chromosome size
    bed_entry <- c(chromosome, start, end)
    bed_entries_list[[length(bed_entries_list) + 1]] <- bed_entry
  }
}

bed_entries_matrix <- do.call(rbind, bed_entries_list)

The bed file I have created looks like this: But I am not sure if it is right way to do this. Then, I want to use this bed file to create CRAM chunks using samtools

chr1    1   51000
chr1    50001   101000
chr1    100001  151000
chr1    150001  201000
chr1    200001  251000
chr1    250001  301000
chr1    300001  351000

ADD REPLY • link 5 months ago by Sd • 0

0

Entering edit mode

But I am not sure if it is right way to do this.

you want bedtools makewindows

Then, I want to use this bed file to create CRAM chunks using samtools

go

ADD REPLY • link 5 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

Your question has already been answered 5 months ago. Why are you asking it again?

ADD REPLY • link 5 months ago by Ram 43k

0

Entering edit mode

I want to add the 1kb padding, but I did not get an answer from the last question. How can I add 1kb padding?

ADD REPLY • link 5 months ago by Sd • 0

0

Entering edit mode

bedtools slop

ADD REPLY • link 5 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

how is it different from your previous question ? How to Split 3000 WGS CRAM files into 1Mbp length chunks

ADD REPLY • link 5 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

Do it per chromosome, not per 50kb bin. 50kb creates thousands of file, that's a big IO burden.

ADD REPLY • link 5 months ago by ATpoint 82k

0

Entering edit mode

I am dealing with WGS data which computation time increases in multiple steps of GATK even per chromosome. Thus, I want very small regions to parallelize the computation.

ADD REPLY • link 5 months ago by Sd • 0

0

Entering edit mode

What infrastructure do you have available? Do you have a HPC that even allows to run thousands of processes and hundreds of nodes in parallel?

ADD REPLY • link 5 months ago by ATpoint 82k

0

Entering edit mode

Yes, I have a cluster allows to run many processes and hundreds of nodes in parallel.

ADD REPLY • link 5 months ago by Sd • 0

0

Entering edit mode

hundreds of nodes in parallel.

I kind of doubt that the scheduler gives them to you all at once. As I said, my recommendation is a per-chromosome splitting (if at all) and then just let it run. Use resources to run samples in parallel, not to split a single sample into thousands of chunks. The overhead to merge that all in the end is big and to some extend error-prone unless you have a bullet-proof pipeline which (with respect) I doubt given that you ask here for help and use R for even creating the intervals (no offense, I know it must sound like it, but it really isn't).

ADD REPLY • link 5 months ago by ATpoint 82k

score 0 · Answer 1 · 2023-11-09

I noticed something. I already have a cram for chr22 reads ( input_chr22.cram ). Then I sliced this cram file and the output was empty in the chr22:1-10510058 .

$ samtools view -b -o output.cram input_chr22.cram chr22:1-10510058
$ samtools view output.cram

The output was emply; however when added one coordinate:

$ samtools view -b -o output.cram input_chr22.cram chr22:1-10510059
$ samtools view output.cram 

A00393:49:H2JWTDSXX:3:2120:27661:13448  99      chr22   10510059        0       151M    =       10510230        322     GCAACTAATATGTATTTTCAAAGCATTATAAATACAGAGTGCTAAGTTACTTCACTGTGAAATGTAGTCATATAAAGAACATAATAATTATACTGGATTATTTTTAAATGGGCTGTCTAACATTATATTAAAATGTTTCATCAGTAATTCA       FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF,FFFF:FFFFFFFF,FFFF:FFFFFFFFFFF:        PG:Z:MarkDuplicates     MQ:i:0  AS:i:146        XS:i:146        MD:Z:133G17    NM:i:1  RG:Z:novaseq01_2_180401_A00393_0049_BH2JWTDSXX.s_3_300_200.001

Since, it is a WGS file, why there is not any read in the 1-10510058 region?