Question: Issues with reading .bed files and compressing output in a specific format
0
gravatar for danekhoffman0319
5 days ago by
danekhoffman03190 wrote:
import gzip
input_file = open("example.bed","rb")#compress existing file
data = input_file.read()
with gzip.open("example.bed.gz", "wb") as filez:
    filez.write(data)
    filez.close()

import pandas as pd#converts gz file to .txt 
df=pd.read_csv("example.bed.gz", delimiter='\t',header=1 )
df.to_csv('exampleziptotxt.bed', index=False) 



import gzip
import os
file_name = "exampleziptotxt.bed"
out_file_root = "example_by_chrom"
file_handle_dict = {}
with open(file_name, "rb") as file_reader:

   for line in file_reader:
      ff = line.split()
      chrom_name = ff[0].decode("utf-8")

      if not (chrom_name in file_handle_dict):
          out_file_chrom_name = out_file_root + "." + chrom_name + ".bed.gz"

          with gzip.open(out_file_chrom_name, "wb") as out_file_chrom_name_handle:
                file_handle_dict[chrom_name] = out_file_chrom_name_handle
               file_handle_dict[chrom_name].write(line)

          file_handle_dict[chrom_name].write(gzip.compress(line))
file_reader.close()

(Desired) program takes a .bed file compresses it, reconverts the gzipped file to a .txt file , and then reads the contents and produces individual gzipped .bed files for each chromosome containing each gene belonging to that chromosome. vs. (Reality) the current script produces gzipped files for every gene for each chromosome and then eventually throws an error

FileNotFoundError: [Errno 2] No such file or directory: example_by_chrom.chr12,11733136,11733137,Cyp3a23/3a1,1,-.bed.gz'

Any help with solving this problem will be greatly appreciated. I have been stuck for days with this issue.

gzip python bed • 55 views
ADD COMMENTlink modified 5 days ago by Alex Reynolds30k • written 5 days ago by danekhoffman03190
0
gravatar for Alex Reynolds
5 days ago by
Alex Reynolds30k
Seattle, WA USA
Alex Reynolds30k wrote:

You could do this task much more easily and more quickly using the shell, instead of Python.

First, sort the starting file with sort-bed:

$ sort-bed in.unsorted.bed > in.bed

Then split the sorted file by chromosome with bedextract, writing each per-chromosome dataset to separate compressed files:

$ for chrom in `bedextract --list-chr in.bed`; do echo ${chrom}; bedextract ${chrom} in.bed | gzip -c > in.${chrom}.bed.gz; done

You could also write to Starch format, if you need to save more disk space:

$ for chrom in `bedextract --list-chr in.bed`; do echo ${chrom}; bedextract ${chrom} in.bed | starch --omit-signature - > in.${chrom}.bed.starch; done

References:

  1. https://bedops.readthedocs.io/en/latest/content/reference/file-management/sorting/sort-bed.html
  2. https://bedops.readthedocs.io/en/latest/content/reference/set-operations/bedextract.html
  3. https://bedops.readthedocs.io/en/latest/content/reference/file-management/compression/starch.html
ADD COMMENTlink written 5 days ago by Alex Reynolds30k

Unfortunately, I am limited to jupyter-lab at the moment, though I see your point.

ADD REPLYlink written 5 days ago by danekhoffman03190

Maybe use subprocess if you absolutely have to use Python. It's just not the right tool for this particular job, though.

ADD REPLYlink written 5 days ago by Alex Reynolds30k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1118 users visited in the last hour