Question: trying to run python script to generate synthetic reads
1
gravatar for lien
4.6 years ago by
lien90
Belgium
lien90 wrote:

Hi all,

I've downloaded a Python script from the Supplementary data from a paper "Genome-wide copy number analysis of single cells', from Nature protocols 2012, Baslan et al. They provide several scripts to use to make your own pipeline to analyse copy numbers from sequencing data. I've ran some scripts from their supplementary files already, and these don't give any problems. However, I have some problems when I try to generate synthetic 50bp reads for each position in the hg19 genome. When I try to run this script 'hg19.generate.reads.k50.py' I don't get an output. Actually, I do get an output: one empty file, instead of several files. I have adapted the filepaths to correctly represent the paths for my case (as I did in the other scripts I ran). I don't know if I need to change some other parameters as well? (I'm not a pro with Python). I've copied the original script below, any help on how to run this would be very much appreciated.

#!/usr/bin/env python

import re
import sys
import random
import string

def main():

    CHROMLIST = open("/filepath/chromlist.txt", "r")

    chromlist = []
    for x in CHROMLIST:
        if x.rstrip().find("_") == -1:
            chromlist.append(x.rstrip())

    INFILE = False
    OUTFILE = open("/filepath/sequence.part.0.k50.txt", "w")
    outfilecount = 0
    counter = 0

    for chrFile in chromlist:
        a = chrFile.split("/")
        b = a[len(a) - 1]
        thisChrom = b.split(".")[0]

        if INFILE:
            INFILE.close()
        INFILE = open(chrFile, "r")
        chr = []
        x = ""
        y = ""
        INFILE.readline()
        for x in INFILE:
            chr.append(x.rstrip())
        x = "".join(chr)
        y = x.upper()
        print "after read " + thisChrom
        sys.stdout.flush()

        for i in range(len(y) - 50):
            counter += 1
            thisRead = y[i:(i+50)]

            if counter % 150000000 == 0:
                OUTFILE.close() 
                outfilecount += 1
                OUTFILE = open("/filepath/sequence.part." + str(outfilecount) + ".k50.txt", "w")

            OUTFILE.write("@" + thisChrom + "." + str(i))
            OUTFILE.write("\n")
            OUTFILE.write(thisRead)
            OUTFILE.write("\n")
            OUTFILE.write("+" + thisChrom + "." + str(i))
            OUTFILE.write("\n")
            OUTFILE.write("b" * len(thisRead))      
            OUTFILE.write("\n")

    INFILE.close()

    OUTFILE.close()

if __name__ == "__main__":
    main()
synthetic reads python • 1.5k views
ADD COMMENTlink modified 4.6 years ago by John12k • written 4.6 years ago by lien90
3

I've corrected the formatting as much as I could without actually going through the code. In general, it's often simpler to just link to a gist.

ADD REPLYlink written 4.6 years ago by Devon Ryan97k

I agree to an extent. Defeats our markdown move though :(

ADD REPLYlink written 4.6 years ago by _r_am31k
3
gravatar for Devon Ryan
4.6 years ago by
Devon Ryan97k
Freiburg, Germany
Devon Ryan97k wrote:

What are you putting in for chromlist.txt? Looking at the code, it's expecting a file giving the path to single chromosome fasta files. No where in the path can a _ occur, which is presumably meant to get rid of *_random.fa contigs. Of course, if you happen to have a directory with a _ in the name then this will cause problems...

As an aside, the coding is terrible and rather inefficient, but that's a different post...

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by Devon Ryan97k
2

The chromlist.txt file indeed contains a list of paths to single chromosome fasta files.

I was working on a cluster where directories have fixed names, indeed containing a _. When I tried running it locally, after changing all names of directories so that they don't contain an _, everything works. Many thanks!

ADD REPLYlink written 4.6 years ago by lien90
1

Man, that was some seriously impressive detective work. How did you guys figure that out without even seeing the file-structure!? This only further illustrates how Devon is actually a Hidden Markov Learning Machine, and where we all see questions with insufficient information to answer, he sees emissions. Feedback only makes him stronger. When he releases Methyl-Skynet or TermSNPator-1000, we'll know its too late :P

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by John12k
2

"Experience" means having already made all of the mistakes at least once :P

ADD REPLYlink written 4.6 years ago by Devon Ryan97k
1

b = a[len(a) - 1] :P hehe. Makes me feel better about my awful code ;)

ADD REPLYlink written 4.6 years ago by John12k
1

I know, right? My favorite is the fact that y is essentially a duplicate of x and likely a couple hundred megs in memory :(

ADD REPLYlink written 4.6 years ago by Devon Ryan97k
2
gravatar for John
4.6 years ago by
John12k
Germany
John12k wrote:

This code is really oddly written...

I've edited it a bit to be simpler but still do the same thing. I've also added some print lines (which end in #############...) which you can delete once your issue is solved - but if run they should give us some more information about what's going wrong.

Gist link :

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by John12k
2

OK mister "I have lab meeting in a couple days so I don't have time to write the README.md file for pybam"...

ADD REPLYlink written 4.6 years ago by Devon Ryan97k
1

hahah, ok ok, i'm on it -_-; :)

ADD REPLYlink written 4.6 years ago by John12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2206 users visited in the last hour