Question: Python script random act
0
gravatar for Bastien Hervé
13 months ago by
Bastien Hervé4.5k
Limoges, CBRS, France
Bastien Hervé4.5k wrote:

In this script I use genode GTF for mouse (last realase vM18)

At the beginning, I wanted to modify the mouse GTF annotation using a python script (modify some positions in chr12 annotation)

Of course I simplified my script to focus on my issue. I'll try to make it clear.

This is my script :

import os
import sys
import getopt
import os.path
import numpy as np
import pandas as pd

def usage():
    print('Usage:\n')
    print('\tpython ' +
          sys.argv[0] + ' -g <genome type> -y <gtf file>')
    print('\t\t-h or --help : display this help')
    print('\t\t-g or --genome : will add the genome to the cytoband file name')
    print('\t\t-y or --file_gtf : gtf file')


def main(argv):

    genome = ""
    file_gtf = ""

    try:
        opts, args = getopt.getopt(sys.argv[1:], 'g:y:', ['genome=', 'file_gtf=', 'help'])
    except getopt.GetoptError:
        usage()
        sys.exit(2)

##############################OPTIONS##############################

    for opt, arg in opts:
        if opt in ('-h', '--help'):
            usage()
            sys.exit(2)
        elif opt in ('-g', '--genome'):
            genome = arg
        elif opt in ('-y', '--file_gtf'):
            file_gtf = arg
        else:
            print("Error : Bad option -> " + opt)
            usage()
            sys.exit(2)

##############################PROGRAM##############################

    new_vcf_file = open(genome + ".gtf", "a")
    with open(file_gtf, 'r') as vcf_f:
        for line in vcf_f:
            if line[0] != '#':
                array_line = line.split("\t")
                if array_line[0] == "chr12":
                    new_vcf_file.write(line)
                else:
                    new_vcf_file.write(line)
            else:
                new_vcf_file.write(line)
    new_vcf_file.close()

if __name__ == '__main__':
    main(sys.argv[1:])

I know it does nothing because I simplified it. Just read and write lines.

I run my script with this command line :

python test_biostar.py -g testBiostar -y gencode.vM18.chr_patch_hapl_scaff.annotation.gtf && wc -l testBiostar.gtf && rm testBiostar.gtf

At the end of this command a new GTF file is generated (testBiostar.gtf). To show my issue I count the number of line of this file and I delete it to re-run easily.

In this script I do not delete lines from the gencode GTF, so I expect to have the same number of lines

17:32:15 [ubuntu16@ubuntu16-VirtualBox Data]$ wc -l gencode.vM18.chr_patch_hapl_scaff.annotation.gtf 
1818071 gencode.vM18.chr_patch_hapl_scaff.annotation.gtf

Then I run my script multiple times :

17:31:21 [ubuntu16@ubuntu16-VirtualBox Data]$ python /media/sf_raid/Projects/PTCB/test_biostar.py -g testBiostar -y gencode.vM18.chr_patch_hapl_scaff.annotation.gtf && wc -l testBiostar.gtf && rm testBiostar.gtf
1754744 testBiostar.gtf
17:31:47 [ubuntu16@ubuntu16-VirtualBox Data]$ python /media/sf_raid/Projects/PTCB/test_biostar.py -g testBiostar -y gencode.vM18.chr_patch_hapl_scaff.annotation.gtf && wc -l testBiostar.gtf && rm testBiostar.gtf
1754735 testBiostar.gtf

Everytime I launch the script I got different results (sometimes I got a result I already got...)


EDIT : While typing I found the answer but I don't know why (computer science as its finest)

I opened the testBiostar.gtf result file and I found that some lines were cut and glue to others. My hints were on the read/write process... I do not know where exactly, but I replaced :

new_vcf_file = open(genome + ".gtf", "a")

To :

new_vcf_file = open(genome + ".gtf", "w")

17:54:10 [ubuntu16@ubuntu16-VirtualBox Data]$ python /media/sf_raid/Projects/PTCB/test_biostar.py -g testBiostar -y gencode.vM18.chr_patch_hapl_scaff.annotation.gtf && wc -l testBiostar.gtf && rm testBiostar.gtf
1818071 testBiostar.gtf
17:55:10 [ubuntu16@ubuntu16-VirtualBox Data]$ python /media/sf_raid/Projects/PTCB/test_biostar.py -g testBiostar -y gencode.vM18.chr_patch_hapl_scaff.annotation.gtf && wc -l testBiostar.gtf && rm testBiostar.gtf
1818071 testBiostar.gtf

My question is why it is working now ? I know the 'a' is for 'append to the end of the file' but after each run I delete this file

I bet it is a stupid answer... but I can not find it

Thanks

python • 510 views
ADD COMMENTlink modified 13 months ago • written 13 months ago by Bastien Hervé4.5k
2

This have no relation to the answer, why you do not use python argparse? this will save you the usage() function and help more in processing input args.

ADD REPLYlink modified 13 months ago • written 13 months ago by Medhat8.5k

Habits mostly, I learned python with getopt so I keep using it. I will give a try to argparse thanks

ADD REPLYlink written 13 months ago by Bastien Hervé4.5k

You will not regret :D

ADD REPLYlink written 13 months ago by Medhat8.5k
1

EDIT : While typing I found the answer [...]

I stopped counting how many time this happens to me ;)

ADD REPLYlink written 13 months ago by finswimmer12k
2
gravatar for Devon Ryan
13 months ago by
Devon Ryan92k
Freiburg, Germany
Devon Ryan92k wrote:

What happens if you use open(genome + ".gtf", "w"), rather than open(genome + ".gtf", "a")? I wonder if there's some buffering going on.

ADD COMMENTlink written 13 months ago by Devon Ryan92k

I forgot to put the result of "a" to "w", I edited my post. It works, I got all the results I was expected. I agree for the buffering in the script with the "a" but the script end successfully, I delete the file and I build a new one using another command line... How can it still buffer something if the script ends ?

If this could have a relation, I am in a virtual environment

ADD REPLYlink modified 13 months ago • written 13 months ago by Bastien Hervé4.5k
1

My guess is that the file system inside the virtualbox is odd, in that it doesn't just delete the inode, but instead starts deleting the actual contents first. Do an immediate ls -l after the rm and see if the file still shows up.

ADD REPLYlink written 13 months ago by Devon Ryan92k

After the rm command the file is gone. But the first run works well I got all the results, the other runs failed. I'll try this on my host machine

ADD REPLYlink modified 13 months ago • written 13 months ago by Bastien Hervé4.5k

Works perfectly well on the host machine !

This is a dangerous behaviour from my virtual machine, took me 3 hours to find out the bug and I doubt about my results now... Did you see something similar on specialize forum ?

ADD REPLYlink modified 13 months ago • written 13 months ago by Bastien Hervé4.5k

I'm not sure what sort of "specialize forum" it is, but no. This was a pure guess on my part aided by knowing what the "a" and "w" bits are really doing under the hood.

ADD REPLYlink written 13 months ago by Devon Ryan92k

Mean StackOverflow discussion or python forum... How does a system manage the "a" option ?

ADD REPLYlink written 13 months ago by Bastien Hervé4.5k

a opens the file and, if it already exists. w opens a file and will truncate it if it already exists. As a general strategy, it's best to avoid using a unless you have a good reason to use it. My guess is that the system call used by ls and that used by the open() system call are seeing somewhat different things due to a shoddy file system implementation in virtual box.

ADD REPLYlink written 13 months ago by Devon Ryan92k

I think you missed some words at the beginning. w truncate ? w overwrite !

ADD REPLYlink written 13 months ago by Bastien Hervé4.5k

It overwrites by truncating in my understanding.

ADD REPLYlink written 13 months ago by Devon Ryan92k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1742 users visited in the last hour