Question: Writing a csv file python
0
gravatar for elisheva
3.5 years ago by
elisheva100
Israel
elisheva100 wrote:

Hello everybody! I need some help on my python script. I wrote a script that gets a multiple fasta sequences and do some counting on the nucleotides. Here is my script:

from Bio import SeqIO
#Initializes a list of all the posibile dinucleotides
dinucleotides = ['AA','AT','AC','AG'
               'TT','TA','TC','TG'
               'CC','CA','CT','CG'
               'GG','GA','GT','GC']

def Counting(seq):
    """This function gets a cDNA and returns the frequency of each dinucleotide"""
    counting = {} #A dictionary where the dinucleotides are the keys and the counts are the values.

        for dinuc in dinucleotides:
            for i in range (len(seq)-2):
                pos = i+1,i+2
                if(seq[i:i+2] == dinuc):          
                    if(counting.get(dinuc,None)!= None): #If the dictionary at the specific dinucleotide is not empty.
                        counting[dinuc] += 1 #Adds the number of times that the dinucleotide appears.
                    else:
                        counting[dinuc] = 1 #initializes the dinucleotide with 1 instance.

        return counting

    def cDNA(Id,seq): #This function has to be written.


    # File path to  FASTA file
    path_to_file = raw_input("file name: ")  
    with open(path_to_file, mode='r') as handle:
        for record in SeqIO.parse(handle, 'fasta'):        
            # Extract individual parts of the FASTA record
            identifier = record.id
            #description = record.description
            sequence = record.seq
            cDNA(identifier,sequence)

What I want the output to be is a file that contains something like this:

     Id     length      AA    AT    AC    AG         
CCE57618     2786       58   450    45   101         
CCE57619     1140       12    3     70    98

etc. for all the dinucleotides of all the sequences.

Thanks!!!

python genome • 3.7k views
ADD COMMENTlink modified 3.5 years ago by Alex Reynolds31k • written 3.5 years ago by elisheva100

CSV is short for "Comma Separated Values". What you specify that you want as output appears to be tab-delimited and right-justified. These are two different formats. Can you clarify your question?

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Alex Reynolds31k

I did so. I don't think that it matters if the values are separated by commas. But I want that in the future it will be possible to read all the lines - line by line.

ADD REPLYlink written 3.5 years ago by elisheva100

I often just do it manually for simple CSVs without too many fields, e.g.:

with open(outfile, 'w') as ofh:
    ofh.write(var1 + ',' + var 2 + ',' + var 3 + ',' + var4 + .... + ',' + varN + '\n')

Nice and easy.

So in this case, if you have a number of fasta files to analyse, I personally would write my script to handle 1 sequence, output a single row of data, then cat all the rows together after I'd run the script for every fasta file.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Joe18k
1
gravatar for Alex Reynolds
3.5 years ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

Generally, to write CSV:

#!/usr/bin/env python

import csv

# set up the CSV handle    
f = csv.writer(open("output.csv", "wb+"))

# set up a couple convenience lists
dimers = [ "AA", ... , "GC" ]
header_keys = ["Id", "length" ] + dimers

# write out the header
f.writerow(header_keys)

# for each sequence, generate the dimer counts and write out to the CSV handle
for seq in sequences:
    # example structure -- you populate this with your code
    counts = {'Id': ... , 'length': ... , 'dimers':{ 'AA' : 123 , 'AT' : 456 , ... }}
    # build row
    row_of_elems_to_write = list()
    row_of_elems_to_write.append(counts['Id'])
    row_of_elems_to_write.append(counts['length'])
    for dimer in dimers:
        row_of_elems_to_write.append(counts['dimers'][dimer])
    # write row to CSV handle
    f.writerow(row_of_elems_to_write)

# close CSV handle        
f.close()
ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Alex Reynolds31k

Thank you so much!! It works!! But I get an error on "f.close()" What can I do for that?
And one more question: This script should run on very large files, what can I do run it faster? Now it took me about 10 minutes for a half-gigabyte file.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by elisheva100

If you get an error on f.close(), you might try wrapping it in a try..except block to catch the exception (and error) being thrown. The tradeoff of using libraries like SeqIO is that they tend to use a lot of memory and do not always handle IO very quickly. If you need speed, don't use CSV, don't use libraries to parse text, and mainly don't use Python. Just write a tab-delimited file via shells like bash or Perl, which are very fast.

ADD REPLYlink written 3.5 years ago by Alex Reynolds31k

I'v fixed this problem by using :

table1 = (open(name + ".csv", "wb+"))
writer1 = csv.writer(table1)
table1.close()

And it worked quite well. But when I'm trying run the script again.(when the csv file is not empty) I get this error:

lable1 = (open(name + ".csv", "wb+"))
IOError: [Errno 13] Permission denied: 'Escherichia_coli.csv'

Do you know how can I fix it?

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by elisheva100

I'd be inclined to enclose the whole block using a with open(): statement that will take care of properly closing the file etc for you.

ADD REPLYlink written 3.5 years ago by Joe18k

o.k I will try it! Thanks!!

ADD REPLYlink written 3.5 years ago by elisheva100

A permission-denied error usually means you don't have permissions to write a file to wherever you are writing it (most likely your current working directory). Specify another folder to put your file into.

ADD REPLYlink written 3.5 years ago by Alex Reynolds31k

Thanks for your tips. My Script really ran a very long time for larger files (like hours). I am not so close with Perl or bash, Would it work properly in C++? Or maybe you can help me to translate it to bash or Perl?

ADD REPLYlink written 3.5 years ago by elisheva100

How many/how big are the files you're processing?

Python isn't famed for it's speed, but unless your dataset is enormous, I wouldn't expect it to take especially long for this task. String parsing is a fairly intensive task however, so Perl or Awk might be better choices.

ADD REPLYlink written 3.5 years ago by Joe18k

My files are about 500,000 KB. This is not my whole script. The problem is that I'm not close with Perl or Awk. And all these things on C++ are very complicated (especially by using Bio Python)

ADD REPLYlink written 3.5 years ago by elisheva100

Unfortunately, you may have little choice. If performance suddenly becomes critical, python probably just isn't the language to use. That said, there are lots of examples online of DNA string parsing to get GC% etc in any number of languages these days - you might be able to find something else you can adapt.

ADD REPLYlink written 3.5 years ago by Joe18k

Ok. can you give some links for it? And one more thing, I also have to calculate the ratio between some specific dinucleotides. Do those scripts outputs a table of all the results?

ADD REPLYlink written 3.5 years ago by elisheva100

You might have more luck searching for generalised solutions to the problem. Which in your case is simple counting the occurrences of substrings within a string (doesn't matter what the strings are, or that they are DNA).

This is still a python implementation, but this thread seems to be attempting a similar problem?

Similarly, there's a substring counting loop in C here . You could modify this to count each of your 16 strings easily enough I would think (I know basically 0 C though, so I might not be much use!)

here are a couple of examples of GC calculators that might help (I have no idea how fast they will be - or whether they will help. You would need to modify the output slightly to get your csv-like format.

#!/usr/bin/python

import os
import sys

inFile = open(sys.argv[1],'r')

totalBP = 0
gcBP = 0
headerCount = 0


for line in inFile:
    if line[0] == ">":
        headerCount += 1
    else:
        seqLine = line.strip().lower()
        totalBP += len(seqLine)
        gcBP += seqLine.count('g') # e.g. modify these lines to count the number of occurences of each dinucleotide
        gcBP += seqLine.count('c') # 

print ("Locus %s" % str(sys.argv[1])) + ' GC: ' + str(float(gcBP) / totalBP)

A perl script I stole from seqanswers:

#!/usr/bin/perl -w

use strict;

my $i=1;
my $infasta=$ARGV[0];

open FH1 , "$infasta" or die "$!";

while(<FH1>){
        if ($i==1){
                chomp;
                $_ =~ tr/>//d;
                print "$_\t";
                $i++;}
        elsif ($i==2){
                my $UC = uc($_);
                my $length = ($UC =~ tr/ATGCN//);
                my $numsGCs = ($UC =~ tr/GC/gc/);   # e.g add additional regexes here to find other dinucleotide patterns,
                my $percentGC = ($numsGCs/$length)*100;
                my $rounded = sprintf("%.2f", $percentGC);
                print "$rounded\n";
                $i++;}
        elsif ($i>2){
                $i=1;
                redo;};
}

They might be moderately more performant than your existing script, but without testing it I don't know.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Joe18k

Thank you so much!!! Can I upload my whole script so you can see what else it requires?

ADD REPLYlink written 3.5 years ago by elisheva100

Yeah sure, it might just be one particular part of the script that is causing the slow down. Out of interest, have you tried the above function in isolation from the rest of the functions in the script and compared the time?

You may also need to provide us with a sample of your input data too.

ADD REPLYlink written 3.5 years ago by Joe18k

I uploaded my script right here: A long run time problem

ADD REPLYlink written 3.5 years ago by elisheva100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 879 users visited in the last hour