Writing a csv file python
1
1
Entering edit mode
7.0 years ago
elisheva ▴ 120

Hello everybody! I need some help on my python script. I wrote a script that gets a multiple fasta sequences and do some counting on the nucleotides. Here is my script:

from Bio import SeqIO
#Initializes a list of all the posibile dinucleotides
dinucleotides = ['AA','AT','AC','AG'
               'TT','TA','TC','TG'
               'CC','CA','CT','CG'
               'GG','GA','GT','GC']

def Counting(seq):
    """This function gets a cDNA and returns the frequency of each dinucleotide"""
    counting = {} #A dictionary where the dinucleotides are the keys and the counts are the values.

        for dinuc in dinucleotides:
            for i in range (len(seq)-2):
                pos = i+1,i+2
                if(seq[i:i+2] == dinuc):          
                    if(counting.get(dinuc,None)!= None): #If the dictionary at the specific dinucleotide is not empty.
                        counting[dinuc] += 1 #Adds the number of times that the dinucleotide appears.
                    else:
                        counting[dinuc] = 1 #initializes the dinucleotide with 1 instance.

        return counting

    def cDNA(Id,seq): #This function has to be written.


    # File path to  FASTA file
    path_to_file = raw_input("file name: ")  
    with open(path_to_file, mode='r') as handle:
        for record in SeqIO.parse(handle, 'fasta'):        
            # Extract individual parts of the FASTA record
            identifier = record.id
            #description = record.description
            sequence = record.seq
            cDNA(identifier,sequence)

What I want the output to be is a file that contains something like this:

     Id     length      AA    AT    AC    AG         
CCE57618     2786       58   450    45   101         
CCE57619     1140       12    3     70    98

etc. for all the dinucleotides of all the sequences.

Thanks!!!

python genome • 5.9k views
ADD COMMENT
0
Entering edit mode

CSV is short for "Comma Separated Values". What you specify that you want as output appears to be tab-delimited and right-justified. These are two different formats. Can you clarify your question?

ADD REPLY
0
Entering edit mode

I did so. I don't think that it matters if the values are separated by commas. But I want that in the future it will be possible to read all the lines - line by line.

ADD REPLY
0
Entering edit mode

I often just do it manually for simple CSVs without too many fields, e.g.:

with open(outfile, 'w') as ofh:
    ofh.write(var1 + ',' + var 2 + ',' + var 3 + ',' + var4 + .... + ',' + varN + '\n')

Nice and easy.

So in this case, if you have a number of fasta files to analyse, I personally would write my script to handle 1 sequence, output a single row of data, then cat all the rows together after I'd run the script for every fasta file.

ADD REPLY
1
Entering edit mode
7.0 years ago

Generally, to write CSV:

#!/usr/bin/env python

import csv

# set up the CSV handle    
f = csv.writer(open("output.csv", "wb+"))

# set up a couple convenience lists
dimers = [ "AA", ... , "GC" ]
header_keys = ["Id", "length" ] + dimers

# write out the header
f.writerow(header_keys)

# for each sequence, generate the dimer counts and write out to the CSV handle
for seq in sequences:
    # example structure -- you populate this with your code
    counts = {'Id': ... , 'length': ... , 'dimers':{ 'AA' : 123 , 'AT' : 456 , ... }}
    # build row
    row_of_elems_to_write = list()
    row_of_elems_to_write.append(counts['Id'])
    row_of_elems_to_write.append(counts['length'])
    for dimer in dimers:
        row_of_elems_to_write.append(counts['dimers'][dimer])
    # write row to CSV handle
    f.writerow(row_of_elems_to_write)

# close CSV handle        
f.close()
ADD COMMENT
0
Entering edit mode

Thank you so much!! It works!! But I get an error on "f.close()" What can I do for that?
And one more question: This script should run on very large files, what can I do run it faster? Now it took me about 10 minutes for a half-gigabyte file.

ADD REPLY
0
Entering edit mode

If you get an error on f.close(), you might try wrapping it in a try..except block to catch the exception (and error) being thrown. The tradeoff of using libraries like SeqIO is that they tend to use a lot of memory and do not always handle IO very quickly. If you need speed, don't use CSV, don't use libraries to parse text, and mainly don't use Python. Just write a tab-delimited file via shells like bash or Perl, which are very fast.

ADD REPLY
0
Entering edit mode

I'v fixed this problem by using :

table1 = (open(name + ".csv", "wb+"))
writer1 = csv.writer(table1)
table1.close()

And it worked quite well. But when I'm trying run the script again.(when the csv file is not empty) I get this error:

lable1 = (open(name + ".csv", "wb+"))
IOError: [Errno 13] Permission denied: 'Escherichia_coli.csv'

Do you know how can I fix it?

ADD REPLY
0
Entering edit mode

I'd be inclined to enclose the whole block using a with open(): statement that will take care of properly closing the file etc for you.

ADD REPLY
0
Entering edit mode

o.k I will try it! Thanks!!

ADD REPLY
0
Entering edit mode

A permission-denied error usually means you don't have permissions to write a file to wherever you are writing it (most likely your current working directory). Specify another folder to put your file into.

ADD REPLY
0
Entering edit mode

Thanks for your tips. My Script really ran a very long time for larger files (like hours). I am not so close with Perl or bash, Would it work properly in C++? Or maybe you can help me to translate it to bash or Perl?

ADD REPLY
0
Entering edit mode

How many/how big are the files you're processing?

Python isn't famed for it's speed, but unless your dataset is enormous, I wouldn't expect it to take especially long for this task. String parsing is a fairly intensive task however, so Perl or Awk might be better choices.

ADD REPLY
0
Entering edit mode

My files are about 500,000 KB. This is not my whole script. The problem is that I'm not close with Perl or Awk. And all these things on C++ are very complicated (especially by using Bio Python)

ADD REPLY
0
Entering edit mode

Unfortunately, you may have little choice. If performance suddenly becomes critical, python probably just isn't the language to use. That said, there are lots of examples online of DNA string parsing to get GC% etc in any number of languages these days - you might be able to find something else you can adapt.

ADD REPLY
0
Entering edit mode

Ok. can you give some links for it? And one more thing, I also have to calculate the ratio between some specific dinucleotides. Do those scripts outputs a table of all the results?

ADD REPLY
0
Entering edit mode

You might have more luck searching for generalised solutions to the problem. Which in your case is simple counting the occurrences of substrings within a string (doesn't matter what the strings are, or that they are DNA).

This is still a python implementation, but this thread seems to be attempting a similar problem?

Similarly, there's a substring counting loop in C here . You could modify this to count each of your 16 strings easily enough I would think (I know basically 0 C though, so I might not be much use!)

here are a couple of examples of GC calculators that might help (I have no idea how fast they will be - or whether they will help. You would need to modify the output slightly to get your csv-like format.

#!/usr/bin/python

import os
import sys

inFile = open(sys.argv[1],'r')

totalBP = 0
gcBP = 0
headerCount = 0


for line in inFile:
    if line[0] == ">":
        headerCount += 1
    else:
        seqLine = line.strip().lower()
        totalBP += len(seqLine)
        gcBP += seqLine.count('g') # e.g. modify these lines to count the number of occurences of each dinucleotide
        gcBP += seqLine.count('c') # 

print ("Locus %s" % str(sys.argv[1])) + ' GC: ' + str(float(gcBP) / totalBP)

A perl script I stole from seqanswers:

#!/usr/bin/perl -w

use strict;

my $i=1;
my $infasta=$ARGV[0];

open FH1 , "$infasta" or die "$!";

while(<FH1>){
        if ($i==1){
                chomp;
                $_ =~ tr/>//d;
                print "$_\t";
                $i++;}
        elsif ($i==2){
                my $UC = uc($_);
                my $length = ($UC =~ tr/ATGCN//);
                my $numsGCs = ($UC =~ tr/GC/gc/);   # e.g add additional regexes here to find other dinucleotide patterns,
                my $percentGC = ($numsGCs/$length)*100;
                my $rounded = sprintf("%.2f", $percentGC);
                print "$rounded\n";
                $i++;}
        elsif ($i>2){
                $i=1;
                redo;};
}

They might be moderately more performant than your existing script, but without testing it I don't know.

ADD REPLY
0
Entering edit mode

Thank you so much!!! Can I upload my whole script so you can see what else it requires?

ADD REPLY
0
Entering edit mode

Yeah sure, it might just be one particular part of the script that is causing the slow down. Out of interest, have you tried the above function in isolation from the rest of the functions in the script and compared the time?

You may also need to provide us with a sample of your input data too.

ADD REPLY
0
Entering edit mode

I uploaded my script right here: A long run time problem

ADD REPLY

Login before adding your answer.

Traffic: 2450 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6