Question

pythonic way creating a matrix from multiple csv files.

0

Entering edit mode

7.0 years ago

novicebioinforesearcher ▴ 70

Hello all,

I am cross posting this from stackoverflow as I was did not receive good response or it was not well received,

I have just started to learn programming using python and hence asking a pythonic way (not pandas as it is very advanced) to approach this problem. any tips or comments is appreciated as it will help me learn.

http://stackoverflow.com/questions/43482716/creating-a-matrix-using-python-for-biologist

Thank you.

python • 2.0k views

ADD COMMENT • link updated 7.0 years ago by Fuzzy D ▴ 10 • written 7.0 years ago by novicebioinforesearcher ▴ 70

1

Entering edit mode

What exactly couldn't you manage with the pandas solution in that thread? It looks pretty well broken down to me and that's exactly how I'd go about doing it. If you're struggling to install pandas, try: python -m pip install pandas .

If you don't have administrative access, consider using the Anaconda/Miniconda distributions.

While we sympathise with your novice status, there is no real substitute for just spending the time trying to figure it out - especially when you're asking for a kind of general solution to a specific task (only you know exactly what your data is going to look like each time).

ADD REPLY • link 7.0 years ago by Joe 21k

0

Entering edit mode

hello @jrj.healey,

I am trying to learn as i go pandas seems to be way advanced, if i can do it in a non pythonic way it would make more sense to me and more over the final output i get does not match the read counts from individual files, may be it has to do with difference in each file, some files have 222000 lines where as some of them have 300000 lines.

ADD REPLY • link 7.0 years ago by novicebioinforesearcher ▴ 70

0

Entering edit mode

I'm not sure what you mean by a more 'pythonic' way? The ethos of python is this:

There should be one-- and preferably only one --obvious way to do it. https://www.python.org/dev/peps/pep-0020/#id3

I'd say in this case pandas is the aforementioned 'obvious way' (with the possible exception of Fuzzy's solution with csv). It might seem like an advanced step, but what you're basically doing is giving yourself access to many clever functions which have been prewritten to allow you to manipulate dataframes. I'd wager you'll have net saving of time investment if you spent it learning a bit of pandas instead of struggling through for, if & else statements.

You'll need to edit your question and provide reproducible input files and commands before we can address what problems you might have had with the various solutions - there isn't really enough to go on at the moment.

ADD REPLY • link 7.0 years ago by Joe 21k

0

Entering edit mode

Is there a reason this has to be in python? In R, it's outer_join() from the dplyr package.

BTW, in reality the genes in the files will all be present and in the same order, so you'll never have missing values.

ADD REPLY • link 7.0 years ago by Devon Ryan 104k

0

Entering edit mode

Dear Devon,

thank you for your reply, that was my assumption to but i am dealing with gene names that have been collapsed from tophat junction bed file, I have to make a matrix of these gene names, so there could be a chance that at some places there would be no reads mapping to junction. eg

gene id chr junctionstart    junctionstop strand genename readcount
 ENSMUSG00000000001 3 107915006 107915391 (-)  Gnai3  20

i need to do this to make a dataset that i can use for a tool where i am trying to see differential junction usage

ADD REPLY • link 7.0 years ago by novicebioinforesearcher ▴ 70

score 1 · Accepted Answer · 2017-04-25

You can read/write csv files using the csv module.

import csv

from collections import defaultdict

Make a list of files you want to read

myFiles = ['file1.csv', 'file2.csv']

Go through the files and get the genes, store them in a dictionary

myDict = defaultdict(list)
for file in myFiles:
    openF = open(file, 'r')
    csvIn = csv.reader(openF, delimiter = ',')
    for line in csvIn:
        gene = line[0]
        expression = line[1]
        myDict[gene].append(expression)

So you will have a dictionary where the keys are the gene ids. Using the key you will get a list of expression values. You can then go through the dictionary key by key and write the output

You may not understand all this code right now but its fairly simple, look up any parts you dont understand. You may have problems if one file has a gene the other does not but you can adapt the code to fix that in multiple ways