Question: Matrix Building
gravatar for Mcdenzlix
9.5 years ago by
Mcdenzlix50 wrote:

i have 2 files which i need to parse and build a matrix out of them: the files are as follows:

file 1.
etc ##

A file of my cleaned outputs from analysis. All in same directory (in this files are genes for certain species in combinations of blast outfile. i.e. in format (one line from file2 \t another line form file2)## if gene in file 2 aligned with gene in file1

file 2.

A file of genes from some species in experiment. has more that 4000 genes.

I want to make a matrix in the sense that the 1st column is file 1 and the first row is file 2

Then i will open the files in one to compare with the list in file2. if matched, the coordinates in the matrix will fill with [1] else [0]. that will give me an absence presence matrix for my list in file2 against outputs in file1.

Urgent help needed since this makes a basis of my next move.



my script so far

#!/usr/bin/env python
import os,sys,re
path = "./xxxxxx"
mylist= open('file1.txt','r')
mychecklist = open('file2.txt','r')
for line in mychecklist:#list of resistant genes
  mybk.append(line) # array of file2
for line in mylist:# list of parsed files from blast output
  listbk.append(line)# array if file1
for i in listbk:# open parsed files to read and analyze content
    file = os.path.join(path,i)
    files.append(file) have all files i
    text= open(files ,'r')
    for line in text:
    ### stuck...since all lines from files1 read to same file
matrix parsing file python • 2.0k views
ADD COMMENTlink modified 9.5 years ago by Istvan Albert ♦♦ 84k • written 9.5 years ago by Mcdenzlix50

Can you put your codes in the codeblock? So it will be easy to debug

ADD REPLYlink written 9.5 years ago by Thaman3.2k
gravatar for Istvan Albert
9.5 years ago by
Istvan Albert ♦♦ 84k
University Park, USA
Istvan Albert ♦♦ 84k wrote:

I am assuming that by matrix you mean that columns always refer to the same gene id even it is missing (in which case you would get an empty cell). The program below does that, in the case you don't need that just remove the section that collects all genes:

import string

# f1.txt the file that contains the other filenames
# get all the filenames
fnames = map(string.strip, open('f1.txt'))

# now collect all gene ids
all_genes = set()
for fname in fnames:
    genes = map(string.strip, open(fname))

# lets give them a nicer order
all_genes = sorted(all_genes)

# now generate the matrix
for fname in fnames:
    genes = map(string.strip, open(fname))
    store = dict(zip(genes, genes))
    row = [ store.get(name, '') for name in all_genes ]
    row = [ fname ] + row
    print '\t'.join(row)

An example test output is then:

f2.txt  A   B   C   D   E       
f3.txt  A   B   C   D       X   Y
ADD COMMENTlink modified 8 months ago by RamRS27k • written 9.5 years ago by Istvan Albert ♦♦ 84k

Hi Istvan. You use both 'map' and list comprehensions in this code. What criteria makes you choose one or the other? Where is 'map' better? Is is at all needed with list comprehensions? Cheers.

ADD REPLYlink written 9.5 years ago by Eric Normandeau10k

I guess I used what seemed more readable to me. I prioritize using map() if the data is a simple transformation that can be achieved by applying a simple function over an iterable. In the other case doing it with map would have required to define a closure over the store variable (and adding at least three extra lines) and seemed simpler with a list comprehension.

ADD REPLYlink written 9.5 years ago by Istvan Albert ♦♦ 84k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1318 users visited in the last hour