Matrix Building
1
1
Entering edit mode
13.4 years ago
Mcdenzlix ▴ 50

I have 2 files which I need to parse and build a matrix out of them:

The files are as follows:

file 1

NC_000964.parsed
NC_002570.parsed
NC_003909.parsed
NC_003997.parsed
NC_004721.parsed
NC_005945.parsed
NC_005957.parsed
NC_006274.parsed
NC_006322.parsed
NC_006510.parsed
NC_006582.parsed
..
..

A file of my cleaned outputs from analysis. All in same directory (in this files are genes for certain species in combinations of blast outfile. i.e. in format (one line from file2 \t another line form file2)## if gene in file 2 aligned with gene in file1

file 2

gi|56418536|ref|YP_145854.1
gi|56418537|ref|YP_145855.1
gi|56418538|ref|YP_145856.1
gi|56418539|ref|YP_145857.1
gi|56418540|ref|YP_145858.1
gi|56418541|ref|YP_145859.1
..
..

A file of genes from some species in experiment. has more that 4000 genes.

I want to make a matrix in the sense that the 1st column is file 1 and the first row is file 2

Then I will open the files in one to compare with the list in file2. if matched, the coordinates in the matrix will fill with [1] else [0]. that will give me an absence presence matrix for my list in file2 against outputs in file1.

Urgent help needed since this makes a basis of my next move.

Thanks

NB..

my script so far

#!/usr/bin/env python
import os,sys,re
path = "./xxxxxx"
mylist= open('file1.txt','r')
mychecklist = open('file2.txt','r')
for line in mychecklist:#list of resistant genes
  line=line.strip()
  mybk.append(line) # array of file2
for line in mylist:# list of parsed files from blast output
  line=line.strip()
  listbk.append(line)# array if file1
for I in listbk:# open parsed files to read and analyze content
    file = os.path.join(path,i)
    files.append(file) have all files I
    text= open(files ,'r')
    for line in text:
    ### stuck...since all lines from files1 read to same file
python matrix • 2.7k views
ADD COMMENT
0
Entering edit mode

Can you put your codes in the codeblock? So it will be easy to debug

ADD REPLY
1
Entering edit mode
13.4 years ago

I am assuming that by matrix you mean that columns always refer to the same gene id even it is missing (in which case you would get an empty cell). The program below does that, in the case you don't need that just remove the section that collects all genes:

import string

# f1.txt the file that contains the other filenames
# get all the filenames
fnames = map(string.strip, open('f1.txt'))

# now collect all gene ids
all_genes = set()
for fname in fnames:
    genes = map(string.strip, open(fname))
    all_genes.update(genes)

# lets give them a nicer order
all_genes = sorted(all_genes)

# now generate the matrix
for fname in fnames:
    genes = map(string.strip, open(fname))
    store = dict(zip(genes, genes))
    row = [ store.get(name, '') for name in all_genes ]
    row = [ fname ] + row
    print '\t'.join(row)

An example test output is then:

f2.txt  A   B   C   D   E       
f3.txt  A   B   C   D       X   Y
ADD COMMENT
1
Entering edit mode

Hi Istvan. You use both 'map' and list comprehensions in this code. What criteria makes you choose one or the other? Where is 'map' better? Is is at all needed with list comprehensions? Cheers.

ADD REPLY
0
Entering edit mode

I guess I used what seemed more readable to me. I prioritize using map() if the data is a simple transformation that can be achieved by applying a simple function over an iterable. In the other case doing it with map would have required to define a closure over the store variable (and adding at least three extra lines) and seemed simpler with a list comprehension.

ADD REPLY

Login before adding your answer.

Traffic: 2711 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6