Question

Extracting Specific Columns from Multiple Files & Writing to File Python

0

Entering edit mode

8.4 years ago

BioICoder ▴ 40

I am currently learning Python. I would really like you help to solve the following problem using Python. So, I have seven tab delimited files, each file has the exact number and name of the columns but different data of each. Below is a sample of how either of the seven files looks like:

 test_id gene_id gene    locus   sample_1        sample_2        status  value_1 value_2 log2(fold_change)
  000001     000001     ZZ 1:1   01  01   NOTEST  0       0       0       0       1       1       no

I am trying to basically read all of those seven files and extract the third, fourth and tenth column (gene, locus, log2(fold_change)) And write those columns in a new file. So the file look something like this:

gene name   locus   log2(fold_change)    log2(fold_change)    log2(fold_change)    log2(fold_change)    log2(fold_change)    log2(fold_change)    log2(fold_change)
ZZ  1:1         0     0     0     0

all the log2(fold_change) are obtain from the tenth column from each of the seven files

What I had so far is this and need help constructing a more efficient pythonic way to accomplish the task above, note that the code is still not accomplish the task explained above, need some work

 dicti = defaultdict(list)
 filetag = []

 def read_data(file, base):
  with open(file, 'r') as f:
    reader = csv.reader((f), delimiter='\t')
     for row in reader:
      if 'test_id' not in row[0]:
            dicti[row[2]].append((base, row))

 name_of_fold = raw_input("Folder name to stored output files in: ")
 for file in glob.glob("*.txt"):
  base=file[0:3]+"-log2(fold_change)"
  filetag.append(base)
  read_data(file, base)


 with open ("output.txt", "w") as out:
  out.write("gene name" + "\t"+  "locus" + "\t" + "\t".join(sorted(filetag))+"\n")
  for k,v in dicti:
   out.write(k + "\t" + v[1][1][3] + "\t" + "".join([ int(z[0][0:3]) * "\t" + z[1][9]  for z in v ])+"\n")

So, the code above is a working code but is not what I am looking for here is why. The output code is the issue, I am writing a tab delimited output file with the gene at the first column (k), v[1][1][3] is the locus of that particular gene, and finally which is what I am having tough time coding is this is part of the output file:

 "".join([ int(z[0][0:3]) * "\t" + z[1][9]  for z in v ])

I am trying to provide a list of fold change from each of the seven file at that particular gene and locus and then write it to the correct column number, so I am basically multiply the column number of which file number is by "\t" this will insure that the value will go to the right column, the problem is that when the next column of another file comes a long, the writing will be starting from where it left off from writing which I don't want, I want to start again from the beginning of the writing:

Here is what I mean for instance,

 gene name   locus     log2(fold change) from file 1    .... log2(fold change) from file7 
 ZZ           1:3      0           
                             0

because first log2 will be recorded based on the column number for instance 2 and that is to ensure recording, I am multiplying the number of column (2) by "\t" and fold_change value , it will record it no problem but then last column will be the seventh for instance and will not record to the seven because the last writing was done

I highly appreciate the help guys and thanks in advanced!

python file-handeling • 12k views

ADD COMMENT • link 8.4 years ago by BioICoder ▴ 40

0

Entering edit mode

First comment: use the csv module: https://docs.python.org/2/library/csv.html

Second: use '\n' at the end of a line to write to the next line with the next out.write statement

Third: use the sys module to get your arguments more easily to the script, execute script as

python yourscript.py file1.txt file2.txt

You can then access the arguments in the script:

file1 = sys.argv[1]
file2 = sys.argv[2]

Notice how counting starts from 1 since sys.argv[0] contains the name of the script, yourscript.py

ADD REPLY • link 8.4 years ago by WouterDeCoster 47k

0

Entering edit mode

1) I used the cvs module! 2) I already used '\n' in the out.write statement, your point is not clear? 3) I can't use that because I am reading multiple files and not a single one.

ADD REPLY • link 8.4 years ago by BioICoder ▴ 40

0

Entering edit mode

Ah excuse me, I apparently didn't read carefully enough, was a bit in a hurry yesterday. You can use the sys module for reading multiple files, so there is not problem with that. I'm afraid I don't completely understand what output you get and what output you want :-/

ADD REPLY • link 8.4 years ago by WouterDeCoster 47k