Read In Multiple Text Files to Calculate a Threshold Python
2
0
Entering edit mode
4.4 years ago

I have ~ 400 text files of BLASTp results.
P_1L_GRL111.out (contents)

NP_387917.1 ADZ06570.1 44.29 289 153 4 2 289 1 282 7e-77 236

I have a file that reads in files ALL my P_1 files, sorts the e values, and then calculates a threshold. I would like to convert this into a function. I have made some attempts to define a function that can read in all my files, but as of now, the files only reads ALL P_1 files, or ALL P_2 files. I also feel like the function is not sorting the values correctly.

def threshold_proteinsVStarget(protein_Files):
# List of Lactobacillus databases created from BLAST commands
Lactobacillus_DB = [ 'L_GRL1112', 'L_214','L_CTV-05','L_JV-V01','L_ST1','L_MV-1A', 'L_202-4','L_224-1','L_JV-V03',    'L_MV-22','L_DSM_13335', 'LactinV_03V1b', 'SPIN_1401G', 'UPII_143_D', 'L_1153','L_269-3', 'L_JV-V16','L_49540']

list_e_values = []
for prot in protein_Files:
for db in Lactobacillus_DB:
line = open (prot + db + '.out').readline()
print line
line2 = line.strip('\n')
fields = line2.split('\t')
e_val = float(fields[10])
list_e_values.append(e_val)
return list_e_values


file1 = ['P_3']

print threshold_proteinsVStarget(file1)

Python E-Value • 2.9k views
1
Entering edit mode
4.4 years ago
st.ph.n ★ 2.6k

You don't specify what the threshold is. You don't have to specify the prefixes of the blast output. Look into glob, to read all the files in the directory, that end with .out, similarly if you were to do ls *.out . The code below will get you to a sorted list of e-values, from all files.

#!/usr/bin/env python
import glob
for file in glob.glob('*.out'):
with open(file, 'r') as f:
for line in f:
if float(line.strip().split('\t')[10]) < *X*:
print line.strip().split('\t')[0], '\t', line.strip().split('\t')[1], '\t', line.strip().split('\t')[10])

0
Entering edit mode

This is very useful.... I was trying to work with the glob module, but I initially thought it would be best to combine ALL my files into one text file, then sort, then return the e-value...but that was not working properly.

import glob
list_files = []
for file in glob.glob('/Users/sueparks/BlastP_Results/*'):

myfile = open(file,'r')  #open each file

with open('merge_BLASTP_results.out', 'a') as f:
for line in lines:
line.strip('\n')
f.write(line)

1
Entering edit mode

Using python to merge your files is unnecessary. Just cat all your files into one, then run your code on it, then. cat *.out > allblast.out

Then you can use my above code, without the for loop with glob, and just open and read the file, append each evalue to a list, and sort it.

0
Entering edit mode

Thanks st.ph.n! My last question, if I write another function to check e value, how can I append the first two columns with the e value.

My last step:

if eval is less than threshold: # append fields [0], [1],[10] print NP_387917.1 ADZ06570.1 7e-77

0
Entering edit mode

See my updated answer. Replace X with the value you want. This will check the evalue as it loops through each line, instead of appending to list. If you need a list, use a tuple or dictionary:

evalues = []
if line.strip().split('\t')[10] < *X*:
evalues.append((line.strip().split('\t')[0], line.strip().split('\t')[1], line.strip().split('\t')[10]))

for e in evalues:
print e[0], '\t', e[1], '\t', e[2]

0
Entering edit mode

The updated answer seemed to do the trick! How could I transform the updated code into a function?

0
Entering edit mode

if all you're going to do is return a list from the function, you don't need to use a function. Just append everything to a list as int he comment. If you do need function, it should be trivial for you to do so. If this answer solves your problem, please accept the answer.

0
Entering edit mode
4.4 years ago

You are correct, I did not upload the entire code. I have a separate function for the threshold (the median calculation).

import subprocess, sys, math

def median(a):
a.sort()
if len(a)%2!=0:
median=a[len(a)/2]
else:
median=(float(a[len(a)/2]+a[(len(a)/2)-1]))/2
return median

# Define a function that reads in .out files to a list
def threshold_proteinsVStarget(protein_Files):

# List of Lactobacillus databases created from BLAST commands
Lactobacillus_DB = [ 'L_GRL1112', 'L_214','L_CTV-05','L_JV-V01','L_ST1','L_MV-1A', 'L_202-4','L_224-1','L_JV-V03', 'L_MV-22','L_DSM_13335', 'LactinV_03V1b', 'SPIN_1401G', 'UPII_143_D', 'L_1153','L_269-3', 'L_JV-V16','L_49540']

list_e_values = []
for prot in protein_Files:
#print prot
for db in Lactobacillus_DB:
line = open (prot + db + '.out').readline()
print line
line2 = line.strip('\n')

fields = line2.split('\t')
#print fields

e_val = float(fields[10])
list_e_values.append(e_val)
return list_e_values

print median (list_e_values)