Question

How to loop through a file to get a value for the coverage in each line of the samtools pileup file?

0

Entering edit mode

5.5 years ago

M.O.L.S ▴ 100

Hi,

I have a question.

I am writing a program in Python. I am using an pileup file generated from samtools. I have loaded the file into python and I want to loop through the file to find the values in the 4th column of each line.

I have managed to do this for only one line so far using the code below...

How can I combine the code in order to loop through the entire file to find the values in the fourth column that are equal to or less than 10 ?

If the number in the fourth column is equal to or less than 10, then I want to print out this line.

This is my code so far:

# A different way to open the mpileup file.         
f = open("/Users/m.o.l.s/outputFile.mpileup","rt")

line_1 = f.readline()
print(line_1)

# Split up the strings based on tabs
individuals = line_1.split("\t")
print(individuals)

# The value we are interested in is in the 3rd index position
The_Coverage = individuals[3]
print(The_Coverage)

# If this number is less than 10, the row needs to be displayed
if The_Coverage <= "10":
    print(The_Coverage)
else:
    print("The Coverage is not less than or equal to 10")

... so this code works , but only for 1 line and I have about 200 lines in the file

Should I turn the file into a pandas data frame? I would appreciate any help possible to further this along. Best.

sequencing • 970 views

ADD COMMENT • link updated 5.5 years ago by finswimmer 16k • written 5.5 years ago by M.O.L.S ▴ 100

0

Entering edit mode

Hello M.O.L.S,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLY • link 5.5 years ago by finswimmer 16k

score 2 · Accepted Answer · 2018-10-30

You only get 1 line, because that’s all readline() does (you may have been looking for readlines())

To iterate a file you want to do something like:

with open(“myfile.pileup”, “r”) as handle:
     for line in handle:
           if line.split(“\t”)[3] <= 10:
                print(line)

I agree with your assessment of using pandas dataframes instead though (especially if the file is large)*

*assuming you have enough memory