Question

Trying To Understand A Simple Perl Vs Python Vs Sed Benchmark

4

Entering edit mode

13.2 years ago

Travis ★ 2.9k

Hi,

I was performing a simple routine (read file, substitute spaces for forwardslashes) in Perl, Python and Sed to get an idea of which is fastest.

EDIT: Relevance to bioinformatics is that I want to write a quick file parser that will process paired-end read files from an Illumina MiSeq and replace spaces between read names and numbers with barcodes for downstream compatibility with seqtk i.e. @readA 1 and @readA 2 would become @readA/1 and @readA/2.

I probably would have expected Python and Perl to perform similarly but Python performed much worse and I am trying to understand why.

The sed version was simply sed -e 's/ /\//' in.txt > out.txt and took approximately 3 seconds for 3 million lines.

The perl version was perl -e 'open(INFILE, "./in.txt") or die; while(<INFILE>) { $_=~s/ /\//;print $_;}' > out.txt and took approximately 5 seconds for the same number of lines.

The Python version was python3.2 -c ' import re

for line in open("./in.txt"):

print(re.sub(r" ","/",line), end="")' > out.txt and took 67 seconds to process the same file.

I am new to Python but would have assumed the Python approach was pretty efficient - can anyone suggest what the issue is?

perl python • 14k views

ADD COMMENT • link updated 13.2 years ago by Marvin ▴ 900 • written 13.2 years ago by Travis ★ 2.9k

4

Entering edit mode

Perl is better for regular expressions than python. Can you try line.replace(" ", "/") instead?

ADD REPLY • link 13.2 years ago by Niek De Klein ★ 2.6k

0

Entering edit mode

That brought the Python time down to 11 seconds! I had thought about using a native Python substitution command but every tutorial I looked at used the re class!

ADD REPLY • link 13.2 years ago by Travis ★ 2.9k

0

Entering edit mode

You could possibly speed it up more by doing changedText = open("./in.txt").read().replace(" ","/"); print(changedText, file="out.txt" although I'm not sure about the python 3.* syntax.

ADD REPLY • link 13.2 years ago by Niek De Klein ★ 2.6k

0

Entering edit mode

9 seconds. I think I'll avoid in Python in this case!

ADD REPLY • link 13.2 years ago by Travis ★ 2.9k

4

Entering edit mode

Perl has a faster regex engine. Even when you optimize both perl and python, perl is usually faster and sometimes a lot. BTW, the better way to use perl is perl -pe 's/ /\//;' in.txt.

ADD REPLY • link 13.2 years ago by lh3 33k

0

Entering edit mode

Thanks Heng. Are there particular problems that Python performs better on?

ADD REPLY • link 13.2 years ago by Travis ★ 2.9k

0

Entering edit mode

That is a different question and should not be discussed in the comments, look at my question: http://www.biostars.org/post/show/13972/when-to-choose-python-over-perl-and-vice-versa/#13984 and if that is not enough info, ask a new question.

ADD REPLY • link 13.2 years ago by Niek De Klein ★ 2.6k

0

Entering edit mode

Python is faster than perl on most other tasks according a couple of benchmarks, though only by a thin margin overall.

ADD REPLY • link 13.2 years ago by lh3 33k

2

Entering edit mode

A suggestion: do not use just this to make a choice. Sure, there are plenty of good reasons to chose Perl (or whatever language), but a difference that is in the same order of magnitude is probably not one of them.

ADD REPLY • link 13.2 years ago by tiagoantao ▴ 690

0

Entering edit mode

That depends on how you feel about time. To me, one week is unbearable if there are not-so-complex solutions to finish the task in one day.

ADD REPLY • link 13.2 years ago by lh3 33k

0

Entering edit mode

If something is taking a week in python that takes a day in perl, there is probably something wrong with the python script.

ADD REPLY • link 13.2 years ago by Niek De Klein ★ 2.6k

0

Entering edit mode

In the reply to tiagoantao, I just intended to say that a factor of 5 in speed is a huge difference. I am not arguing Perl vs. Python at all.

ADD REPLY • link 13.2 years ago by lh3 33k

1

Entering edit mode

For speed with python, this might speed up things (if you could test and feedback, it would be cool)

lines = open("./in.txt").readlines() #Note: readlineS not readline
print lines.replace(" ", "/"),

ADD REPLY • link 13.2 years ago by tiagoantao ▴ 690

0

Entering edit mode

lines is a list which doesn't have a replace method

ADD REPLY • link 13.2 years ago by Travis ★ 2.9k

0

Entering edit mode

Sorry

lines = open("./in.txt").readlines()

print "".join(map(lambda x:x.replace(" ", "/"),lines)),

This does a single read IO. One wonders if most of the time is spent there. Trading memory for time...

ADD REPLY • link 13.2 years ago by tiagoantao ▴ 690

0

Entering edit mode

I'm running on a virtual machine witrh limited memory so this particular approach ground the machine to a halt!

ADD REPLY • link 13.2 years ago by Travis ★ 2.9k

0

Entering edit mode

This is a basic programming question more suited to stackexchange.com. Please state the relevance to a bioinformatics research problem.

ADD REPLY • link 13.2 years ago by Neilfws 49k

0

Entering edit mode

Question edited to state relevance

ADD REPLY • link 13.2 years ago by Travis ★ 2.9k

0

Entering edit mode

Not that it matters, but you can simplify your perl one liner: perl -e 'while(<>) { $_=~s/ /\//;print $_;}' in.text > out.text. the <> operator will read either stdin or the file named in $ARGV[0]. In fact you can reduce the loop code to { s/ /\//; print; }, since both the regex and the print statement will use the $_ operator if you don't give them an explicit argument.

ADD REPLY • link 13.2 years ago by glocke01 ▴ 190

0

Entering edit mode

More importantly, if this tool is going to be doing anything more complicated than a regex replace, you should do it in the language you're comfortable with so that you can debug, extend, etc more easily, especially since you're now down to 5 vs 11 seconds.

ADD REPLY • link 13.2 years ago by glocke01 ▴ 190

score 3 · Answer 1 · 2012-05-04

3

Entering edit mode

13.2 years ago

Niek De Klein ★ 2.6k

So you can accept the answer (if it was satisfactory enough):

Perl is better for regular expressions than python. line.replace(" ", "/") is faster than re.sub.

You could possibly speed it up more by doing changedText = open("./in.txt").read().replace(" ","/"); print(changedText, file="out.txt" although I'm not sure about the python 3.* syntax.

ADD COMMENT • link 13.2 years ago by Niek De Klein ★ 2.6k

2

Entering edit mode

you can slightly improve the python script's perfomance by compiling the re object before the loop.

import re space_re = re.compile(" ") for line in open("./in.txt"): print(space_re.sub("/",line), end="")'

ADD REPLY • link 13.2 years ago by Giovanni M Dall'Olio 28k

score 1 · Answer 2 · 2012-05-04

You can use the replace method of a string to speed things up considerably:

$ python -m timeit -s "import re" "re.sub(' ', '/', 'some string goes here')"
100000 loops, best of 3: 2.94 usec per loop

$ python -m timeit -s "import re; p=re.compile(' ')" "p.sub('/', 'some string goes here')"
1000000 loops, best of 3: 1.66 usec per loop

$ python -m timeit -s "import re"  "'some string goes here'.replace(' ', '/')"
1000000 loops, best of 3: 0.33 usec per loop

score 1 · Answer 3 · 2012-05-06

1

Entering edit mode

13.2 years ago

Marvin ▴ 900

tr ' ' /

You're dragging in three enormous infrastructures for a simple problem, then ask why one of them has lower overhead for the simple problem. Moreover, most of the time is probably spent in I/O, not in actual processing. What's the point?

ADD COMMENT • link 13.2 years ago by Marvin ▴ 900

4

Entering edit mode

amusingly (and unexpectedly) your solution is about twice as slow than the best performer so far ... goes to show that one can almost never tell where a program spends its time:

ialbert@porthos ~/work/test
$ time tr '' / < one.fq > tmp

real    0m6.437s
user    0m4.925s
sys 0m0.311s

ialbert@porthos ~/work/test
$ time sed -e 's/ /\//' one.fq > tmp

real    0m5.808s
user    0m0.911s
sys 0m0.331s

ialbert@porthos ~/work/test
$ time perl -e 'open(INFILE, "./one.fq") or die; while(<INFILE>) { $_=~s/ /\//;print $_;}' > tmp

real    0m3.305s
user    0m1.550s
sys 0m0.315s

ialbert@porthos ~/work/test
$ time python -c 'for line in file("one.fq"): print line.replace(" ", "/")' > tmp

real    0m3.815s
user    0m3.284s
sys 0m0.317s

ADD REPLY • link 13.2 years ago by Istvan Albert 102k

0

Entering edit mode

Here are the numbers on my machine (tr is the fastest):

tr ":" " " -- 1.5s
sed "s,:, ,g" -- 5.7s
perl -pe 'tr/:/ /' -- 4.1s
perl -pe 's/:/ /g' -- 5.7s
python -c 'for line in file("1.fq"): print line.replace(":", " ")' -- 6.0s

You may try to change "LC_ALL", though this has no effect on my side.

ADD REPLY • link 13.2 years ago by lh3 33k

0

Entering edit mode

I ran my benchmarks on a Mac OSX and even after rerunning it today it seems that on that platform tr runs a lot slower than on a linux. On Linux I can reproduce your observation.

ADD REPLY • link 13.2 years ago by Istvan Albert 102k

0

Entering edit mode

Yes, I was running on Linux. Perhaps Mac comes with a crappy "tr".

ADD REPLY • link 13.2 years ago by lh3 33k

0

Entering edit mode

you can probably improve the times on all languages by not doing line-wise file readling:

import sys

fh = open('t.fq')
while True:
    print (fh.read(16384).replace(" ", "/") or sys.exit()),

ADD REPLY • link 13.2 years ago by brentp 24k

0

Entering edit mode

Firstly there is more to the problem than I stated in my original question. Secondly, the solution needs to be useable by people with no bioinformatics training.

ADD REPLY • link 13.2 years ago by Travis ★ 2.9k