Trying To Understand A Simple Perl Vs Python Vs Sed Benchmark
10.6 years ago
Travis ★ 2.8k

Hi,

I was performing a simple routine (read file, substitute spaces for forwardslashes) in Perl, Python and Sed to get an idea of which is fastest.

EDIT: Relevance to bioinformatics is that I want to write a quick file parser that will process paired-end read files from an Illumina MiSeq and replace spaces between read names and numbers with barcodes for downstream compatibility with seqtk i.e. @readA 1 and @readA 2 would become @readA/1 and @readA/2.

I probably would have expected Python and Perl to perform similarly but Python performed much worse and I am trying to understand why.

The sed version was simply sed -e 's/ /\//' in.txt > out.txt and took approximately 3 seconds for 3 million lines.

The perl version was perl -e 'open(INFILE, "./in.txt") or die; while(<INFILE>) { $_=~s/ /\//;print$_;}' > out.txt and took approximately 5 seconds for the same number of lines.

The Python version was python3.2 -c ' import re

for line in open("./in.txt"):

print(re.sub(r" ","/",line), end="")' > out.txt and took 67 seconds to process the same file.

I am new to Python but would have assumed the Python approach was pretty efficient - can anyone suggest what the issue is?

Perl is better for regular expressions than python. Can you try line.replace(" ", "/") instead?

That brought the Python time down to 11 seconds! I had thought about using a native Python substitution command but every tutorial I looked at used the re class!

You could possibly speed it up more by doing changedText = open("./in.txt").read().replace(" ","/"); print(changedText, file="out.txt" although I'm not sure about the python 3.* syntax.

9 seconds. I think I'll avoid in Python in this case!

Perl has a faster regex engine. Even when you optimize both perl and python, perl is usually faster and sometimes a lot. BTW, the better way to use perl is perl -pe 's/ /\//;' in.txt.

Thanks Heng. Are there particular problems that Python performs better on?

That is a different question and should not be discussed in the comments, look at my question: http://www.biostars.org/post/show/13972/when-to-choose-python-over-perl-and-vice-versa/#13984 and if that is not enough info, ask a new question.

Python is faster than perl on most other tasks according a couple of benchmarks, though only by a thin margin overall.

A suggestion: do not use just this to make a choice. Sure, there are plenty of good reasons to chose Perl (or whatever language), but a difference that is in the same order of magnitude is probably not one of them.

That depends on how you feel about time. To me, one week is unbearable if there are not-so-complex solutions to finish the task in one day.

If something is taking a week in python that takes a day in perl, there is probably something wrong with the python script.

In the reply to tiagoantao, I just intended to say that a factor of 5 in speed is a huge difference. I am not arguing Perl vs. Python at all.

For speed with python, this might speed up things (if you could test and feedback, it would be cool)

lines = open("./in.txt").readlines() #Note: readlineS not readline
print lines.replace(" ", "/"),

lines is a list which doesn't have a replace method

Sorry

print "".join(map(lambda x:x.replace(" ", "/"),lines)),

This does a single read IO. One wonders if most of the time is spent there. Trading memory for time...

I'm running on a virtual machine witrh limited memory so this particular approach ground the machine to a halt!

This is a basic programming question more suited to stackexchange.com. Please state the relevance to a bioinformatics research problem.

Question edited to state relevance

Not that it matters, but you can simplify your perl one liner: perl -e 'while(<>) { $_=~s/ /\//;print$_;}' in.text > out.text. the <> operator will read either stdin or the file named in $ARGV[0]. In fact you can reduce the loop code to { s/ /\//; print; }, since both the regex and the print statement will use the $_ operator if you don't give them an explicit argument.

More importantly, if this tool is going to be doing anything more complicated than a regex replace, you should do it in the language you're comfortable with so that you can debug, extend, etc more easily, especially since you're now down to 5 vs 11 seconds.

10.6 years ago
Niek De Klein ★ 2.6k

So you can accept the answer (if it was satisfactory enough):

Perl is better for regular expressions than python. line.replace(" ", "/") is faster than re.sub.

You could possibly speed it up more by doing changedText = open("./in.txt").read().replace(" ","/"); print(changedText, file="out.txt" although I'm not sure about the python 3.* syntax.

you can slightly improve the python script's perfomance by compiling the re object before the loop.

import re space_re = re.compile(" ") for line in open("./in.txt"): print(space_re.sub("/",line), end="")'

10.6 years ago

You can use the replace method of a string to speed things up considerably:

$python -m timeit -s "import re" "re.sub(' ', '/', 'some string goes here')" 100000 loops, best of 3: 2.94 usec per loop$ python -m timeit -s "import re; p=re.compile(' ')" "p.sub('/', 'some string goes here')"
1000000 loops, best of 3: 1.66 usec per loop

$python -m timeit -s "import re" "'some string goes here'.replace(' ', '/')" 1000000 loops, best of 3: 0.33 usec per loop  ADD COMMENT 1 Entering edit mode 10.6 years ago Marvin ▴ 870 tr ' ' / You're dragging in three enormous infrastructures for a simple problem, then ask why one of them has lower overhead for the simple problem. Moreover, most of the time is probably spent in I/O, not in actual processing. What's the point? ADD COMMENT 4 Entering edit mode amusingly (and unexpectedly) your solution is about twice as slow than the best performer so far ... goes to show that one can almost never tell where a program spends its time: ialbert@porthos ~/work/test$ time tr '' / < one.fq > tmp

real    0m6.437s
user    0m4.925s
sys 0m0.311s

ialbert@porthos ~/work/test
$time sed -e 's/ /\//' one.fq > tmp real 0m5.808s user 0m0.911s sys 0m0.331s ialbert@porthos ~/work/test$ time perl -e 'open(INFILE, "./one.fq") or die; while(<INFILE>) { $_=~s/ /\//;print$_;}' > tmp

real    0m3.305s
user    0m1.550s
sys 0m0.315s

ialbert@porthos ~/work/test
\$ time python -c 'for line in file("one.fq"): print line.replace(" ", "/")' > tmp

real    0m3.815s
user    0m3.284s
sys 0m0.317s

Here are the numbers on my machine (tr is the fastest):

tr ":" " " -- 1.5s
sed "s,:, ,g" -- 5.7s
perl -pe 'tr/:/ /' -- 4.1s
perl -pe 's/:/ /g' -- 5.7s
python -c 'for line in file("1.fq"): print line.replace(":", " ")' -- 6.0s


You may try to change "LC_ALL", though this has no effect on my side.

I ran my benchmarks on a Mac OSX and even after rerunning it today it seems that on that platform tr runs a lot slower than on a linux. On Linux I can reproduce your observation.

Yes, I was running on Linux. Perhaps Mac comes with a crappy "tr".

you can probably improve the times on all languages by not doing line-wise file readling:

import sys

fh = open('t.fq')
while True:
print (fh.read(16384).replace(" ", "/") or sys.exit()),

Firstly there is more to the problem than I stated in my original question. Secondly, the solution needs to be useable by people with no bioinformatics training.