Trying To Understand A Simple Perl Vs Python Vs Sed Benchmark
3
4
Entering edit mode
10.6 years ago
Travis ★ 2.8k

Hi,

I was performing a simple routine (read file, substitute spaces for forwardslashes) in Perl, Python and Sed to get an idea of which is fastest.

EDIT: Relevance to bioinformatics is that I want to write a quick file parser that will process paired-end read files from an Illumina MiSeq and replace spaces between read names and numbers with barcodes for downstream compatibility with seqtk i.e. @readA 1 and @readA 2 would become @readA/1 and @readA/2.

I probably would have expected Python and Perl to perform similarly but Python performed much worse and I am trying to understand why.

The sed version was simply sed -e 's/ /\//' in.txt > out.txt and took approximately 3 seconds for 3 million lines.

The perl version was perl -e 'open(INFILE, "./in.txt") or die; while(<INFILE>) { $_=~s/ /\//;print$_;}' > out.txt and took approximately 5 seconds for the same number of lines.

The Python version was python3.2 -c ' import re

for line in open("./in.txt"):

print(re.sub(r" ","/",line), end="")' > out.txt and took 67 seconds to process the same file.

I am new to Python but would have assumed the Python approach was pretty efficient - can anyone suggest what the issue is?

perl python • 10k views
4
Entering edit mode

Perl is better for regular expressions than python. Can you try line.replace(" ", "/") instead?

0
Entering edit mode

That brought the Python time down to 11 seconds! I had thought about using a native Python substitution command but every tutorial I looked at used the re class!

0
Entering edit mode

You could possibly speed it up more by doing changedText = open("./in.txt").read().replace(" ","/"); print(changedText, file="out.txt" although I'm not sure about the python 3.* syntax.

0
Entering edit mode

9 seconds. I think I'll avoid in Python in this case!

4
Entering edit mode

Perl has a faster regex engine. Even when you optimize both perl and python, perl is usually faster and sometimes a lot. BTW, the better way to use perl is perl -pe 's/ /\//;' in.txt.

0
Entering edit mode

Thanks Heng. Are there particular problems that Python performs better on?

0
Entering edit mode

That is a different question and should not be discussed in the comments, look at my question: http://www.biostars.org/post/show/13972/when-to-choose-python-over-perl-and-vice-versa/#13984 and if that is not enough info, ask a new question.

0
Entering edit mode

Python is faster than perl on most other tasks according a couple of benchmarks, though only by a thin margin overall.

2
Entering edit mode

A suggestion: do not use just this to make a choice. Sure, there are plenty of good reasons to chose Perl (or whatever language), but a difference that is in the same order of magnitude is probably not one of them.

0
Entering edit mode

That depends on how you feel about time. To me, one week is unbearable if there are not-so-complex solutions to finish the task in one day.

0
Entering edit mode

If something is taking a week in python that takes a day in perl, there is probably something wrong with the python script.

0
Entering edit mode

In the reply to tiagoantao, I just intended to say that a factor of 5 in speed is a huge difference. I am not arguing Perl vs. Python at all.

1
Entering edit mode

For speed with python, this might speed up things (if you could test and feedback, it would be cool)

lines = open("./in.txt").readlines() #Note: readlineS not readline
print lines.replace(" ", "/"),

0
Entering edit mode

lines is a list which doesn't have a replace method

0
Entering edit mode

Sorry

print "".join(map(lambda x:x.replace(" ", "/"),lines)),

This does a single read IO. One wonders if most of the time is spent there. Trading memory for time...

0
Entering edit mode

I'm running on a virtual machine witrh limited memory so this particular approach ground the machine to a halt!

0
Entering edit mode

This is a basic programming question more suited to stackexchange.com. Please state the relevance to a bioinformatics research problem.

0
Entering edit mode

Question edited to state relevance

0
Entering edit mode

Not that it matters, but you can simplify your perl one liner: perl -e 'while(<>) { $_=~s/ /\//;print$_;}' in.text > out.text. the <> operator will read either stdin or the file named in $ARGV[0]. In fact you can reduce the loop code to { s/ /\//; print; }, since both the regex and the print statement will use the $_ operator if you don't give them an explicit argument.

0
Entering edit mode

More importantly, if this tool is going to be doing anything more complicated than a regex replace, you should do it in the language you're comfortable with so that you can debug, extend, etc more easily, especially since you're now down to 5 vs 11 seconds.

3
Entering edit mode
10.6 years ago
Niek De Klein ★ 2.6k

So you can accept the answer (if it was satisfactory enough):

Perl is better for regular expressions than python. line.replace(" ", "/") is faster than re.sub.

You could possibly speed it up more by doing changedText = open("./in.txt").read().replace(" ","/"); print(changedText, file="out.txt" although I'm not sure about the python 3.* syntax.

2
Entering edit mode

you can slightly improve the python script's perfomance by compiling the re object before the loop.

import re space_re = re.compile(" ") for line in open("./in.txt"): print(space_re.sub("/",line), end="")'

1
Entering edit mode
10.6 years ago

You can use the replace method of a string to speed things up considerably:

$python -m timeit -s "import re" "re.sub(' ', '/', 'some string goes here')" 100000 loops, best of 3: 2.94 usec per loop$ python -m timeit -s "import re; p=re.compile(' ')" "p.sub('/', 'some string goes here')"
1000000 loops, best of 3: 1.66 usec per loop

$python -m timeit -s "import re" "'some string goes here'.replace(' ', '/')" 1000000 loops, best of 3: 0.33 usec per loop  ADD COMMENT 1 Entering edit mode 10.6 years ago Marvin ▴ 870 tr ' ' / You're dragging in three enormous infrastructures for a simple problem, then ask why one of them has lower overhead for the simple problem. Moreover, most of the time is probably spent in I/O, not in actual processing. What's the point? ADD COMMENT 4 Entering edit mode amusingly (and unexpectedly) your solution is about twice as slow than the best performer so far ... goes to show that one can almost never tell where a program spends its time: ialbert@porthos ~/work/test$ time tr '' / < one.fq > tmp

real    0m6.437s
user    0m4.925s
sys 0m0.311s

ialbert@porthos ~/work/test
$time sed -e 's/ /\//' one.fq > tmp real 0m5.808s user 0m0.911s sys 0m0.331s ialbert@porthos ~/work/test$ time perl -e 'open(INFILE, "./one.fq") or die; while(<INFILE>) { $_=~s/ /\//;print$_;}' > tmp

real    0m3.305s
user    0m1.550s
sys 0m0.315s

ialbert@porthos ~/work/test
\$ time python -c 'for line in file("one.fq"): print line.replace(" ", "/")' > tmp

real    0m3.815s
user    0m3.284s
sys 0m0.317s

0
Entering edit mode

Here are the numbers on my machine (tr is the fastest):

tr ":" " " -- 1.5s
sed "s,:, ,g" -- 5.7s
perl -pe 'tr/:/ /' -- 4.1s
perl -pe 's/:/ /g' -- 5.7s
python -c 'for line in file("1.fq"): print line.replace(":", " ")' -- 6.0s


You may try to change "LC_ALL", though this has no effect on my side.

0
Entering edit mode

I ran my benchmarks on a Mac OSX and even after rerunning it today it seems that on that platform tr runs a lot slower than on a linux. On Linux I can reproduce your observation.

0
Entering edit mode

Yes, I was running on Linux. Perhaps Mac comes with a crappy "tr".

0
Entering edit mode

you can probably improve the times on all languages by not doing line-wise file readling:

import sys

fh = open('t.fq')
while True:
print (fh.read(16384).replace(" ", "/") or sys.exit()),

0
Entering edit mode

Firstly there is more to the problem than I stated in my original question. Secondly, the solution needs to be useable by people with no bioinformatics training.