Hi,
I was performing a simple routine (read file, substitute spaces for forwardslashes) in Perl, Python and Sed to get an idea of which is fastest.
EDIT: Relevance to bioinformatics is that I want to write a quick file parser that will process paired-end read files from an Illumina MiSeq and replace spaces between read names and numbers with barcodes for downstream compatibility with seqtk i.e. @readA 1 and @readA 2 would become @readA/1 and @readA/2.
I probably would have expected Python and Perl to perform similarly but Python performed much worse and I am trying to understand why.
The sed version was simply
sed -e 's/ /\//' in.txt > out.txt
and took approximately 3 seconds for 3 million lines.
The perl version was perl -e 'open(INFILE, "./in.txt") or die; while(<INFILE>) { $_=~s/ /\//;print $_;}' > out.txt
and took approximately 5 seconds for the same number of lines.
The Python version was python3.2 -c '
import re
for line in open("./in.txt"):
print(re.sub(r" ","/",line), end="")' > out.txt
and took 67 seconds to process the same file.
I am new to Python but would have assumed the Python approach was pretty efficient - can anyone suggest what the issue is?
Perl is better for regular expressions than python. Can you try
line.replace(" ", "/")
instead?That brought the Python time down to 11 seconds! I had thought about using a native Python substitution command but every tutorial I looked at used the re class!
You could possibly speed it up more by doing
changedText = open("./in.txt").read().replace(" ","/"); print(changedText, file="out.txt"
although I'm not sure about the python 3.* syntax.9 seconds. I think I'll avoid in Python in this case!
Perl has a faster regex engine. Even when you optimize both perl and python, perl is usually faster and sometimes a lot. BTW, the better way to use perl is
perl -pe 's/ /\//;' in.txt
.Thanks Heng. Are there particular problems that Python performs better on?
That is a different question and should not be discussed in the comments, look at my question: http://www.biostars.org/post/show/13972/when-to-choose-python-over-perl-and-vice-versa/#13984 and if that is not enough info, ask a new question.
Python is faster than perl on most other tasks according a couple of benchmarks, though only by a thin margin overall.
A suggestion: do not use just this to make a choice. Sure, there are plenty of good reasons to chose Perl (or whatever language), but a difference that is in the same order of magnitude is probably not one of them.
That depends on how you feel about time. To me, one week is unbearable if there are not-so-complex solutions to finish the task in one day.
If something is taking a week in python that takes a day in perl, there is probably something wrong with the python script.
In the reply to tiagoantao, I just intended to say that a factor of 5 in speed is a huge difference. I am not arguing Perl vs. Python at all.
For speed with python, this might speed up things (if you could test and feedback, it would be cool)
lines is a list which doesn't have a replace method
Sorry
lines = open("./in.txt").readlines()
print "".join(map(lambda x:x.replace(" ", "/"),lines)),
This does a single read IO. One wonders if most of the time is spent there. Trading memory for time...
I'm running on a virtual machine witrh limited memory so this particular approach ground the machine to a halt!
This is a basic programming question more suited to stackexchange.com. Please state the relevance to a bioinformatics research problem.
Question edited to state relevance
Not that it matters, but you can simplify your perl one liner:
perl -e 'while(<>) { $_=~s/ /\//;print $_;}' in.text > out.text
. the<>
operator will read either stdin or the file named in$ARGV[0]
. In fact you can reduce the loop code to{ s/ /\//; print; }
, since both the regex and the print statement will use the$_
operator if you don't give them an explicit argument.More importantly, if this tool is going to be doing anything more complicated than a regex replace, you should do it in the language you're comfortable with so that you can debug, extend, etc more easily, especially since you're now down to 5 vs 11 seconds.