After compiling Ruby 1.8.6 with '-O3 -mtune=K8 -march=K8' on an AMD 4800
+, I decided to run Antonio Cangiano's benchmark suite to see what
performance gain, if any, the new interpreter realized. Needless to say
I was impressed with the results. The specifics:
control: ruby 1.8.6 (2007-09-24 patchlevel 111) [x86_64-linux] (apt-get
install ruby)
test: ruby 1.8.6 (2008-08-11 patchlevel 287) [x86_64-linux] (source
compiled with '-O3 -mtune=K8 -march=K8')
kernel: 2.6.24-19-server
test-suite: git://github.com/acangiano/ruby-benchmark-suite.git
Notes:
The default timeout for any given test was set at the default of 30
seconds. Twenty-for tests exceeded the timeout therefore the ratio is
unknown. In two of the tests: bm_regex_dna.rb and bm_hilbert_matrix.rb
the optimized version of ruby was actually *slower*. The patch level of
the two interpreters is different so this is not exactly
apples-to-apples comparison. Two tests which reported a 'stack to deep'
error.
Default
Optim(O3,native) Optim/Default
/core-features/bm_app_answer.rb: 1.29 0.8
0.62
/core-features/bm_app_factorial.rb: Error
Error ?
/core-features/bm_app_factorial2.rb: Error
Error ?
/core-features/bm_app_fib.rb: T/O
T/O ?
/core-features/bm_app_raise.rb: 6.63 6.01
0.91
/core-features/bm_app_tak.rb: 12.4 8.38
0.68
/core-features/bm_app_tarai.rb: 9.91 6.88
0.69
/core-features/bm_loop_times.rb: 7.97 3.69
0.46
/core-features/bm_loop_whileloop.rb: T/O
10.2 ?
/core-features/bm_loop_whileloop2.rb: T/O
21.77 ?
/core-features/bm_so_ackermann.rb: T/O
T/O ?
/core-features/bm_so_nested_loop.rb: 9.31 5.33
0.57
/core-features/bm_so_object.rb: 11.74 9.26
0.79
/core-features/bm_so_random.rb: T/O 2.35
-2.35
/core-features/bm_startup.rb: 0 0
0.71
/core-features/bm_vm1_block.rb: T/O
23.67 ?
/core-features/bm_vm1_const.rb: T/O
T/O ?
/core-features/bm_vm1_ensure.rb: 27.68 15.82
0.57
/core-features/bm_vm1_length.rb: 22.99 19.91
0.87
/core-features/bm_vm1_rescue.rb: T/O
12.86 ?
/core-features/bm_vm1_simplereturn.rb: T/O
18.3 ?
/core-features/bm_vm1_swap.rb: T/O
T/O ?
/core-features/bm_vm2_method.rb: 21.01 11.65
0.55
/core-features/bm_vm2_poly_method.rb: T/O
15.72 ?
/core-features/bm_vm2_poly_method_ov.rb: 5.59 4.88
0.87
/core-features/bm_vm2_proc.rb: 8.92 6.2
0.69
/core-features/bm_vm2_send.rb: 5.69 4.67
0.82
/core-features/bm_vm2_super.rb: 6.97 4.46
0.64
/core-features/bm_vm2_unif1.rb: 5.11 3.65
0.71
/core-features/bm_vm2_zsuper.rb: 7.47 4.93
0.66
/core-library/bm_app_strconcat.rb: 1.44 1.13
0.78
/core-library/bm_pathname.rb: T/O
T/O ?
/core-library/bm_so_array.rb: 9.1 5.6
0.62
/core-library/bm_so_concatenate.rb: 3.42 1.84
0.54
/core-library/bm_so_count_words.rb: 0.03
0.03 ?
/core-library/bm_so_exception.rb: 7.58 5.28
0.7
/core-library/bm_so_lists.rb: T/O
T/O ?
/core-library/bm_so_matrix.rb: 2.62 1.79
0.68
/core-library/bm_vm2_array.rb: 9.55 6.18
0.65
/core-library/bm_vm2_regexp.rb: 4.72 6.4
*1.35
/core-library/bm_vm3_thread_create_join.rb: 0.08 0.03
0.34
/micro-benchmarks/bm_app_pentomino.rb: T/O
T/O ?
/micro-benchmarks/bm_binary_trees.rb: T/O
T/O ?
/micro-benchmarks/bm_fannkuch.rb: T/O
T/O ?
/micro-benchmarks/bm_fasta.rb: T/O
T/O ?
/micro-benchmarks/bm_fractal.rb: T/O
T/O ?
/micro-benchmarks/bm_knucleotide.rb: 2.21 1.55
0.70
/micro-benchmarks/bm_lucas_lehmer.rb: 7.32 6.44
0.88
/micro-benchmarks/bm_mandelbrot.rb: T/O
T/O ?
/micro-benchmarks/bm_mergesort.rb: 2.91 2.62
0.9
/micro-benchmarks/bm_meteor_contest.rb: T/O
T/O ?
/micro-benchmarks/bm_monte_carlo_pi.rb: 24.83 19.52
0.79
/micro-benchmarks/bm_nbody.rb: T/O
T/O ?
/micro-benchmarks/bm_nsieve.rb: 24.55 21.47
0.87
/micro-benchmarks/bm_nsieve_bits.rb: T/O
T/O ?
/micro-benchmarks/bm_partial_sums.rb: 27.83 25.13
0.9
/micro-benchmarks/bm_quicksort.rb: 10.76 6.06
0.56
/micro-benchmarks/bm_recursive.rb: T/O
28.06 ?
/micro-benchmarks/bm_regex_dna.rb: 1.54 2.05
*1.33
/micro-benchmarks/bm_reverse_compliment.rb: T/O
T/O ?
/micro-benchmarks/bm_so_sieve.rb: T/O
T/O ?
/micro-benchmarks/bm_spectral_norm.rb: T/O
T/O ?
/micro-benchmarks/bm_sum_file.rb: 20.89 16.44
0.79
/micro-benchmarks/bm_thread_ring.rb: T/O
T/O ?
/micro-benchmarks/bm_word_anagrams.rb: 12.24 8.79
0.72
/real-world/bm_hilbert_matrix.rb: 24.74 T/O
*?
/standard-library/bm_app_mandelbrot.rb: 0.81 0.61
0.75
on 14.08.2008 19:22
on 15.08.2008 04:48
On Fri, 2008-08-15 at 02:19 +0900, kevin nolan wrote: > test-suite: git://github.com/acangiano/ruby-benchmark-suite.git > > Notes: > > The default timeout for any given test was set at the default of 30 > seconds. Twenty-for tests exceeded the timeout therefore the ratio is > unknown. I think that can be changed easily. > In two of the tests: bm_regex_dna.rb and bm_hilbert_matrix.rb the > optimized version of ruby was actually *slower*. Thanks for letting me know -- I wrote "bm_hilbert_matrix", so I think I'll check this out over the weekend with my "oprofile" setup. BTW, I usually compile "-O3 -march=athlon64" and I have been using gcc 4.3.1 for a couple of months. Do you expect a fundamental difference between "-march=athlon64" and "-march=k8 -mtune=k8"? > The patch level of the two interpreters is different so this is not > exactly apples-to-apples comparison. Two tests which reported a 'stack > to deep' error. Try "ulimit -a" and look at the stack size. Then type "ulimit -s <4x>" where <4x> is four times the number you got from "ulimit -a". This made those stack errors go away when I ran these. By the way -- the Ruby Benchmark Suite has its own mailing list -- http://groups.google.com/group/ruby-benchmark-suite to be precise. -- M. Edward (Ed) Borasky ruby-perspectives.blogspot.com "A mathematician is a machine for turning coffee into theorems." -- Alfréd Rényi via Paul Erdős
on 15.08.2008 20:43
kevin nolan: > control: ruby 1.8.6 (2007-09-24 patchlevel 111) [x86_64-linux] > (apt-get install ruby) > test: ruby 1.8.6 (2008-08-11 patchlevel 287) [x86_64-linux] > (source compiled with '-O3 -mtune=K8 -march=K8') Are you sure the differences are not because of the --with-phtreads flag in Ubuntu’s build? In my case, the difference between Ubuntu’s Ruby and `configure; make; make install` 1.8.6.p111 was about 40% (without touching -O, -mtune and -march). -- Shot
on 16.08.2008 06:08
On Sat, 2008-08-16 at 03:39 +0900, Shot (Piotr Szotkowski) wrote:
> (without touching -O, -mtune and -march).
Last time I looked, the difference between no optimization whatsoever
and "-O3 -march=<your chip here>" was about 30 percent. But yes,
pthreads makes a big difference. And I think pthreads is mandatory for
1.9.
BTW ... 64-bit compiled is slower than 32-bit compiled on a 64-bit chip,
too ... cache sizes, alignments, and such, I suspect, though I haven't
taken the time to profile it.
--
M. Edward (Ed) Borasky
ruby-perspectives.blogspot.com
"A mathematician is a machine for turning coffee into theorems." --
Alfréd Rényi via Paul Erdős
on 17.08.2008 00:23
M. Edward (Ed) Borasky wrote: >> Are you sure the differences are not because of the --with-phtreads >> flag in Ubuntu’s build? In my case, the difference between Ubuntu’s >> Ruby and `configure; make; make install` 1.8.6.p111 was about 40% >> (without touching -O, -mtune and -march). >> > > Last time I looked, the difference between no optimization whatsoever > and "-O3 -march=<your chip here>" was about 30 percent. But yes, > pthreads makes a big difference. I've seen huge performance enhancements by not using "--enable-pthreads" too. I initially began to use it after the Ruby build admonished me for trying link to the Tk library(which on Solaris is built with threading support) without using --enable-pthreads for Ruby. But it turned out to be a bad idea for performance since it makes the interpretor invoke a *lot* of getcontext calls that pull down the performance by about 50% in cases. I'm not sure why the --enable-pthreads uses the *context calls based implementation. The ruby build messages talk about frequent crashes if a pthreads based tcl/tk is linked into a non-pthreaded Ruby. I though that perhaps, having extensions that invoked threads would change the context from beneath the Ruby interpretor and leave it in an inconsistent state if Ruby didn't store it away first. So I built an extension that created threads and did some trivial computations(I did check to make sure that these weren't optimized away by the compiler). But that didn't cause ruby 1.8.6 to crash(I haven't tried on 1.9). What advantages are obtained by using --enable-pthreads in Ruby 1.8? I'm also curious if someone has gotten Ruby(built without --enable-pthreads) to work successfully(without crashes) with tck/tk libraries with threading support built in. thanks, -ps
on 20.10.2008 02:44
M. Edward (Ed) Borasky: > Last time I looked, the difference between no optimization > whatsoever and "-O3 -march=<your chip here>" was about 30 percent. What benchmarks did you use? In my code’s case, the difference between empty CFLAGS and CFLAGS='-O3 -march=native' is minimal (Athlon 64 X2). (gcc’s man page says -march implies the same -mtune, and that ‘native’ is inteligently handled to mean whatever arch is the best in my case.) > BTW ... 64-bit compiled is slower than 32-bit compiled on a 64-bit > chip, too ... cache sizes, alignments, and such, I suspect, though > I haven't taken the time to profile it. That’s interesting. Can I build 32-bit Ruby and use it inside my x86_64 system? If so, how? (Sorry, I’m a total novice when it comes to this.) -- Shot
on 20.10.2008 15:13
On 20/10/2008, Shot (Piotr Szotkowski) <shot@hot.pl> wrote: > M. Edward (Ed) Borasky: > > > > Last time I looked, the difference between no optimization > > whatsoever and "-O3 -march=<your chip here>" was about 30 percent. > > > What benchmarks did you use? In my code's case, the difference between > empty CFLAGS and CFLAGS='-O3 -march=native' is minimal (Athlon 64 X2). It also depends on your chip. Recent AMD chips tend to have sane design wrt balance of number of execution units, cache sizes, decoder, etc. The parts fit well together so the CPU can handle any code without much trouble. On the other hand, Pentium4 chips (before Core2 which are sometimes also called P4 for some reason) were very poorly designed with slow decoder and inbalanced number of execution units. The compiler can reorder instructions so that they can get to the execution units faster on this chip and achieve better saturation of the CPU hence improving performance considerably. > That's interesting. Can I build 32-bit Ruby and use it inside my x86_64 > system? If so, how? (Sorry, I'm a total novice when it comes to this.) You probably do that by passing some parameter to gcc. Obviously you would need 32bit versions of all the libraries you use in your extensions. And you would not be able to use as much memory. The 32bit address space is very limited (normally only 1-2GB on Linux). Thanks Michal