strcoll: improve performance by removing the cache (#15884)
authorLeonhard Holz <leonhard.holz@web.de>
Fri, 17 Oct 2014 10:17:23 +0000 (15:47 +0530)
committerSiddhesh Poyarekar <siddhesh@redhat.com>
Fri, 17 Oct 2014 10:17:23 +0000 (15:47 +0530)
commit0742aef6e52a935f9ccd69594831b56d807feef3
treead38b0391baea7c79db50e1f9bbc31c50e0e5b88
parentee54ce44cb734f18fec4f6ccdfbe997d2574321e
strcoll: improve performance by removing the cache (#15884)

this is a path that should solve bug 15884. It complains about the performance
of strcoll(). It was found out that the runtime of strcoll() is actually bound
to strlen which is needed for calculating the size of a cache that was
installed to improve the comparison performance.

The idea for this patch was that the cache is only useful in rare cases
(strings of same length and same first-level-chars) and that it would be
better to avoid memory allocation at all. To prove this I wrote a performance
test bench-strcoll.c with test data in benchtests-strcoll.tar.gz. Also
modifications in benchtests/Makefile and localedata/Makefile are necessary to
make it work.

After removing the cache the strcoll method showed the predicted behavior
(getting slightly faster) in all but the test case for hindi word sorting.
This was due the hindi text having much more equal words than the other ones.
For equal strings the performance was worse since all comparison levels were
run through and from the second level on the cache improved the comparison
performance of the original version.

Therefore I added a bytewise test via strcmp iff the first level comparison
found that both strings did match because in this case it is very likely that
equal strings are compared. This solved the problem with the hindi test case
and improved the performance of the others.

Performance comparison:

glibc files     -33.77%
vi_VN.UTF-8     -34.12%
en_US.UTF-8     -42.42%
ar_SA.UTF-8     -27.49%
zh_CN.UTF-8     +07.90%
cs_CZ.UTF-8     -29.67%
en_GB.UTF-8     -28.50%
da_DK.UTF-8     -36.57%
pl_PL.UTF-8     -39.31%
fr_FR.UTF-8     -28.57%
pt_PT.UTF-8     -22.82%
el_GR.UTF-8     -26.77%
ru_RU.UTF-8     -35.81%
iw_IL.UTF-8     -35.34%
es_ES.UTF-8     -34.46%
hi_IN.UTF-8     -00.38%
sv_SE.UTF-8     -36.99%
hu_HU.UTF-8     -16.35%
tr_TR.UTF-8     -27.80%
is_IS.UTF-8     -33.24%
it_IT.UTF-8     -24.39%
sr_RS.UTF-8     -37.55%
ja_JP.UTF-8     +02.84%
ChangeLog
NEWS
string/strcoll_l.c