review.tizen.org Git - platform/upstream/pixman.git/log

test: Added more demos and tests to .gitignore file

Uses a wildcard to handle the majority which end in "-test".

Reviewed-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>

test: Add a new benchmarker targeting affine operations

Affine-bench is written by following the example of lowlevel-blt-bench.

Affine-bench differs from lowlevel-blt-bench in the following:
- does not test different sized operations fitting to specific caches,
destination is always 1920x1080
- allows defining the affine transformation parameters
- carefully computes operation extents to hit the COVER_CLIP fast paths

Original version by Ben Avison. Changes by Pekka in v3:
- commit message
- style fixes
- more comments
- refactoring (e.g. bench_info_t)
- help output tweak

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

lowlevel-blt-bench: use a8r8g8b8 for CA solid masks

When doing component alpha with a solid mask, use a mask format that has
all the color channels instead of just a8. As Ben Avison explains it:

"Lowlevel-blt-bench initialises all its images using memset(0xCC) so an
a8 solid image would be converted by _pixman_image_get_solid() to
0xCC000000 whereas an a8r8g8b8 would be 0xCCCCCCCC. When you're not in
component alpha mode, only the alpha byte matters for the mask image,
but in the case of component alpha operations, a fast path might decide
that it can save itself a lot of multiplications if it spots that 3
constant mask components are already 0."

No (default) test so far has a solid mask with CA. This is just
future-proofing lowlevel-blt-bench to do what one would expect.

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

lowlevel-blt-bench: use the test pattern parser

Let lowlevel-blt-bench parse the test name string from the command line,
allowing to run almost infinitely more tests. One is no longer limited
to the tests listed in the big table.

While you can use the old short-hand names like src_8888_8888, you can
also use all possible operators now, and specify pixel formats exactly
rather than just x888, for instance.

This even allows to run crazy patterns like
conjoint_over_reverse_a8b8g8r8_n_r8g8b8x8.

All individual patterns are now interpreted through the parser. The
pattern "all" runs the same old default test set as before but through
the parser instead of the hard-coded parameters.

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

lowlevel-blt-bench: add test name parser and self-test

This patch is inspired by "lowlevel-blt-bench: Parse test name strings in
general case" by Ben Avison. From Ben's commit message:

"There are many types of composite operation that are useful to benchmark
but which are omitted from the table. Continually having to add extra
entries to the table is a nuisance and is prone to human error, so this
patch adds the ability to break down unknow strings of the format
<operation>_<src>[_<mask]_<dst>[_ca]
where bitmap formats are specified by number of bits of each component
(assumed in ARGB order) or 'n' to indicate a solid source or mask."

Add the parser to lowlevel-blt-bench.c, but do not hook it up to the
command line just yet. Instead, make it run a self-test.

As we now dynamically parse strings similar to the test names in the
huge table 'tests_tbl', we should make sure we can parse the old
well-known test names and produce exactly the same test parameters. The
self-test goes through this old table and verifies the parsing results.

Unfortunately the old table is not exactly consistent, it contains some
special cases that cannot be produced by the parsing rules. Whether
these special cases are intentional or just an oversight is not always
clear. Anyway, add a small table to reproduce the special cases
verbatim.

If we wanted, we could remove the big old table in a follow-up commit,
but then we would also lose the parser self-test.

The point of this whole excercise to let lowlevel-blt-bench recognize
novel test patterns in the future, following exactly the conventions
used in the old table.

Ben, from what I see, this parser has one major difference to what you
wrote. For a solid mask, your parser uses a8r8g8b8 format, while mine
uses a8 which comes from the old table.

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

test/utils: add format aliases used by lowlevel-blt-bench

Lowlevel-blt-bench uses several pixel format shorthands. Pick them from
the great table in lowlevel-blt-bench.c and add them here so that
format_from_string() can recognize them.

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

test/utils: add operator aliases for lowlevel-blt-bench

Lowlevel-blt-bench uses the operator alias "outrev". Add an alias for it
in the operator-name table.

Also add aliases for overrev, inrev and atoprev, so that
lowlevel-blt-bench can later recognize them for new test cases.

The aliases are added such, that an operator to name lookup will never
return them; it returns the proper names instead.

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

test/utils: support format name aliases

Previously there was a flat list of formats, used to iterate over all
formats when looking up a format from name or listing them. This cannot
support name aliases.

To support name aliases (multiple name strings mapping to the same
format), create a format-name mapping table. Functions format_name(),
format_from_string(), and list_formats() should keep on working exactly
like before, except format_from_string() now recognizes the additional
formats that format_name() already supported.

The only the formats from the old format list are added with ENTRY, so
that list_formats() works as before. The whole list is verified against
the authoritative list in pixman.h, entries missing from the old list
are commented out.

The extra formats supported by the old format_name() are added as
ALIASes. A side-effect of that is that now also format_from_string()
recognizes the following new names: x4c4 / c8, x4g4 / g8, c4, g4, g1,
yuy2, yv12, null, solid, pixbuf, rpixbuf, unknown.

Name aliases will be useful in follow-up patches, where
lowlevel-blt-bench.c is converted to parse short-hand format names from
strings.

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

test/utils: support operator name aliases

Previously there was a flat list of operators (pixman_op_t), used to
iterate over all operators when looking up an operator from name or
listing them. This cannot support name aliases.

To support name aliases (multiple name strings mapping to the same
operator), create an operator-name mapping table. Functions
operator_name, operator_from_string, and list_operators should keep on
working exactly like before, except operator_from_string now recognizes
a few aliases too.

Name aliases will be useful in follow-up patches, where
lowlevel-blt-bench.c is converted to parse operator names from strings.
Lowlevel-blt-bench uses shorthand names instead of the usual names. This
change allows lowlevel-blt-bench.s to use operator_from_string in the
future.

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

test: Move format and operator string functions to utils.[ch]

This permits format_from_string(), list_formats(), list_operators() and
operator_from_string() to be used from tests other than check-formats.

Reviewed-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>

pixman.c: Coding style

A few violations of coding style were identified in code copied from here
into affine-bench.

Reviewed-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>

armv6: Fix typo in preload macro

Missing "lsl" meant that cases with a 32-bit source and/or mask, and an
8-bit destination, the code would not assemble.

mmx: Fix _mm_empty problems for over_8888_8888/over_8888_n_8888

Using "--disable-sse2 --disable-ssse3" configure options and
CFLAGS="-m32 -O2 -g" on an x86 system results in pixman "make check"
failures:

    ../test-driver: line 95: 29874 Aborted
    FAIL: affine-test
    ../test-driver: line 95: 29887 Aborted
    FAIL: scaling-test

One _mm_empty () was missing and another one is needed to workaround
an old GCC bug https://gcc.gnu.org/PR47759 (GCC may move MMX instructions
around and cause test suite failures).

Reviewed-by: Matt Turner <mattst88@gmail.com>

Fix comment about BILINEAR_INTERPOLATION_BITS to say < 8 rather than <= 8

Since a4c79d695d52c94647b1aff7 the constant
BILINEAR_INTERPOLATION_BITS must be strictly less than 8, so fix the
comment to say this, and also add a COMPILE_TIME_ASSERT in the
bilinear fetcher in pixman-fast-path.c

mmx: Add nearest over_8888_8888

lowlevel-blt-bench -n, over_8888_8888, 15 iterations on Loongson 2f:

           Before          After
          Mean StdDev     Mean StdDev   Change
    L1    15.8   0.02     24.0   0.06   +52.0%
    L2    14.8   0.15     23.3   0.13   +56.9%
    M     10.3   0.01     13.8   0.03   +33.6%
    HT    10.0   0.02     14.5   0.05   +44.7%
    VT     9.7   0.02     13.5   0.04   +39.2%
    R      9.1   0.01     12.2   0.04   +34.4%
    RT     7.1   0.06      8.9   0.09   +25.2%

mmx: Add nearest over_8888_n_8888

lowlevel-blt-bench -n, over_8888_n_8888, 15 iterations on Loongson 2f:

           Before          After
          Mean StdDev     Mean StdDev   Change
    L1     9.7   0.01     19.2   0.02   +98.2%
    L2     9.6   0.11     19.2   0.16   +99.5%
    M      7.3   0.02     12.5   0.01   +72.0%
    HT     6.6   0.01     13.4   0.02  +103.2%
    VT     6.4   0.01     12.6   0.03   +96.1%
    R      6.3   0.01     11.2   0.01   +76.5%
    RT     4.4   0.01      8.1   0.03   +82.6%

MIPS: Fix exported symbols in public API.

test: Rearrange tests in order of increasing runtime

Making short tests run first is convenient to catch obvious bugs
early.

pixman-gradient-walker: Make left_x and right_x 64 bit variables

The variables left_x, and right_x in gradient_walker_reset() are
computed from pos, which is a 64 bit quantity, so to avoid overflows,
these variables must be 64 bit as well.

Similarly, the left_x and right_x that are stored in
pixman_gradient_walker_t need to be 64 bit as well; otherwise,
pixman_gradient_walker_pixel() will call reset too often.

This fixes the radial-invalid test, which was generating 'invalid'
floating point exceptions when the overflows caused color values to be
outside of [0, 255].

test: Add radial-invalid test program

This program demonstrates a bug in gradient walker, where some integer
overflows cause colors outside the range [0, 255] to be generated,
which in turns cause 'invalid' floating point exceptions when those
colors are converted to uint8_t.

The bug was first reported by Owen Taylor on the #cairo IRC channel.

ARMv6: Add fast path for src_x888_0565

Benchmark results, "before" is upstream/master
5f661ee719be25c3aa0eb0d45e0db23a37e76468, and "after" contains this
patch on top.

lowlevel-blt-bench, src_8888_0565, 100 iterations:

       Before          After
      Mean StdDev     Mean StdDev   Confidence   Change
L1    25.9   0.20    115.6   0.70    100.00%    +347.1%
L2    14.4   0.23     52.7   3.48    100.00%    +265.0%
M     14.1   0.01     79.8   0.17    100.00%    +465.9%
HT    10.2   0.03     32.9   0.31    100.00%    +221.2%
VT     9.8   0.03     29.8   0.25    100.00%    +203.4%
R      9.4   0.03     27.8   0.18    100.00%    +194.7%
RT     4.6   0.04     10.9   0.29    100.00%    +135.9%

At most 19 outliers rejected per test per set.

cairo-perf-trace with trimmed traces results were indifferent.

A system-wide perf_3.10 profile on Raspbian shows significant
differences in the X server CPU usage. The following were measured from
a 130x62 char lxterminal running 'dmesg' every 0.5 seconds for roughly
30 seconds. These profiles are libpixman.so symbols only.

Before:

Samples: 63K of event 'cpu-clock', Event count (approx.): 2941348112, DSO: libpixman-1.so.0.33.1
37.77%  Xorg  [.] fast_fetch_r5g6b5
14.39%  Xorg  [.] pixman_composite_over_n_8_8888_asm_armv6
  8.51%  Xorg  [.] fast_write_back_r5g6b5
  7.38%  Xorg  [.] pixman_composite_src_8888_8888_asm_armv6
  4.39%  Xorg  [.] pixman_composite_add_8_8_asm_armv6
  3.69%  Xorg  [.] pixman_composite_src_n_8888_asm_armv6
  2.53%  Xorg  [.] _pixman_image_validate
  2.35%  Xorg  [.] pixman_image_composite32

After:

Samples: 31K of event 'cpu-clock', Event count (approx.): 3619782704, DSO: libpixman-1.so.0.33.1
22.36%  Xorg  [.] pixman_composite_over_n_8_8888_asm_armv6
13.59%  Xorg  [.] pixman_composite_src_x888_0565_asm_armv6
12.75%  Xorg  [.] pixman_composite_src_8888_8888_asm_armv6
  6.79%  Xorg  [.] pixman_composite_add_8_8_asm_armv6
  5.95%  Xorg  [.] pixman_composite_src_n_8888_asm_armv6
  4.12%  Xorg  [.] pixman_image_composite32
  3.69%  Xorg  [.] _pixman_image_validate
  3.65%  Xorg  [.] _pixman_bits_image_setup_accessors

Before, fast_fetch_r5g6b5 + fast_write_back_r5g6b5 took 46% of the
samples in libpixman, and probably incurred some memcpy() load, too.
After, pixman_composite_src_x888_0565_asm_armv6 takes 14%. Note, that
the sample counts are very different before/after, as less time is spent
in Pixman and running time is not exactly the same.

Furthermore, in the above test, the CPU idle function was sampled 9%
before, and 15% after.

v4, Pekka Paalanen <pekka.paalanen@collabora.co.uk> :
Re-benchmarked on Raspberry Pi, commit message.

ARM: use pixman_asm_function in internal headers

The two ARM headers contained open-coded copies of pixman_asm_function,
replace these.

Since it seems customary that ARM headers do not use CPP include guards,
rely on the .S files to #include "pixman-arm-asm.h" first. They all
do now.

v2: Fix a build failure on rpi by adding one #include.

ARMv6: Add fast path for in_reverse_8888_8888

Benchmark results, "before" is the patch
* upstream/master 4b76bbfda670f9ede67d0449f3640605e1fc4df0
+ ARMv6: Support for very variable-hungry composite operations
+ ARMv6: Add fast path for over_n_8888_8888_ca
and "after" contains the additional patches on top:
+ ARMv6: Add fast path flag to force no preload of destination buffer
+ ARMv6: Add fast path for in_reverse_8888_8888 (this patch)

lowlevel-blt-bench, in_reverse_8888_8888, 100 iterations:

       Before          After
      Mean StdDev     Mean StdDev   Confidence   Change
L1    21.1   0.07     32.3   0.08    100.00%     +52.9%
L2    11.6   0.29     18.0   0.52    100.00%     +54.4%
M     10.5   0.01     16.1   0.03    100.00%     +54.1%
HT     8.2   0.02     12.0   0.04    100.00%     +45.9%
VT     8.1   0.02     11.7   0.04    100.00%     +44.5%
R      8.1   0.02     11.3   0.04    100.00%     +39.7%
RT     4.8   0.04      6.1   0.09    100.00%     +27.3%

At most 12 outliers rejected per test per set.

cairo-perf-trace with trimmed traces, 30 iterations:

                                    Before          After
                                   Mean StdDev     Mean StdDev   Confidence   Change
t-firefox-paintball.trace          18.0   0.01     14.1   0.01    100.00%     +27.4%
t-firefox-chalkboard.trace         36.7   0.03     36.0   0.02    100.00%      +1.9%
t-firefox-canvas-alpha.trace       20.7   0.22     20.3   0.22    100.00%      +1.9%
t-swfdec-youtube.trace              7.8   0.03      7.8   0.03    100.00%      +0.9%
t-firefox-talos-gfx.trace          25.8   0.44     25.6   0.29     93.87%      +0.7%  (insignificant)
t-firefox-talos-svg.trace          20.6   0.04     20.6   0.03    100.00%      +0.2%
t-firefox-fishbowl.trace           21.2   0.04     21.1   0.02    100.00%      +0.2%
t-xfce4-terminal-a1.trace           4.8   0.01      4.8   0.01     98.85%      +0.2%  (insignificant)
t-swfdec-giant-steps.trace         14.9   0.03     14.9   0.02     99.99%      +0.2%
t-poppler-reseau.trace             22.4   0.11     22.4   0.08     86.52%      +0.2%  (insignificant)
t-gnome-system-monitor.trace       17.3   0.03     17.2   0.03     99.74%      +0.2%
t-firefox-scrolling.trace          24.8   0.12     24.8   0.11     70.15%      +0.1%  (insignificant)
t-firefox-particles.trace          27.5   0.18     27.5   0.21     48.33%      +0.1%  (insignificant)
t-grads-heat-map.trace              4.4   0.04      4.4   0.04     16.61%      +0.0%  (insignificant)
t-firefox-fishtank.trace           13.2   0.01     13.2   0.01      7.64%      +0.0%  (insignificant)
t-firefox-canvas.trace             18.0   0.05     18.0   0.05      1.31%      -0.0%  (insignificant)
t-midori-zoomed.trace               8.0   0.01      8.0   0.01     78.22%      -0.0%  (insignificant)
t-firefox-planet-gnome.trace       10.9   0.02     10.9   0.02     64.81%      -0.0%  (insignificant)
t-gvim.trace                       33.2   0.21     33.2   0.18     38.61%      -0.1%  (insignificant)
t-firefox-canvas-swscroll.trace    32.2   0.09     32.2   0.11     73.17%      -0.1%  (insignificant)
t-firefox-asteroids.trace          11.1   0.01     11.1   0.01    100.00%      -0.2%
t-evolution.trace                  13.0   0.05     13.0   0.05     91.99%      -0.2%  (insignificant)
t-gnome-terminal-vim.trace         19.9   0.14     20.0   0.14     97.38%      -0.4%  (insignificant)
t-poppler.trace                     9.8   0.06      9.8   0.04     99.91%      -0.5%
t-chromium-tabs.trace               4.9   0.02      4.9   0.02    100.00%      -0.6%

At most 6 outliers rejected per test per set.

Cairo perf reports the running time, but the change is computed for
operations per second instead (inverse of running time).

Confidence is based on Welch's t-test. Absolute changes less than 1%
can be accounted as measurement errors, even if statistically
significant.

There was a question of why FLAG_NO_PRELOAD_DST is used. It makes
lowlevel-blt-bench results worse except for L1, but improves some
Cairo trace benchmarks.

"Ben Avison" <bavison@riscosopen.org> wrote:

> The thing with the lowlevel-blt-bench benchmarks for the more
> sophisticated composite types (as a general rule, anything that involves
> branches at the per-pixel level) is that they are only profiling the case
> where you have mid-level alpha values in the source/mask/destination.
> Real-world images typically have a disproportionate number of fully
> opaque and fully transparent pixels, which is why when there's a
> discrepancy between which implementation performs best with cairo-perf
> trace versus lowlevel-blt-bench, I usually favour the Cairo winner.
>
> The results of removing FLAG_NO_PRELOAD_DST (in other words, adding
> preload of the destination buffer) are easy to explain in the
> lowlevel-blt-bench results. In the L1 case, the destination buffer is
> already in the L1 cache, so adding the preloads is simply adding extra
> instruction cycles that have no effect on memory operations. The "in"
> compositing operator depends upon the alpha of both source and
> destination, so if you use uniform mid-alpha, then you actually do need
> to read your destination pixels, so you benefit from preloading them. But
> for fully opaque or fully transparent source pixels, you don't need to
> read the corresponding destination pixel - it'll either be left alone or
> overwritten. Since the ARM11 doesn't use write-allocate cacheing, both of
> these cases avoid both the time taken to load the extra cachelines, as
> well as increasing the efficiency of the cache for other data. If you
> examine the source images being used by the Cairo test, you'll probably
> find they mostly use transparent or opaque pixels.

v4, Pekka Paalanen <pekka.paalanen@collabora.co.uk> :
Rebased, re-benchmarked on Raspberry Pi, commit message.

v5, Pekka Paalanen <pekka.paalanen@collabora.co.uk> :
Rebased, re-benchmarked on Raspberry Pi due to a fix to
"ARMv6: Add fast path for over_n_8888_8888_ca" patch.

ARMv6: Add fast path flag to force no preload of destination buffer

ARMv6: Add fast path for over_n_8888_8888_ca

Benchmark results, "before" is
* upstream/master 4b76bbfda670f9ede67d0449f3640605e1fc4df0
"after" contains the additional patches on top:
+ ARMv6: Support for very variable-hungry composite operations
+ ARMv6: Add fast path for over_n_8888_8888_ca (this patch)

lowlevel-blt-bench, over_n_8888_8888_ca, 100 iterations:

       Before          After
      Mean StdDev     Mean StdDev   Confidence   Change
L1     2.7   0.00     16.1   0.06    100.00%    +500.7%
L2     2.4   0.01     14.1   0.15    100.00%    +489.9%
M      2.3   0.00     14.3   0.01    100.00%    +510.2%
HT     2.2   0.00      9.7   0.03    100.00%    +345.0%
VT     2.2   0.00      9.4   0.02    100.00%    +333.4%
R      2.2   0.01      9.5   0.03    100.00%    +331.6%
RT     1.9   0.01      5.5   0.07    100.00%    +192.7%

At most 1 outliers rejected per test per set.

cairo-perf-trace with trimmed traces, 30 iterations:

                                    Before          After
                                   Mean StdDev     Mean StdDev   Confidence   Change
t-firefox-talos-gfx.trace          33.1   0.42     25.8   0.44    100.00%     +28.6%
t-firefox-scrolling.trace          31.4   0.11     24.8   0.12    100.00%     +26.3%
t-gnome-terminal-vim.trace         22.4   0.10     19.9   0.14    100.00%     +12.5%
t-evolution.trace                  13.9   0.07     13.0   0.05    100.00%      +6.5%
t-firefox-planet-gnome.trace       11.6   0.02     10.9   0.02    100.00%      +6.5%
t-gvim.trace                       34.0   0.21     33.2   0.21    100.00%      +2.4%
t-chromium-tabs.trace               4.9   0.02      4.9   0.02    100.00%      +1.0%
t-poppler.trace                     9.8   0.05      9.8   0.06    100.00%      +0.7%
t-firefox-canvas-swscroll.trace    32.3   0.10     32.2   0.09    100.00%      +0.4%
t-firefox-paintball.trace          18.1   0.01     18.0   0.01    100.00%      +0.3%
t-poppler-reseau.trace             22.5   0.09     22.4   0.11     99.29%      +0.3%
t-firefox-canvas.trace             18.1   0.06     18.0   0.05     99.29%      +0.2%
t-xfce4-terminal-a1.trace           4.8   0.01      4.8   0.01     99.77%      +0.2%
t-firefox-fishbowl.trace           21.2   0.03     21.2   0.04    100.00%      +0.2%
t-gnome-system-monitor.trace       17.3   0.03     17.3   0.03     99.54%      +0.1%
t-firefox-asteroids.trace          11.1   0.01     11.1   0.01    100.00%      +0.1%
t-midori-zoomed.trace               8.0   0.01      8.0   0.01     99.98%      +0.1%
t-grads-heat-map.trace              4.4   0.04      4.4   0.04     34.08%      +0.1%  (insignificant)
t-firefox-talos-svg.trace          20.6   0.03     20.6   0.04     54.06%      +0.0%  (insignificant)
t-firefox-fishtank.trace           13.2   0.01     13.2   0.01     52.81%      -0.0%  (insignificant)
t-swfdec-giant-steps.trace         14.9   0.02     14.9   0.03     85.50%      -0.1%  (insignificant)
t-firefox-chalkboard.trace         36.6   0.02     36.7   0.03    100.00%      -0.2%
t-firefox-canvas-alpha.trace       20.7   0.32     20.7   0.22     55.76%      -0.3%  (insignificant)
t-swfdec-youtube.trace              7.8   0.02      7.8   0.03    100.00%      -0.5%
t-firefox-particles.trace          27.4   0.16     27.5   0.18     99.94%      -0.6%

At most 4 outliers rejected per test per set.

Cairo perf reports the running time, but the change is computed for
operations per second instead (inverse of running time).

Confidence is based on Welch's t-test. Absolute changes less than 1%
can be accounted as measurement errors, even if statistically
significant.

v4, Pekka Paalanen <pekka.paalanen@collabora.co.uk> :
Use pixman_asm_function instead of startfunc.
Rebased. Re-benchmarked on Raspberry Pi.
Commit message.

v5, Ben Avison <bavison@riscosopen.org> :
Fixed the bug exposed in blitters-test 4928372.
15 hours of testing, compared to the 45 minutes to hit
the bug originally.
    Pekka Paalanen <pekka.paalanen@collabora.co.uk> :
Squash the fix, re-benchmark on Raspberry Pi.

ARMv6: Support for very variable-hungry composite operations

Previously, the variable ARGS_STACK_OFFSET was available to extract values
from function arguments during the init macro. Now this changes dynamically
around stack operations in the function as a whole so that arguments can be
accessed at any point. It is also joined by LOCALS_STACK_OFFSET, which
allows access to space reserved on the stack during the init macro.

On top of this, composite macros now have the option of using all of WK0-WK3
registers rather than just the subset it was told to use; this requires the
pixel count to be spilled to the stack over the leading pixels at the start
of each line. Thus, at best, each composite operation can use 11 registers,
plus any pointer registers not required for the composite type, plus as much
stack space as it needs, divided up into constants and variables as necessary.

create_bits(): Cast the result of height * stride to size_t

In create_bits() both height and stride are ints, so the result is
also an int, which will overflow if height or stride are big enough
and size_t is bigger than int.

This patch simply casts height to size_t to prevent these overflows,
which prevents the crash in:

https://bugzilla.redhat.com/show_bug.cgi?id=972647

It's not even close to fixing the full problem of supporting big
images in pixman.

See also

https://bugs.freedesktop.org/show_bug.cgi?id=69014

ARM: share pixman_asm_function definition

Several files define identically the asm macro pixman_asm_function.
Merge all these definitions into a new asm header.

The original definition is taken from pixman-arm-simd-asm-scaled.S with
the copyright/licence/author blurb verbatim.

ARMv6: Add fast path for over_reverse_n_8888

Benchmark results, "before" is upstream commit
c343846 lowlevel-blt-bench: add in_reverse_8888_8888 test
and "after" is with this patch only added on top.

lowlevel-blt-bench, over_reverse_n_8888, 100 iterations:

       Before          After
      Mean StdDev     Mean StdDev   Confidence   Change
L1    15.1    0.1    274.5    2.3    100.00%   +1718.9%
L2    12.8    0.3    181.8    0.7    100.00%   +1315.5%
M     10.8    0.0     77.9    0.0    100.00%    +621.2%
HT     9.7    0.0     29.4    0.2    100.00%    +204.9%
VT     9.5    0.0     26.7    0.1    100.00%    +179.3%
R      9.3    0.0     25.3    0.1    100.00%    +173.6%
RT     6.0    0.1     11.0    0.2    100.00%     +82.9%

At most 16 outliers rejected per case per set.

cairo-perf-trace with trimmed traces, 30 iterations:

                                    Before          After
                                   Mean StdDev     Mean StdDev   Confidence   Change
t-poppler.trace                    12.9    0.1      9.7    0.0    100.00%     +32.6%
t-firefox-talos-gfx.trace          33.2    0.7     32.9    0.4     95.23%      +0.9%  (insignificant)
t-firefox-particles.trace          27.4    0.1     27.3    0.2     99.65%      +0.4%
t-firefox-canvas-alpha.trace       20.5    0.3     20.5    0.3     57.51%      +0.3%  (insignificant)
t-poppler-reseau.trace             22.4    0.1     22.4    0.1     95.69%      +0.3%  (insignificant)
t-firefox-fishtank.trace           13.2    0.0     13.2    0.0     99.84%      +0.1%
t-swfdec-giant-steps.trace         14.9    0.0     14.9    0.0     87.68%      +0.1%  (insignificant)
t-swfdec-youtube.trace              7.8    0.0      7.8    0.0     35.22%      +0.1%  (insignificant)
t-firefox-planet-gnome.trace       11.5    0.0     11.5    0.0     29.37%      +0.0%  (insignificant)
t-firefox-fishbowl.trace           21.2    0.0     21.2    0.0     18.09%      +0.0%  (insignificant)
t-grads-heat-map.trace              4.4    0.0      4.4    0.0      1.84%      +0.0%  (insignificant)
t-firefox-paintball.trace          18.0    0.0     18.0    0.0     33.43%      -0.0%  (insignificant)
t-firefox-talos-svg.trace          20.5    0.0     20.5    0.1     68.56%      -0.1%  (insignificant)
t-midori-zoomed.trace               8.0    0.0      8.0    0.0     99.98%      -0.1%
t-firefox-canvas-swscroll.trace    32.1    0.1     32.1    0.1     85.27%      -0.1%  (insignificant)
t-gnome-system-monitor.trace       17.2    0.0     17.2    0.0     99.97%      -0.2%
t-firefox-chalkboard.trace         36.5    0.0     36.6    0.0    100.00%      -0.2%
t-firefox-asteroids.trace          11.1    0.0     11.1    0.0    100.00%      -0.2%
t-firefox-canvas.trace             17.9    0.0     18.0    0.0    100.00%      -0.3%
t-chromium-tabs.trace               4.9    0.0      4.9    0.0     97.95%      -0.3%  (insignificant)
t-xfce4-terminal-a1.trace           4.8    0.0      4.8    0.0    100.00%      -0.4%
t-firefox-scrolling.trace          31.1    0.1     31.2    0.1    100.00%      -0.5%
t-evolution.trace                  13.7    0.1     13.8    0.1     99.99%      -0.6%
t-gnome-terminal-vim.trace         22.0    0.2     22.2    0.1     99.99%      -0.7%
t-gvim.trace                       33.2    0.2     33.5    0.2    100.00%      -0.8%

At most 6 outliers rejected per case per set.

Cairo perf reports the running time, but the change is computed for
operations per second instead (inverse of running time).

Changes in the order of +/- 1% can be accounted for measurement errors,
even if they are deemed to be statistically significant. This claim is
based on comparing two 30-iteration identical "before" runs using the
exact same binaries, and observing changes from -0.4% to +0.5% with
>=99% confidence.

Confidence is based on Welch's t-test.

v4, Pekka Paalanen <pekka.paalanen@collabora.co.uk> :
Rebased, re-benchmarked on Raspberry Pi, commit message.

test: Fix OpenMP clauses for the tolerance-test

Compiling with the Intel Compiler reveals a problem:

tolerance-test.c(350): error: index variable "i" of for statement following an OpenMP for pragma must be private
# pragma omp parallel for default(none) shared(i) private (result)
^

In addition to this, the 'result' variable also should not be private
(otherwise its value does not survive after the end of the loop). It
needs to be either shared or use the reduction clause to describe how
the results from multiple threads are combined together. Reduction
seems to be more appropriate here.

configure.ac: Check if the compiler supports GCC vector extensions

The Intel Compiler 14.0.0 claims version GCC 4.7.3 compatibility
via __GNUC__/__GNUC__MINOR__ macros, but does not provide the same
level of GCC vector extensions support as the original GCC compiler:
    http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html

Which results in the following compilation failure:

In file included from ../test/utils.h(7),
                 from ../test/utils.c(3):
../test/utils-prng.h(138): error: expression must have integral type
      uint32x4 e = x->a - ((x->b << 27) + (x->b >> (32 - 27)));
                            ^

The problem is fixed by doing a special check in configure for
this feature.

lowlevel-blt-bench: add in_reverse_8888_8888 test

in_reverse_8888_8888 is one of the more commonly used operations in the
cairo-perf-trace suite that hasn't been in lowlevel-blt-bench until now.

v4, Pekka Paalanen <pekka.paalanen@collabora.co.uk> :
Split from "Add extra test to lowlevel-blt-bench and fix an
existing one", new summary.

lowlevel-blt-bench: over_reverse_n_8888 needs solid source

v4, Pekka Paalanen <pekka.paalanen@collabora.co.uk> :
Split from "Add extra test to lowlevel-blt-bench and fix an
existing one", new summary.

ARMv6: remove 1 instr per row in generate_composite_function

This knocks off one instruction per row. The effect is probably too small to
be measurable, but might as well be included. The second occurrence of this
sequence doesn't actually benefit at all, but is changed for consistency.

The saved instruction comes from combining the "and" inside the .if
statement with an earlier "tst". The "and" was normally needed, except
for in one special case, where bits 4-31 were all shifted off the top of
the register later on in preload_leading_step2, so we didn't care about
their values.

v4, Pekka Paalanen <pekka.paalanen@collabora.co.uk> :
Remove "bits 0-3" from the comments, update patch summary, and
augment message with Ben's suggestion.

ARMv6: Fix indentation in the composite macros

Remove all the operators that use division from pixman-combine32.c

These are now handled by floating point combiners.

Copy the comments from pixman-combine32.c to pixman-combine-float.c

An upcoming commit will delete many of the operators from
pixman-combine32.c and rely on the ones in pixman-combine-float.c. The
comments about how the operators were derived are still useful though,
so copy them into pixman-combine-float.c before the deletion.

utils.c: Set DEVIATION to 0.0128

Consider a HARD_LIGHT operation with the following pixels:

- source:           15      (6 bits)
- source alpha:     255     (8 bits)
- mask alpha:       223     (8 bits)
- dest              255     (8 bits)
- dest alpha:       0       (8 bits)

Since 2 times the source is less than source alpha, the first branch
of the hard light blend mode is taken:

        (1 - sa) * d + (1 - da) * s + 2 * s * d

Since da is 0 and d is 1, this degenerates to:

        (1 - sa) + 3 * s

Taking (src IN mask) into account along with the fact that sa is 1,
this becomes:

        (1 - ma) + 3 * s * ma

      = (1 - 223/255.0) + 3 * (15/63.0) * (223/255.0)

      = 0.7501400560224089

When computed with the source converted by bit replication to eight
bits, and additionally with the (src IN mask) part rounded to eight
bits, we get:

        ma = 223/255.0

        s * ma = (60 / 255.0) * (223/255.0) which rounds to 52 / 255

and the result is

        (1 - ma) + 3 * s * ma

      = (1 - 223/255.0) + 3 * 52/255.0

      = 0.7372549019607844

so now we have an error of 0.012885.

Without making changes to the way pixman does integer
rounding/arithmetic, this error must then be considered
acceptable. Due to conservative computations in the test suite we can
however get away with 0.0128 as the acceptable deviation.

This fixes the remaining failures in pixel-test.

Use floating point combiners for all operators that involve divisions

Consider a DISJOINT_ATOP operation with the following pixels:

- source: 0xff (8 bits)
- source alpha: 0x01 (8 bits)
- mask alpha: 0x7b (8 bits)
- dest: 0x00 (8 bits)
- dest alpha: 0xff (8 bits)

When (src IN mask) is computed in 8 bits, the resulting alpha channel
is 0 due to rounding:

     floor ((0x01 * 0x7b) / 255.0 + 0.5) = floor (0.9823) = 0

which means that since Render defines any division by zero as
infinity, the Fa and Fb for this operator end up as follows:

     Fa = max (1 - (1 - 1) / 0, 0) = 0

     Fb = min (1, (1 - 0) / 1) = 1

and so since dest is 0x00, the overall result is 0.

However, when computed in full precision, the alpha value no longer
rounds to 0, and so Fa ends up being

     Fa = max (1 - (1 - 1) / 0.0001, 0) = 1

and so the result is now

     s * ma * Fa + d * Fb

   = (1.0 * (0x7b / 255.0) * 1) + d * 0

   = 0x7b / 255.0

   = 0.4823

so the error in this case ends up being 0.48235294, which is clearly
not something that can be considered acceptable.

In order to avoid this problem, we need to do all arithmetic in such a
way that a multiplication of two tiny numbers can never end up being
zero unless one of the input numbers is itself zero.

This patch makes all computations that involve divisions take place in
floating point, which is sufficient to fix the test cases

This brings the number of failures in pixel-test down to 14.

Soft Light: Consistent approach to division by zero

The Soft Light operator has several branches. One them is decided
based on whether 2 * s is less than or equal to 2 * sa. In floating
point implementations, when those two values are very close to each
other, it may not be completely predictable which branch we hit.

This is a problem because in one branch, when destination alpha is
zero, we get the result

r = d * as

and in the other we get

r = 0

So when d and as are not 0, this causes two different results to be
returned from essentially identical input values. In other words,
there is a discontinuity in the current implementation.

This patch randomly changes the second branch such that it now returns
d * sa instead. There is no deep meaning behind this, because
essentially this is an attempt to assign meaning to division by zero,
and all that is requires is that that meaning doesn't depend on minute
differences in input values.

This makes the number of failed pixels in pixel-test go down to 347.

pixman-combine32.c: Fix bugs related to integer promotion

In the component alpha part of the PDF_SEPARABLE_BLEND_MODE macro, the
expression ~RED_8 (m) is used. Because RED_8(m) gets promoted to int
before ~ is applied, the whole expression typically becomes some
negative value rather than (255 - RED_8(m)) as desired.

Fix this by using unsigned temporary variables.

This reduces the number of failures in pixel-test to 363.

pixman/pixman-combine32.c: Bug fixes for separable blend modes

This commit fixes four separate bugs:

1. In the computation

      (1 - sa) * d + (1 - da) * s + sa * da * B(s, d)

   we were using regular addition for all four channels, but for
   superluminescent pixels, the addition could overflow causing
   nonsensical results.

2. The variables and return types used for the results of the blend
   mode calculations were unsigned, but for various blend modes (and
   especially with superluminescent pixels), the blend mode
   calculations could be negative, resulting in underflows.

3. The blend mode computations were returned as 8-bit values, which is
   not sufficient precision (especially considering that we need
   signed results).

4. The value before the final division by 255 was not properly clamped
   to [0, 255].

This patch fixes all those bugs. The blend mode computations are now
returned as signed 16 bit values with 1 represented as 255 * 255.

With these fixes, the number of failing pixels in pixel-test goes down
from 431 to 384.

pixel-test.c: Add a number of pixels that have failed at some point

This commit adds a large number of pixel regressions to
pixel-test. All of these have at some point been failing in
blend-mode-test, and most of them do fail currently.

To be specific, with this commit, pixel-test reports 431 failed tests.

test/tolerance-test: New test program

This new test program is similar to test/composite in that it relies
on the pixel_checker_t API to do tolerance based verification. But
unlike the composite test, which verifies combinations of a fixed set
of pixels, this one generates random images and verifies that those
composite correctly.

Also unlike composite, tolerance-test supports all the separable blend
mode operators in addition to the original Render operators.

When tests fail, a C struct is printed that can be pasted into
pixel-test for regression purposes.

There is an option "--forever" which causes the random seed to be set
to the current time, and then the test runs until interrupted. This is
useful for overnight runs.

This test currently fails badly due to various bugs in the blend mode
operators. Later commits will fix those.

pixel-test: Command line argument to specify the regression to run

A new command line argument allows the user to specify which one of
the regressions should be run.

pixel-test: Add support for mask pixels

Support is added to pixel-test for verifying operations involving
masks. If a regression includes a mask, it is verified with the
pixel_checker API in in both unified and component alpha modes.

test/check-formats.c: Add support for separable blend modes

test/utils.c: Add support for separable blend mode ops to do_composite()

The implementations are copied from the floating point pipeline, but
use double precision instead of single precision.

configure.ac: Check and use -Wno-unused-local-typedefs GCC option

With GCC 4.8.2 the COMPILE_TIME_ASSERT macro produces a spurious
warning about an unused local typedef:

    In file included from pixman.c:29:0:
    pixman.c: In function 'optimize_operator':
    pixman-private.h:1019:22: warning: typedef 'compile_time_assertion' locally defined but not used [-Wunused-local-typedefs]

The flag -Wno-unused-local-typedefs suppresses that warning.

Soft Light: The first comparison should be <=, not <

According to the definition of soft light, the first comparison is
less-than-or-equal, not less-than.

general: Support component alpha for all image types

Currently, if you attempt to use component alpha on source images or
images without RGB channels, Pixman will silently just use unified
alpha instead. This patch makes such images supported for component
alpha.

There is no particularly compelling usecase at the moment, but this
patch does get rid of a bit of special-case code both in
pixman-general.c and in test/composite.c.

test/utils.c: Make the stack unaligned only on 32 bit Windows

The call_test_function() contains some assembly that deliberately
causes the stack to be aligned to 32 bits rather than 128 bits on
x86-32. The intention is to catch bugs that surface when pixman is
called from code that only uses a 32 bit alignment.

However, recent versions of GCC apparently make the assumption (either
accidentally or deliberately) that that the incoming stack is aligned
to 128 bits, where older versions only seemed to make this assumption
when compiling with -msse2. This causes the vector code in the PRNG to
now segfault when called from call_test_function() on x86-32.

This patch fixes that by only making the stack unaligned on 32 bit
Windows, where it would definitely be incorrect for GCC to assume that
the incoming stack is aligned to 128 bits.

V2: Put "defined(...)" around __GNUC__

Reviewed-and-Tested-by: Matt Turner <mattst88@gmail.com>
Bugzilla: https://bugs.gentoo.org/show_bug.cgi?id=491110

Fix the SSSE3 CPUID detection.

SSSE3 is detected by bit 9 of ECX, but we were checking bit 9 of EDX
which is APIC leading to SSSE3 routines being called on CPUs without
SSSE3.

Reviewed-by: Matt Turner <mattst88@gmail.com>

demos/Makefile.am: Move EXTRA_DIST outside "if HAVE_GTK"

Without this, if tarballs are generated on a system that doesn't have
GTK+ 2 development headers available, the files in EXTRA_DIST will not
be included, which then causes builds from the tarball to fail on
systems that do have GTK+ 2 headers available.

Fixes https://bugs.freedesktop.org/show_bug.cgi?id=71465

test: Fix the win32 build

The win32 build has no config.h, so HAVE_CONFIG_H should be checked
before including it, as in utils.h.

Post-release version bump to 0.33.1

Pre-release version bump to 0.32.0

Post-release version bump to 0.31.3

Pre-release version bump to 0.31.2

pixman_trapezoid_valid(): Fix underflow when bottom is close to MIN_INT

If t->bottom is close to MIN_INT (probably invalid value), subtracting
top can lead to underflow which causes crashes. Attached patch will
fix the issue.

This fixes bug 67484.

test/trap-crasher.c: Add trapezoid that demonstrates a crash

This trapezoid causes a crash due to an underflow in the
pixman_trapezoid_valid().

Test case from Ritesh Khadgaray.

Fix pixman build with older GCC releases

The following patch fixes building pixman with older GCC releases
such as GCC 3.3 and older (OpenBSD; some older archs use GCC 3.3.6)
by changing the method of detecting the presence of __builtin_clz
to utilizing an autoconf check to determine its presence. Compilers
that pretend to be GCC, implement __builtin_clz and are already
utilizing the intrinsic include LLVM/Clang, Open64, EKOPath and
PCC.

pixman-glyph.c: Add __force_align_arg_pointer to composite functions

The functions pixman_composite_glyphs_no_mask() and
pixman_composite_glyphs() can call into code compiled with -msse2,
which requires the stack to be aligned to 16 bytes. Since the ABIs on
Windows and Linux for x86-32 don't provide this guarantee, we need to
use this attribute to make GCC generate a prologue that realigns the
stack.

This fixes the crash introduced in the previous commit and also

https://bugs.freedesktop.org/show_bug.cgi?id=70348

and

https://bugs.freedesktop.org/show_bug.cgi?id=68300

utils.c: On x86-32 unalign the stack before calling test_function

GCC when compiling with -msse2 and -mssse3 will assume that the stack
is aligned to 16 bytes even on x86-32 and accordingly issue movdqa
instructions for stack allocated variables.

But despite what GCC thinks, the standard ABI on x86-32 only requires
a 4-byte aligned stack. This is true at least on Windows, but there
also was (and maybe still is) Linux code in the wild that assumed
this. When such code calls into pixman and hits something compiled
with -msse2, we get a segfault from the unaligned movdqas.

Pixman has worked around this issue in the past with the gcc attribute
"force_align_arg_pointer" but the problem has resurfaced now in

https://bugs.freedesktop.org/show_bug.cgi?id=68300

because pixman_composite_glyphs() is missing this attribute.

This patch makes fuzzer_test_main() call the test_function through a
trampoline, which, on x86-32, has a bit of assembly that deliberately
avoids aligning the stack to 16 bytes as GCC normally expects. The
result is that glyph-test now crashes.

V2: Mark caller-save registers as clobbered, rather than using
noinline on the trampoline.

configure.ac: check and use -Wdeclaration-after-statement GCC option

The accidental use of declaration after statement breaks compilation
with C89 compilers such as MSVC. Assuming that MSVC is one of the
supported compilers, it makes sense to ask GCC to at least report
warnings for such problematic code.

sse2: bilinear fast path for src_x888_8888

Running cairo-perf-trace benchmark on Intel Core2 T7300:

Before:
[  0]    image    t-firefox-canvas-swscroll    1.989    2.008   0.43%    8/8
[  1]    image        firefox-canvas-scroll    4.574    4.609   0.50%    8/8

After:
[  0]    image    t-firefox-canvas-swscroll    1.404    1.418   0.51%    8/8
[  1]    image        firefox-canvas-scroll    4.228    4.259   0.36%    8/8

configure.ac: Add check for pmulhuw assembly

Clang 3.0 chokes on the following bit of assembly

    asm ("pmulhuw %1, %0\n\t"
        : "+y" (__A)
        : "y" (__B)
    );

from pixman-mmx.c with this error message:

    fatal error: error in backend: Unsupported asm: input constraint
        with a matching output constraint of incompatible type!

So add a check in configure to only enable MMX when the compiler can
deal with it.

scale.c: Use int instead of kernel_t for values in named_int_t

The 'value' field in the 'named_int_t' struct is used for both
pixman_repeat_t and pixman_kernel_t values, so the type should be int,
not pixman_kernel_t.

Fixes some warnings like this

scale.c:124:33: warning: implicit conversion from enumeration
      type 'pixman_repeat_t' to different enumeration type
      'pixman_kernel_t' [-Wconversion]
    { "None",                   PIXMAN_REPEAT_NONE },
    ~                           ^~~~~~~~~~~~~~~~~~

when compiled with clang.

pixman-combine32.c: Make Color Burn routine follow the math more closely

For superluminescent destinations, the old code could underflow in

uint32_t r = (ad - d) * as / s;

when (ad - d) was negative. The new code avoids this problem (and
therefore causes changes in the checksums of thread-test and
blitters-test), but it is likely still buggy due to the use of
unsigned variables and other issues in the blend mode code.

pixman-combine32: Make Color Dodge routine follow the math more closely

Change blend_color_dodge() to follow the math in the comment more
closely.

Note, the new code here is in some sense worse than the old code
because it can now underflow the unsigned variables when the source is
superluminescent and (as - s) is therefore negative. The old code was
careful to clamp to 0.

But for superluminescent variables we really need the ability for the
blend function to become negative, and so the solution the underflow
problem is to just use signed variables. The use of unsigned variables
is a general problem in all of the blend mode code that will have to
be solved later.

The CRC32 values in thread-test and blitters-test are updated to
account for the changes in output.

pixman-combine32: Rename a number of variable from sa/sca to as/s

There are no semantic changes, just variables renames. The motivation
for these renames is so that the names are shorter and better match
the one used in the comments.

pixman-combine32: Improve documentation for blend mode operators

This commit overhauls the comments in pixman-comine32.c regarding
blend modes:

- Add a link to the PDF supplement that clarifies the specification of
ColorBurn and ColorDodge

- Clarify how the formulas for premultiplied colors are derived form
the ones in the PDF specifications

- Write out the derivation of the formulas in each blend routine

pixman-combine32.c: Formatting fixes

Fix a bunch of spacing issues.

V2: More spacing issues, in the _ca combiners

Fix thread-test on non-OpenMP systems

The non-reentrant versions of prng_* functions are thread-safe only in
OpenMP-enabled builds.

Fixes thread-test failing when compiled with Clang (both on Linux and
on MacOS).

Add support for SSSE3 to the MSVC build system

Handle SSSE3 just like MMX and SSE2.

Fix build of check-formats on MSVC

Fixes

check-formats.obj : error LNK2019: unresolved external symbol
_strcasecmp referenced in function _format_from_string

check-formats.obj : error LNK2019: unresolved external symbol
_snprintf referenced in function _list_operators

Fix building of "other" programs on MSVC

In d1434d112ca5cd325e4fb85fc60afd1b9e902786 the benchmarks have been
extended to include other programs as well and the variable names have
been updated accordingly in the autotools-based build system, but not
in the MSVC one.

Fix build on MSVC

After a4c79d695d52c94647b1aff78548e5892d616b70 the MMX and SSE2 code
has some declarations after the beginning of a block, which is not
allowed by MSVC.

Fixes multiple errors like:

pixman-mmx.c(3625) : error C2275: '__m64' : illegal use of this type
as an expression

pixman-sse2.c(5708) : error C2275: '__m128i' : illegal use of this
type as an expression

fast: Swap image and iter flags in generated fast paths

The generated fast paths that were moved into the 'fast'
implementation in ec0e38cbb746a673f8e989ab8eae356c8c77dac7 had their
image and iter flag arguments swapped; as a result, none of the fast
paths were ever called.

vmx: there is no need to handle unaligned destination anymore

So the redundant variables, memory reads/writes and reshuffles
can be safely removed. For example, this makes the inner loop
of 'vmx_combine_add_u_no_mask' function much more simple.

Before:

    7a20:7d a8 48 ce lvx     v13,r8,r9
    7a24:7d 80 48 ce lvx     v12,r0,r9
    7a28:7d 28 50 ce lvx     v9,r8,r10
    7a2c:7c 20 50 ce lvx     v1,r0,r10
    7a30:39 4a 00 10 addi    r10,r10,16
    7a34:10 0d 62 eb vperm   v0,v13,v12,v11
    7a38:10 21 4a 2b vperm   v1,v1,v9,v8
    7a3c:11 2c 6a eb vperm   v9,v12,v13,v11
    7a40:10 21 4a 00 vaddubs v1,v1,v9
    7a44:11 a1 02 ab vperm   v13,v1,v0,v10
    7a48:10 00 0a ab vperm   v0,v0,v1,v10
    7a4c:7d a8 49 ce stvx    v13,r8,r9
    7a50:7c 00 49 ce stvx    v0,r0,r9
    7a54:39 29 00 10 addi    r9,r9,16
    7a58:42 00 ff c8 bdnz+   7a20 <.vmx_combine_add_u_no_mask+0x120>

After:

    76c0:7c 00 48 ce lvx     v0,r0,r9
    76c4:7d a8 48 ce lvx     v13,r8,r9
    76c8:39 29 00 10 addi    r9,r9,16
    76cc:7c 20 50 ce lvx     v1,r0,r10
    76d0:10 00 6b 2b vperm   v0,v0,v13,v12
    76d4:10 00 0a 00 vaddubs v0,v0,v1
    76d8:7c 00 51 ce stvx    v0,r0,r10
    76dc:39 4a 00 10 addi    r10,r10,16
    76e0:42 00 ff e0 bdnz+   76c0 <.vmx_combine_add_u_no_mask+0x120>

vmx: align destination to fix valgrind invalid memory writes

The SIMD optimized inner loops in the VMX/Altivec code are trying
to emulate unaligned accesses to the destination buffer. For each
4 pixels (which fit into a 128-bit register) the current
implementation:
  1. first performs two aligned reads, which cover the needed data
  2. reshuffles bytes to get the needed data in a single vector register
  3. does all the necessary calculations
  4. reshuffles bytes back to their original location in two registers
  5. performs two aligned writes back to the destination buffer

Unfortunately in the case if the destination buffer is unaligned and
the width is a perfect multiple of 4 pixels, we may have some writes
crossing the boundaries of the destination buffer. In a multithreaded
environment this may potentially corrupt the data outside of the
destination buffer if it is concurrently read and written by some
other thread.

The valgrind report for blitters-test is full of:

==23085== Invalid write of size 8
==23085==    at 0x1004B0B4: vmx_combine_add_u (pixman-vmx.c:1089)
==23085==    by 0x100446EF: general_composite_rect (pixman-general.c:214)
==23085==    by 0x10002537: test_composite (blitters-test.c:363)
==23085==    by 0x1000369B: fuzzer_test_main._omp_fn.0 (utils.c:733)
==23085==    by 0x10004943: fuzzer_test_main (utils.c:728)
==23085==    by 0x10002C17: main (blitters-test.c:397)
==23085==  Address 0x5188218 is 0 bytes after a block of size 88 alloc'd
==23085==    at 0x4051DA0: memalign (vg_replace_malloc.c:581)
==23085==    by 0x4051E7B: posix_memalign (vg_replace_malloc.c:709)
==23085==    by 0x10004CFF: aligned_malloc (utils.c:833)
==23085==    by 0x10001DCB: create_random_image (blitters-test.c:47)
==23085==    by 0x10002263: test_composite (blitters-test.c:283)
==23085==    by 0x1000369B: fuzzer_test_main._omp_fn.0 (utils.c:733)
==23085==    by 0x10004943: fuzzer_test_main (utils.c:728)
==23085==    by 0x10002C17: main (blitters-test.c:397)

This patch addresses the problem by first aligning the destination
buffer at a 16 byte boundary in each combiner function. This trick
is borrowed from the pixman SSE2 code.

It allows to pass the new thread-test on PowerPC VMX/Altivec systems and
also resolves the "make check" failure reported for POWER7 hardware:
    http://lists.freedesktop.org/archives/pixman/2013-August/002871.html

test: Add new thread-test program

This test program allocates an array of 16 * 7 uint32_ts and spawns 16
threads that each use 7 of the allocated uint32_ts as a destination
image for a large number of composite operations. Each thread then
computes and returns a checksum for the image. Finally, the main
thread computes a checksum of the checksums and verifies that it
matches expectations.

The purpose of this test is catch errors where memory outside images
is read and then written back. Such out-of-bounds accesses are broken
when multiple threads are involved, because the threads will race to
read and write the shared memory.

V2:
- Incorporate fixes from Siarhei for endianness and undefined behavior
  regarding argument evaluation
- Make the images 7 pixels wide since the bug only happens when the
  composite width is greater than 4.
- Compute a checksum of the checksums so that you don't have to
  update 16 values if something changes.

V3: Remove stray dollar sign

Rename HAVE_PTHREAD_SETSPECIFIC to HAVE_PTHREADS

The test for pthread_setspecific() can be used as a general test for
whether pthreads are available, so rename the variable from
HAVE_PTHREAD_SETSPECIFIC to HAVE_PTHREADS and run the test even when
better support for thread local variables are available.

However, the pthread arguments are still only added to CFLAGS and
LDFLAGS when pthread_setspecific() is used for thread local variables.

V2: AC_SUBST(PTHREAD_CFLAGS)

blitters-test: Remove unused variable

utils.c: Make image_endian_swap() deal with negative strides

Use a temporary variable s containing the absolute value of the stride
as the upper bound in the inner loops.

V2: Do this for the bpp == 16 case as well

utils.c: Make print_image actually cope with negative strides

Commit 4312f077365bf9f59423b1694136089c6da6216b claimed to have made
print_image() work with negative strides, but it didn't actually
work. When the stride was negative, the image buffer would be accessed
as if the stride were positive.

Fix the bug by not changing the stride variable and instead using a
temporary, s, that contains the absolute value of stride.

Move generated affine fetchers into pixman-fast-path.c

The generated fetchers for NEAREST, BILINEAR, and
SEPARABLE_CONVOLUTION filters are fast paths and so they belong in
pixman-fast-path.c

Move bits_image_fetch_bilinear_no_repeat_8888 into pixman-fast-path.c

This iterator is really a fast path, so it belongs in the fast path
implementation.

fast, ssse3: Simplify logic to fetch lines in the bilinear iterators

Instead of having logic to swap the lines around when one of them
doesn't match, store the two lines in an array and use the least
significant bit of the y coordinate as the index into that
array. Since the two lines always have different least significant
bits, they will never collide.

The effect is that lines corresponding to even y coordinates are
stored in info->lines[0] and lines corresponding to odd y coordinates
are stored in info->lines[1].

test: Test negative strides

Pixman supports negative strides, but up until now they haven't been
tested outside of stress-test. This commit adds testing of negative
strides to blitters-test, scaling-test, affine-test, rotate-test, and
composite-traps-test.

test: Share the image printing code

The affine-test, blitters-test, and scaling-test all have the ability
to print out the bytes of the destination image. Share this code by
moving it to utils.c.

At the same time make the code work correctly with negative strides.

{scaling,affine,composite-traps}-test: Use compute_crc32_for_image()

By using this function instead of compute_crc32() the alpha masking
code and the call to image_endian_swap() are not duplicated.

pixman-filter.c: Use 65536, not 65535, for fixed point conversion

Converting a double precision number to 16.16 fixed point should be
done by multiplying with 65536.0, not 65535.0.

The bug could potentially cause certain filters that would otherwise
leave the image bit-for-bit unchanged under an identity
transformation, to not do so, but the numbers are close enough that
there weren't any visual differences.

demos/scale.ui: Allow subsample_bits to be 0

The separable convolution filter supports a subsample_bits of 0 which
corresponds to no subsampling at all, so allow this value to be used
in the scale demo.

ssse3: Add iterator for separable bilinear scaling

This new iterator uses the SSSE3 instructions pmaddubsw and pabsw to
implement a fast iterator for bilinear scaling.

There is a graph here recording the per-pixel time for various
bilinear scaling algorithms as reported by scaling-bench:

    http://people.freedesktop.org/~sandmann/ssse3.v2/ssse3.v2.png

As the graph shows, this new iterator is clearly faster than the
existing C iterator, and when used with an SSE2 combiner, it is also
faster than the existing SSE2 fast paths for upscaling, though not for
downscaling.

Another graph:

    http://people.freedesktop.org/~sandmann/ssse3.v2/movdqu.png

shows the difference between writing to iter->buffer with movdqa,
movdqu on an aligned buffer, and movdqu on a deliberately unaligned
buffer. Since the differences are very small, the patch here avoids
using movdqa because imposing alignment restrictions on iter->buffer
may interfere with other optimizations, such as writing directly to
the destination image.

The data was measured with scaling-bench on a Sandy Bridge Core
i3-2350M @ 2.3GHz and is available in this directory:

    http://people.freedesktop.org/~sandmann/ssse3.v2/

where there is also a Gnumeric spreadsheet ssse3.v2.gnumeric
containing the per-pixel values and the graph.

V2:
- Use uintptr_t instead of unsigned long in the ALIGN macro
- Use _mm_storel_epi64 instead of _mm_cvtsi128_si64 as the latter form
  is not available on x86-32.
- Use _mm_storeu_si128() instead of _mm_store_si128() to avoid
  imposing alignment requirements on iter->buffer

Add empty SSSE3 implementation

This commit adds a new, empty SSSE3 implementation and the associated
build system support.

configure.ac:   detect whether the compiler understands SSSE3
                intrinsics and set up the required CFLAGS

Makefile.am:    Add libpixman-ssse3.la

pixman-x86.c:   Add X86_SSSE3 feature flag and detect it in
                detect_cpu_features().

pixman-ssse3.c: New file with an empty SSSE3 implementation

V2: Remove SSSE3_LDFLAGS since it isn't necessary unless Solaris
support is added.

general: Ensure that iter buffers are aligned to 16 bytes

At the moment iter buffers are only guaranteed to be aligned to a 4
byte boundary. SIMD implementations benefit from the buffers being
aligned to 16 bytes, so ensure this is the case.

V2:
- Use uintptr_t instead of unsigned long
- allocate 3 * SCANLINE_BUFFER_LENGTH byte on stack rather than just
SCANLINE_BUFFER_LENGTH
- use sizeof (stack_scanline_buffer) instead of SCANLINE_BUFFER_LENGTH
to determine overflow

sse2: faster bilinear scaling (pack 4 pixels to write with MOVDQA)

The loops are already unrolled, so it was just a matter of packing
4 pixels into a single XMM register and doing aligned 128-bit
writes to memory via MOVDQA instructions for the SRC compositing
operator fast path. For the other fast paths, this XMM register
is also directly routed to further processing instead of doing
extra reshuffling. This replaces "8 PACKSSDW/PACKUSWB + 4 MOVD"
instructions with "3 PACKSSDW/PACKUSWB + 1 MOVDQA" per 4 pixels,
which results in a clear performance improvement.

There are also some other (less important) tweaks:

1. Convert 'pixman_fixed_t' to 'intptr_t' before using it as an
   index for addressing memory. The problem is that 'pixman_fixed_t'
   is a 32-bit data type and it has to be extended to 64-bit
   offsets, which needs extra instructions on 64-bit systems.

2. Allow to recalculate the horizontal interpolation weights only
   once per 4 pixels by treating the XMM register as four pairs
   of 16-bit values. Each of these 16-bit/16-bit pairs can be
   replicated to fill the whole 128-bit register by using PSHUFD
   instructions. So we get "3 PADDW/PSRLW + 4 PSHUFD" instructions
   per 4 pixels instead of "12 PADDW/PSRLW" per 4 pixels
   (or "3 PADDW/PSRLW" per each pixel).

   Now a good question is whether replacing "9 PADDW/PSRLW" with
   "4 PSHUFD" is a favourable exchange. As it turns out, PSHUFD
   instructions are very fast on new Intel processors (including
   Atoms), but are rather slow on the first generation of Core2
   (Merom) and on the other processors from that time or older.
   A good instructions latency/throughput table, covering all the
   relevant processors, can be found at:
        http://www.agner.org/optimize/instruction_tables.pdf

   Enabling this optimization is controlled by the PSHUFD_IS_FAST
   define in "pixman-sse2.c".

3. One use of PSHUFD instruction (_mm_shuffle_epi32 intrinsic) in
   the older code has been also replaced by PUNPCKLQDQ equivalent
   (_mm_unpacklo_epi64 intrinsic) in PSHUFD_IS_FAST=0 configuration.
   The PUNPCKLQDQ instruction is usually faster on older processors,
   but has some side effects (instead of fully overwriting the
   destination register like PSHUFD does, it retains half of the
   original value, which may inhibit some compiler optimizations).

Benchmarks with "lowlevel-blt-bench -b src_8888_8888" using GCC 4.8.1 on
x86-64 system and default optimizations. The results are in MPix/s:

====== Intel Core2 T7300 (2GHz) ======

old:                     src_8888_8888 =  L1: 128.69  L2: 125.07  M:124.86
                        over_8888_8888 =  L1:  83.19  L2:  81.73  M: 80.63
                      over_8888_n_8888 =  L1:  79.56  L2:  78.61  M: 77.85
                      over_8888_8_8888 =  L1:  77.15  L2:  75.79  M: 74.63

new (PSHUFD_IS_FAST=0):  src_8888_8888 =  L1: 168.67  L2: 163.26  M:162.44
                        over_8888_8888 =  L1: 102.91  L2: 100.43  M: 99.01
                      over_8888_n_8888 =  L1:  97.40  L2:  95.64  M: 94.24
                      over_8888_8_8888 =  L1:  98.04  L2:  95.83  M: 94.33

new (PSHUFD_IS_FAST=1):  src_8888_8888 =  L1: 154.67  L2: 149.16  M:148.48
                        over_8888_8888 =  L1:  95.97  L2:  93.90  M: 91.85
                      over_8888_n_8888 =  L1:  93.18  L2:  91.47  M: 90.15
                      over_8888_8_8888 =  L1:  95.33  L2:  93.32  M: 91.42

====== Intel Core i7 860 (2.8GHz) ======

old:                     src_8888_8888 =  L1: 323.48  L2: 318.86  M:314.81
                        over_8888_8888 =  L1: 187.38  L2: 186.74  M:182.46

new (PSHUFD_IS_FAST=0):  src_8888_8888 =  L1: 373.06  L2: 370.94  M:368.32
                        over_8888_8888 =  L1: 217.28  L2: 215.57  M:211.32

new (PSHUFD_IS_FAST=1):  src_8888_8888 =  L1: 401.98  L2: 397.65  M:395.61
                        over_8888_8888 =  L1: 218.89  L2: 217.56  M:213.48

The most interesting benchmark is "src_8888_8888" (because this code can
be reused for a generic non-separable SSE2 bilinear fetch iterator).

The results shows that PSHUFD instructions are bad for Intel Core2 T7300
(Merom core) and good for Intel Core i7 860 (Nehalem core). Both of these
processors support SSSE3 instructions though, so they are not the primary
targets for SSE2 code. But without having any other more relevant hardware
to test, PSHUFD_IS_FAST=0 seems to be a reasonable default for SSE2 code
and old processors (until the runtime CPU features detection becomes
clever enough to recognize different microarchitectures).

(Rebased on top of patch that removes support for 8-bit bilinear
filtering -ssp)

test: safeguard the scaling-bench test against COW

The calloc call from pixman_image_create_bits may still
rely on http://en.wikipedia.org/wiki/Copy-on-write
Explicitly initializing the destination image results in
a more predictable behaviour.

V2:
- allocate 16 bytes aligned buffer with aligned stride instead
of delegating this to pixman_image_create_bits
- use memset for the allocated buffer instead of pixman solid fill
- repeat tests 3 times and select best results in order to filter
out even more measurement noise

Drop support for 8-bit precision in bilinear filtering

The default has been 7-bit for a while now, and the quality
improvement with 8-bit precision is not enough to justify keeping the
code around as a compile-time option.