review.tizen.org Git - platform/upstream/pixman.git/log

More general BILINEAR=>NEAREST reduction

Generalize and simplify the code that reduces BILINEAR to NEAREST so
that the reduction happens for all affine transformations where
t00...t12 are integers and (t00 + t01) and (t10 + t11) are both
odd. This is a sufficient condition for the resulting transformed
coordinates to be exactly at the center of a pixel so that BILINEAR
becomes identical to NEAREST.

V2: Address some comments by Bill Spitzak

Signed-off-by: Søren Sandmann <soren.sandmann@gmail.com>
Reviewed-by: Bill Spitzak <spitzak@gmail.com>

Add new test of filter reduction from BILINEAR to NEAREST

This new test tests a bunch of bilinear downscalings, where many have
a transformation such that the BILINEAR filter can be reduced to
NEAREST (and many don't).

A CRC32 is computed for all the resulting images and compared to a
known-good value for both 4-bit and 7-bit interpolation.

V2: Remove leftover comment, some minor formatting fixes, use a
timestamp as the PRNG seed.

Signed-off-by: Søren Sandmann <soren.sandmann@gmail.com>
Reviewed-by: Bill Spitzak <spitzak@gmail.com>

pixman-fast-path.c: Pick NEAREST affine fast paths before BILINEAR ones

When a BILINEAR filter is reduced to NEAREST, it is possible for both
types of fast paths to run; in this case, the NEAREST ones should be
preferred as that is the simpler filter.

Signed-off-by: Soren Sandmann <soren.sandmann@gmail.com>
Reviewed-by: Bill Spitzak <spitzak@gmail.com>

pixman-private: include <float.h> only in C code

<float.h> is included unconditionally by pixman-private.h, which in
turn gets included by assembler files. Unfortunately, with certain C
libraries (like the musl C library), <float.h> cannot be included in
assembler files:

CCLD libpixman-arm-simd.la
/home/test/buildroot/output/host/usr/arm-buildroot-linux-musleabihf/sysroot/usr/include/float.h: Assembler messages:
/home/test/buildroot/output/host/usr/arm-buildroot-linux-musleabihf/sysroot/usr/include/float.h:8: Error: bad instruction `int __flt_rounds(void)'
/home/test/buildroot/output/host/usr/arm-buildroot-linux-musleabihf/sysroot/usr/include/float.h: Assembler messages:
/home/test/buildroot/output/host/usr/arm-buildroot-linux-musleabihf/sysroot/usr/include/float.h:8: Error: bad instruction `int __flt_rounds(void)'

It turns out however that <float.h> is not needed by assembly files,
so we move its inclusion within the #ifndef __ASSEMBLER__ condition,
which solves the problem.

Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Reviewed-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>

build: Distinguish SKIP and FAIL on Win32

The `check` target in test/Makefile.win32 assumed that any non-0 exit
code from the tests was an error, but the testsuite is currently using
77 as a SKIP exit code (based on the convention used in autotools).

Fixes fence-image-self-test and cover-test (now reported as SKIP).

Signed-off-by: Andrea Canciani <ranma42@gmail.com>
Acked-by: Oded Gabbay <oded.gabbay@gmail.com>

build: Use `del` instead of `rm` on `cmd.exe` shells

The `rm` command is not usually available when running on Win32 in a
`cmd.exe` shell. Instead the shell provides the `del` builtin, which
has somewhat more limited wildcars expansion and error handling.

This makes all of the Makefile targets work on Win32 both using
`cmd.exe` and using the MSYS environment.

Signed-off-by: Simon Richter <Simon.Richter@hogyros.de>
Signed-off-by: Andrea Canciani <ranma42@gmail.com>
Acked-by: Oded Gabbay <oded.gabbay@gmail.com>

build: Do not use `mkdir -p` on Windows

When the build is performed using `cmd.exe` as shell, the `mkdir`
command does not support the `-p` flag. The ability to create multiple
netsted folder is not used, hence it can be easily replaced by only
creating the directory if it does not exist.

This makes the build work on the `cmd.exe` shell, except for the
`clean` targets.

Signed-off-by: Andrea Canciani <ranma42@gmail.com>
Acked-by: Oded Gabbay <oded.gabbay@gmail.com>

build: Avoid phony `pixman` target in test/Makefile.win32

Instead of explicitly depending on "pixman" for the "all" and "check"
targets, rely on the dependency to the .lib file

Signed-off-by: Andrea Canciani <ranma42@gmail.com>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>

build: Remove use of BUILT_SOURCES from Makefile.win32

Since 3d81d89c292058522cce91338028d9b4c4a23c24 BUILT_SOURCES is not
used anymore, but it was unintentionally left in Win32 Makefiles.

Signed-off-by: Andrea Canciani <ranma42@gmail.com>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>

Post 0.34 branch creation version bump to 0.35.1

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>

Post-release version bump to 0.33.7

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>

Pre-release version bump to 0.33.6

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>

configura.ac: fix test for SSE2 & SSSE3 assembler support

This patch modifies the SSE2 & SSSE3 tests in configure.ac to use a
global variable to initialize vector variables. In addition, we now
return the value of the computation instead of 0.

This is done so gcc 4.9 (and lower) won't optimize the SSE assembly
instructions (when using -O1 and higher), because then the configure test
might incorrectly pass even though the assembler doesn't support the
SSE instructions (the test will pass because the compiler does support
the intrinsics).

v2: instead of using volatile, use a global variable as input

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>

mmx: Improve detection of support for "K" constraint

Older versions of clang emitted an error on the "K" constraint, but at
least since version 3.7 it is supported. Just like gcc, this
constraint is only allowed for constants, but apparently clang
requires them to be known before inlining.

Using the macro definition _mm_shuffle_pi16(A, N) ensures that the "K"
constraint is always applied to a literal constant, independently from
the compiler optimizations and allows building pixman-mmx on modern
clang.

Reviewed-by: Matt Turner <mattst88@gmail.com>
Signed-off-by: Andrea Canciani <ranma42@gmail.com>

Revert "mmx: Use MMX2 intrinsics from xmmintrin.h directly."

This reverts commit 7de61d8d14e84623b6fa46506eb74f938287f536.

Newer versions of gcc allow inclusion of xmmintrin.h without -msse, but
still won't allow usage of the intrinsics.

Bugzilla: https://bugs.gentoo.org/show_bug.cgi?id=564024

Post-release version bump to 0.33.5

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>

Pre-release version bump to 0.33.4

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>

test: Fix fence-image-self-test on Mac

On MacOS X, according to the manpage of mprotect(), "When a program
violates the protections of a page, it gets a SIGBUS or SIGSEGV
signal.", but fence-image-self-test was only accepting a SIGSEGV as
notification of invalid access.

Fixes fence-image-self-test

Reviewed-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>

mmx: Use MMX2 intrinsics from xmmintrin.h directly.

We had lots of hacks to handle the inability to include xmmintrin.h
without compiling with -msse (lest SSE instructions be used in
pixman-mmx.c). Some recent version of gcc relaxed this restriction.

Change configure.ac to test that xmmintrin.h can be included and that we
can use some intrinsics from it, and remove the work-around code from
pixman-mmx.c.

Evidently allows gcc 4.9.3 to optimize better as well:

   text    data     bss     dec     hex filename
657078   30848     680 688606   a81de libpixman-1.so.0.33.3 before
656710   30848     680 688238   a806e libpixman-1.so.0.33.3 after

Reviewed-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
Tested-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Signed-off-by: Matt Turner <mattst88@gmail.com>

vmx: implement fast path vmx_composite_over_n_8888

Running "lowlevel-blt-bench over_n_8888" on Playstation3 3.2GHz,
Gentoo ppc (32-bit userland) gave the following results:

before:  over_n_8888 =  L1: 147.47  L2: 205.86  M:121.07
after:   over_n_8888 =  L1: 287.27  L2: 261.09  M:133.48

Cairo non-trimmed benchmarks on POWER8, 3.4GHz 8 Cores:

ocitysmap          659.69  -> 611.71   :  1.08x speedup
xfce4-terminal-a1  2725.22 -> 2547.47  :  1.07x speedup

Signed-off-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>

affine-bench: remove 8e margin from COVER area

Patch "Remove the 8e extra safety margin in COVER_CLIP analysis" reduced
the required image area for setting the COVER flags in
pixman.c:analyze_extent(). Do the same reduction in affine-bench.

Leaving the old calculations in place would be very confusing for anyone
reading the code.

Also add a comment that explains how affine-bench wants to hit the COVER
paths. This explains why the intricate extent calculations are copied
from pixman.c.

[Pekka: split patch, change comments, write commit message]
Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

Remove the 8e extra safety margin in COVER_CLIP analysis

As discussed in
http://lists.freedesktop.org/archives/pixman/2015-August/003905.html

the 8 * pixman_fixed_e (8e) adjustment which was applied to the transformed
coordinates is a legacy of rounding errors which used to occur in old
versions of Pixman, but which no longer apply. For any affine transform,
you are now guaranteed to get the same result by transforming the upper
coordinate as though you transform the lower coordinate and add (size-1)
steps of the increment in source coordinate space. No projective
transform routines use the COVER_CLIP flags, so they cannot be affected.

Proof by Siarhei Siamashka:

Let's take a look at the following affine transformation matrix (with 16.16
fixed point values) and two vectors:

         | a   b     c    |
M      = | d   e     f    |
         | 0   0  0x10000 |

         |  x_dst  |
P     =  |  y_dst  |
         | 0x10000 |

         | 0x10000 |
ONE_X  = |    0    |
         |    0    |

The current matrix multiplication code does the following calculations:

             | (a * x_dst + b * y_dst + 0x8000) / 0x10000 + c |
    M * P =  | (d * x_dst + e * y_dst + 0x8000) / 0x10000 + f |
             |                   0x10000                      |

These calculations are not perfectly exact and we may get rounding
because the integer coordinates are adjusted by 0.5 (or 0x8000 in the
16.16 fixed point format) before doing matrix multiplication. For
example, if the 'a' coefficient is an odd number and 'b' is zero,
then we are losing some of the least significant bits when dividing by
0x10000.

So we need to strictly prove that the following expression is always
true even though we have to deal with rounding:

                                          | a |
    M * (P + ONE_X) - M * P = M * ONE_X = | d |
                                          | 0 |

or

   ((a * (x_dst + 0x10000) + b * y_dst + 0x8000) / 0x10000 + c)
  -
   ((a * x_dst             + b * y_dst + 0x8000) / 0x10000 + c)
  =
    a

It's easy to see that this is equivalent to

    a + ((a * x_dst + b * y_dst + 0x8000) / 0x10000 + c)
      - ((a * x_dst + b * y_dst + 0x8000) / 0x10000 + c)
  =
    a

Which means that stepping exactly by one pixel horizontally in the
destination image space (advancing 'x_dst' by 0x10000) is the same as
changing the transformed 'x_src' coordinate in the source image space
exactly by 'a'. The same applies to the vertical direction too.
Repeating these steps, we can reach any pixel in the source image
space and get exactly the same fixed point coordinates as doing
matrix multiplications per each pixel.

By the way, the older matrix multiplication implementation, which was
relying on less accurate calculations with three intermediate roundings
"((a + 0x8000) >> 16) + ((b + 0x8000) >> 16) + ((c + 0x8000) >> 16)",
also has the same properties. However reverting
    http://cgit.freedesktop.org/pixman/commit/?id=ed39992564beefe6b12f81e842caba11aff98a9c
and applying this "Remove the 8e extra safety margin in COVER_CLIP
analysis" patch makes the cover test fail. The real reason why it fails
is that the old pixman code was using "pixman_transform_point_3d()"
function
    http://cgit.freedesktop.org/pixman/tree/pixman/pixman-matrix.c?id=pixman-0.28.2#n49
for getting the transformed coordinate of the top left corner pixel
in the image scaling code, but at the same time using a different
"pixman_transform_point()" function
    http://cgit.freedesktop.org/pixman/tree/pixman/pixman-matrix.c?id=pixman-0.28.2#n82
in the extents calculation code for setting the cover flag. And these
functions did the intermediate rounding differently. That's why the 8e
safety margin was needed.

** proof ends

However, for COVER_CLIP_NEAREST, the actual margins added were not 8e.
Because the half-way cases round down, that is, coordinate 0 hits pixel
index -1 while coordinate e hits pixel index 0, the extra safety margins
were actually 7e to the left and up, and 9e to the right and down. This
patch removes the 7e and 9e margins and restores the -e adjustment
required for NEAREST sampling in Pixman. For reference, see
pixman/rounding.txt.

For COVER_CLIP_BILINEAR, the margins were exactly 8e as there are no
additional offsets to be restored, so simply removing the 8e additions
is enough.

Proof:

All implementations must give the same numerical results as
bits_image_fetch_pixel_nearest() / bits_image_fetch_pixel_bilinear().

The former does
    int x0 = pixman_fixed_to_int (x - pixman_fixed_e);
which maps directly to the new test for the nearest flag, when you consider
that x0 must fall in the interval [0,width).

The latter does
    x1 = x - pixman_fixed_1 / 2;
    x1 = pixman_fixed_to_int (x1);
    x2 = x1 + 1;
When you write a COVER path, you take advantage of the assumption that
both x1 and x2 fall in the interval [0, width).

As samplers are allowed to fetch the pixel at x2 unconditionally, we
require
    x1 >= 0
    x2 < width
so
    x - pixman_fixed_1 / 2 >= 0
    x - pixman_fixed_1 / 2 + pixman_fixed_1 < width * pixman_fixed_1
so
    pixman_fixed_to_int (x - pixman_fixed_1 / 2) >= 0
    pixman_fixed_to_int (x + pixman_fixed_1 / 2) < width
which matches the source code lines for the bilinear case, once you delete
the lines that add the 8e margin.

Signed-off-by: Ben Avison <bavison@riscosopen.org>
[Pekka: adjusted commit message, left affine-bench changes for another patch]
[Pekka: add commit message parts from Siarhei]
Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

pixman-general: Tighten up calculation of temporary buffer sizes

Each of the aligns can only add a maximum of 15 bytes to the space
requirement. This permits some edge cases to use the stack buffer where
previously it would have deduced that a heap buffer was required.

Reviewed-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>

pixman-general: Fix stack related pointer arithmetic overflow

As https://bugs.freedesktop.org/show_bug.cgi?id=92027#c6 explains,
the stack is allocated at the very top of the process address space
in some configurations (32-bit x86 systems with ASLR disabled).
And the careless computations done with the 'dest_buffer' pointer
may overflow, failing the buffer upper limit check.

The problem can be reproduced using the 'stress-test' program,
which segfaults when executed via setarch:

    export CFLAGS="-O2 -m32" && ./autogen.sh
    ./configure --disable-libpng --disable-gtk && make
    setarch i686 -R test/stress-test

This patch introduces the required corrections. The extra check
for negative 'width' may be redundant (the invalid 'width' value
is not supposed to reach here), but it's better to play safe
when dealing with the buffers allocated on stack.

Reported-by: Ludovic Courtès <ludo@gnu.org>
Signed-off-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
Reviewed-by: soren.sandmann@gmail.com
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>

test: add a check for FE_DIVBYZERO

Some architectures, such as Microblaze and Nios2, currently do not
implement FE_DIVBYZERO, even though they have <fenv.h> and
feenableexcept(). This commit adds a configure.ac check to verify
whether FE_DIVBYZERO is defined or not, and if not, disables the
problematic code in test/utils.c.

Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Signed-off-by: Marek Vasut <marex@denx.de>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>

vmx: Remove unused expensive functions

Now that we replaced the expensive functions with better performing
alternatives, we should remove them so they will not be used again.

Running Cairo benchmark on trimmed traces gave the following results:

POWER8, 8 cores, 3.4GHz, RHEL 7.2 ppc64le.

Speedups
========
t-firefox-scrolling     1232.30 -> 1096.55 :  1.12x
t-gnome-terminal-vim    613.86  -> 553.10  :  1.11x
t-evolution             405.54  -> 371.02  :  1.09x
t-firefox-talos-gfx     919.31  -> 862.27  :  1.07x
t-gvim                  653.02  -> 616.85  :  1.06x
t-firefox-canvas-alpha  941.29  -> 890.42  :  1.06x

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Acked-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>

vmx: implement fast path vmx_composite_over_n_8_8888

POWER8, 8 cores, 3.4GHz, RHEL 7.2 ppc64le.

reference memcpy speed = 25008.9MB/s (6252.2MP/s for 32bpp fills)

                Before         After           Change
              ---------------------------------------------
L1              91.32          182.84         +100.22%
L2              94.94          182.83         +92.57%
M               95.55          181.51         +89.96%
HT              88.96          162.09         +82.21%
VT              87.4           168.35         +92.62%
R               83.37          146.23         +75.40%
RT              66.4           91.5           +37.80%
Kops/s          683            859            +25.77%

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Acked-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>

vmx: optimize vmx_composite_over_n_8888_8888_ca

This patch optimizes vmx_composite_over_n_8888_8888_ca by removing use
of expand_alpha_1x128, unpack/pack and in_over_2x128 in favor of
splat_alpha, in_over and MUL/ADD macros from pixman_combine32.h.

Running "lowlevel-blt-bench -n over_8888_8888" on POWER8, 8 cores,
3.4GHz, RHEL 7.2 ppc64le gave the following results:

reference memcpy speed = 23475.4MB/s (5868.8MP/s for 32bpp fills)

                Before          After           Change
              --------------------------------------------
L1              244.97          474.05         +93.51%
L2              243.74          473.05         +94.08%
M               243.29          467.16         +92.02%
HT              144.03          252.79         +75.51%
VT              174.24          279.03         +60.14%
R               109.86          149.98         +36.52%
RT              47.96           53.18          +10.88%
Kops/s          524             576            +9.92%

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Acked-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>

vmx: optimize scaled_nearest_scanline_vmx_8888_8888_OVER

This patch optimizes scaled_nearest_scanline_vmx_8888_8888_OVER and all
the functions it calls (combine1, combine4 and
core_combine_over_u_pixel_vmx).

The optimization is done by removing use of expand_alpha_1x128 and
expand_alpha_2x128 in favor of splat_alpha and MUL/ADD macros from
pixman_combine32.h.

Running "lowlevel-blt-bench -n over_8888_8888" on POWER8, 8 cores,
3.4GHz, RHEL 7.2 ppc64le gave the following results:

reference memcpy speed = 24847.3MB/s (6211.8MP/s for 32bpp fills)

                Before          After           Change
              --------------------------------------------
L1              182.05          210.22         +15.47%
L2              180.6           208.92         +15.68%
M               180.52          208.22         +15.34%
HT              130.17          178.97         +37.49%
VT              145.82          184.22         +26.33%
R               104.51          129.38         +23.80%
RT              48.3            61.54          +27.41%
Kops/s          430             504            +17.21%

v2: Check *pm is not NULL before dereferencing it in combine1()

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Acked-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>

armv6: enable over_n_8888

Enable the fast path added in the previous patch by moving the lookup
table entries to their proper locations.

Lowlevel-blt-bench benchmark statistics with 30 iterations, showing the
effect of adding this one patch on top of
"armv6: Add over_n_8888 fast path (disabled)", which was applied on
fd595692941f3d9ddea8934462bd1d18aed07c65.

       Before          After
      Mean StdDev     Mean StdDev   Confidence   Change
L1    12.5   0.04     45.2   0.10    100.00%    +263.1%
L2    11.1   0.02     43.2   0.03    100.00%    +289.3%
M      9.4   0.00     42.4   0.02    100.00%    +351.7%
HT     8.5   0.02     25.4   0.10    100.00%    +198.8%
VT     8.4   0.02     22.3   0.07    100.00%    +167.0%
R      8.2   0.02     23.1   0.09    100.00%    +183.6%
RT     5.4   0.05     11.4   0.21    100.00%    +110.3%

At most 3 outliers rejected per test per set.

Iterating here means that lowlevel-blt-bench was executed 30 times, and
the statistics above were computed from the output.

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>

armv6: Add over_n_8888 fast path (disabled)

This new fast path is initially disabled by putting the entries in the
lookup table after the sentinel. The compiler cannot tell the new code
is not used, so it cannot eliminate the code. Also the lookup table size
will include the new fast path. When the follow-up patch then enables
the new fast path, the binary layout (alignments, size, etc.) will stay
the same compared to the disabled case.

Keeping the binary layout identical is important for benchmarking on
Raspberry Pi 1. The addresses at which functions are loaded will have a
significant impact on benchmark results, causing unexpected performance
changes. Keeping all function addresses the same across the patch
enabling a new fast path improves the reliability of benchmarks.

Benchmark results are included in the patch enabling this fast path.

[Pekka: disabled the fast path, commit message]
Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>

test: Add cover-test v5

This test aims to verify both numerical correctness and the honouring of
array bounds for scaled plots (both nearest-neighbour and bilinear) at or
close to the boundary conditions for applicability of "cover" type fast paths
and iter fetch routines.

It has a secondary purpose: by setting the env var EXACT (to any value) it
will only test plots that are exactly on the boundary condition. This makes
it possible to ensure that "cover" routines are being used to the maximum,
although this requires the use of a debugger or code instrumentation to
verify.

Changes in v4:

  Check the fence page size and skip the test if it is too large. Since
  we need to deal with pixman_fixed_t coordinates that go beyond the
  real image width, make the page size limit 16 kB. A 32 kB or larger
  page size would cause an a8 image width to be 32k or more, which is no
  longer representable in pixman_fixed_t.

  Use a shorthand variable 'filter' in test_cover().

  Whitespace adjustments.

Changes in v5:

  Skip if fenced memory is not supported. Do you know of any such
  platform?

Signed-off-by: Ben Avison <bavison@riscosopen.org>
[Pekka: changes in v4 and v5]
Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>
Acked-by: Oded Gabbay <oded.gabbay@gmail.com>

implementation: add PIXMAN_DISABLE=wholeops

Add a new option to PIXMAN_DISABLE: "wholeops". This option disables all
whole-operation fast paths regardless of implementation level, except
the general path (general_composite_rect).

The purpose is to add a debug option that allows us to test optimized
iterator paths specifically. With this, it is possible to see if:
- fast paths mask bugs in iterators
- compare fast paths with iterator paths for performance

The effect was tested on x86_64 by running:
$ PIXMAN_DISABLE='' ./test/lowlevel-blt-bench over_8888_8888
$ PIXMAN_DISABLE='wholeops' ./test/lowlevel-blt-bench over_8888_8888

In the first case time is spent in sse2_composite_over_8888_8888(), and
in the latter in sse2_combine_over_u().

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>

utils.[ch]: add fence_get_page_size()

Add a function to get the page size used for memory fence purposes, and
use it everywhere where getpagesize() was used.

This offers a single point in code to override the page size, in case
one wants to experiment how the tests work with a higher page size than
what the developer's machine has.

This also offers a clean API, without adding #ifdefs, to tests for
checking the page size.

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

utils.c: fix fallback code for fence_image_create_bits()

Used a wrong variable name, causing:
/home/pq/git/pixman/demos/../test/utils.c: In function ‘fence_image_create_bits’:
/home/pq/git/pixman/demos/../test/utils.c:562:46: error: ‘width’ undeclared (first use in this function)

Use the correct variable.

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

test: add fence-image-self-test

Tests that fence_malloc and fence_image_create_bits actually work: that
out-of-bounds and out-of-row (unused stride area) accesses trigger
SIGSEGV.

If fence_malloc is a dummy (FENCE_MALLOC_ACTIVE not defined), this test
is skipped.

Changes in v2:

- check FENCE_MALLOC_ACTIVE value, not whether it is defined
- test that reading bytes near the fence pages does not cause a
  segmentation fault

Changes in v3:

- Do not print progress messages unless VERBOSE environment variable is
  set. Avoid spamming the terminal output of 'make check' on some
  versions of autotools.

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

utils.[ch]: add fence_image_create_bits ()

Useful for detecting out-of-bounds accesses in composite operations.

This will be used by follow-up patches adding new tests.

Changes in v2:

- fix style on fence_image_create_bits args
- add page to stride only if stride_fence
- add comment on the fallback definition about freeing storage

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

utils.[ch]: add FENCE_MALLOC_ACTIVE

Define a new token to simplify checking whether fence_malloc() actually
can catch out-of-bounds access.

This will be used in the future to skip tests that rely on fence_malloc
checking functionality.

Changes in v2:

- #define FENCE_MALLOC_ACTIVE always, but change its value to help catch
use of it without including utils.h

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

scaling-test: list more details when verbose

Add mask details to the output.

[Pekka: redo whitespace and print src,dst,mask x and y.]
Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

lowlevel-blt-bench: make extra arguments an error

If a user gives multiple patterns or extra arguments, only the last one
was used as the pattern while the former were just ignored. This is a
user error silently converted to something possibly unexpected.

In presence of extra arguments, complain and quit.

Cc: Ben Avison <bavison@riscosopen.org>
Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>

Post-release version bump to 0.33.3

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>

Pre-release version bump to 0.33.2

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>

vmx: implement fast path iterator vmx_fetch_a8

no changes were observed when running cairo trimmed benchmarks.

Running "lowlevel-blt-bench src_8_8888" on POWER8, 8 cores,
3.4GHz, RHEL 7.1 ppc64le gave the following results:

reference memcpy speed = 25197.2MB/s (6299.3MP/s for 32bpp fills)

                Before          After           Change
              --------------------------------------------
L1              965.34          3936           +307.73%
L2              942.99          3436.29        +264.40%
M               902.24          2757.77        +205.66%
HT              448.46          784.99         +75.04%
VT              430.05          819.78         +90.62%
R               412.9           717.04         +73.66%
RT              168.93          220.63         +30.60%
Kops/s          1025            1303           +27.12%

It was benchmarked against commid id e2d211a from pixman/master

Siarhei Siamashka reported that on playstation3, it shows the following
results:

== before ==

              src_8_8888 =  L1: 194.37  L2: 198.46  M:155.90 (148.35%)
              HT: 59.18  VT: 36.71  R: 38.93  RT: 12.79 ( 106Kops/s)

== after ==

              src_8_8888 =  L1: 373.96  L2: 391.10  M:245.81 (233.88%)
              HT: 80.81  VT: 44.33  R: 48.10  RT: 14.79 ( 122Kops/s)

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>

vmx: implement fast path iterator vmx_fetch_x8r8g8b8

It was benchmarked against commid id 2be523b from pixman/master

POWER8, 8 cores, 3.4GHz, RHEL 7.1 ppc64le.

cairo trimmed benchmarks :

Speedups
========
t-firefox-asteroids 533.92 -> 489.94 : 1.09x

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>

vmx: implement fast path scaled nearest vmx_8888_8888_OVER

It was benchmarked against commid id 2be523b from pixman/master

POWER8, 8 cores, 3.4GHz, RHEL 7.1 ppc64le.
reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills)

                Before           After           Change
              ---------------------------------------------
L1              134.36          181.68          +35.22%
L2              135.07          180.67          +33.76%
M               134.6           180.51          +34.11%
HT              121.77          128.79          +5.76%
VT              120.49          145.07          +20.40%
R               93.83           102.3           +9.03%
RT              50.82           46.93           -7.65%
Kops/s          448             422             -5.80%

cairo trimmed benchmarks :

Speedups
========
t-firefox-asteroids  533.92 -> 497.92 :  1.07x
    t-midori-zoomed  692.98 -> 651.24 :  1.06x

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>

vmx: implement fast path vmx_composite_src_x888_8888

It was benchmarked against commid id 2be523b from pixman/master

POWER8, 8 cores, 3.4GHz, RHEL 7.1 ppc64le.
reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills)

                Before           After           Change
              ---------------------------------------------
L1              1115.4          5006.49         +348.85%
L2              1112.26         4338.01         +290.02%
M               1110.54         2524.15         +127.29%
HT              745.41          1140.03         +52.94%
VT              749.03          1287.13         +71.84%
R               423.91          547.6           +29.18%
RT              205.79          194.98          -5.25%
Kops/s          1414            1361            -3.75%

cairo trimmed benchmarks :

Speedups
========
t-gnome-system-monitor  1402.62  -> 1212.75 :  1.16x
   t-firefox-asteroids   533.92  ->  474.50 :  1.13x

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>

vmx: implement fast path vmx_composite_over_n_8888_8888_ca

It was benchmarked against commid id 2be523b from pixman/master

POWER8, 8 cores, 3.4GHz, RHEL 7.1 ppc64le.

reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills)

                Before           After           Change
              ---------------------------------------------
L1              61.92            244.91          +295.53%
L2              62.74            243.3           +287.79%
M               63.03            241.94          +283.85%
HT              59.91            144.22          +140.73%
VT              59.4             174.39          +193.59%
R               53.6             111.37          +107.78%
RT              37.99            46.38           +22.08%
Kops/s          436              506             +16.06%

cairo trimmed benchmarks :

Speedups
========
t-xfce4-terminal-a1  1540.37 -> 1226.14 :  1.26x
t-firefox-talos-gfx  1488.59 -> 1209.19 :  1.23x

Slowdowns
=========
        t-evolution  553.88  -> 581.63  :  1.05x
          t-poppler  364.99  -> 383.79  :  1.05x
t-firefox-scrolling  1223.65 -> 1304.34 :  1.07x

The slowdowns can be explained in cases where the images are small and
un-aligned to 16-byte boundary. In that case, the function will first
work on the un-aligned area, even in operations of 1 byte. In case of
small images, the overhead of such operations can be more than the
savings we get from using the vmx instructions that are done on the
aligned part of the image.

In the C fast-path implementation, there is no special treatment for the
un-aligned part, as it works in 4 byte quantities on the entire image.

Because llbb is a synthetic test, I would assume it has much less
alignment issues than "real-world" scenario, such as cairo benchmarks,
which are basically recorded traces of real application activity.

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>

vmx: implement fast path composite_add_8888_8888

Copied impl. from sse2 file and edited to use vmx functions

It was benchmarked against commid id 2be523b from pixman/master

POWER8, 16 cores, 3.4GHz, ppc64le :

reference memcpy speed = 27036.4MB/s (6759.1MP/s for 32bpp fills)

                Before           After           Change
              ---------------------------------------------
L1              248.76          3284.48         +1220.34%
L2              264.09          2826.47         +970.27%
M               261.24          2405.06         +820.63%
HT              217.27          857.3           +294.58%
VT              213.78          980.09          +358.46%
R               176.61          442.95          +150.81%
RT              107.54          150.08          +39.56%
Kops/s          917             1125            +22.68%

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>

vmx: implement fast path composite_add_8_8

Copied impl. from sse2 file and edited to use vmx functions

It was benchmarked against commid id 2be523b from pixman/master

POWER8, 16 cores, 3.4GHz, ppc64le :

reference memcpy speed = 27036.4MB/s (6759.1MP/s for 32bpp fills)

                Before           After           Change
              ---------------------------------------------
L1              687.63          9140.84         +1229.33%
L2              715             7495.78         +948.36%
M               717.39          8460.14         +1079.29%
HT              569.56          1020.12         +79.11%
VT              520.3           1215.56         +133.63%
R               514.81          874.35          +69.84%
RT              341.28          305.42          -10.51%
Kops/s          1621            1579            -2.59%

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>

vmx: implement fast path composite_over_8888_8888

Copied impl. from sse2 file and edited to use vmx functions

It was benchmarked against commid id 2be523b from pixman/master

POWER8, 16 cores, 3.4GHz, ppc64le :

reference memcpy speed = 27036.4MB/s (6759.1MP/s for 32bpp fills)

                Before           After           Change
              ---------------------------------------------
L1              129.47          1054.62         +714.57%
L2              138.31          1011.02         +630.98%
M               139.99          1008.65         +620.52%
HT              122.11          468.45          +283.63%
VT              121.06          532.21          +339.62%
R               108.48          240.5           +121.70%
RT              77.87           116.7           +49.87%
Kops/s          758             981             +29.42%

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>

vmx: implement fast path vmx_fill

Based on sse2 impl.

It was benchmarked against commid id e2d211a from pixman/master

Tested cairo trimmed benchmarks on POWER8, 8 cores, 3.4GHz,
RHEL 7.1 ppc64le :

speedups
========
     t-swfdec-giant-steps  1383.09 ->  718.63  :  1.92x speedup
   t-gnome-system-monitor  1403.53 ->  918.77  :  1.53x speedup
              t-evolution  552.34  ->  415.24  :  1.33x speedup
      t-xfce4-terminal-a1  1573.97 ->  1351.46 :  1.16x speedup
      t-firefox-paintball  847.87  ->  734.50  :  1.15x speedup
      t-firefox-asteroids  565.99  ->  492.77  :  1.15x speedup
t-firefox-canvas-swscroll  1656.87 ->  1447.48 :  1.14x speedup
          t-midori-zoomed  724.73  ->  642.16  :  1.13x speedup
   t-firefox-planet-gnome  975.78  ->  911.92  :  1.07x speedup
          t-chromium-tabs  292.12  ->  274.74  :  1.06x speedup
     t-firefox-chalkboard  690.78  ->  653.93  :  1.06x speedup
      t-firefox-talos-gfx  1375.30 ->  1303.74 :  1.05x speedup
   t-firefox-canvas-alpha  1016.79 ->  967.24  :  1.05x speedup

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>

vmx: add helper functions

This patch adds the following helper functions for reuse of code,
hiding BE/LE differences and maintainability.

All of the functions were defined as static force_inline.

Names were copied from pixman-sse2.c so conversion of fast-paths between
sse2 and vmx would be easier from now on. Therefore, I tried to keep the
input/output of the functions to be as close as possible to the sse2
definitions.

The functions are:

- load_128_aligned       : load 128-bit from a 16-byte aligned memory
                           address into a vector

- load_128_unaligned     : load 128-bit from memory into a vector,
                           without guarantee of alignment for the
                           source pointer

- save_128_aligned       : save 128-bit vector into a 16-byte aligned
                           memory address

- create_mask_16_128     : take a 16-bit value and fill with it
                           a new vector

- create_mask_1x32_128   : take a 32-bit pointer and fill a new
                           vector with the 32-bit value from that pointer

- create_mask_32_128     : take a 32-bit value and fill with it
                           a new vector

- unpack_32_1x128        : unpack 32-bit value into a vector

- unpacklo_128_16x8      : unpack the eight low 8-bit values of a vector

- unpackhi_128_16x8      : unpack the eight high 8-bit values of a vector

- unpacklo_128_8x16      : unpack the four low 16-bit values of a vector

- unpackhi_128_8x16      : unpack the four high 16-bit values of a vector

- unpack_128_2x128       : unpack the eight low 8-bit values of a vector
                           into one vector and the eight high 8-bit
                           values into another vector

- unpack_128_2x128_16    : unpack the four low 16-bit values of a vector
                           into one vector and the four high 16-bit
                           values into another vector

- unpack_565_to_8888     : unpack an RGB_565 vector to 8888 vector

- pack_1x128_32          : pack a vector and return the LSB 32-bit of it

- pack_2x128_128         : pack two vectors into one and return it

- negate_2x128           : xor two vectors with mask_00ff (separately)

- is_opaque              : returns whether all the pixels contained in
                           the vector are opaque

- is_zero                : returns whether the vector equals 0

- is_transparent         : returns whether all the pixels
                           contained in the vector are transparent

- expand_pixel_8_1x128   : expand an 8-bit pixel into lower 8 bytes of a
                           vector

- expand_alpha_1x128     : expand alpha from vector and return the new
                           vector

- expand_alpha_2x128     : expand alpha from one vector and another alpha
                           from a second vector

- expand_alpha_rev_2x128 : expand a reversed alpha from one vector and
                           another reversed alpha from a second vector

- pix_multiply_2x128     : do pix_multiply for two vectors (separately)

- over_2x128             : perform over op. on two vectors

- in_over_2x128          : perform in-over op. on two vectors

v2: removed expand_pixel_32_1x128 as it was not used by any function and
its implementation was erroneous

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>

vmx: add LOAD_VECTOR macro

This patch adds a macro for loading a single vector.
It also make the other LOAD_VECTORx macros use this macro as a base so
code would be re-used.

In addition, I fixed minor coding style issues.

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>

MIPS: update author's e-mail address

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>

lowlevel-blt-bench: add option to skip memcpy measurement

The memcpy speed measurement takes several seconds. When you are running
single tests in a harness that iterates dozens or hundreds of times, the
repeated measurements are redundant and take a lot of time. It is also
an open question whether the measured speed changes over long test runs
due to unidentified platform reasons (Raspberry Pi).

Add a command line option to set the reference memcpy speed, skipping
the measuring.

The speed is mainly used to compute how many iterations do run inside
the bench_*() functions, so for repeated testing on the same hardware,
it makes sense to lock that number to a constant.

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

lowlevel-blt-bench: add CSV output mode

Add a command line option for choosing CSV output mode.

In CSV mode, only the results in Mpixels/s are printed in an easily
machine-parseable format. All user-friendly printing is suppressed.

This is intended for cases where you benchmark one particular operation
at a time. Running the "all" set of benchmarks will print just fine, but
you may have trouble matching rows to operations as you have to look at
the tests_tbl[] to see what row is which.

Reviewed-by: Ben Avison <bavison@riscosopen.org>
v2: don't add a space after comma in CSV.

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>

lowlevel-blt-bench: refactor to Mpx_per_sec()

Refactor the Mpixels/s computations into a function. Easier to read and
better documents what is being computed.

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

lowlevel-blt-bench: all bench funcs to return pix_cnt

The bench_* functions, that did not already do it, are modified to
return the number of pixels processed during the benchmark. This moves
the computation to the site that actually determines the number, and
simplifies bench_composite() a bit.

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

lowlevel-blt-bench: move speed and scaling printing

Move the printing of the memory speed and scaling mode into a new
function. This will help with implementing a machine-readable output
option.

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

lowlevel-blt-bench: print single pattern details

When given just a single test pattern instead of "all", print the test
details. This can be used to verify the pattern parser agrees with the
user, just like scaling settings are printed.

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

lowlevel-blt-bench: make test_entry::testname const

We assign string literals to it, so it better be const.

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

lowlevel-blt-bench: move explanation printing

Move explanation printing to a new function. This will help with
implementing a machine-readable output option.

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

lowlevel-blt-bench: move usage to a function

Move printing of usage into a new function and use argv[0] as the
program name. This will help printing usage from multiple places.

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

vmx: fix pix_multiply for ppc64le

vec_mergeh/l operates differently for BE and LE, because of the order of
the vector elements (l->r in BE and r->l in LE).
To fix that, we simply need to swap between the input parameters, in case
we are working in LE.

v2:

- replace _LITTLE_ENDIAN with WORDS_BIGENDIAN for consistency
- fixed whitespaces and indentation issues

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Adam Jackson <ajax@redhat.com>
Acked-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>

vmx: fix unused var warnings

v2: don't put ';' at the end of macro definition. Instead, move it to
each line the macro is used.

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Adam Jackson <ajax@redhat.com>
Acked-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>

vmx: encapsulate the temporary variables inside the macros

v2: fixed whitespaces and indentation issues

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Adam Jackson <ajax@redhat.com>
Acked-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>

vmx: adjust macros when loading vectors on ppc64le

Replaced usage of vec_lvsl to direct unaligned assignment
operation (=). That is because, according to Power ABI Specification,
the usage of lvsl is deprecated on ppc64le.

Changed COMPUTE_SHIFT_{MASK,MASKS,MASKC} macro usage to no-op for powerpc
little endian since unaligned access is supported on ppc64le.

v2:

- replace _LITTLE_ENDIAN with WORDS_BIGENDIAN for consistency
- fixed whitespaces and indentation issues

Signed-off-by: Fernando Seiti Furusato <ferseiti@linux.vnet.ibm.com>
Reviewed-by: Adam Jackson <ajax@redhat.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Acked-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>

vmx: fix splat_alpha for ppc64le

The permutation vector isn't correct for LE, so correct its values
in case we are in LE mode.

v2:

- replace _LITTLE_ENDIAN with WORDS_BIGENDIAN for consistency
- change #ifndef to #ifdef for readability

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Adam Jackson <ajax@redhat.com>
Acked-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>

mmx/sse2: Use SIMPLE_NEAREST_SOLID_MASK_FAST_PATH for NORMAL repeat

These two architectures were the only place where
SIMPLE_NEAREST_SOLID_MASK_FAST_PATH was used, and in both cases the
equivalent SIMPLE_NEAREST_SOLID_MASK_FAST_PATH_NORMAL macro was used
immediately afterwards, so including the NORMAL case in the main macro
simplifies the fast path table.

[Pekka: removed extra comma from the end of
SIMPLE_NEAREST_SOLID_MASK_FAST_PATH]

Reviewed-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>

mmx/sse2: Use SIMPLE_NEAREST_FAST_PATH macro

There is some reordering, but the only significant thing to ensure that
the same routine is chosen is that a COVER fast path for a given
combination of operator and source/destination pixel formats must
precede all the variants of repeated fast paths for the same
combination. This patch (and the other mmx/sse2 one) still follows that
rule.

I believe that in every other case, the set of operations that match any
pair of fast paths that are reordered in these patches are mutually
exclusive. While there will be a very subtle timing difference due to
the distance through the table we have to search to find a match
(sometimes faster, sometime slower) there is no evidence that the tables
have been carefully ordered by frequency of occurrence - just for ease
of copy-and-pasting.

Reviewed-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>

mips: Retire PIXMAN_MIPS_SIMPLE_NEAREST_A8_MASK_FAST_PATH

This macro does exactly the same thing as the platform-neutral macro
SIMPLE_NEAREST_A8_MASK_FAST_PATH.

Reviewed-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>

arm: Simplify PIXMAN_ARM_SIMPLE_NEAREST_A8_MASK_FAST_PATH

This macro is a superset of the platform-neutral macro
SIMPLE_NEAREST_A8_MASK_FAST_PATH. In other words, in addition to the
_COVER, _NONE and _PAD suffixes, its expansion includes the _NORMAL suffix.

Reviewed-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>

arm: Retire PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH

This macro does exactly the same thing as the platform-neutral macro
SIMPLE_NEAREST_FAST_PATH.

Reviewed-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>

test: Fix solid-test for big-endian targets

When generating test data, we need to make sure the interpretation of
the data is the same regardless of endianess. That is, the pixel value
for each channel is the same on both little and big-endians.

This fixes a test failure on ppc64 (big-endian).

Tested-by: Fernando Seiti Furusato <ferseiti@linux.vnet.ibm.com> (ppc64le, ppc64, powerpc)
Tested-by: Ben Avison <bavison@riscosopen.org> (armv6l, armv7l, i686)
[Pekka: added commit message]
Reviewed-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Tested-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk> (x86_64)

test: Add new fuzz tester targeting solid images

This places a heavier emphasis on solid images than the other fuzz testers,
and tests both single-pixel repeating bitmap images as well as those created
using pixman_image_create_solid_fill(). In the former case, it also
exercises the case where the bitmap contents are written to after the
image's first use, which is not a use-case that any other test has
previously covered.

[Pekka: added the default case to the switch in test_solid ().]

Reviewed-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>

MIPS: Drop #ifdef __ELF__ in definition of LEAF_MIPS32R2

Commit 6d2cf40166d8 ("MIPS: Fix exported symbols in public API") attempted to
add a .hidden assembly directive, conditional on the code being compiled for an
ELF target. Unfortunately the #ifdef added was already inside a macro and
wasn't expanded properly by the preprocessor.

Fix by removing the check. It's unlikely there are many non-ELF MIPS systems
around anyway.

Fixes: Bug 83358 (https://bugs.freedesktop.org/83358)
Fixes: 6d2cf40166d8 ("MIPS: Fix exported symbols in public API")
Signed-off-by: James Cowgill <james410@cowgill.org.uk>
Cc: Vicente Olivert Riera <Vincent.Riera@imgtec.com>
Cc: Nemanja Lukic <nemanja.lukic@rt-rk.com>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
Reviewed-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>

test: Added more demos and tests to .gitignore file

Uses a wildcard to handle the majority which end in "-test".

Reviewed-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>

test: Add a new benchmarker targeting affine operations

Affine-bench is written by following the example of lowlevel-blt-bench.

Affine-bench differs from lowlevel-blt-bench in the following:
- does not test different sized operations fitting to specific caches,
destination is always 1920x1080
- allows defining the affine transformation parameters
- carefully computes operation extents to hit the COVER_CLIP fast paths

Original version by Ben Avison. Changes by Pekka in v3:
- commit message
- style fixes
- more comments
- refactoring (e.g. bench_info_t)
- help output tweak

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

lowlevel-blt-bench: use a8r8g8b8 for CA solid masks

When doing component alpha with a solid mask, use a mask format that has
all the color channels instead of just a8. As Ben Avison explains it:

"Lowlevel-blt-bench initialises all its images using memset(0xCC) so an
a8 solid image would be converted by _pixman_image_get_solid() to
0xCC000000 whereas an a8r8g8b8 would be 0xCCCCCCCC. When you're not in
component alpha mode, only the alpha byte matters for the mask image,
but in the case of component alpha operations, a fast path might decide
that it can save itself a lot of multiplications if it spots that 3
constant mask components are already 0."

No (default) test so far has a solid mask with CA. This is just
future-proofing lowlevel-blt-bench to do what one would expect.

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

lowlevel-blt-bench: use the test pattern parser

Let lowlevel-blt-bench parse the test name string from the command line,
allowing to run almost infinitely more tests. One is no longer limited
to the tests listed in the big table.

While you can use the old short-hand names like src_8888_8888, you can
also use all possible operators now, and specify pixel formats exactly
rather than just x888, for instance.

This even allows to run crazy patterns like
conjoint_over_reverse_a8b8g8r8_n_r8g8b8x8.

All individual patterns are now interpreted through the parser. The
pattern "all" runs the same old default test set as before but through
the parser instead of the hard-coded parameters.

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

lowlevel-blt-bench: add test name parser and self-test

This patch is inspired by "lowlevel-blt-bench: Parse test name strings in
general case" by Ben Avison. From Ben's commit message:

"There are many types of composite operation that are useful to benchmark
but which are omitted from the table. Continually having to add extra
entries to the table is a nuisance and is prone to human error, so this
patch adds the ability to break down unknow strings of the format
<operation>_<src>[_<mask]_<dst>[_ca]
where bitmap formats are specified by number of bits of each component
(assumed in ARGB order) or 'n' to indicate a solid source or mask."

Add the parser to lowlevel-blt-bench.c, but do not hook it up to the
command line just yet. Instead, make it run a self-test.

As we now dynamically parse strings similar to the test names in the
huge table 'tests_tbl', we should make sure we can parse the old
well-known test names and produce exactly the same test parameters. The
self-test goes through this old table and verifies the parsing results.

Unfortunately the old table is not exactly consistent, it contains some
special cases that cannot be produced by the parsing rules. Whether
these special cases are intentional or just an oversight is not always
clear. Anyway, add a small table to reproduce the special cases
verbatim.

If we wanted, we could remove the big old table in a follow-up commit,
but then we would also lose the parser self-test.

The point of this whole excercise to let lowlevel-blt-bench recognize
novel test patterns in the future, following exactly the conventions
used in the old table.

Ben, from what I see, this parser has one major difference to what you
wrote. For a solid mask, your parser uses a8r8g8b8 format, while mine
uses a8 which comes from the old table.

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

test/utils: add format aliases used by lowlevel-blt-bench

Lowlevel-blt-bench uses several pixel format shorthands. Pick them from
the great table in lowlevel-blt-bench.c and add them here so that
format_from_string() can recognize them.

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

test/utils: add operator aliases for lowlevel-blt-bench

Lowlevel-blt-bench uses the operator alias "outrev". Add an alias for it
in the operator-name table.

Also add aliases for overrev, inrev and atoprev, so that
lowlevel-blt-bench can later recognize them for new test cases.

The aliases are added such, that an operator to name lookup will never
return them; it returns the proper names instead.

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

test/utils: support format name aliases

Previously there was a flat list of formats, used to iterate over all
formats when looking up a format from name or listing them. This cannot
support name aliases.

To support name aliases (multiple name strings mapping to the same
format), create a format-name mapping table. Functions format_name(),
format_from_string(), and list_formats() should keep on working exactly
like before, except format_from_string() now recognizes the additional
formats that format_name() already supported.

The only the formats from the old format list are added with ENTRY, so
that list_formats() works as before. The whole list is verified against
the authoritative list in pixman.h, entries missing from the old list
are commented out.

The extra formats supported by the old format_name() are added as
ALIASes. A side-effect of that is that now also format_from_string()
recognizes the following new names: x4c4 / c8, x4g4 / g8, c4, g4, g1,
yuy2, yv12, null, solid, pixbuf, rpixbuf, unknown.

Name aliases will be useful in follow-up patches, where
lowlevel-blt-bench.c is converted to parse short-hand format names from
strings.

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

test/utils: support operator name aliases

Previously there was a flat list of operators (pixman_op_t), used to
iterate over all operators when looking up an operator from name or
listing them. This cannot support name aliases.

To support name aliases (multiple name strings mapping to the same
operator), create an operator-name mapping table. Functions
operator_name, operator_from_string, and list_operators should keep on
working exactly like before, except operator_from_string now recognizes
a few aliases too.

Name aliases will be useful in follow-up patches, where
lowlevel-blt-bench.c is converted to parse operator names from strings.
Lowlevel-blt-bench uses shorthand names instead of the usual names. This
change allows lowlevel-blt-bench.s to use operator_from_string in the
future.

Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>

test: Move format and operator string functions to utils.[ch]

This permits format_from_string(), list_formats(), list_operators() and
operator_from_string() to be used from tests other than check-formats.

Reviewed-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>

pixman.c: Coding style

A few violations of coding style were identified in code copied from here
into affine-bench.

Reviewed-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>

armv6: Fix typo in preload macro

Missing "lsl" meant that cases with a 32-bit source and/or mask, and an
8-bit destination, the code would not assemble.

mmx: Fix _mm_empty problems for over_8888_8888/over_8888_n_8888

Using "--disable-sse2 --disable-ssse3" configure options and
CFLAGS="-m32 -O2 -g" on an x86 system results in pixman "make check"
failures:

    ../test-driver: line 95: 29874 Aborted
    FAIL: affine-test
    ../test-driver: line 95: 29887 Aborted
    FAIL: scaling-test

One _mm_empty () was missing and another one is needed to workaround
an old GCC bug https://gcc.gnu.org/PR47759 (GCC may move MMX instructions
around and cause test suite failures).

Reviewed-by: Matt Turner <mattst88@gmail.com>

Fix comment about BILINEAR_INTERPOLATION_BITS to say < 8 rather than <= 8

Since a4c79d695d52c94647b1aff7 the constant
BILINEAR_INTERPOLATION_BITS must be strictly less than 8, so fix the
comment to say this, and also add a COMPILE_TIME_ASSERT in the
bilinear fetcher in pixman-fast-path.c

mmx: Add nearest over_8888_8888

lowlevel-blt-bench -n, over_8888_8888, 15 iterations on Loongson 2f:

           Before          After
          Mean StdDev     Mean StdDev   Change
    L1    15.8   0.02     24.0   0.06   +52.0%
    L2    14.8   0.15     23.3   0.13   +56.9%
    M     10.3   0.01     13.8   0.03   +33.6%
    HT    10.0   0.02     14.5   0.05   +44.7%
    VT     9.7   0.02     13.5   0.04   +39.2%
    R      9.1   0.01     12.2   0.04   +34.4%
    RT     7.1   0.06      8.9   0.09   +25.2%

mmx: Add nearest over_8888_n_8888

lowlevel-blt-bench -n, over_8888_n_8888, 15 iterations on Loongson 2f:

           Before          After
          Mean StdDev     Mean StdDev   Change
    L1     9.7   0.01     19.2   0.02   +98.2%
    L2     9.6   0.11     19.2   0.16   +99.5%
    M      7.3   0.02     12.5   0.01   +72.0%
    HT     6.6   0.01     13.4   0.02  +103.2%
    VT     6.4   0.01     12.6   0.03   +96.1%
    R      6.3   0.01     11.2   0.01   +76.5%
    RT     4.4   0.01      8.1   0.03   +82.6%

MIPS: Fix exported symbols in public API.

test: Rearrange tests in order of increasing runtime

Making short tests run first is convenient to catch obvious bugs
early.

pixman-gradient-walker: Make left_x and right_x 64 bit variables

The variables left_x, and right_x in gradient_walker_reset() are
computed from pos, which is a 64 bit quantity, so to avoid overflows,
these variables must be 64 bit as well.

Similarly, the left_x and right_x that are stored in
pixman_gradient_walker_t need to be 64 bit as well; otherwise,
pixman_gradient_walker_pixel() will call reset too often.

This fixes the radial-invalid test, which was generating 'invalid'
floating point exceptions when the overflows caused color values to be
outside of [0, 255].

test: Add radial-invalid test program

This program demonstrates a bug in gradient walker, where some integer
overflows cause colors outside the range [0, 255] to be generated,
which in turns cause 'invalid' floating point exceptions when those
colors are converted to uint8_t.

The bug was first reported by Owen Taylor on the #cairo IRC channel.

ARMv6: Add fast path for src_x888_0565

Benchmark results, "before" is upstream/master
5f661ee719be25c3aa0eb0d45e0db23a37e76468, and "after" contains this
patch on top.

lowlevel-blt-bench, src_8888_0565, 100 iterations:

       Before          After
      Mean StdDev     Mean StdDev   Confidence   Change
L1    25.9   0.20    115.6   0.70    100.00%    +347.1%
L2    14.4   0.23     52.7   3.48    100.00%    +265.0%
M     14.1   0.01     79.8   0.17    100.00%    +465.9%
HT    10.2   0.03     32.9   0.31    100.00%    +221.2%
VT     9.8   0.03     29.8   0.25    100.00%    +203.4%
R      9.4   0.03     27.8   0.18    100.00%    +194.7%
RT     4.6   0.04     10.9   0.29    100.00%    +135.9%

At most 19 outliers rejected per test per set.

cairo-perf-trace with trimmed traces results were indifferent.

A system-wide perf_3.10 profile on Raspbian shows significant
differences in the X server CPU usage. The following were measured from
a 130x62 char lxterminal running 'dmesg' every 0.5 seconds for roughly
30 seconds. These profiles are libpixman.so symbols only.

Before:

Samples: 63K of event 'cpu-clock', Event count (approx.): 2941348112, DSO: libpixman-1.so.0.33.1
37.77%  Xorg  [.] fast_fetch_r5g6b5
14.39%  Xorg  [.] pixman_composite_over_n_8_8888_asm_armv6
  8.51%  Xorg  [.] fast_write_back_r5g6b5
  7.38%  Xorg  [.] pixman_composite_src_8888_8888_asm_armv6
  4.39%  Xorg  [.] pixman_composite_add_8_8_asm_armv6
  3.69%  Xorg  [.] pixman_composite_src_n_8888_asm_armv6
  2.53%  Xorg  [.] _pixman_image_validate
  2.35%  Xorg  [.] pixman_image_composite32

After:

Samples: 31K of event 'cpu-clock', Event count (approx.): 3619782704, DSO: libpixman-1.so.0.33.1
22.36%  Xorg  [.] pixman_composite_over_n_8_8888_asm_armv6
13.59%  Xorg  [.] pixman_composite_src_x888_0565_asm_armv6
12.75%  Xorg  [.] pixman_composite_src_8888_8888_asm_armv6
  6.79%  Xorg  [.] pixman_composite_add_8_8_asm_armv6
  5.95%  Xorg  [.] pixman_composite_src_n_8888_asm_armv6
  4.12%  Xorg  [.] pixman_image_composite32
  3.69%  Xorg  [.] _pixman_image_validate
  3.65%  Xorg  [.] _pixman_bits_image_setup_accessors

Before, fast_fetch_r5g6b5 + fast_write_back_r5g6b5 took 46% of the
samples in libpixman, and probably incurred some memcpy() load, too.
After, pixman_composite_src_x888_0565_asm_armv6 takes 14%. Note, that
the sample counts are very different before/after, as less time is spent
in Pixman and running time is not exactly the same.

Furthermore, in the above test, the CPU idle function was sampled 9%
before, and 15% after.

v4, Pekka Paalanen <pekka.paalanen@collabora.co.uk> :
Re-benchmarked on Raspberry Pi, commit message.

ARM: use pixman_asm_function in internal headers

The two ARM headers contained open-coded copies of pixman_asm_function,
replace these.

Since it seems customary that ARM headers do not use CPP include guards,
rely on the .S files to #include "pixman-arm-asm.h" first. They all
do now.

v2: Fix a build failure on rpi by adding one #include.

ARMv6: Add fast path for in_reverse_8888_8888

Benchmark results, "before" is the patch
* upstream/master 4b76bbfda670f9ede67d0449f3640605e1fc4df0
+ ARMv6: Support for very variable-hungry composite operations
+ ARMv6: Add fast path for over_n_8888_8888_ca
and "after" contains the additional patches on top:
+ ARMv6: Add fast path flag to force no preload of destination buffer
+ ARMv6: Add fast path for in_reverse_8888_8888 (this patch)

lowlevel-blt-bench, in_reverse_8888_8888, 100 iterations:

       Before          After
      Mean StdDev     Mean StdDev   Confidence   Change
L1    21.1   0.07     32.3   0.08    100.00%     +52.9%
L2    11.6   0.29     18.0   0.52    100.00%     +54.4%
M     10.5   0.01     16.1   0.03    100.00%     +54.1%
HT     8.2   0.02     12.0   0.04    100.00%     +45.9%
VT     8.1   0.02     11.7   0.04    100.00%     +44.5%
R      8.1   0.02     11.3   0.04    100.00%     +39.7%
RT     4.8   0.04      6.1   0.09    100.00%     +27.3%

At most 12 outliers rejected per test per set.

cairo-perf-trace with trimmed traces, 30 iterations:

                                    Before          After
                                   Mean StdDev     Mean StdDev   Confidence   Change
t-firefox-paintball.trace          18.0   0.01     14.1   0.01    100.00%     +27.4%
t-firefox-chalkboard.trace         36.7   0.03     36.0   0.02    100.00%      +1.9%
t-firefox-canvas-alpha.trace       20.7   0.22     20.3   0.22    100.00%      +1.9%
t-swfdec-youtube.trace              7.8   0.03      7.8   0.03    100.00%      +0.9%
t-firefox-talos-gfx.trace          25.8   0.44     25.6   0.29     93.87%      +0.7%  (insignificant)
t-firefox-talos-svg.trace          20.6   0.04     20.6   0.03    100.00%      +0.2%
t-firefox-fishbowl.trace           21.2   0.04     21.1   0.02    100.00%      +0.2%
t-xfce4-terminal-a1.trace           4.8   0.01      4.8   0.01     98.85%      +0.2%  (insignificant)
t-swfdec-giant-steps.trace         14.9   0.03     14.9   0.02     99.99%      +0.2%
t-poppler-reseau.trace             22.4   0.11     22.4   0.08     86.52%      +0.2%  (insignificant)
t-gnome-system-monitor.trace       17.3   0.03     17.2   0.03     99.74%      +0.2%
t-firefox-scrolling.trace          24.8   0.12     24.8   0.11     70.15%      +0.1%  (insignificant)
t-firefox-particles.trace          27.5   0.18     27.5   0.21     48.33%      +0.1%  (insignificant)
t-grads-heat-map.trace              4.4   0.04      4.4   0.04     16.61%      +0.0%  (insignificant)
t-firefox-fishtank.trace           13.2   0.01     13.2   0.01      7.64%      +0.0%  (insignificant)
t-firefox-canvas.trace             18.0   0.05     18.0   0.05      1.31%      -0.0%  (insignificant)
t-midori-zoomed.trace               8.0   0.01      8.0   0.01     78.22%      -0.0%  (insignificant)
t-firefox-planet-gnome.trace       10.9   0.02     10.9   0.02     64.81%      -0.0%  (insignificant)
t-gvim.trace                       33.2   0.21     33.2   0.18     38.61%      -0.1%  (insignificant)
t-firefox-canvas-swscroll.trace    32.2   0.09     32.2   0.11     73.17%      -0.1%  (insignificant)
t-firefox-asteroids.trace          11.1   0.01     11.1   0.01    100.00%      -0.2%
t-evolution.trace                  13.0   0.05     13.0   0.05     91.99%      -0.2%  (insignificant)
t-gnome-terminal-vim.trace         19.9   0.14     20.0   0.14     97.38%      -0.4%  (insignificant)
t-poppler.trace                     9.8   0.06      9.8   0.04     99.91%      -0.5%
t-chromium-tabs.trace               4.9   0.02      4.9   0.02    100.00%      -0.6%

At most 6 outliers rejected per test per set.

Cairo perf reports the running time, but the change is computed for
operations per second instead (inverse of running time).

Confidence is based on Welch's t-test. Absolute changes less than 1%
can be accounted as measurement errors, even if statistically
significant.

There was a question of why FLAG_NO_PRELOAD_DST is used. It makes
lowlevel-blt-bench results worse except for L1, but improves some
Cairo trace benchmarks.

"Ben Avison" <bavison@riscosopen.org> wrote:

> The thing with the lowlevel-blt-bench benchmarks for the more
> sophisticated composite types (as a general rule, anything that involves
> branches at the per-pixel level) is that they are only profiling the case
> where you have mid-level alpha values in the source/mask/destination.
> Real-world images typically have a disproportionate number of fully
> opaque and fully transparent pixels, which is why when there's a
> discrepancy between which implementation performs best with cairo-perf
> trace versus lowlevel-blt-bench, I usually favour the Cairo winner.
>
> The results of removing FLAG_NO_PRELOAD_DST (in other words, adding
> preload of the destination buffer) are easy to explain in the
> lowlevel-blt-bench results. In the L1 case, the destination buffer is
> already in the L1 cache, so adding the preloads is simply adding extra
> instruction cycles that have no effect on memory operations. The "in"
> compositing operator depends upon the alpha of both source and
> destination, so if you use uniform mid-alpha, then you actually do need
> to read your destination pixels, so you benefit from preloading them. But
> for fully opaque or fully transparent source pixels, you don't need to
> read the corresponding destination pixel - it'll either be left alone or
> overwritten. Since the ARM11 doesn't use write-allocate cacheing, both of
> these cases avoid both the time taken to load the extra cachelines, as
> well as increasing the efficiency of the cache for other data. If you
> examine the source images being used by the Cairo test, you'll probably
> find they mostly use transparent or opaque pixels.

v4, Pekka Paalanen <pekka.paalanen@collabora.co.uk> :
Rebased, re-benchmarked on Raspberry Pi, commit message.

v5, Pekka Paalanen <pekka.paalanen@collabora.co.uk> :
Rebased, re-benchmarked on Raspberry Pi due to a fix to
"ARMv6: Add fast path for over_n_8888_8888_ca" patch.

ARMv6: Add fast path flag to force no preload of destination buffer