review.tizen.org Git - platform/upstream/pixman.git/log

general: Support component alpha for all image types

Currently, if you attempt to use component alpha on source images or
images without RGB channels, Pixman will silently just use unified
alpha instead. This patch makes such images supported for component
alpha.

There is no particularly compelling usecase at the moment, but this
patch does get rid of a bit of special-case code both in
pixman-general.c and in test/composite.c.

test/utils.c: Make the stack unaligned only on 32 bit Windows

The call_test_function() contains some assembly that deliberately
causes the stack to be aligned to 32 bits rather than 128 bits on
x86-32. The intention is to catch bugs that surface when pixman is
called from code that only uses a 32 bit alignment.

However, recent versions of GCC apparently make the assumption (either
accidentally or deliberately) that that the incoming stack is aligned
to 128 bits, where older versions only seemed to make this assumption
when compiling with -msse2. This causes the vector code in the PRNG to
now segfault when called from call_test_function() on x86-32.

This patch fixes that by only making the stack unaligned on 32 bit
Windows, where it would definitely be incorrect for GCC to assume that
the incoming stack is aligned to 128 bits.

V2: Put "defined(...)" around __GNUC__

Reviewed-and-Tested-by: Matt Turner <mattst88@gmail.com>
Bugzilla: https://bugs.gentoo.org/show_bug.cgi?id=491110

Fix the SSSE3 CPUID detection.

SSSE3 is detected by bit 9 of ECX, but we were checking bit 9 of EDX
which is APIC leading to SSSE3 routines being called on CPUs without
SSSE3.

Reviewed-by: Matt Turner <mattst88@gmail.com>

demos/Makefile.am: Move EXTRA_DIST outside "if HAVE_GTK"

Without this, if tarballs are generated on a system that doesn't have
GTK+ 2 development headers available, the files in EXTRA_DIST will not
be included, which then causes builds from the tarball to fail on
systems that do have GTK+ 2 headers available.

Fixes https://bugs.freedesktop.org/show_bug.cgi?id=71465

test: Fix the win32 build

The win32 build has no config.h, so HAVE_CONFIG_H should be checked
before including it, as in utils.h.

Post-release version bump to 0.33.1

Pre-release version bump to 0.32.0

Post-release version bump to 0.31.3

Pre-release version bump to 0.31.2

pixman_trapezoid_valid(): Fix underflow when bottom is close to MIN_INT

If t->bottom is close to MIN_INT (probably invalid value), subtracting
top can lead to underflow which causes crashes. Attached patch will
fix the issue.

This fixes bug 67484.

test/trap-crasher.c: Add trapezoid that demonstrates a crash

This trapezoid causes a crash due to an underflow in the
pixman_trapezoid_valid().

Test case from Ritesh Khadgaray.

Fix pixman build with older GCC releases

The following patch fixes building pixman with older GCC releases
such as GCC 3.3 and older (OpenBSD; some older archs use GCC 3.3.6)
by changing the method of detecting the presence of __builtin_clz
to utilizing an autoconf check to determine its presence. Compilers
that pretend to be GCC, implement __builtin_clz and are already
utilizing the intrinsic include LLVM/Clang, Open64, EKOPath and
PCC.

pixman-glyph.c: Add __force_align_arg_pointer to composite functions

The functions pixman_composite_glyphs_no_mask() and
pixman_composite_glyphs() can call into code compiled with -msse2,
which requires the stack to be aligned to 16 bytes. Since the ABIs on
Windows and Linux for x86-32 don't provide this guarantee, we need to
use this attribute to make GCC generate a prologue that realigns the
stack.

This fixes the crash introduced in the previous commit and also

https://bugs.freedesktop.org/show_bug.cgi?id=70348

and

https://bugs.freedesktop.org/show_bug.cgi?id=68300

utils.c: On x86-32 unalign the stack before calling test_function

GCC when compiling with -msse2 and -mssse3 will assume that the stack
is aligned to 16 bytes even on x86-32 and accordingly issue movdqa
instructions for stack allocated variables.

But despite what GCC thinks, the standard ABI on x86-32 only requires
a 4-byte aligned stack. This is true at least on Windows, but there
also was (and maybe still is) Linux code in the wild that assumed
this. When such code calls into pixman and hits something compiled
with -msse2, we get a segfault from the unaligned movdqas.

Pixman has worked around this issue in the past with the gcc attribute
"force_align_arg_pointer" but the problem has resurfaced now in

https://bugs.freedesktop.org/show_bug.cgi?id=68300

because pixman_composite_glyphs() is missing this attribute.

This patch makes fuzzer_test_main() call the test_function through a
trampoline, which, on x86-32, has a bit of assembly that deliberately
avoids aligning the stack to 16 bytes as GCC normally expects. The
result is that glyph-test now crashes.

V2: Mark caller-save registers as clobbered, rather than using
noinline on the trampoline.

configure.ac: check and use -Wdeclaration-after-statement GCC option

The accidental use of declaration after statement breaks compilation
with C89 compilers such as MSVC. Assuming that MSVC is one of the
supported compilers, it makes sense to ask GCC to at least report
warnings for such problematic code.

sse2: bilinear fast path for src_x888_8888

Running cairo-perf-trace benchmark on Intel Core2 T7300:

Before:
[  0]    image    t-firefox-canvas-swscroll    1.989    2.008   0.43%    8/8
[  1]    image        firefox-canvas-scroll    4.574    4.609   0.50%    8/8

After:
[  0]    image    t-firefox-canvas-swscroll    1.404    1.418   0.51%    8/8
[  1]    image        firefox-canvas-scroll    4.228    4.259   0.36%    8/8

configure.ac: Add check for pmulhuw assembly

Clang 3.0 chokes on the following bit of assembly

    asm ("pmulhuw %1, %0\n\t"
        : "+y" (__A)
        : "y" (__B)
    );

from pixman-mmx.c with this error message:

    fatal error: error in backend: Unsupported asm: input constraint
        with a matching output constraint of incompatible type!

So add a check in configure to only enable MMX when the compiler can
deal with it.

scale.c: Use int instead of kernel_t for values in named_int_t

The 'value' field in the 'named_int_t' struct is used for both
pixman_repeat_t and pixman_kernel_t values, so the type should be int,
not pixman_kernel_t.

Fixes some warnings like this

scale.c:124:33: warning: implicit conversion from enumeration
      type 'pixman_repeat_t' to different enumeration type
      'pixman_kernel_t' [-Wconversion]
    { "None",                   PIXMAN_REPEAT_NONE },
    ~                           ^~~~~~~~~~~~~~~~~~

when compiled with clang.

pixman-combine32.c: Make Color Burn routine follow the math more closely

For superluminescent destinations, the old code could underflow in

uint32_t r = (ad - d) * as / s;

when (ad - d) was negative. The new code avoids this problem (and
therefore causes changes in the checksums of thread-test and
blitters-test), but it is likely still buggy due to the use of
unsigned variables and other issues in the blend mode code.

pixman-combine32: Make Color Dodge routine follow the math more closely

Change blend_color_dodge() to follow the math in the comment more
closely.

Note, the new code here is in some sense worse than the old code
because it can now underflow the unsigned variables when the source is
superluminescent and (as - s) is therefore negative. The old code was
careful to clamp to 0.

But for superluminescent variables we really need the ability for the
blend function to become negative, and so the solution the underflow
problem is to just use signed variables. The use of unsigned variables
is a general problem in all of the blend mode code that will have to
be solved later.

The CRC32 values in thread-test and blitters-test are updated to
account for the changes in output.

pixman-combine32: Rename a number of variable from sa/sca to as/s

There are no semantic changes, just variables renames. The motivation
for these renames is so that the names are shorter and better match
the one used in the comments.

pixman-combine32: Improve documentation for blend mode operators

This commit overhauls the comments in pixman-comine32.c regarding
blend modes:

- Add a link to the PDF supplement that clarifies the specification of
ColorBurn and ColorDodge

- Clarify how the formulas for premultiplied colors are derived form
the ones in the PDF specifications

- Write out the derivation of the formulas in each blend routine

pixman-combine32.c: Formatting fixes

Fix a bunch of spacing issues.

V2: More spacing issues, in the _ca combiners

Fix thread-test on non-OpenMP systems

The non-reentrant versions of prng_* functions are thread-safe only in
OpenMP-enabled builds.

Fixes thread-test failing when compiled with Clang (both on Linux and
on MacOS).

Add support for SSSE3 to the MSVC build system

Handle SSSE3 just like MMX and SSE2.

Fix build of check-formats on MSVC

Fixes

check-formats.obj : error LNK2019: unresolved external symbol
_strcasecmp referenced in function _format_from_string

check-formats.obj : error LNK2019: unresolved external symbol
_snprintf referenced in function _list_operators

Fix building of "other" programs on MSVC

In d1434d112ca5cd325e4fb85fc60afd1b9e902786 the benchmarks have been
extended to include other programs as well and the variable names have
been updated accordingly in the autotools-based build system, but not
in the MSVC one.

Fix build on MSVC

After a4c79d695d52c94647b1aff78548e5892d616b70 the MMX and SSE2 code
has some declarations after the beginning of a block, which is not
allowed by MSVC.

Fixes multiple errors like:

pixman-mmx.c(3625) : error C2275: '__m64' : illegal use of this type
as an expression

pixman-sse2.c(5708) : error C2275: '__m128i' : illegal use of this
type as an expression

fast: Swap image and iter flags in generated fast paths

The generated fast paths that were moved into the 'fast'
implementation in ec0e38cbb746a673f8e989ab8eae356c8c77dac7 had their
image and iter flag arguments swapped; as a result, none of the fast
paths were ever called.

vmx: there is no need to handle unaligned destination anymore

So the redundant variables, memory reads/writes and reshuffles
can be safely removed. For example, this makes the inner loop
of 'vmx_combine_add_u_no_mask' function much more simple.

Before:

    7a20:7d a8 48 ce lvx     v13,r8,r9
    7a24:7d 80 48 ce lvx     v12,r0,r9
    7a28:7d 28 50 ce lvx     v9,r8,r10
    7a2c:7c 20 50 ce lvx     v1,r0,r10
    7a30:39 4a 00 10 addi    r10,r10,16
    7a34:10 0d 62 eb vperm   v0,v13,v12,v11
    7a38:10 21 4a 2b vperm   v1,v1,v9,v8
    7a3c:11 2c 6a eb vperm   v9,v12,v13,v11
    7a40:10 21 4a 00 vaddubs v1,v1,v9
    7a44:11 a1 02 ab vperm   v13,v1,v0,v10
    7a48:10 00 0a ab vperm   v0,v0,v1,v10
    7a4c:7d a8 49 ce stvx    v13,r8,r9
    7a50:7c 00 49 ce stvx    v0,r0,r9
    7a54:39 29 00 10 addi    r9,r9,16
    7a58:42 00 ff c8 bdnz+   7a20 <.vmx_combine_add_u_no_mask+0x120>

After:

    76c0:7c 00 48 ce lvx     v0,r0,r9
    76c4:7d a8 48 ce lvx     v13,r8,r9
    76c8:39 29 00 10 addi    r9,r9,16
    76cc:7c 20 50 ce lvx     v1,r0,r10
    76d0:10 00 6b 2b vperm   v0,v0,v13,v12
    76d4:10 00 0a 00 vaddubs v0,v0,v1
    76d8:7c 00 51 ce stvx    v0,r0,r10
    76dc:39 4a 00 10 addi    r10,r10,16
    76e0:42 00 ff e0 bdnz+   76c0 <.vmx_combine_add_u_no_mask+0x120>

vmx: align destination to fix valgrind invalid memory writes

The SIMD optimized inner loops in the VMX/Altivec code are trying
to emulate unaligned accesses to the destination buffer. For each
4 pixels (which fit into a 128-bit register) the current
implementation:
  1. first performs two aligned reads, which cover the needed data
  2. reshuffles bytes to get the needed data in a single vector register
  3. does all the necessary calculations
  4. reshuffles bytes back to their original location in two registers
  5. performs two aligned writes back to the destination buffer

Unfortunately in the case if the destination buffer is unaligned and
the width is a perfect multiple of 4 pixels, we may have some writes
crossing the boundaries of the destination buffer. In a multithreaded
environment this may potentially corrupt the data outside of the
destination buffer if it is concurrently read and written by some
other thread.

The valgrind report for blitters-test is full of:

==23085== Invalid write of size 8
==23085==    at 0x1004B0B4: vmx_combine_add_u (pixman-vmx.c:1089)
==23085==    by 0x100446EF: general_composite_rect (pixman-general.c:214)
==23085==    by 0x10002537: test_composite (blitters-test.c:363)
==23085==    by 0x1000369B: fuzzer_test_main._omp_fn.0 (utils.c:733)
==23085==    by 0x10004943: fuzzer_test_main (utils.c:728)
==23085==    by 0x10002C17: main (blitters-test.c:397)
==23085==  Address 0x5188218 is 0 bytes after a block of size 88 alloc'd
==23085==    at 0x4051DA0: memalign (vg_replace_malloc.c:581)
==23085==    by 0x4051E7B: posix_memalign (vg_replace_malloc.c:709)
==23085==    by 0x10004CFF: aligned_malloc (utils.c:833)
==23085==    by 0x10001DCB: create_random_image (blitters-test.c:47)
==23085==    by 0x10002263: test_composite (blitters-test.c:283)
==23085==    by 0x1000369B: fuzzer_test_main._omp_fn.0 (utils.c:733)
==23085==    by 0x10004943: fuzzer_test_main (utils.c:728)
==23085==    by 0x10002C17: main (blitters-test.c:397)

This patch addresses the problem by first aligning the destination
buffer at a 16 byte boundary in each combiner function. This trick
is borrowed from the pixman SSE2 code.

It allows to pass the new thread-test on PowerPC VMX/Altivec systems and
also resolves the "make check" failure reported for POWER7 hardware:
    http://lists.freedesktop.org/archives/pixman/2013-August/002871.html

test: Add new thread-test program

This test program allocates an array of 16 * 7 uint32_ts and spawns 16
threads that each use 7 of the allocated uint32_ts as a destination
image for a large number of composite operations. Each thread then
computes and returns a checksum for the image. Finally, the main
thread computes a checksum of the checksums and verifies that it
matches expectations.

The purpose of this test is catch errors where memory outside images
is read and then written back. Such out-of-bounds accesses are broken
when multiple threads are involved, because the threads will race to
read and write the shared memory.

V2:
- Incorporate fixes from Siarhei for endianness and undefined behavior
  regarding argument evaluation
- Make the images 7 pixels wide since the bug only happens when the
  composite width is greater than 4.
- Compute a checksum of the checksums so that you don't have to
  update 16 values if something changes.

V3: Remove stray dollar sign

Rename HAVE_PTHREAD_SETSPECIFIC to HAVE_PTHREADS

The test for pthread_setspecific() can be used as a general test for
whether pthreads are available, so rename the variable from
HAVE_PTHREAD_SETSPECIFIC to HAVE_PTHREADS and run the test even when
better support for thread local variables are available.

However, the pthread arguments are still only added to CFLAGS and
LDFLAGS when pthread_setspecific() is used for thread local variables.

V2: AC_SUBST(PTHREAD_CFLAGS)

blitters-test: Remove unused variable

utils.c: Make image_endian_swap() deal with negative strides

Use a temporary variable s containing the absolute value of the stride
as the upper bound in the inner loops.

V2: Do this for the bpp == 16 case as well

utils.c: Make print_image actually cope with negative strides

Commit 4312f077365bf9f59423b1694136089c6da6216b claimed to have made
print_image() work with negative strides, but it didn't actually
work. When the stride was negative, the image buffer would be accessed
as if the stride were positive.

Fix the bug by not changing the stride variable and instead using a
temporary, s, that contains the absolute value of stride.

Move generated affine fetchers into pixman-fast-path.c

The generated fetchers for NEAREST, BILINEAR, and
SEPARABLE_CONVOLUTION filters are fast paths and so they belong in
pixman-fast-path.c

Move bits_image_fetch_bilinear_no_repeat_8888 into pixman-fast-path.c

This iterator is really a fast path, so it belongs in the fast path
implementation.

fast, ssse3: Simplify logic to fetch lines in the bilinear iterators

Instead of having logic to swap the lines around when one of them
doesn't match, store the two lines in an array and use the least
significant bit of the y coordinate as the index into that
array. Since the two lines always have different least significant
bits, they will never collide.

The effect is that lines corresponding to even y coordinates are
stored in info->lines[0] and lines corresponding to odd y coordinates
are stored in info->lines[1].

test: Test negative strides

Pixman supports negative strides, but up until now they haven't been
tested outside of stress-test. This commit adds testing of negative
strides to blitters-test, scaling-test, affine-test, rotate-test, and
composite-traps-test.

test: Share the image printing code

The affine-test, blitters-test, and scaling-test all have the ability
to print out the bytes of the destination image. Share this code by
moving it to utils.c.

At the same time make the code work correctly with negative strides.

{scaling,affine,composite-traps}-test: Use compute_crc32_for_image()

By using this function instead of compute_crc32() the alpha masking
code and the call to image_endian_swap() are not duplicated.

pixman-filter.c: Use 65536, not 65535, for fixed point conversion

Converting a double precision number to 16.16 fixed point should be
done by multiplying with 65536.0, not 65535.0.

The bug could potentially cause certain filters that would otherwise
leave the image bit-for-bit unchanged under an identity
transformation, to not do so, but the numbers are close enough that
there weren't any visual differences.

demos/scale.ui: Allow subsample_bits to be 0

The separable convolution filter supports a subsample_bits of 0 which
corresponds to no subsampling at all, so allow this value to be used
in the scale demo.

ssse3: Add iterator for separable bilinear scaling

This new iterator uses the SSSE3 instructions pmaddubsw and pabsw to
implement a fast iterator for bilinear scaling.

There is a graph here recording the per-pixel time for various
bilinear scaling algorithms as reported by scaling-bench:

    http://people.freedesktop.org/~sandmann/ssse3.v2/ssse3.v2.png

As the graph shows, this new iterator is clearly faster than the
existing C iterator, and when used with an SSE2 combiner, it is also
faster than the existing SSE2 fast paths for upscaling, though not for
downscaling.

Another graph:

    http://people.freedesktop.org/~sandmann/ssse3.v2/movdqu.png

shows the difference between writing to iter->buffer with movdqa,
movdqu on an aligned buffer, and movdqu on a deliberately unaligned
buffer. Since the differences are very small, the patch here avoids
using movdqa because imposing alignment restrictions on iter->buffer
may interfere with other optimizations, such as writing directly to
the destination image.

The data was measured with scaling-bench on a Sandy Bridge Core
i3-2350M @ 2.3GHz and is available in this directory:

    http://people.freedesktop.org/~sandmann/ssse3.v2/

where there is also a Gnumeric spreadsheet ssse3.v2.gnumeric
containing the per-pixel values and the graph.

V2:
- Use uintptr_t instead of unsigned long in the ALIGN macro
- Use _mm_storel_epi64 instead of _mm_cvtsi128_si64 as the latter form
  is not available on x86-32.
- Use _mm_storeu_si128() instead of _mm_store_si128() to avoid
  imposing alignment requirements on iter->buffer

Add empty SSSE3 implementation

This commit adds a new, empty SSSE3 implementation and the associated
build system support.

configure.ac:   detect whether the compiler understands SSSE3
                intrinsics and set up the required CFLAGS

Makefile.am:    Add libpixman-ssse3.la

pixman-x86.c:   Add X86_SSSE3 feature flag and detect it in
                detect_cpu_features().

pixman-ssse3.c: New file with an empty SSSE3 implementation

V2: Remove SSSE3_LDFLAGS since it isn't necessary unless Solaris
support is added.

general: Ensure that iter buffers are aligned to 16 bytes

At the moment iter buffers are only guaranteed to be aligned to a 4
byte boundary. SIMD implementations benefit from the buffers being
aligned to 16 bytes, so ensure this is the case.

V2:
- Use uintptr_t instead of unsigned long
- allocate 3 * SCANLINE_BUFFER_LENGTH byte on stack rather than just
SCANLINE_BUFFER_LENGTH
- use sizeof (stack_scanline_buffer) instead of SCANLINE_BUFFER_LENGTH
to determine overflow

sse2: faster bilinear scaling (pack 4 pixels to write with MOVDQA)

The loops are already unrolled, so it was just a matter of packing
4 pixels into a single XMM register and doing aligned 128-bit
writes to memory via MOVDQA instructions for the SRC compositing
operator fast path. For the other fast paths, this XMM register
is also directly routed to further processing instead of doing
extra reshuffling. This replaces "8 PACKSSDW/PACKUSWB + 4 MOVD"
instructions with "3 PACKSSDW/PACKUSWB + 1 MOVDQA" per 4 pixels,
which results in a clear performance improvement.

There are also some other (less important) tweaks:

1. Convert 'pixman_fixed_t' to 'intptr_t' before using it as an
   index for addressing memory. The problem is that 'pixman_fixed_t'
   is a 32-bit data type and it has to be extended to 64-bit
   offsets, which needs extra instructions on 64-bit systems.

2. Allow to recalculate the horizontal interpolation weights only
   once per 4 pixels by treating the XMM register as four pairs
   of 16-bit values. Each of these 16-bit/16-bit pairs can be
   replicated to fill the whole 128-bit register by using PSHUFD
   instructions. So we get "3 PADDW/PSRLW + 4 PSHUFD" instructions
   per 4 pixels instead of "12 PADDW/PSRLW" per 4 pixels
   (or "3 PADDW/PSRLW" per each pixel).

   Now a good question is whether replacing "9 PADDW/PSRLW" with
   "4 PSHUFD" is a favourable exchange. As it turns out, PSHUFD
   instructions are very fast on new Intel processors (including
   Atoms), but are rather slow on the first generation of Core2
   (Merom) and on the other processors from that time or older.
   A good instructions latency/throughput table, covering all the
   relevant processors, can be found at:
        http://www.agner.org/optimize/instruction_tables.pdf

   Enabling this optimization is controlled by the PSHUFD_IS_FAST
   define in "pixman-sse2.c".

3. One use of PSHUFD instruction (_mm_shuffle_epi32 intrinsic) in
   the older code has been also replaced by PUNPCKLQDQ equivalent
   (_mm_unpacklo_epi64 intrinsic) in PSHUFD_IS_FAST=0 configuration.
   The PUNPCKLQDQ instruction is usually faster on older processors,
   but has some side effects (instead of fully overwriting the
   destination register like PSHUFD does, it retains half of the
   original value, which may inhibit some compiler optimizations).

Benchmarks with "lowlevel-blt-bench -b src_8888_8888" using GCC 4.8.1 on
x86-64 system and default optimizations. The results are in MPix/s:

====== Intel Core2 T7300 (2GHz) ======

old:                     src_8888_8888 =  L1: 128.69  L2: 125.07  M:124.86
                        over_8888_8888 =  L1:  83.19  L2:  81.73  M: 80.63
                      over_8888_n_8888 =  L1:  79.56  L2:  78.61  M: 77.85
                      over_8888_8_8888 =  L1:  77.15  L2:  75.79  M: 74.63

new (PSHUFD_IS_FAST=0):  src_8888_8888 =  L1: 168.67  L2: 163.26  M:162.44
                        over_8888_8888 =  L1: 102.91  L2: 100.43  M: 99.01
                      over_8888_n_8888 =  L1:  97.40  L2:  95.64  M: 94.24
                      over_8888_8_8888 =  L1:  98.04  L2:  95.83  M: 94.33

new (PSHUFD_IS_FAST=1):  src_8888_8888 =  L1: 154.67  L2: 149.16  M:148.48
                        over_8888_8888 =  L1:  95.97  L2:  93.90  M: 91.85
                      over_8888_n_8888 =  L1:  93.18  L2:  91.47  M: 90.15
                      over_8888_8_8888 =  L1:  95.33  L2:  93.32  M: 91.42

====== Intel Core i7 860 (2.8GHz) ======

old:                     src_8888_8888 =  L1: 323.48  L2: 318.86  M:314.81
                        over_8888_8888 =  L1: 187.38  L2: 186.74  M:182.46

new (PSHUFD_IS_FAST=0):  src_8888_8888 =  L1: 373.06  L2: 370.94  M:368.32
                        over_8888_8888 =  L1: 217.28  L2: 215.57  M:211.32

new (PSHUFD_IS_FAST=1):  src_8888_8888 =  L1: 401.98  L2: 397.65  M:395.61
                        over_8888_8888 =  L1: 218.89  L2: 217.56  M:213.48

The most interesting benchmark is "src_8888_8888" (because this code can
be reused for a generic non-separable SSE2 bilinear fetch iterator).

The results shows that PSHUFD instructions are bad for Intel Core2 T7300
(Merom core) and good for Intel Core i7 860 (Nehalem core). Both of these
processors support SSSE3 instructions though, so they are not the primary
targets for SSE2 code. But without having any other more relevant hardware
to test, PSHUFD_IS_FAST=0 seems to be a reasonable default for SSE2 code
and old processors (until the runtime CPU features detection becomes
clever enough to recognize different microarchitectures).

(Rebased on top of patch that removes support for 8-bit bilinear
filtering -ssp)

test: safeguard the scaling-bench test against COW

The calloc call from pixman_image_create_bits may still
rely on http://en.wikipedia.org/wiki/Copy-on-write
Explicitly initializing the destination image results in
a more predictable behaviour.

V2:
- allocate 16 bytes aligned buffer with aligned stride instead
of delegating this to pixman_image_create_bits
- use memset for the allocated buffer instead of pixman solid fill
- repeat tests 3 times and select best results in order to filter
out even more measurement noise

Drop support for 8-bit precision in bilinear filtering

The default has been 7-bit for a while now, and the quality
improvement with 8-bit precision is not enough to justify keeping the
code around as a compile-time option.

Make the first argument to scanline fetchers have type bits_image_t *

Scanline fetchers haven't been used for images other than bits for a
long time, so by making the type reflect this fact, a bit of casting
can be saved in various places.

iwmmxt: Disallow if gcc version is < 4.8.

Later versions of gcc-4.7.x are capable of generating iwMMXt
instructions properly, but gcc-4.8 contains better support and other
fixes, including iwMMXt in conjunction with hardfp. The existing 4.5
requirement was based on attempts to have OLPC use a patched gcc to
build pixman. Let's just require gcc-4.8.

fast_bilinear_cover_init: Don't install a finalizer on the error path

No memory is allocated in the error case, so a finalizer is not
necessary, and will cause problems if the data pointer is not
initialized to NULL.

Add an iterator that can fetch bilinearly scaled images

This new iterator works in a separable way; that is, for a destination
scaline, it scales the two involved source scanlines and then caches
them so that they can be reused for the next destination scanlines.

There are two versions of the code, one that uses 64 bit arithmetic,
and one that uses 32 bit arithmetic only. The latter version is
used on 32 bit systems, where it is expected to be faster.

This scheme saves a substantial amount of arithmetic for larger
scalings; the per-pixel times for various configurations as reported
by scaling-bench are graphed here:

http://people.freedesktop.org/~sandmann/separable.v2/v2.png

The "sse2" graph is current default on x86, "mmx" is with sse2
disabled, "old c" is with sse2 and mmx disabled. The "new 32" and "new
64" graphs show times for the new code. As the graphs show, the 64 bit
version of the new code beats the "old c" for all scaling ratios.

The data was taken on a Sandy Bridge Core i3-2350M CPU @ 2.0 GHz
running in 64 bit mode.

The data used to generate the graph is available in this directory:

http://people.freedesktop.org/~sandmann/separable.v2/

There is also a Gnumeric spreadsheet v2.gnumeric containing the
per-pixel values and the graph.

V2:
- Add error message in the OOM/bad matrix case
- Save some shifts by storing the cached scanlines in AGBR order
- Special cased version that uses 32 bit arithmetic when sizeof(long) <= 4

Add support for iter finalizers

Iterators may sometimes need to allocate auxillary memory. In order to
be able to free this memory, optional iterator finalizers are
required.

test/scaling-bench.c: New benchmark for bilinear scaling

This new benchmark scales a 320 x 240 test a8r8g8b8 image by all
ratios from 0.1, 0.2, ... up to 10.0 and reports the time it to took
to do each of the scaling operations, and the time spent per
destination pixel.

The times reported for the scaling operations are given in
milliseconds, the times-per-pixel are in nanoseconds.

V2: Format output better

RELEASING: Add note about changing the topic of the #cairo IRC channel

test: fix matrix-test on big endian systems

test: Fix build on MSVC

The MSVC compiler is very strict about variable declarations after
statements.

Move all the declarations of each block before any statement in the
same block to fix multiple instances of:

alpha-loop.c(XX) : error C2275: 'pixman_image_t' : illegal use of this
type as an expression

Require GTK+ version >= 2.16

I'm got bug in my system:

lcc: "scale.c", line 374: warning: function "gtk_scale_add_mark" declared
          implicitly [-Wimplicit-function-declaration]
      gtk_scale_add_mark (GTK_SCALE (widget), 0.0, GTK_POS_LEFT, NULL);
      ^

  CCLD   scale
scale.o: In function `app_new':
(.text+0x23e4): undefined reference to `gtk_scale_add_mark'
scale.o: In function `app_new':
(.text+0x250c): undefined reference to `gtk_scale_add_mark'
scale.o: In function `app_new':
(.text+0x2634): undefined reference to `gtk_scale_add_mark'
make[2]: *** [scale] Error 1
make[2]: Target `all' not remade because of errors.

$ pkg-config --modversion gtk+-2.0
2.12.1

The demos/scale.c use call to gtk_scale_add_mark() function from 2.16+
version of GTK+. Need do support old GTK+ (rewrite scale.c) or simple
demand of high version of GTK+, like this:

configure.ac: Don't use '+=' since it's not POSIX

Reviewed-by: Matt Turner <mattst88@gmail.com>
Signed-off-by: Matthieu Herrb <matthieu.herrb@laas.fr>

Consolidate all the iter_init_bits_stride functions

The SSE2, MMX, and fast implementations all have a copy of the
function iter_init_bits_stride that computes an image buffer and
stride.

Move that function to pixman-utils.c and share it among all the
implementations.

Delete the old src/dest_iter_init() functions

Now that we are using the new _pixman_implementation_iter_init(), the
old _src/_dest_iter_init() functions are no longer needed, so they can
be deleted, and the corresponding fields in pixman_implementation_t
can be removed.

Add _pixman_implementation_iter_init() and use instead of _src/_dest_init()

A new field, 'iter_info', is added to the implementation struct, and
all the implementations store a pointer to their iterator tables in
it. A new function, _pixman_implementation_iter_init(), is then added
that searches those tables, and the new function is called in
pixman-general.c and pixman-image.c instead of the old
_pixman_implementation_src_init() and _pixman_implementation_dest_init().

general: Store the iter initializer in a one-entry pixman_iter_info_t table

In preparation for sharing all iterator initialization code from all
the implementations, move the general implementation to use a table of
pixman_iter_info_t.

The existing src_iter_init and dest_iter_init functions are
consolidated into one general_iter_init() function that checks the
iter_flags for whether it is dealing with a source or destination
iterator.

Unlike in the other implementations, the general_iter_init() function
stores its own get_scanline() and write_back() functions in the
iterator, so it relies on the initializer being called after
get_scanline and write_back being copied from the struct to the
iterator.

fast: Replace the fetcher_info_t table with a pixman_iter_info_t table

Similar to the SSE2 and MMX patches, this commit replaces a table of
fetcher_info_t with a table of pixman_iter_info_t, and similar to the
noop patch, both fast_src_iter_init() and fast_dest_iter_init() are
now doing exactly the same thing, so their code can be shared in a new
function called fast_iter_init_common().

mmx: Replace the fetcher_info_t table with a pixman_iter_info_t table

Similar to the SSE2 commit, information about the iterators is stored
in a table of pixman_iter_info_t.

sse2: Replace the fetcher_info_t table with a pixman_iter_info_t table

Similar to the changes to noop, put all the iterators into a table of
pixman_iter_info_t and then do a generic search of that table during
iterator initialization.

noop: Keep information about iterators in an array of pixman_iter_info_t

Instead of having a nest of if statements, store the information about
iterators in a table of a new struct type, pixman_iter_info_t, and
then walk that table when initializing iterators.

The new struct contains a format, a set of image flags, and a set of
iter flags, plus a pixman_iter_get_scanline_t, a
pixman_iter_write_back_t, and a new function type
pixman_iter_initializer_t.

If the iterator matches an entry, it is first initialized with the
given get_scanline and write_back functions, and then the provided
iter_initializer (if present) is run. Running the iter_initializer
after setting get_scanline and write_back allows the initializer to
override those fields if it wishes.

The table contains both source and destination iterators,
distinguished based on the recently-added ITER_SRC and ITER_DEST;
similarly, wide iterators are recognized with the ITER_WIDE
flag. Having both source and destination iterators in the table means
the noop_src_iter_init() and noop_dest_iter_init() functions become
identical, so this patch factors out their code in a new function
noop_iter_init_common() that both calls.

The following patches in this series will change all the
implementations to use an iterator table, and then move the table
search code to pixman-implementation.c.

Always set the FAST_PATH_NO_ALPHA_MAP flag for non-BITS images

We only support alpha maps for BITS images, so it's always to ignore
the alpha map for non-BITS image. This makes it possible get rid of
the check for SOLID images since it will now be subsumed by the check
for FAST_PATH_NO_ALPHA_MAP.

Opaque masks are reduced to NULL images in pixman.c, and those can
also safely be treated as not having an alpha map, so set the
FAST_PATH_NO_ALPHA_MAP bit for those as well.

Add ITER_WIDE iter flag

This will be useful for putting iterators into tables where they can
be looked up by iterator flags. Without this flag, wide iterators can
only be recognized by the absence of ITER_NARROW, which makes testing
for a match difficult.

Add ITER_SRC and ITER_DEST iter flags

These indicate whether the iterator is for a source or a destination
image. Note iterator initializers are allowed to rely on one of these
being set, so they can't be left out the way it's generally harmless
(aside from potentil performance degradation) to leave out a
particular fast path flag.

Make use of image flag in noop iterators

Similar to c2230fe2aff, simply check against SAMPLES_COVER_CLIP_NEAREST
instead of comparing all the x/y/width/height parameters.

Use AC_LINK_IFELSE to check if the Loongson MMI code can link

The Loongson code is compiled with -march=loongson2f to enable the MMI
instructions, but binutils refuses to link object code compiled with
different -march settings, leading to link failures later in the
compile. This avoids that problem by checking if we can link code
compiled for Loongson.

Reviewed-by: Matt Turner <mattst88@gmail.com>
Signed-off-by: Markos Chandras <markos.chandras@imgtec.com>

mmx: Document implementation(s) of pix_multiply().

I look at that function and can never remember what it does or how it
manages to do it.

Fix broken build when HAVE_CONFIG_H is undefined, e.g. on Win32.

Build fix for platforms without a generated config.h, for example Win32.

Post-release version bump to 0.31.1

Pre-release version bump to 0.30.0

Post-release version bump to 0.29.5

Pre-release version bump to 0.29.4

pixman/refactor: Delete this file

Essentially all of it is obsolete by now.

MIPS: DSPr2: Added rpixbuf fast path.

Performance numbers before/after on MIPS-74kc @ 1GHz:

lowlevel-blt-bench results

Referent (before):
rpixbuf = L1: 14.63 L2: 13.55 M: 9.91 ( 79.53%) HT: 8.47 VT: 8.32 R: 8.17 RT: 4.90 ( 33Kops/s)

Optimized:
rpixbuf = L1: 45.69 L2: 37.30 M: 17.24 (138.31%) HT: 15.66 VT: 14.88 R: 13.97 RT: 8.38 ( 44Kops/s)

MIPS: DSPr2: Added pixbuf fast path.

Performance numbers before/after on MIPS-74kc @ 1GHz:

lowlevel-blt-bench results

Referent (before):
pixbuf = L1: 18.18 L2: 16.47 M: 13.36 (107.27%) HT: 10.16 VT: 10.07 R: 9.84 RT: 5.54 ( 35Kops/s)

Optimized:
pixbuf = L1: 43.54 L2: 36.02 M: 17.08 (137.09%) HT: 15.58 VT: 14.85 R: 13.87 RT: 8.38 ( 44Kops/s)

test: add "pixbuf" and "rpixbuf" to lowlevel-blt-bench

Add necessary support to lowlevel-blt benchmark for benchmarking pixbuf and
rpixbuf fast paths. bench_composite function now checks for pixbuf string in
testname, and if that is detected, use same bits for src and mask images.

test: add "src_0888_8888_rev" and "src_0888_0565_rev" to lowlevel-blt-bench

MIPS: DSPr2: Fix for bug in in_n_8 routine.

Rounding logic was not implemented right.
Instead of using rounding version of the 8-bit shift, logical shifts were used.
Also, code used unnecessary multiplications, which could be avoided by packing
4 destination (a8) pixel into one 32bit register. There were also, unnecessary
spills on stack. Code is rewritten to address mentioned issues.

The bug was revealed by increasing number of the iterations in blitters-test.

Performance numbers on MIPS-74kc @ 1GHz:

lowlevel-blt-bench results

Referent (before):
                   in_n_8 =  L1:  21.20  L2:  22.86  M: 21.42 ( 14.21%)  HT: 15.97  VT: 15.69  R: 15.47  RT:  8.00 (  48Kops/s)
Optimized (first implementation, with bug):
                   in_n_8 =  L1:  89.38  L2:  86.07  M: 65.48 ( 43.44%)  HT: 44.64  VT: 41.50  R: 40.77  RT: 16.94 (  66Kops/s)
Optimized (with bug fix, and code revisited):
                   in_n_8 =  L1: 102.33  L2:  95.65  M: 70.54 ( 46.84%)  HT: 48.35  VT: 45.06  R: 43.20  RT: 17.60 (  66Kops/s)

MIPS: DSPr2: Added src_0565_8888 nearest neighbor fast path.

Performance numbers before/after on MIPS-74kc @ 1GHz:

lowlevel-blt-bench results

Referent (before):
src_0565_8888 = L1: 20.70 L2: 19.22 M: 12.50 ( 49.79%) HT: 10.45 VT: 10.18 R: 9.99 RT: 5.31 ( 31Kops/s)

Optimized:
src_0565_8888 = L1: 62.98 L2: 53.44 M: 23.07 ( 91.87%) HT: 19.85 VT: 19.15 R: 17.70 RT: 9.68 ( 43Kops/s)

MIPS: DSPr2: Added over_8888_0565 nearest neighbor fast path.

Performance numbers before/after on MIPS-74kc @ 1GHz:

lowlevel-blt-bench results

Referent (before):
over_8888_0565 = L1: 13.22 L2: 12.02 M: 9.77 ( 38.92%) HT: 8.58 VT: 8.35 R: 8.38 RT: 5.78 ( 35Kops/s)

Optimized:
over_8888_0565 = L1: 26.20 L2: 22.97 M: 15.92 ( 63.40%) HT: 13.33 VT: 13.13 R: 12.72 RT: 7.65 ( 39Kops/s)

MIPS: DSPr2: Added over_8888_8888 nearest neighbor fast path.

Performance numbers before/after on MIPS-74kc @ 1GHz:

lowlevel-blt-bench results

Referent (before):
over_8888_8888 = L1: 19.47 L2: 16.30 M: 11.24 ( 59.69%) HT: 9.54 VT: 9.29 R: 9.47 RT: 6.24 ( 37Kops/s)

Optimized:
over_8888_8888 = L1: 43.67 L2: 33.30 M: 16.32 ( 86.65%) HT: 14.10 VT: 13.78 R: 12.96 RT: 7.85 ( 39Kops/s)

MIPS: DSPr2: Fix bug in over_n_8888_8888_ca/over_n_8888_0565_ca routines

After introducing new PRNG (pseudorandom number generator) a bug in two DSPr2
routines was revealed. Bug manifested by wrong calculation in composite and
glyph tests, which caused make check to fail for MIPS DSPr2 optimizations.

Bug was in the calculation of the:
*dst = over (src, *dst) when ma == 0xffffffff

In this case src was not negated and shifted right by 24 bits, it was only
negated. When implementing this routine in the first place, I missplaced those
shifts, which alowed me to combine code for over operation and:
    UN8x4_MUL_UN8x4 (s, ma);
    UN8x4_MUL_UN8 (ma, srca);
    ma = ~ma;
    UN8x4_MUL_UN8x4_ADD_UN8x4 (d, ma, s);
So I decided to rewrite that piece of code from scratch. I changed logic, so
now assembly code mimics code from pixman-fast-path.c but processes two pixels
at a time. This code should be easier to debug and maintain.

The bug was revealed in commit b31a6962. Errors were detected by composite
and glyph tests.

sse2: faster bilinear interpolation (get rid of XOR instruction)

The old code was calculating horizontal weights for right pixels
in the following way (for simplicity assume 8-bit interpolation
precision):

  Start with "x = vx" and do increment "x += ux" after each pixel.
  In this case right pixel weight for interpolation can be calculated
  as "((x >> 8) ^ 0xFF) + 1", which is the same as "256 - (x >> 8)".

The new code instead:

  Starts with "x = -(vx + 1)", performs increment "x += -ux" after
  each pixel and calculates right weights as just "(x >> 8) + 1",
  eliminating the need for XOR operation in the inner loop.

So we have one instruction less on the critical path. Benchmarks
with "lowlevel-blt-bench -b src_8888_8888" using GCC 4.7.2 on
x86-64 system and default optimizations:

Intel Core i7 860 (2.8GHz):
    before: src_8888_8888 =  L1: 291.37  L2: 288.58  M:285.38
    after:  src_8888_8888 =  L1: 319.66  L2: 316.47  M:312.06

Intel Core2 T7300 (2GHz):
    before: src_8888_8888 =  L1: 121.95  L2: 118.38  M:118.52
    after:  src_8888_8888 =  L1: 128.82  L2: 125.12  M:124.88

Intel Atom N450 (1.67GHz):
    before: src_8888_8888 =  L1:  64.25  L2:  62.37  M: 61.80
    after:  src_8888_8888 =  L1:  64.23  L2:  62.37  M: 61.82

Inspired by the "sse2_bilinear_interpolation" function (single
pixel interpolation) from:
    http://lists.freedesktop.org/archives/pixman/2013-January/002575.html

test: larger 0xFF/0x00 filled clusters in random images for blitters-test

Current blitters-test program had difficulties detecting a bug in
over_n_8888_8888_ca implementation for MIPS DSPr2:

http://lists.freedesktop.org/archives/pixman/2013-March/002645.html

In order to hit the buggy code path, two consecutive mask values had
to be equal to 0xFFFFFFFF because of loop unrolling. The current
blitters-test generates random images in such a way that each byte
has 25% probability for having 0xFF value. Hence each 32-bit mask
value has ~0.4% probability for 0xFFFFFFFF. Because we are testing
many compositing operations with many pixels, encountering at least
one 0xFFFFFFFF mask value reasonably fast is not a problem. If a
bug related to 0xFFFFFFFF mask value is artificialy introduced into
over_n_8888_8888_ca generic C function, it gets detected on 675591
iteration in blitters-test (out of 2000000).

However two consecutive 0xFFFFFFFF mask values are much less likely
to be generated, so the bug was missed by blitters-test.

This patch addresses the problem by also randomly setting the 32-bit
values in images to either 0xFFFFFFFF or 0x00000000 (also with 25%
probability). It allows to have larger clusters of consecutive 0x00
or 0xFF bytes in images which may have special shortcuts for handling
them in unrolled or SIMD optimized code.

Trivial spelling fixes in comments

They were found by codespell.

Signed-off-by: Stefan Weil <sw@weilnetz.de>

Check for missing sqrtf() as, e.g., for Solaris 9

Signed-off-by: Peter Breitenlohner <peb@mppmu.mpg.de>

Improve precision of calculations in pixman-gradient-walker.c

The computations in pixman-gradient-walker.c currently take place at
very limited 8 bit precision which results in quite visible artefacts
in gradients. An example is the one produced by demos/linear-gradient
which currently looks like this:

    http://i.imgur.com/kQbX8nd.png

With the changes in this commit, the gradient looks like this:

    http://i.imgur.com/nUlyuKI.png

The images are also available here:

    http://people.freedesktop.org/~sandmann/gradients/before.png
    http://people.freedesktop.org/~sandmann/gradients/after.png

This patch computes pixels using floating point, but uses a faster
algorithm, which makes up for the loss of performance.

== Theory:

In both the new and the old algorithm, the various gradient
implementations compute a parameter x that indicates how far along the
gradient the current scanline is. The current algorithm has a cache of
the two color stops surrounding the last parameter; those are used in
a SIMD-within-register fashion in this way:

    t1 = walker->left_rb * idist + walker->right_rb * dist;

where dist and idist are the distances to the left and right color
stops respectively normalized to the distance between the left and
right stops. The normalization (which involves a division) is captured
in another cached variable "stepper". The cached values are recomputed
whenever the parameter moves in between two different stops (called
"reset" in the implementation).

Because idist and dist are computed in 8 bits only, a lot of
information is lost, which is quite visible as the image linked above
shows.

The new algorithm caches more information in the following way. When
interpolating between stops, the formula to be used is this:

     t = ((x - left) / (right - left));

     result = lc * (1 - t) + rc * t;

where

    - x is the parameter as computed by the main gradient code,
    - left is the position of the left color stop,
    - right is the position of the right color stop
    - lc is the color of the left color stop
    - rc is the color of the right color stop

That formula can also be written like this:

    result
      = lc * (1 - t) + rc * t;
      = lc + (rc - lc) * t
      = lc + (rc - lc) * ((x - left) / (right - left))
      = (rc - lc) / (right - left) * x +
              lc - (left * (rc - lc)) / (right - left)
      = s * x + b

where

    s = (rc - lc) / (right - left)

and

    b = lc - left * (rc - lc) / (right - left)
      = (lc * (right - left) - left * (rc - lc)) / (right - left)
      = (lc * right - rc * left) / (right - left)

To summarize, setting w = (right - left):

    s = (rc - lc) / w
    b = (lc * right - rc * left) / w

    r = s * x + b

Since s and b only depend on the two active stops, both can be cached
so that the computation only needs to do one multiplication and one
addition per pixel (followed by premultiplication of the alpha
channel). That is, seven multiplications in total, which is the same
number as the old SIMD-within-register implementation had.

== Implementation notes:

The new formula described above is implemented in single precision
floating point, and the eight divisions necessary to compute the
cached values are done by multiplication with the reciprocal of the
distance between the color stops.

The alpha values used in the cached computation are scaled by 255.0,
whereas the RGB values are kept in the [0, 1] interval. The ensures
that after premultiplication, all values will be in the [0, 255]
interval.

This scaling is done by first dividing all the all the channels by
257, and then later on dividing the r, g, b channels by 255. It would
be more natural to do all this scaling in only one place, but
inexplicably, that results in a (substantial) slowdown on Sandy Bridge
with GCC v 4.7.

== Performance impact (median of three runs of radial-perf-test):

   == Intel Sandy Bridge, Core i3 @ 1.2GHz

   Before: 0.014553
   After:  0.014410
   Change: 1.0% faster

   == AMD Barcelona @ 1.2 GHz

   Before: 0.021735
   After:  0.021328
   Change: 1.9% faster

Ie., slightly faster, though conceivably there could be a negative
impact on machines with a bigger difference between integer and
floating point performance.

V2:

- Use 's' and 'b' in the variable names instead of 'm' and 'd'. This
  way they match the explanation above

- Move variable declarations to the top of the function

- Remove unused stepper field

- Some formatting fixes

- Don't pointlessly include pixman-combine32.h

- Don't offset x for each pixel; go back to offsetting left_x and
  right_x at reset time. The offsets cancel out in the formula above,
  so there is no impact on the calcualations.

Move the IS_ZERO() to pixman-private.h and rename to FLOAT_IS_ZERO()

Some upcoming changes to pixman-gradient-walker.c will need this
macro.

test: Add radial-perf-test, a microbenchmark for radial gradients

This benchmark renders one of the radial gradients used in the
swfdec-youtube cairo trace 500 times and reports the average time it
took.

V2: Update .gitignore

demos: Add linear-gradient demo program

This program displays a linear gradient from blue to yellow. Due to
limited precision in pixman-gradient-walker.c, it currently has some
ugly artefacts that gives it a 'brushed metal' appearance.

V2: Update .gitignore

Remove unused macro

MIPS: DSPr2: Added more fast-paths for SRC operation:
- src_0888_8888_rev
- src_0888_0565_rev

Performance numbers before/after on MIPS-74kc @ 1GHz:

lowlevel-blt-bench results

Referent (before):
        src_0888_8888_rev =  L1:  51.88  L2:  42.00  M: 19.04 ( 88.50%)  HT: 15.27  VT: 14.62  R: 14.13  RT:  7.12 (  45Kops/s)
        src_0888_0565_rev =  L1:  31.96  L2:  30.90  M: 22.60 ( 75.03%)  HT: 15.32  VT: 15.11  R: 14.49  RT:  6.64 (  43Kops/s)

Optimized:
        src_0888_8888_rev =  L1: 222.73  L2: 113.70  M: 20.97 ( 97.35%)  HT: 18.31  VT: 17.14  R: 16.71  RT:  9.74 (  54Kops/s)
        src_0888_0565_rev =  L1: 100.37  L2:  74.27  M: 29.43 ( 97.63%)  HT: 22.92  VT: 21.59  R: 20.52  RT: 10.56 (  56Kops/s)