Pekka Paalanen [Wed, 10 Jun 2015 08:21:14 +0000 (11:21 +0300)]
lowlevel-blt-bench: move explanation printing
Move explanation printing to a new function. This will help with
implementing a machine-readable output option.
Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>
Pekka Paalanen [Wed, 10 Jun 2015 08:14:38 +0000 (11:14 +0300)]
lowlevel-blt-bench: move usage to a function
Move printing of usage into a new function and use argv[0] as the
program name. This will help printing usage from multiple places.
Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>
Oded Gabbay [Thu, 25 Jun 2015 12:59:57 +0000 (15:59 +0300)]
vmx: fix pix_multiply for ppc64le
vec_mergeh/l operates differently for BE and LE, because of the order of
the vector elements (l->r in BE and r->l in LE).
To fix that, we simply need to swap between the input parameters, in case
we are working in LE.
v2:
- replace _LITTLE_ENDIAN with WORDS_BIGENDIAN for consistency
- fixed whitespaces and indentation issues
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Adam Jackson <ajax@redhat.com>
Acked-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Oded Gabbay [Thu, 25 Jun 2015 12:59:56 +0000 (15:59 +0300)]
vmx: fix unused var warnings
v2: don't put ';' at the end of macro definition. Instead, move it to
each line the macro is used.
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Adam Jackson <ajax@redhat.com>
Acked-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Oded Gabbay [Thu, 25 Jun 2015 12:59:55 +0000 (15:59 +0300)]
vmx: encapsulate the temporary variables inside the macros
v2: fixed whitespaces and indentation issues
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Adam Jackson <ajax@redhat.com>
Acked-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Fernando Seiti Furusato [Thu, 25 Jun 2015 12:59:54 +0000 (15:59 +0300)]
vmx: adjust macros when loading vectors on ppc64le
Replaced usage of vec_lvsl to direct unaligned assignment
operation (=). That is because, according to Power ABI Specification,
the usage of lvsl is deprecated on ppc64le.
Changed COMPUTE_SHIFT_{MASK,MASKS,MASKC} macro usage to no-op for powerpc
little endian since unaligned access is supported on ppc64le.
v2:
- replace _LITTLE_ENDIAN with WORDS_BIGENDIAN for consistency
- fixed whitespaces and indentation issues
Signed-off-by: Fernando Seiti Furusato <ferseiti@linux.vnet.ibm.com>
Reviewed-by: Adam Jackson <ajax@redhat.com>
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Acked-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Oded Gabbay [Thu, 25 Jun 2015 12:59:53 +0000 (15:59 +0300)]
vmx: fix splat_alpha for ppc64le
The permutation vector isn't correct for LE, so correct its values
in case we are in LE mode.
v2:
- replace _LITTLE_ENDIAN with WORDS_BIGENDIAN for consistency
- change #ifndef to #ifdef for readability
Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Reviewed-by: Adam Jackson <ajax@redhat.com>
Acked-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Ben Avison [Tue, 26 May 2015 22:58:29 +0000 (23:58 +0100)]
mmx/sse2: Use SIMPLE_NEAREST_SOLID_MASK_FAST_PATH for NORMAL repeat
These two architectures were the only place where
SIMPLE_NEAREST_SOLID_MASK_FAST_PATH was used, and in both cases the
equivalent SIMPLE_NEAREST_SOLID_MASK_FAST_PATH_NORMAL macro was used
immediately afterwards, so including the NORMAL case in the main macro
simplifies the fast path table.
[Pekka: removed extra comma from the end of
SIMPLE_NEAREST_SOLID_MASK_FAST_PATH]
Reviewed-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
Ben Avison [Tue, 26 May 2015 22:58:28 +0000 (23:58 +0100)]
mmx/sse2: Use SIMPLE_NEAREST_FAST_PATH macro
There is some reordering, but the only significant thing to ensure that
the same routine is chosen is that a COVER fast path for a given
combination of operator and source/destination pixel formats must
precede all the variants of repeated fast paths for the same
combination. This patch (and the other mmx/sse2 one) still follows that
rule.
I believe that in every other case, the set of operations that match any
pair of fast paths that are reordered in these patches are mutually
exclusive. While there will be a very subtle timing difference due to
the distance through the table we have to search to find a match
(sometimes faster, sometime slower) there is no evidence that the tables
have been carefully ordered by frequency of occurrence - just for ease
of copy-and-pasting.
Reviewed-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
Ben Avison [Tue, 26 May 2015 22:58:27 +0000 (23:58 +0100)]
mips: Retire PIXMAN_MIPS_SIMPLE_NEAREST_A8_MASK_FAST_PATH
This macro does exactly the same thing as the platform-neutral macro
SIMPLE_NEAREST_A8_MASK_FAST_PATH.
Reviewed-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
Ben Avison [Tue, 26 May 2015 22:58:26 +0000 (23:58 +0100)]
arm: Simplify PIXMAN_ARM_SIMPLE_NEAREST_A8_MASK_FAST_PATH
This macro is a superset of the platform-neutral macro
SIMPLE_NEAREST_A8_MASK_FAST_PATH. In other words, in addition to the
_COVER, _NONE and _PAD suffixes, its expansion includes the _NORMAL suffix.
Reviewed-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
Ben Avison [Wed, 27 May 2015 11:45:25 +0000 (12:45 +0100)]
arm: Retire PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH
This macro does exactly the same thing as the platform-neutral macro
SIMPLE_NEAREST_FAST_PATH.
Reviewed-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
Ben Avison [Fri, 29 May 2015 15:20:43 +0000 (16:20 +0100)]
test: Fix solid-test for big-endian targets
When generating test data, we need to make sure the interpretation of
the data is the same regardless of endianess. That is, the pixel value
for each channel is the same on both little and big-endians.
This fixes a test failure on ppc64 (big-endian).
Tested-by: Fernando Seiti Furusato <ferseiti@linux.vnet.ibm.com> (ppc64le, ppc64, powerpc)
Tested-by: Ben Avison <bavison@riscosopen.org> (armv6l, armv7l, i686)
[Pekka: added commit message]
Reviewed-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Tested-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk> (x86_64)
Ben Avison [Thu, 7 May 2015 18:32:46 +0000 (19:32 +0100)]
test: Add new fuzz tester targeting solid images
This places a heavier emphasis on solid images than the other fuzz testers,
and tests both single-pixel repeating bitmap images as well as those created
using pixman_image_create_solid_fill(). In the former case, it also
exercises the case where the bitmap contents are written to after the
image's first use, which is not a use-case that any other test has
previously covered.
[Pekka: added the default case to the switch in test_solid ().]
Reviewed-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
James Cowgill [Tue, 5 May 2015 15:39:38 +0000 (16:39 +0100)]
MIPS: Drop #ifdef __ELF__ in definition of LEAF_MIPS32R2
Commit
6d2cf40166d8 ("MIPS: Fix exported symbols in public API") attempted to
add a .hidden assembly directive, conditional on the code being compiled for an
ELF target. Unfortunately the #ifdef added was already inside a macro and
wasn't expanded properly by the preprocessor.
Fix by removing the check. It's unlikely there are many non-ELF MIPS systems
around anyway.
Fixes: Bug 83358 (https://bugs.freedesktop.org/83358)
Fixes:
6d2cf40166d8 ("MIPS: Fix exported symbols in public API")
Signed-off-by: James Cowgill <james410@cowgill.org.uk>
Cc: Vicente Olivert Riera <Vincent.Riera@imgtec.com>
Cc: Nemanja Lukic <nemanja.lukic@rt-rk.com>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
Reviewed-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Bill Spitzak [Wed, 29 Apr 2015 18:44:17 +0000 (11:44 -0700)]
test: Added more demos and tests to .gitignore file
Uses a wildcard to handle the majority which end in "-test".
Reviewed-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Ben Avison [Wed, 8 Apr 2015 13:20:09 +0000 (14:20 +0100)]
test: Add a new benchmarker targeting affine operations
Affine-bench is written by following the example of lowlevel-blt-bench.
Affine-bench differs from lowlevel-blt-bench in the following:
- does not test different sized operations fitting to specific caches,
destination is always 1920x1080
- allows defining the affine transformation parameters
- carefully computes operation extents to hit the COVER_CLIP fast paths
Original version by Ben Avison. Changes by Pekka in v3:
- commit message
- style fixes
- more comments
- refactoring (e.g. bench_info_t)
- help output tweak
Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>
Pekka Paalanen [Tue, 14 Apr 2015 08:42:00 +0000 (11:42 +0300)]
lowlevel-blt-bench: use a8r8g8b8 for CA solid masks
When doing component alpha with a solid mask, use a mask format that has
all the color channels instead of just a8. As Ben Avison explains it:
"Lowlevel-blt-bench initialises all its images using memset(0xCC) so an
a8 solid image would be converted by _pixman_image_get_solid() to
0xCC000000 whereas an a8r8g8b8 would be 0xCCCCCCCC. When you're not in
component alpha mode, only the alpha byte matters for the mask image,
but in the case of component alpha operations, a fast path might decide
that it can save itself a lot of multiplications if it spots that 3
constant mask components are already 0."
No (default) test so far has a solid mask with CA. This is just
future-proofing lowlevel-blt-bench to do what one would expect.
Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>
Pekka Paalanen [Fri, 10 Apr 2015 13:42:49 +0000 (16:42 +0300)]
lowlevel-blt-bench: use the test pattern parser
Let lowlevel-blt-bench parse the test name string from the command line,
allowing to run almost infinitely more tests. One is no longer limited
to the tests listed in the big table.
While you can use the old short-hand names like src_8888_8888, you can
also use all possible operators now, and specify pixel formats exactly
rather than just x888, for instance.
This even allows to run crazy patterns like
conjoint_over_reverse_a8b8g8r8_n_r8g8b8x8.
All individual patterns are now interpreted through the parser. The
pattern "all" runs the same old default test set as before but through
the parser instead of the hard-coded parameters.
Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>
Pekka Paalanen [Fri, 10 Apr 2015 11:39:13 +0000 (14:39 +0300)]
lowlevel-blt-bench: add test name parser and self-test
This patch is inspired by "lowlevel-blt-bench: Parse test name strings in
general case" by Ben Avison. From Ben's commit message:
"There are many types of composite operation that are useful to benchmark
but which are omitted from the table. Continually having to add extra
entries to the table is a nuisance and is prone to human error, so this
patch adds the ability to break down unknow strings of the format
<operation>_<src>[_<mask]_<dst>[_ca]
where bitmap formats are specified by number of bits of each component
(assumed in ARGB order) or 'n' to indicate a solid source or mask."
Add the parser to lowlevel-blt-bench.c, but do not hook it up to the
command line just yet. Instead, make it run a self-test.
As we now dynamically parse strings similar to the test names in the
huge table 'tests_tbl', we should make sure we can parse the old
well-known test names and produce exactly the same test parameters. The
self-test goes through this old table and verifies the parsing results.
Unfortunately the old table is not exactly consistent, it contains some
special cases that cannot be produced by the parsing rules. Whether
these special cases are intentional or just an oversight is not always
clear. Anyway, add a small table to reproduce the special cases
verbatim.
If we wanted, we could remove the big old table in a follow-up commit,
but then we would also lose the parser self-test.
The point of this whole excercise to let lowlevel-blt-bench recognize
novel test patterns in the future, following exactly the conventions
used in the old table.
Ben, from what I see, this parser has one major difference to what you
wrote. For a solid mask, your parser uses a8r8g8b8 format, while mine
uses a8 which comes from the old table.
Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>
Pekka Paalanen [Fri, 10 Apr 2015 09:50:23 +0000 (12:50 +0300)]
test/utils: add format aliases used by lowlevel-blt-bench
Lowlevel-blt-bench uses several pixel format shorthands. Pick them from
the great table in lowlevel-blt-bench.c and add them here so that
format_from_string() can recognize them.
Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>
Pekka Paalanen [Mon, 13 Apr 2015 07:06:31 +0000 (10:06 +0300)]
test/utils: add operator aliases for lowlevel-blt-bench
Lowlevel-blt-bench uses the operator alias "outrev". Add an alias for it
in the operator-name table.
Also add aliases for overrev, inrev and atoprev, so that
lowlevel-blt-bench can later recognize them for new test cases.
The aliases are added such, that an operator to name lookup will never
return them; it returns the proper names instead.
Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>
Pekka Paalanen [Thu, 9 Apr 2015 11:17:54 +0000 (14:17 +0300)]
test/utils: support format name aliases
Previously there was a flat list of formats, used to iterate over all
formats when looking up a format from name or listing them. This cannot
support name aliases.
To support name aliases (multiple name strings mapping to the same
format), create a format-name mapping table. Functions format_name(),
format_from_string(), and list_formats() should keep on working exactly
like before, except format_from_string() now recognizes the additional
formats that format_name() already supported.
The only the formats from the old format list are added with ENTRY, so
that list_formats() works as before. The whole list is verified against
the authoritative list in pixman.h, entries missing from the old list
are commented out.
The extra formats supported by the old format_name() are added as
ALIASes. A side-effect of that is that now also format_from_string()
recognizes the following new names: x4c4 / c8, x4g4 / g8, c4, g4, g1,
yuy2, yv12, null, solid, pixbuf, rpixbuf, unknown.
Name aliases will be useful in follow-up patches, where
lowlevel-blt-bench.c is converted to parse short-hand format names from
strings.
Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>
Pekka Paalanen [Thu, 9 Apr 2015 10:13:09 +0000 (13:13 +0300)]
test/utils: support operator name aliases
Previously there was a flat list of operators (pixman_op_t), used to
iterate over all operators when looking up an operator from name or
listing them. This cannot support name aliases.
To support name aliases (multiple name strings mapping to the same
operator), create an operator-name mapping table. Functions
operator_name, operator_from_string, and list_operators should keep on
working exactly like before, except operator_from_string now recognizes
a few aliases too.
Name aliases will be useful in follow-up patches, where
lowlevel-blt-bench.c is converted to parse operator names from strings.
Lowlevel-blt-bench uses shorthand names instead of the usual names. This
change allows lowlevel-blt-bench.s to use operator_from_string in the
future.
Signed-off-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Reviewed-by: Ben Avison <bavison@riscosopen.org>
Ben Avison [Tue, 3 Mar 2015 15:24:17 +0000 (15:24 +0000)]
test: Move format and operator string functions to utils.[ch]
This permits format_from_string(), list_formats(), list_operators() and
operator_from_string() to be used from tests other than check-formats.
Reviewed-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Ben Avison [Wed, 8 Apr 2015 13:20:30 +0000 (14:20 +0100)]
pixman.c: Coding style
A few violations of coding style were identified in code copied from here
into affine-bench.
Reviewed-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Ben Avison [Tue, 3 Mar 2015 15:24:16 +0000 (15:24 +0000)]
armv6: Fix typo in preload macro
Missing "lsl" meant that cases with a 32-bit source and/or mask, and an
8-bit destination, the code would not assemble.
Siarhei Siamashka [Mon, 22 Sep 2014 03:30:51 +0000 (06:30 +0300)]
mmx: Fix _mm_empty problems for over_8888_8888/over_8888_n_8888
Using "--disable-sse2 --disable-ssse3" configure options and
CFLAGS="-m32 -O2 -g" on an x86 system results in pixman "make check"
failures:
../test-driver: line 95: 29874 Aborted
FAIL: affine-test
../test-driver: line 95: 29887 Aborted
FAIL: scaling-test
One _mm_empty () was missing and another one is needed to workaround
an old GCC bug https://gcc.gnu.org/PR47759 (GCC may move MMX instructions
around and cause test suite failures).
Reviewed-by: Matt Turner <mattst88@gmail.com>
Søren Sandmann Pedersen [Sat, 20 Sep 2014 04:51:56 +0000 (21:51 -0700)]
Fix comment about BILINEAR_INTERPOLATION_BITS to say < 8 rather than <= 8
Since
a4c79d695d52c94647b1aff7 the constant
BILINEAR_INTERPOLATION_BITS must be strictly less than 8, so fix the
comment to say this, and also add a COMPILE_TIME_ASSERT in the
bilinear fetcher in pixman-fast-path.c
Matt Turner [Wed, 2 Jan 2013 19:16:12 +0000 (11:16 -0800)]
mmx: Add nearest over_8888_8888
lowlevel-blt-bench -n, over_8888_8888, 15 iterations on Loongson 2f:
Before After
Mean StdDev Mean StdDev Change
L1 15.8 0.02 24.0 0.06 +52.0%
L2 14.8 0.15 23.3 0.13 +56.9%
M 10.3 0.01 13.8 0.03 +33.6%
HT 10.0 0.02 14.5 0.05 +44.7%
VT 9.7 0.02 13.5 0.04 +39.2%
R 9.1 0.01 12.2 0.04 +34.4%
RT 7.1 0.06 8.9 0.09 +25.2%
Matt Turner [Wed, 2 Jan 2013 05:18:09 +0000 (21:18 -0800)]
mmx: Add nearest over_8888_n_8888
lowlevel-blt-bench -n, over_8888_n_8888, 15 iterations on Loongson 2f:
Before After
Mean StdDev Mean StdDev Change
L1 9.7 0.01 19.2 0.02 +98.2%
L2 9.6 0.11 19.2 0.16 +99.5%
M 7.3 0.02 12.5 0.01 +72.0%
HT 6.6 0.01 13.4 0.02 +103.2%
VT 6.4 0.01 12.6 0.03 +96.1%
R 6.3 0.01 11.2 0.01 +76.5%
RT 4.4 0.01 8.1 0.03 +82.6%
Nemanja Lukic [Fri, 27 Jun 2014 16:05:38 +0000 (18:05 +0200)]
MIPS: Fix exported symbols in public API.
Søren Sandmann Pedersen [Sun, 1 Jun 2014 22:50:23 +0000 (18:50 -0400)]
test: Rearrange tests in order of increasing runtime
Making short tests run first is convenient to catch obvious bugs
early.
Søren Sandmann Pedersen [Thu, 24 Apr 2014 00:25:40 +0000 (20:25 -0400)]
pixman-gradient-walker: Make left_x and right_x 64 bit variables
The variables left_x, and right_x in gradient_walker_reset() are
computed from pos, which is a 64 bit quantity, so to avoid overflows,
these variables must be 64 bit as well.
Similarly, the left_x and right_x that are stored in
pixman_gradient_walker_t need to be 64 bit as well; otherwise,
pixman_gradient_walker_pixel() will call reset too often.
This fixes the radial-invalid test, which was generating 'invalid'
floating point exceptions when the overflows caused color values to be
outside of [0, 255].
Søren Sandmann Pedersen [Thu, 24 Apr 2014 00:07:37 +0000 (20:07 -0400)]
test: Add radial-invalid test program
This program demonstrates a bug in gradient walker, where some integer
overflows cause colors outside the range [0, 255] to be generated,
which in turns cause 'invalid' floating point exceptions when those
colors are converted to uint8_t.
The bug was first reported by Owen Taylor on the #cairo IRC channel.
Ben Avison [Thu, 24 Apr 2014 10:39:06 +0000 (13:39 +0300)]
ARMv6: Add fast path for src_x888_0565
Benchmark results, "before" is upstream/master
5f661ee719be25c3aa0eb0d45e0db23a37e76468, and "after" contains this
patch on top.
lowlevel-blt-bench, src_8888_0565, 100 iterations:
Before After
Mean StdDev Mean StdDev Confidence Change
L1 25.9 0.20 115.6 0.70 100.00% +347.1%
L2 14.4 0.23 52.7 3.48 100.00% +265.0%
M 14.1 0.01 79.8 0.17 100.00% +465.9%
HT 10.2 0.03 32.9 0.31 100.00% +221.2%
VT 9.8 0.03 29.8 0.25 100.00% +203.4%
R 9.4 0.03 27.8 0.18 100.00% +194.7%
RT 4.6 0.04 10.9 0.29 100.00% +135.9%
At most 19 outliers rejected per test per set.
cairo-perf-trace with trimmed traces results were indifferent.
A system-wide perf_3.10 profile on Raspbian shows significant
differences in the X server CPU usage. The following were measured from
a 130x62 char lxterminal running 'dmesg' every 0.5 seconds for roughly
30 seconds. These profiles are libpixman.so symbols only.
Before:
Samples: 63K of event 'cpu-clock', Event count (approx.):
2941348112, DSO: libpixman-1.so.0.33.1
37.77% Xorg [.] fast_fetch_r5g6b5
14.39% Xorg [.] pixman_composite_over_n_8_8888_asm_armv6
8.51% Xorg [.] fast_write_back_r5g6b5
7.38% Xorg [.] pixman_composite_src_8888_8888_asm_armv6
4.39% Xorg [.] pixman_composite_add_8_8_asm_armv6
3.69% Xorg [.] pixman_composite_src_n_8888_asm_armv6
2.53% Xorg [.] _pixman_image_validate
2.35% Xorg [.] pixman_image_composite32
After:
Samples: 31K of event 'cpu-clock', Event count (approx.):
3619782704, DSO: libpixman-1.so.0.33.1
22.36% Xorg [.] pixman_composite_over_n_8_8888_asm_armv6
13.59% Xorg [.] pixman_composite_src_x888_0565_asm_armv6
12.75% Xorg [.] pixman_composite_src_8888_8888_asm_armv6
6.79% Xorg [.] pixman_composite_add_8_8_asm_armv6
5.95% Xorg [.] pixman_composite_src_n_8888_asm_armv6
4.12% Xorg [.] pixman_image_composite32
3.69% Xorg [.] _pixman_image_validate
3.65% Xorg [.] _pixman_bits_image_setup_accessors
Before, fast_fetch_r5g6b5 + fast_write_back_r5g6b5 took 46% of the
samples in libpixman, and probably incurred some memcpy() load, too.
After, pixman_composite_src_x888_0565_asm_armv6 takes 14%. Note, that
the sample counts are very different before/after, as less time is spent
in Pixman and running time is not exactly the same.
Furthermore, in the above test, the CPU idle function was sampled 9%
before, and 15% after.
v4, Pekka Paalanen <pekka.paalanen@collabora.co.uk> :
Re-benchmarked on Raspberry Pi, commit message.
Pekka Paalanen [Thu, 10 Apr 2014 06:41:38 +0000 (09:41 +0300)]
ARM: use pixman_asm_function in internal headers
The two ARM headers contained open-coded copies of pixman_asm_function,
replace these.
Since it seems customary that ARM headers do not use CPP include guards,
rely on the .S files to #include "pixman-arm-asm.h" first. They all
do now.
v2: Fix a build failure on rpi by adding one #include.
Ben Avison [Wed, 9 Apr 2014 13:25:32 +0000 (16:25 +0300)]
ARMv6: Add fast path for in_reverse_8888_8888
Benchmark results, "before" is the patch
* upstream/master
4b76bbfda670f9ede67d0449f3640605e1fc4df0
+ ARMv6: Support for very variable-hungry composite operations
+ ARMv6: Add fast path for over_n_8888_8888_ca
and "after" contains the additional patches on top:
+ ARMv6: Add fast path flag to force no preload of destination buffer
+ ARMv6: Add fast path for in_reverse_8888_8888 (this patch)
lowlevel-blt-bench, in_reverse_8888_8888, 100 iterations:
Before After
Mean StdDev Mean StdDev Confidence Change
L1 21.1 0.07 32.3 0.08 100.00% +52.9%
L2 11.6 0.29 18.0 0.52 100.00% +54.4%
M 10.5 0.01 16.1 0.03 100.00% +54.1%
HT 8.2 0.02 12.0 0.04 100.00% +45.9%
VT 8.1 0.02 11.7 0.04 100.00% +44.5%
R 8.1 0.02 11.3 0.04 100.00% +39.7%
RT 4.8 0.04 6.1 0.09 100.00% +27.3%
At most 12 outliers rejected per test per set.
cairo-perf-trace with trimmed traces, 30 iterations:
Before After
Mean StdDev Mean StdDev Confidence Change
t-firefox-paintball.trace 18.0 0.01 14.1 0.01 100.00% +27.4%
t-firefox-chalkboard.trace 36.7 0.03 36.0 0.02 100.00% +1.9%
t-firefox-canvas-alpha.trace 20.7 0.22 20.3 0.22 100.00% +1.9%
t-swfdec-youtube.trace 7.8 0.03 7.8 0.03 100.00% +0.9%
t-firefox-talos-gfx.trace 25.8 0.44 25.6 0.29 93.87% +0.7% (insignificant)
t-firefox-talos-svg.trace 20.6 0.04 20.6 0.03 100.00% +0.2%
t-firefox-fishbowl.trace 21.2 0.04 21.1 0.02 100.00% +0.2%
t-xfce4-terminal-a1.trace 4.8 0.01 4.8 0.01 98.85% +0.2% (insignificant)
t-swfdec-giant-steps.trace 14.9 0.03 14.9 0.02 99.99% +0.2%
t-poppler-reseau.trace 22.4 0.11 22.4 0.08 86.52% +0.2% (insignificant)
t-gnome-system-monitor.trace 17.3 0.03 17.2 0.03 99.74% +0.2%
t-firefox-scrolling.trace 24.8 0.12 24.8 0.11 70.15% +0.1% (insignificant)
t-firefox-particles.trace 27.5 0.18 27.5 0.21 48.33% +0.1% (insignificant)
t-grads-heat-map.trace 4.4 0.04 4.4 0.04 16.61% +0.0% (insignificant)
t-firefox-fishtank.trace 13.2 0.01 13.2 0.01 7.64% +0.0% (insignificant)
t-firefox-canvas.trace 18.0 0.05 18.0 0.05 1.31% -0.0% (insignificant)
t-midori-zoomed.trace 8.0 0.01 8.0 0.01 78.22% -0.0% (insignificant)
t-firefox-planet-gnome.trace 10.9 0.02 10.9 0.02 64.81% -0.0% (insignificant)
t-gvim.trace 33.2 0.21 33.2 0.18 38.61% -0.1% (insignificant)
t-firefox-canvas-swscroll.trace 32.2 0.09 32.2 0.11 73.17% -0.1% (insignificant)
t-firefox-asteroids.trace 11.1 0.01 11.1 0.01 100.00% -0.2%
t-evolution.trace 13.0 0.05 13.0 0.05 91.99% -0.2% (insignificant)
t-gnome-terminal-vim.trace 19.9 0.14 20.0 0.14 97.38% -0.4% (insignificant)
t-poppler.trace 9.8 0.06 9.8 0.04 99.91% -0.5%
t-chromium-tabs.trace 4.9 0.02 4.9 0.02 100.00% -0.6%
At most 6 outliers rejected per test per set.
Cairo perf reports the running time, but the change is computed for
operations per second instead (inverse of running time).
Confidence is based on Welch's t-test. Absolute changes less than 1%
can be accounted as measurement errors, even if statistically
significant.
There was a question of why FLAG_NO_PRELOAD_DST is used. It makes
lowlevel-blt-bench results worse except for L1, but improves some
Cairo trace benchmarks.
"Ben Avison" <bavison@riscosopen.org> wrote:
> The thing with the lowlevel-blt-bench benchmarks for the more
> sophisticated composite types (as a general rule, anything that involves
> branches at the per-pixel level) is that they are only profiling the case
> where you have mid-level alpha values in the source/mask/destination.
> Real-world images typically have a disproportionate number of fully
> opaque and fully transparent pixels, which is why when there's a
> discrepancy between which implementation performs best with cairo-perf
> trace versus lowlevel-blt-bench, I usually favour the Cairo winner.
>
> The results of removing FLAG_NO_PRELOAD_DST (in other words, adding
> preload of the destination buffer) are easy to explain in the
> lowlevel-blt-bench results. In the L1 case, the destination buffer is
> already in the L1 cache, so adding the preloads is simply adding extra
> instruction cycles that have no effect on memory operations. The "in"
> compositing operator depends upon the alpha of both source and
> destination, so if you use uniform mid-alpha, then you actually do need
> to read your destination pixels, so you benefit from preloading them. But
> for fully opaque or fully transparent source pixels, you don't need to
> read the corresponding destination pixel - it'll either be left alone or
> overwritten. Since the ARM11 doesn't use write-allocate cacheing, both of
> these cases avoid both the time taken to load the extra cachelines, as
> well as increasing the efficiency of the cache for other data. If you
> examine the source images being used by the Cairo test, you'll probably
> find they mostly use transparent or opaque pixels.
v4, Pekka Paalanen <pekka.paalanen@collabora.co.uk> :
Rebased, re-benchmarked on Raspberry Pi, commit message.
v5, Pekka Paalanen <pekka.paalanen@collabora.co.uk> :
Rebased, re-benchmarked on Raspberry Pi due to a fix to
"ARMv6: Add fast path for over_n_8888_8888_ca" patch.
Ben Avison [Wed, 9 Apr 2014 13:25:31 +0000 (16:25 +0300)]
ARMv6: Add fast path flag to force no preload of destination buffer
Ben Avison [Wed, 9 Apr 2014 13:25:30 +0000 (16:25 +0300)]
ARMv6: Add fast path for over_n_8888_8888_ca
Benchmark results, "before" is
* upstream/master
4b76bbfda670f9ede67d0449f3640605e1fc4df0
"after" contains the additional patches on top:
+ ARMv6: Support for very variable-hungry composite operations
+ ARMv6: Add fast path for over_n_8888_8888_ca (this patch)
lowlevel-blt-bench, over_n_8888_8888_ca, 100 iterations:
Before After
Mean StdDev Mean StdDev Confidence Change
L1 2.7 0.00 16.1 0.06 100.00% +500.7%
L2 2.4 0.01 14.1 0.15 100.00% +489.9%
M 2.3 0.00 14.3 0.01 100.00% +510.2%
HT 2.2 0.00 9.7 0.03 100.00% +345.0%
VT 2.2 0.00 9.4 0.02 100.00% +333.4%
R 2.2 0.01 9.5 0.03 100.00% +331.6%
RT 1.9 0.01 5.5 0.07 100.00% +192.7%
At most 1 outliers rejected per test per set.
cairo-perf-trace with trimmed traces, 30 iterations:
Before After
Mean StdDev Mean StdDev Confidence Change
t-firefox-talos-gfx.trace 33.1 0.42 25.8 0.44 100.00% +28.6%
t-firefox-scrolling.trace 31.4 0.11 24.8 0.12 100.00% +26.3%
t-gnome-terminal-vim.trace 22.4 0.10 19.9 0.14 100.00% +12.5%
t-evolution.trace 13.9 0.07 13.0 0.05 100.00% +6.5%
t-firefox-planet-gnome.trace 11.6 0.02 10.9 0.02 100.00% +6.5%
t-gvim.trace 34.0 0.21 33.2 0.21 100.00% +2.4%
t-chromium-tabs.trace 4.9 0.02 4.9 0.02 100.00% +1.0%
t-poppler.trace 9.8 0.05 9.8 0.06 100.00% +0.7%
t-firefox-canvas-swscroll.trace 32.3 0.10 32.2 0.09 100.00% +0.4%
t-firefox-paintball.trace 18.1 0.01 18.0 0.01 100.00% +0.3%
t-poppler-reseau.trace 22.5 0.09 22.4 0.11 99.29% +0.3%
t-firefox-canvas.trace 18.1 0.06 18.0 0.05 99.29% +0.2%
t-xfce4-terminal-a1.trace 4.8 0.01 4.8 0.01 99.77% +0.2%
t-firefox-fishbowl.trace 21.2 0.03 21.2 0.04 100.00% +0.2%
t-gnome-system-monitor.trace 17.3 0.03 17.3 0.03 99.54% +0.1%
t-firefox-asteroids.trace 11.1 0.01 11.1 0.01 100.00% +0.1%
t-midori-zoomed.trace 8.0 0.01 8.0 0.01 99.98% +0.1%
t-grads-heat-map.trace 4.4 0.04 4.4 0.04 34.08% +0.1% (insignificant)
t-firefox-talos-svg.trace 20.6 0.03 20.6 0.04 54.06% +0.0% (insignificant)
t-firefox-fishtank.trace 13.2 0.01 13.2 0.01 52.81% -0.0% (insignificant)
t-swfdec-giant-steps.trace 14.9 0.02 14.9 0.03 85.50% -0.1% (insignificant)
t-firefox-chalkboard.trace 36.6 0.02 36.7 0.03 100.00% -0.2%
t-firefox-canvas-alpha.trace 20.7 0.32 20.7 0.22 55.76% -0.3% (insignificant)
t-swfdec-youtube.trace 7.8 0.02 7.8 0.03 100.00% -0.5%
t-firefox-particles.trace 27.4 0.16 27.5 0.18 99.94% -0.6%
At most 4 outliers rejected per test per set.
Cairo perf reports the running time, but the change is computed for
operations per second instead (inverse of running time).
Confidence is based on Welch's t-test. Absolute changes less than 1%
can be accounted as measurement errors, even if statistically
significant.
v4, Pekka Paalanen <pekka.paalanen@collabora.co.uk> :
Use pixman_asm_function instead of startfunc.
Rebased. Re-benchmarked on Raspberry Pi.
Commit message.
v5, Ben Avison <bavison@riscosopen.org> :
Fixed the bug exposed in blitters-test 4928372.
15 hours of testing, compared to the 45 minutes to hit
the bug originally.
Pekka Paalanen <pekka.paalanen@collabora.co.uk> :
Squash the fix, re-benchmark on Raspberry Pi.
Ben Avison [Wed, 9 Apr 2014 13:25:29 +0000 (16:25 +0300)]
ARMv6: Support for very variable-hungry composite operations
Previously, the variable ARGS_STACK_OFFSET was available to extract values
from function arguments during the init macro. Now this changes dynamically
around stack operations in the function as a whole so that arguments can be
accessed at any point. It is also joined by LOCALS_STACK_OFFSET, which
allows access to space reserved on the stack during the init macro.
On top of this, composite macros now have the option of using all of WK0-WK3
registers rather than just the subset it was told to use; this requires the
pixel count to be spilled to the stack over the leading pixels at the start
of each line. Thus, at best, each composite operation can use 11 registers,
plus any pointer registers not required for the composite type, plus as much
stack space as it needs, divided up into constants and variables as necessary.
Søren Sandmann [Wed, 9 Apr 2014 18:14:12 +0000 (14:14 -0400)]
create_bits(): Cast the result of height * stride to size_t
In create_bits() both height and stride are ints, so the result is
also an int, which will overflow if height or stride are big enough
and size_t is bigger than int.
This patch simply casts height to size_t to prevent these overflows,
which prevents the crash in:
https://bugzilla.redhat.com/show_bug.cgi?id=972647
It's not even close to fixing the full problem of supporting big
images in pixman.
See also
https://bugs.freedesktop.org/show_bug.cgi?id=69014
Pekka Paalanen [Mon, 31 Mar 2014 12:03:43 +0000 (15:03 +0300)]
ARM: share pixman_asm_function definition
Several files define identically the asm macro pixman_asm_function.
Merge all these definitions into a new asm header.
The original definition is taken from pixman-arm-simd-asm-scaled.S with
the copyright/licence/author blurb verbatim.
Ben Avison [Fri, 28 Mar 2014 09:13:21 +0000 (11:13 +0200)]
ARMv6: Add fast path for over_reverse_n_8888
Benchmark results, "before" is upstream commit
c343846 lowlevel-blt-bench: add in_reverse_8888_8888 test
and "after" is with this patch only added on top.
lowlevel-blt-bench, over_reverse_n_8888, 100 iterations:
Before After
Mean StdDev Mean StdDev Confidence Change
L1 15.1 0.1 274.5 2.3 100.00% +1718.9%
L2 12.8 0.3 181.8 0.7 100.00% +1315.5%
M 10.8 0.0 77.9 0.0 100.00% +621.2%
HT 9.7 0.0 29.4 0.2 100.00% +204.9%
VT 9.5 0.0 26.7 0.1 100.00% +179.3%
R 9.3 0.0 25.3 0.1 100.00% +173.6%
RT 6.0 0.1 11.0 0.2 100.00% +82.9%
At most 16 outliers rejected per case per set.
cairo-perf-trace with trimmed traces, 30 iterations:
Before After
Mean StdDev Mean StdDev Confidence Change
t-poppler.trace 12.9 0.1 9.7 0.0 100.00% +32.6%
t-firefox-talos-gfx.trace 33.2 0.7 32.9 0.4 95.23% +0.9% (insignificant)
t-firefox-particles.trace 27.4 0.1 27.3 0.2 99.65% +0.4%
t-firefox-canvas-alpha.trace 20.5 0.3 20.5 0.3 57.51% +0.3% (insignificant)
t-poppler-reseau.trace 22.4 0.1 22.4 0.1 95.69% +0.3% (insignificant)
t-firefox-fishtank.trace 13.2 0.0 13.2 0.0 99.84% +0.1%
t-swfdec-giant-steps.trace 14.9 0.0 14.9 0.0 87.68% +0.1% (insignificant)
t-swfdec-youtube.trace 7.8 0.0 7.8 0.0 35.22% +0.1% (insignificant)
t-firefox-planet-gnome.trace 11.5 0.0 11.5 0.0 29.37% +0.0% (insignificant)
t-firefox-fishbowl.trace 21.2 0.0 21.2 0.0 18.09% +0.0% (insignificant)
t-grads-heat-map.trace 4.4 0.0 4.4 0.0 1.84% +0.0% (insignificant)
t-firefox-paintball.trace 18.0 0.0 18.0 0.0 33.43% -0.0% (insignificant)
t-firefox-talos-svg.trace 20.5 0.0 20.5 0.1 68.56% -0.1% (insignificant)
t-midori-zoomed.trace 8.0 0.0 8.0 0.0 99.98% -0.1%
t-firefox-canvas-swscroll.trace 32.1 0.1 32.1 0.1 85.27% -0.1% (insignificant)
t-gnome-system-monitor.trace 17.2 0.0 17.2 0.0 99.97% -0.2%
t-firefox-chalkboard.trace 36.5 0.0 36.6 0.0 100.00% -0.2%
t-firefox-asteroids.trace 11.1 0.0 11.1 0.0 100.00% -0.2%
t-firefox-canvas.trace 17.9 0.0 18.0 0.0 100.00% -0.3%
t-chromium-tabs.trace 4.9 0.0 4.9 0.0 97.95% -0.3% (insignificant)
t-xfce4-terminal-a1.trace 4.8 0.0 4.8 0.0 100.00% -0.4%
t-firefox-scrolling.trace 31.1 0.1 31.2 0.1 100.00% -0.5%
t-evolution.trace 13.7 0.1 13.8 0.1 99.99% -0.6%
t-gnome-terminal-vim.trace 22.0 0.2 22.2 0.1 99.99% -0.7%
t-gvim.trace 33.2 0.2 33.5 0.2 100.00% -0.8%
At most 6 outliers rejected per case per set.
Cairo perf reports the running time, but the change is computed for
operations per second instead (inverse of running time).
Changes in the order of +/- 1% can be accounted for measurement errors,
even if they are deemed to be statistically significant. This claim is
based on comparing two 30-iteration identical "before" runs using the
exact same binaries, and observing changes from -0.4% to +0.5% with
>=99% confidence.
Confidence is based on Welch's t-test.
v4, Pekka Paalanen <pekka.paalanen@collabora.co.uk> :
Rebased, re-benchmarked on Raspberry Pi, commit message.
Siarhei Siamashka [Fri, 7 Mar 2014 06:23:10 +0000 (08:23 +0200)]
test: Fix OpenMP clauses for the tolerance-test
Compiling with the Intel Compiler reveals a problem:
tolerance-test.c(350): error: index variable "i" of for statement following an OpenMP for pragma must be private
# pragma omp parallel for default(none) shared(i) private (result)
^
In addition to this, the 'result' variable also should not be private
(otherwise its value does not survive after the end of the loop). It
needs to be either shared or use the reduction clause to describe how
the results from multiple threads are combined together. Reduction
seems to be more appropriate here.
Siarhei Siamashka [Fri, 7 Mar 2014 04:39:42 +0000 (06:39 +0200)]
configure.ac: Check if the compiler supports GCC vector extensions
The Intel Compiler 14.0.0 claims version GCC 4.7.3 compatibility
via __GNUC__/__GNUC__MINOR__ macros, but does not provide the same
level of GCC vector extensions support as the original GCC compiler:
http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html
Which results in the following compilation failure:
In file included from ../test/utils.h(7),
from ../test/utils.c(3):
../test/utils-prng.h(138): error: expression must have integral type
uint32x4 e = x->a - ((x->b << 27) + (x->b >> (32 - 27)));
^
The problem is fixed by doing a special check in configure for
this feature.
Ben Avison [Thu, 20 Mar 2014 08:30:28 +0000 (10:30 +0200)]
lowlevel-blt-bench: add in_reverse_8888_8888 test
in_reverse_8888_8888 is one of the more commonly used operations in the
cairo-perf-trace suite that hasn't been in lowlevel-blt-bench until now.
v4, Pekka Paalanen <pekka.paalanen@collabora.co.uk> :
Split from "Add extra test to lowlevel-blt-bench and fix an
existing one", new summary.
Ben Avison [Thu, 20 Mar 2014 08:30:27 +0000 (10:30 +0200)]
lowlevel-blt-bench: over_reverse_n_8888 needs solid source
v4, Pekka Paalanen <pekka.paalanen@collabora.co.uk> :
Split from "Add extra test to lowlevel-blt-bench and fix an
existing one", new summary.
Ben Avison [Thu, 20 Mar 2014 08:30:26 +0000 (10:30 +0200)]
ARMv6: remove 1 instr per row in generate_composite_function
This knocks off one instruction per row. The effect is probably too small to
be measurable, but might as well be included. The second occurrence of this
sequence doesn't actually benefit at all, but is changed for consistency.
The saved instruction comes from combining the "and" inside the .if
statement with an earlier "tst". The "and" was normally needed, except
for in one special case, where bits 4-31 were all shifted off the top of
the register later on in preload_leading_step2, so we didn't care about
their values.
v4, Pekka Paalanen <pekka.paalanen@collabora.co.uk> :
Remove "bits 0-3" from the comments, update patch summary, and
augment message with Ben's suggestion.
Ben Avison [Thu, 20 Mar 2014 08:30:25 +0000 (10:30 +0200)]
ARMv6: Fix indentation in the composite macros
Søren Sandmann [Sun, 8 Dec 2013 14:08:45 +0000 (09:08 -0500)]
Remove all the operators that use division from pixman-combine32.c
These are now handled by floating point combiners.
Søren Sandmann [Sun, 8 Dec 2013 13:51:31 +0000 (08:51 -0500)]
Copy the comments from pixman-combine32.c to pixman-combine-float.c
An upcoming commit will delete many of the operators from
pixman-combine32.c and rely on the ones in pixman-combine-float.c. The
comments about how the operators were derived are still useful though,
so copy them into pixman-combine-float.c before the deletion.
Søren Sandmann Pedersen [Mon, 30 Sep 2013 23:22:20 +0000 (19:22 -0400)]
utils.c: Set DEVIATION to 0.0128
Consider a HARD_LIGHT operation with the following pixels:
- source: 15 (6 bits)
- source alpha: 255 (8 bits)
- mask alpha: 223 (8 bits)
- dest 255 (8 bits)
- dest alpha: 0 (8 bits)
Since 2 times the source is less than source alpha, the first branch
of the hard light blend mode is taken:
(1 - sa) * d + (1 - da) * s + 2 * s * d
Since da is 0 and d is 1, this degenerates to:
(1 - sa) + 3 * s
Taking (src IN mask) into account along with the fact that sa is 1,
this becomes:
(1 - ma) + 3 * s * ma
= (1 - 223/255.0) + 3 * (15/63.0) * (223/255.0)
= 0.
7501400560224089
When computed with the source converted by bit replication to eight
bits, and additionally with the (src IN mask) part rounded to eight
bits, we get:
ma = 223/255.0
s * ma = (60 / 255.0) * (223/255.0) which rounds to 52 / 255
and the result is
(1 - ma) + 3 * s * ma
= (1 - 223/255.0) + 3 * 52/255.0
= 0.
7372549019607844
so now we have an error of 0.012885.
Without making changes to the way pixman does integer
rounding/arithmetic, this error must then be considered
acceptable. Due to conservative computations in the test suite we can
however get away with 0.0128 as the acceptable deviation.
This fixes the remaining failures in pixel-test.
Søren Sandmann [Sun, 24 Nov 2013 00:38:50 +0000 (19:38 -0500)]
Use floating point combiners for all operators that involve divisions
Consider a DISJOINT_ATOP operation with the following pixels:
- source: 0xff (8 bits)
- source alpha: 0x01 (8 bits)
- mask alpha: 0x7b (8 bits)
- dest: 0x00 (8 bits)
- dest alpha: 0xff (8 bits)
When (src IN mask) is computed in 8 bits, the resulting alpha channel
is 0 due to rounding:
floor ((0x01 * 0x7b) / 255.0 + 0.5) = floor (0.9823) = 0
which means that since Render defines any division by zero as
infinity, the Fa and Fb for this operator end up as follows:
Fa = max (1 - (1 - 1) / 0, 0) = 0
Fb = min (1, (1 - 0) / 1) = 1
and so since dest is 0x00, the overall result is 0.
However, when computed in full precision, the alpha value no longer
rounds to 0, and so Fa ends up being
Fa = max (1 - (1 - 1) / 0.0001, 0) = 1
and so the result is now
s * ma * Fa + d * Fb
= (1.0 * (0x7b / 255.0) * 1) + d * 0
= 0x7b / 255.0
= 0.4823
so the error in this case ends up being 0.
48235294, which is clearly
not something that can be considered acceptable.
In order to avoid this problem, we need to do all arithmetic in such a
way that a multiplication of two tiny numbers can never end up being
zero unless one of the input numbers is itself zero.
This patch makes all computations that involve divisions take place in
floating point, which is sufficient to fix the test cases
This brings the number of failures in pixel-test down to 14.
Søren Sandmann [Mon, 18 Nov 2013 18:26:33 +0000 (13:26 -0500)]
Soft Light: Consistent approach to division by zero
The Soft Light operator has several branches. One them is decided
based on whether 2 * s is less than or equal to 2 * sa. In floating
point implementations, when those two values are very close to each
other, it may not be completely predictable which branch we hit.
This is a problem because in one branch, when destination alpha is
zero, we get the result
r = d * as
and in the other we get
r = 0
So when d and as are not 0, this causes two different results to be
returned from essentially identical input values. In other words,
there is a discontinuity in the current implementation.
This patch randomly changes the second branch such that it now returns
d * sa instead. There is no deep meaning behind this, because
essentially this is an attempt to assign meaning to division by zero,
and all that is requires is that that meaning doesn't depend on minute
differences in input values.
This makes the number of failed pixels in pixel-test go down to 347.
Søren Sandmann Pedersen [Fri, 18 Oct 2013 20:39:38 +0000 (16:39 -0400)]
pixman-combine32.c: Fix bugs related to integer promotion
In the component alpha part of the PDF_SEPARABLE_BLEND_MODE macro, the
expression ~RED_8 (m) is used. Because RED_8(m) gets promoted to int
before ~ is applied, the whole expression typically becomes some
negative value rather than (255 - RED_8(m)) as desired.
Fix this by using unsigned temporary variables.
This reduces the number of failures in pixel-test to 363.
Søren Sandmann Pedersen [Sat, 19 Jan 2013 03:25:36 +0000 (22:25 -0500)]
pixman/pixman-combine32.c: Bug fixes for separable blend modes
This commit fixes four separate bugs:
1. In the computation
(1 - sa) * d + (1 - da) * s + sa * da * B(s, d)
we were using regular addition for all four channels, but for
superluminescent pixels, the addition could overflow causing
nonsensical results.
2. The variables and return types used for the results of the blend
mode calculations were unsigned, but for various blend modes (and
especially with superluminescent pixels), the blend mode
calculations could be negative, resulting in underflows.
3. The blend mode computations were returned as 8-bit values, which is
not sufficient precision (especially considering that we need
signed results).
4. The value before the final division by 255 was not properly clamped
to [0, 255].
This patch fixes all those bugs. The blend mode computations are now
returned as signed 16 bit values with 1 represented as 255 * 255.
With these fixes, the number of failing pixels in pixel-test goes down
from 431 to 384.
Søren Sandmann [Wed, 4 Dec 2013 15:06:06 +0000 (10:06 -0500)]
pixel-test.c: Add a number of pixels that have failed at some point
This commit adds a large number of pixel regressions to
pixel-test. All of these have at some point been failing in
blend-mode-test, and most of them do fail currently.
To be specific, with this commit, pixel-test reports 431 failed tests.
Søren Sandmann Pedersen [Thu, 17 Jan 2013 11:36:51 +0000 (06:36 -0500)]
test/tolerance-test: New test program
This new test program is similar to test/composite in that it relies
on the pixel_checker_t API to do tolerance based verification. But
unlike the composite test, which verifies combinations of a fixed set
of pixels, this one generates random images and verifies that those
composite correctly.
Also unlike composite, tolerance-test supports all the separable blend
mode operators in addition to the original Render operators.
When tests fail, a C struct is printed that can be pasted into
pixel-test for regression purposes.
There is an option "--forever" which causes the random seed to be set
to the current time, and then the test runs until interrupted. This is
useful for overnight runs.
This test currently fails badly due to various bugs in the blend mode
operators. Later commits will fix those.
Søren Sandmann [Wed, 4 Dec 2013 15:32:29 +0000 (10:32 -0500)]
pixel-test: Command line argument to specify the regression to run
A new command line argument allows the user to specify which one of
the regressions should be run.
Søren Sandmann [Wed, 4 Dec 2013 15:05:44 +0000 (10:05 -0500)]
pixel-test: Add support for mask pixels
Support is added to pixel-test for verifying operations involving
masks. If a regression includes a mask, it is verified with the
pixel_checker API in in both unified and component alpha modes.
Søren Sandmann Pedersen [Mon, 30 Sep 2013 23:22:11 +0000 (19:22 -0400)]
test/check-formats.c: Add support for separable blend modes
Søren Sandmann Pedersen [Sat, 19 Jan 2013 17:24:07 +0000 (12:24 -0500)]
test/utils.c: Add support for separable blend mode ops to do_composite()
The implementations are copied from the floating point pipeline, but
use double precision instead of single precision.
Søren Sandmann [Thu, 26 Dec 2013 14:41:53 +0000 (09:41 -0500)]
configure.ac: Check and use -Wno-unused-local-typedefs GCC option
With GCC 4.8.2 the COMPILE_TIME_ASSERT macro produces a spurious
warning about an unused local typedef:
In file included from pixman.c:29:0:
pixman.c: In function 'optimize_operator':
pixman-private.h:1019:22: warning: typedef 'compile_time_assertion' locally defined but not used [-Wunused-local-typedefs]
The flag -Wno-unused-local-typedefs suppresses that warning.
Søren Sandmann [Tue, 3 Dec 2013 22:59:42 +0000 (17:59 -0500)]
Soft Light: The first comparison should be <=, not <
According to the definition of soft light, the first comparison is
less-than-or-equal, not less-than.
Søren Sandmann [Sun, 24 Nov 2013 01:30:33 +0000 (20:30 -0500)]
general: Support component alpha for all image types
Currently, if you attempt to use component alpha on source images or
images without RGB channels, Pixman will silently just use unified
alpha instead. This patch makes such images supported for component
alpha.
There is no particularly compelling usecase at the moment, but this
patch does get rid of a bit of special-case code both in
pixman-general.c and in test/composite.c.
Søren Sandmann [Sat, 16 Nov 2013 23:57:01 +0000 (18:57 -0500)]
test/utils.c: Make the stack unaligned only on 32 bit Windows
The call_test_function() contains some assembly that deliberately
causes the stack to be aligned to 32 bits rather than 128 bits on
x86-32. The intention is to catch bugs that surface when pixman is
called from code that only uses a 32 bit alignment.
However, recent versions of GCC apparently make the assumption (either
accidentally or deliberately) that that the incoming stack is aligned
to 128 bits, where older versions only seemed to make this assumption
when compiling with -msse2. This causes the vector code in the PRNG to
now segfault when called from call_test_function() on x86-32.
This patch fixes that by only making the stack unaligned on 32 bit
Windows, where it would definitely be incorrect for GCC to assume that
the incoming stack is aligned to 128 bits.
V2: Put "defined(...)" around __GNUC__
Reviewed-and-Tested-by: Matt Turner <mattst88@gmail.com>
Bugzilla: https://bugs.gentoo.org/show_bug.cgi?id=491110
Jakub Bogusz [Tue, 12 Nov 2013 20:59:42 +0000 (12:59 -0800)]
Fix the SSSE3 CPUID detection.
SSSE3 is detected by bit 9 of ECX, but we were checking bit 9 of EDX
which is APIC leading to SSSE3 routines being called on CPUs without
SSSE3.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Søren Sandmann [Tue, 12 Nov 2013 00:13:31 +0000 (19:13 -0500)]
demos/Makefile.am: Move EXTRA_DIST outside "if HAVE_GTK"
Without this, if tarballs are generated on a system that doesn't have
GTK+ 2 development headers available, the files in EXTRA_DIST will not
be included, which then causes builds from the tarball to fail on
systems that do have GTK+ 2 headers available.
Fixes https://bugs.freedesktop.org/show_bug.cgi?id=71465
Andrea Canciani [Mon, 11 Nov 2013 10:21:23 +0000 (11:21 +0100)]
test: Fix the win32 build
The win32 build has no config.h, so HAVE_CONFIG_H should be checked
before including it, as in utils.h.
Søren Sandmann [Sun, 10 Nov 2013 23:17:12 +0000 (18:17 -0500)]
Post-release version bump to 0.33.1
Søren Sandmann [Sun, 10 Nov 2013 23:05:47 +0000 (18:05 -0500)]
Pre-release version bump to 0.32.0
Søren Sandmann Pedersen [Sat, 2 Nov 2013 00:52:00 +0000 (20:52 -0400)]
Post-release version bump to 0.31.3
Søren Sandmann Pedersen [Sat, 2 Nov 2013 00:39:46 +0000 (20:39 -0400)]
Pre-release version bump to 0.31.2
Ritesh Khadgaray [Wed, 23 Oct 2013 21:29:07 +0000 (17:29 -0400)]
pixman_trapezoid_valid(): Fix underflow when bottom is close to MIN_INT
If t->bottom is close to MIN_INT (probably invalid value), subtracting
top can lead to underflow which causes crashes. Attached patch will
fix the issue.
This fixes bug 67484.
Søren Sandmann Pedersen [Wed, 23 Oct 2013 21:28:11 +0000 (17:28 -0400)]
test/trap-crasher.c: Add trapezoid that demonstrates a crash
This trapezoid causes a crash due to an underflow in the
pixman_trapezoid_valid().
Test case from Ritesh Khadgaray.
Brad Smith [Fri, 18 Oct 2013 03:22:02 +0000 (23:22 -0400)]
Fix pixman build with older GCC releases
The following patch fixes building pixman with older GCC releases
such as GCC 3.3 and older (OpenBSD; some older archs use GCC 3.3.6)
by changing the method of detecting the presence of __builtin_clz
to utilizing an autoconf check to determine its presence. Compilers
that pretend to be GCC, implement __builtin_clz and are already
utilizing the intrinsic include LLVM/Clang, Open64, EKOPath and
PCC.
Søren Sandmann Pedersen [Fri, 11 Oct 2013 04:49:44 +0000 (00:49 -0400)]
pixman-glyph.c: Add __force_align_arg_pointer to composite functions
The functions pixman_composite_glyphs_no_mask() and
pixman_composite_glyphs() can call into code compiled with -msse2,
which requires the stack to be aligned to 16 bytes. Since the ABIs on
Windows and Linux for x86-32 don't provide this guarantee, we need to
use this attribute to make GCC generate a prologue that realigns the
stack.
This fixes the crash introduced in the previous commit and also
https://bugs.freedesktop.org/show_bug.cgi?id=70348
and
https://bugs.freedesktop.org/show_bug.cgi?id=68300
Søren Sandmann Pedersen [Wed, 2 Oct 2013 18:38:16 +0000 (14:38 -0400)]
utils.c: On x86-32 unalign the stack before calling test_function
GCC when compiling with -msse2 and -mssse3 will assume that the stack
is aligned to 16 bytes even on x86-32 and accordingly issue movdqa
instructions for stack allocated variables.
But despite what GCC thinks, the standard ABI on x86-32 only requires
a 4-byte aligned stack. This is true at least on Windows, but there
also was (and maybe still is) Linux code in the wild that assumed
this. When such code calls into pixman and hits something compiled
with -msse2, we get a segfault from the unaligned movdqas.
Pixman has worked around this issue in the past with the gcc attribute
"force_align_arg_pointer" but the problem has resurfaced now in
https://bugs.freedesktop.org/show_bug.cgi?id=68300
because pixman_composite_glyphs() is missing this attribute.
This patch makes fuzzer_test_main() call the test_function through a
trampoline, which, on x86-32, has a bit of assembly that deliberately
avoids aligning the stack to 16 bytes as GCC normally expects. The
result is that glyph-test now crashes.
V2: Mark caller-save registers as clobbered, rather than using
noinline on the trampoline.
Siarhei Siamashka [Sat, 5 Oct 2013 19:00:26 +0000 (22:00 +0300)]
configure.ac: check and use -Wdeclaration-after-statement GCC option
The accidental use of declaration after statement breaks compilation
with C89 compilers such as MSVC. Assuming that MSVC is one of the
supported compilers, it makes sense to ask GCC to at least report
warnings for such problematic code.
Siarhei Siamashka [Wed, 2 Oct 2013 00:54:30 +0000 (00:54 +0000)]
sse2: bilinear fast path for src_x888_8888
Running cairo-perf-trace benchmark on Intel Core2 T7300:
Before:
[ 0] image t-firefox-canvas-swscroll 1.989 2.008 0.43% 8/8
[ 1] image firefox-canvas-scroll 4.574 4.609 0.50% 8/8
After:
[ 0] image t-firefox-canvas-swscroll 1.404 1.418 0.51% 8/8
[ 1] image firefox-canvas-scroll 4.228 4.259 0.36% 8/8
Søren Sandmann Pedersen [Thu, 10 Oct 2013 02:12:23 +0000 (22:12 -0400)]
configure.ac: Add check for pmulhuw assembly
Clang 3.0 chokes on the following bit of assembly
asm ("pmulhuw %1, %0\n\t"
: "+y" (__A)
: "y" (__B)
);
from pixman-mmx.c with this error message:
fatal error: error in backend: Unsupported asm: input constraint
with a matching output constraint of incompatible type!
So add a check in configure to only enable MMX when the compiler can
deal with it.
Søren Sandmann Pedersen [Thu, 10 Oct 2013 02:05:59 +0000 (22:05 -0400)]
scale.c: Use int instead of kernel_t for values in named_int_t
The 'value' field in the 'named_int_t' struct is used for both
pixman_repeat_t and pixman_kernel_t values, so the type should be int,
not pixman_kernel_t.
Fixes some warnings like this
scale.c:124:33: warning: implicit conversion from enumeration
type 'pixman_repeat_t' to different enumeration type
'pixman_kernel_t' [-Wconversion]
{ "None", PIXMAN_REPEAT_NONE },
~ ^~~~~~~~~~~~~~~~~~
when compiled with clang.
Søren Sandmann Pedersen [Fri, 4 Oct 2013 20:45:21 +0000 (16:45 -0400)]
pixman-combine32.c: Make Color Burn routine follow the math more closely
For superluminescent destinations, the old code could underflow in
uint32_t r = (ad - d) * as / s;
when (ad - d) was negative. The new code avoids this problem (and
therefore causes changes in the checksums of thread-test and
blitters-test), but it is likely still buggy due to the use of
unsigned variables and other issues in the blend mode code.
Søren Sandmann Pedersen [Fri, 4 Oct 2013 20:40:17 +0000 (16:40 -0400)]
pixman-combine32: Make Color Dodge routine follow the math more closely
Change blend_color_dodge() to follow the math in the comment more
closely.
Note, the new code here is in some sense worse than the old code
because it can now underflow the unsigned variables when the source is
superluminescent and (as - s) is therefore negative. The old code was
careful to clamp to 0.
But for superluminescent variables we really need the ability for the
blend function to become negative, and so the solution the underflow
problem is to just use signed variables. The use of unsigned variables
is a general problem in all of the blend mode code that will have to
be solved later.
The CRC32 values in thread-test and blitters-test are updated to
account for the changes in output.
Søren Sandmann Pedersen [Fri, 4 Oct 2013 20:35:35 +0000 (16:35 -0400)]
pixman-combine32: Rename a number of variable from sa/sca to as/s
There are no semantic changes, just variables renames. The motivation
for these renames is so that the names are shorter and better match
the one used in the comments.
Søren Sandmann Pedersen [Fri, 4 Oct 2013 20:27:39 +0000 (16:27 -0400)]
pixman-combine32: Improve documentation for blend mode operators
This commit overhauls the comments in pixman-comine32.c regarding
blend modes:
- Add a link to the PDF supplement that clarifies the specification of
ColorBurn and ColorDodge
- Clarify how the formulas for premultiplied colors are derived form
the ones in the PDF specifications
- Write out the derivation of the formulas in each blend routine
Søren Sandmann Pedersen [Fri, 4 Oct 2013 20:40:23 +0000 (16:40 -0400)]
pixman-combine32.c: Formatting fixes
Fix a bunch of spacing issues.
V2: More spacing issues, in the _ca combiners
Andrea Canciani [Wed, 9 Oct 2013 16:23:27 +0000 (18:23 +0200)]
Fix thread-test on non-OpenMP systems
The non-reentrant versions of prng_* functions are thread-safe only in
OpenMP-enabled builds.
Fixes thread-test failing when compiled with Clang (both on Linux and
on MacOS).
Andrea Canciani [Thu, 26 Sep 2013 07:23:41 +0000 (09:23 +0200)]
Add support for SSSE3 to the MSVC build system
Handle SSSE3 just like MMX and SSE2.
Andrea Canciani [Thu, 26 Sep 2013 07:26:17 +0000 (09:26 +0200)]
Fix build of check-formats on MSVC
Fixes
check-formats.obj : error LNK2019: unresolved external symbol
_strcasecmp referenced in function _format_from_string
check-formats.obj : error LNK2019: unresolved external symbol
_snprintf referenced in function _list_operators
Andrea Canciani [Thu, 26 Sep 2013 07:12:31 +0000 (09:12 +0200)]
Fix building of "other" programs on MSVC
In
d1434d112ca5cd325e4fb85fc60afd1b9e902786 the benchmarks have been
extended to include other programs as well and the variable names have
been updated accordingly in the autotools-based build system, but not
in the MSVC one.
Andrea Canciani [Thu, 26 Sep 2013 07:16:41 +0000 (09:16 +0200)]
Fix build on MSVC
After
a4c79d695d52c94647b1aff78548e5892d616b70 the MMX and SSE2 code
has some declarations after the beginning of a block, which is not
allowed by MSVC.
Fixes multiple errors like:
pixman-mmx.c(3625) : error C2275: '__m64' : illegal use of this type
as an expression
pixman-sse2.c(5708) : error C2275: '__m128i' : illegal use of this
type as an expression
Søren Sandmann Pedersen [Wed, 2 Oct 2013 21:51:36 +0000 (17:51 -0400)]
fast: Swap image and iter flags in generated fast paths
The generated fast paths that were moved into the 'fast'
implementation in
ec0e38cbb746a673f8e989ab8eae356c8c77dac7 had their
image and iter flag arguments swapped; as a result, none of the fast
paths were ever called.
Siarhei Siamashka [Sat, 28 Sep 2013 01:51:21 +0000 (04:51 +0300)]
vmx: there is no need to handle unaligned destination anymore
So the redundant variables, memory reads/writes and reshuffles
can be safely removed. For example, this makes the inner loop
of 'vmx_combine_add_u_no_mask' function much more simple.
Before:
7a20:7d a8 48 ce lvx v13,r8,r9
7a24:7d 80 48 ce lvx v12,r0,r9
7a28:7d 28 50 ce lvx v9,r8,r10
7a2c:7c 20 50 ce lvx v1,r0,r10
7a30:39 4a 00 10 addi r10,r10,16
7a34:10 0d 62 eb vperm v0,v13,v12,v11
7a38:10 21 4a 2b vperm v1,v1,v9,v8
7a3c:11 2c 6a eb vperm v9,v12,v13,v11
7a40:10 21 4a 00 vaddubs v1,v1,v9
7a44:11 a1 02 ab vperm v13,v1,v0,v10
7a48:10 00 0a ab vperm v0,v0,v1,v10
7a4c:7d a8 49 ce stvx v13,r8,r9
7a50:7c 00 49 ce stvx v0,r0,r9
7a54:39 29 00 10 addi r9,r9,16
7a58:42 00 ff c8 bdnz+ 7a20 <.vmx_combine_add_u_no_mask+0x120>
After:
76c0:7c 00 48 ce lvx v0,r0,r9
76c4:7d a8 48 ce lvx v13,r8,r9
76c8:39 29 00 10 addi r9,r9,16
76cc:7c 20 50 ce lvx v1,r0,r10
76d0:10 00 6b 2b vperm v0,v0,v13,v12
76d4:10 00 0a 00 vaddubs v0,v0,v1
76d8:7c 00 51 ce stvx v0,r0,r10
76dc:39 4a 00 10 addi r10,r10,16
76e0:42 00 ff e0 bdnz+ 76c0 <.vmx_combine_add_u_no_mask+0x120>
Siarhei Siamashka [Sat, 28 Sep 2013 00:48:07 +0000 (03:48 +0300)]
vmx: align destination to fix valgrind invalid memory writes
The SIMD optimized inner loops in the VMX/Altivec code are trying
to emulate unaligned accesses to the destination buffer. For each
4 pixels (which fit into a 128-bit register) the current
implementation:
1. first performs two aligned reads, which cover the needed data
2. reshuffles bytes to get the needed data in a single vector register
3. does all the necessary calculations
4. reshuffles bytes back to their original location in two registers
5. performs two aligned writes back to the destination buffer
Unfortunately in the case if the destination buffer is unaligned and
the width is a perfect multiple of 4 pixels, we may have some writes
crossing the boundaries of the destination buffer. In a multithreaded
environment this may potentially corrupt the data outside of the
destination buffer if it is concurrently read and written by some
other thread.
The valgrind report for blitters-test is full of:
==23085== Invalid write of size 8
==23085== at 0x1004B0B4: vmx_combine_add_u (pixman-vmx.c:1089)
==23085== by 0x100446EF: general_composite_rect (pixman-general.c:214)
==23085== by 0x10002537: test_composite (blitters-test.c:363)
==23085== by 0x1000369B: fuzzer_test_main._omp_fn.0 (utils.c:733)
==23085== by 0x10004943: fuzzer_test_main (utils.c:728)
==23085== by 0x10002C17: main (blitters-test.c:397)
==23085== Address 0x5188218 is 0 bytes after a block of size 88 alloc'd
==23085== at 0x4051DA0: memalign (vg_replace_malloc.c:581)
==23085== by 0x4051E7B: posix_memalign (vg_replace_malloc.c:709)
==23085== by 0x10004CFF: aligned_malloc (utils.c:833)
==23085== by 0x10001DCB: create_random_image (blitters-test.c:47)
==23085== by 0x10002263: test_composite (blitters-test.c:283)
==23085== by 0x1000369B: fuzzer_test_main._omp_fn.0 (utils.c:733)
==23085== by 0x10004943: fuzzer_test_main (utils.c:728)
==23085== by 0x10002C17: main (blitters-test.c:397)
This patch addresses the problem by first aligning the destination
buffer at a 16 byte boundary in each combiner function. This trick
is borrowed from the pixman SSE2 code.
It allows to pass the new thread-test on PowerPC VMX/Altivec systems and
also resolves the "make check" failure reported for POWER7 hardware:
http://lists.freedesktop.org/archives/pixman/2013-August/002871.html
Søren Sandmann Pedersen [Sat, 28 Sep 2013 05:10:24 +0000 (01:10 -0400)]
test: Add new thread-test program
This test program allocates an array of 16 * 7 uint32_ts and spawns 16
threads that each use 7 of the allocated uint32_ts as a destination
image for a large number of composite operations. Each thread then
computes and returns a checksum for the image. Finally, the main
thread computes a checksum of the checksums and verifies that it
matches expectations.
The purpose of this test is catch errors where memory outside images
is read and then written back. Such out-of-bounds accesses are broken
when multiple threads are involved, because the threads will race to
read and write the shared memory.
V2:
- Incorporate fixes from Siarhei for endianness and undefined behavior
regarding argument evaluation
- Make the images 7 pixels wide since the bug only happens when the
composite width is greater than 4.
- Compute a checksum of the checksums so that you don't have to
update 16 values if something changes.
V3: Remove stray dollar sign
Søren Sandmann Pedersen [Sat, 28 Sep 2013 05:03:55 +0000 (01:03 -0400)]
Rename HAVE_PTHREAD_SETSPECIFIC to HAVE_PTHREADS
The test for pthread_setspecific() can be used as a general test for
whether pthreads are available, so rename the variable from
HAVE_PTHREAD_SETSPECIFIC to HAVE_PTHREADS and run the test even when
better support for thread local variables are available.
However, the pthread arguments are still only added to CFLAGS and
LDFLAGS when pthread_setspecific() is used for thread local variables.
V2: AC_SUBST(PTHREAD_CFLAGS)
Søren Sandmann Pedersen [Sun, 29 Sep 2013 20:47:53 +0000 (16:47 -0400)]
blitters-test: Remove unused variable
Søren Sandmann Pedersen [Thu, 26 Sep 2013 22:56:07 +0000 (18:56 -0400)]
utils.c: Make image_endian_swap() deal with negative strides
Use a temporary variable s containing the absolute value of the stride
as the upper bound in the inner loops.
V2: Do this for the bpp == 16 case as well