Søren Sandmann Pedersen [Mon, 28 Jan 2013 01:08:06 +0000 (20:08 -0500)]
Change default GPGKEY to
3892336E, which is soren.sandmann@gmail.com
The old one belongs to the email address sandmann@daimi.au.dk, which
doesn't work anyore.
Also use gpg to get the name and address for the "(Signed by ...)"
line since that works more reliably for me than using git.
Ben Avison [Thu, 24 Jan 2013 18:19:48 +0000 (18:19 +0000)]
Improve L1 and L2 benchmark tests for caches that don't use allocate-on-write
In particular this affects single-core ARMs (e.g. ARM11, Cortex-A8), which
are usually configured this way. For other CPUs, this should only add a
constant time, which will be cancelled out by the EXCLUDE_OVERHEAD runs.
The problems were caused by cachelines becoming permanently evicted from
the cache, because the code that was intended to pull them back in again on
each iteration assumed too long a cache line (for the L1 test) or failed to
read memory beyond the first pixel row (for the L2 test). Also, the reloading
of the source buffer was unnecessary.
These issues were identified by Siarhei in this post:
http://lists.freedesktop.org/archives/pixman/2013-January/002543.html
Søren Sandmann Pedersen [Fri, 18 Jan 2013 19:13:21 +0000 (14:13 -0500)]
pixman-combine-float.c: Use IS_ZERO() in clip_color() and set_sat()
The clip_color() function has some checks to avoid division by zero,
but they are done by comparing the value to 4 * FLT_EPSILON, where a
better choice is the IS_ZERO() macro that compares to +/- FLT_MIN.
In set_sat(), the check is that *max > *min before dividing by *max -
*min, but that has the potential problem that interactions between GCC
optimizions and 80 bit x87 registers could mean that (*max > *min) is
true in 80 bits, but (*max - *min) is 0 in 32 bits, so that the
division by zero is not prevented. Using IS_ZERO() here as well
prevents this.
Ben Avison [Sat, 19 Jan 2013 16:16:53 +0000 (16:16 +0000)]
ARMv6: Replacement add_8_8, over_8888_8888, over_8888_n_8888 and over_n_8_8888 routines
Improved by adding preloads, combining writes and using the SEL
instruction.
add_8_8
Before After
Mean StdDev Mean StdDev Confidence Change
L1 62.1 0.2 543.4 12.4 100.0% +774.9%
L2 38.7 0.4 116.8 1.7 100.0% +201.8%
M 40.0 0.1 110.1 0.5 100.0% +175.3%
HT 30.9 0.2 43.4 0.5 100.0% +40.4%
VT 30.6 0.3 39.2 0.5 100.0% +28.0%
R 21.3 0.2 35.4 0.4 100.0% +66.6%
RT 8.6 0.2 10.2 0.3 100.0% +19.4%
over_8888_8888
Before After
Mean StdDev Mean StdDev Confidence Change
L1 32.3 0.1 38.0 0.2 100.0% +17.7%
L2 15.9 0.4 30.6 0.5 100.0% +92.8%
M 13.3 0.0 25.6 0.0 100.0% +92.9%
HT 10.5 0.1 15.5 0.1 100.0% +47.1%
VT 10.4 0.1 14.6 0.1 100.0% +40.8%
R 10.3 0.1 15.8 0.1 100.0% +53.3%
RT 6.0 0.1 7.6 0.1 100.0% +25.9%
over_8888_n_8888
Before After
Mean StdDev Mean StdDev Confidence Change
L1 17.6 0.1 21.0 0.1 100.0% +19.2%
L2 11.2 0.2 19.2 0.1 100.0% +71.2%
M 10.2 0.0 19.6 0.0 100.0% +92.6%
HT 8.4 0.0 11.9 0.1 100.0% +41.7%
VT 8.3 0.0 11.3 0.1 100.0% +36.4%
R 8.3 0.0 11.8 0.1 100.0% +43.1%
RT 5.1 0.1 6.2 0.1 100.0% +21.3%
over_n_8_8888
Before After
Mean StdDev Mean StdDev Confidence Change
L1 17.5 0.1 22.8 0.8 100.0% +30.1%
L2 14.2 0.3 21.7 0.2 100.0% +52.6%
M 12.0 0.0 22.3 0.0 100.0% +84.8%
HT 10.5 0.1 14.1 0.1 100.0% +34.5%
VT 10.0 0.1 13.5 0.1 100.0% +35.3%
R 9.4 0.0 12.9 0.2 100.0% +37.7%
RT 5.5 0.1 6.5 0.2 100.0% +19.2%
Ben Avison [Sat, 19 Jan 2013 16:16:52 +0000 (16:16 +0000)]
ARMv6: New conversion routines
There was no previous attempt at accelerating these specifically for
ARMv6.
src_x888_8888
Before After
Mean StdDev Mean StdDev Confidence Change
L1 96.7 0.5 270.4 2.6 100.0% +179.5%
L2 44.6 2.7 110.6 9.7 100.0% +148.0%
M 26.9 0.1 87.6 0.5 100.0% +226.1%
HT 19.3 0.2 37.5 0.4 100.0% +93.7%
VT 18.6 0.1 33.7 0.4 100.0% +81.6%
R 18.4 0.1 32.2 0.3 100.0% +75.2%
RT 9.2 0.2 12.1 0.3 100.0% +31.4%
src_0565_8888
Before After
Mean StdDev Mean StdDev Confidence Change
L1 37.0 0.3 66.9 0.2 100.0% +80.8%
L2 30.3 0.2 55.9 0.3 100.0% +84.4%
M 25.9 0.0 62.3 0.2 100.0% +140.3%
HT 15.2 0.1 33.1 0.3 100.0% +116.9%
VT 15.1 0.1 30.7 0.3 100.0% +103.6%
R 14.2 0.1 27.6 0.3 100.0% +94.0%
RT 6.0 0.1 11.2 0.3 100.0% +87.2%
Ben Avison [Sat, 19 Jan 2013 16:16:51 +0000 (16:16 +0000)]
ARMv6: New blit routines
These are usable either as various composite operations, or via the
top-level function pixman_blt() which now does some blitting for the
first time on an ARMv6 platform (previously it just returned FALSE).
src_8888_8888
Before After
Mean StdDev Mean StdDev Confidence Change
L1 414.5 9.4 445.8 3.6 100.0% +7.6%
L2 93.3 20.7 114.5 12.9 100.0% +22.7%
M 57.0 0.2 89.2 0.5 100.0% +56.4%
HT 28.7 0.3 39.6 0.4 100.0% +37.9%
VT 25.5 0.2 35.3 0.4 100.0% +38.4%
R 20.1 0.1 33.8 0.3 100.0% +67.8%
RT 7.8 0.2 12.7 0.4 100.0% +62.7%
src_0565_0565
Before After
Mean StdDev Mean StdDev Confidence Change
L1 397.4 6.1 412.5 5.2 100.0% +3.8%
L2 143.2 10.9 141.9 6.5 68.9% -0.9% (insignificant)
M 90.7 0.4 133.5 0.7 100.0% +47.1%
HT 38.6 0.3 53.7 0.7 100.0% +39.0%
VT 33.0 0.3 47.3 0.6 100.0% +43.3%
R 25.7 0.2 42.1 0.5 100.0% +64.1%
RT 8.0 0.2 13.3 0.3 100.0% +65.6%
src_8_8
Before After
Mean StdDev Mean StdDev Confidence Change
L1 716.5 9.8 768.2 20.4 100.0% +7.2%
L2 246.2 12.7 260.5 8.8 100.0% +5.8%
M 146.8 0.7 227.9 0.7 100.0% +55.2%
HT 44.9 0.6 62.1 1.0 100.0% +38.2%
VT 35.6 0.4 53.4 0.7 100.0% +50.0%
R 29.7 0.3 48.2 0.6 100.0% +62.2%
RT 8.6 0.2 12.9 0.4 100.0% +49.3%
Ben Avison [Sat, 19 Jan 2013 16:16:50 +0000 (16:16 +0000)]
ARMv6: New fill routines
Note that this also effectively accelerates src_n_8888, src_n_0565 and
src_n_8 composite types, because of the fast paths in
pixman-fast-path.c implemented by fast_composite_solid_fill(), which
end up dispatching these platform-specific fill routines.
src_n_8888
Before After
Mean StdDev Mean StdDev Confidence Change
L1 157.3 1.1 574.2 8.7 100.0% +265.0%
L2 94.2 0.5 364.8 4.2 100.0% +287.3%
M 92.7 0.4 358.7 1.1 100.0% +287.1%
HT 68.5 0.9 133.6 4.0 100.0% +95.2%
VT 61.3 0.8 111.8 2.6 100.0% +82.4%
R 61.1 0.9 108.7 2.8 100.0% +78.1%
RT 24.6 1.0 28.6 1.6 100.0% +16.0%
src_n_0565
Before After
Mean StdDev Mean StdDev Confidence Change
L1 157.4 1.0 983.1 38.5 100.0% +524.6%
L2 93.6 0.5 696.0 14.3 100.0% +643.4%
M 92.7 0.4 680.5 1.0 100.0% +634.0%
HT 68.3 0.9 160.3 6.6 100.0% +134.6%
VT 61.1 0.8 130.1 3.4 100.0% +112.9%
R 61.0 0.8 125.4 4.1 100.0% +105.7%
RT 24.9 1.3 29.5 1.5 100.0% +18.2%
src_n_8
Before After
Mean StdDev Mean StdDev Confidence Change
L1 154.7 1.0 1324.4 48.5 100.0% +756.3%
L2 92.4 0.4 1178.4 10.9 100.0% +1175.6%
M 92.9 0.4 1275.7 2.1 100.0% +1273.5%
HT 68.2 1.0 169.8 5.5 100.0% +149.0%
VT 61.2 1.0 138.5 3.6 100.0% +126.3%
R 61.3 0.9 130.1 3.8 100.0% +112.4%
RT 25.5 1.3 29.2 1.9 100.0% +14.6%
Ben Avison [Mon, 28 Jan 2013 17:03:50 +0000 (17:03 +0000)]
ARMv6: Lay the groundwork for later patches in the series
Move the entire contents of pixman-arm-simd-asm.S to a new file;
ultimately this will only retain the scaled operations, so it is
named pixman-arm-simd-asm-scaled.S. Added new header file
pixman-arm-simd-asm.h, containing the macros which are the basis of
all the new ARMv6 implementations, although at this point in the
series, nothing uses them and the library should be binary-identical.
Søren Sandmann Pedersen [Sat, 26 Jan 2013 05:34:53 +0000 (00:34 -0500)]
demo/scale: Add a spin button to set the number of subsample bits
For large upscalings the level of subsampling for the filter has a
quite visible effect, so make it settable in the UI so that people can
experiment with various values.
Siarhei Siamashka [Sat, 15 Dec 2012 05:18:53 +0000 (07:18 +0200)]
Use pixman_transform_point_31_16() from pixman_transform_point()
Old functions pixman_transform_point() and pixman_transform_point_3d()
now become just wrappers for pixman_transform_point_31_16() and
pixman_transform_point_31_16_3d(). Eventually their uses should be
completely eliminated in the pixman code and replaced with their
extended range counterparts. This is needed in order to be able
to correctly handle any matrices and parameters that may come
to pixman from the code responsible for XRender implementation.
Siarhei Siamashka [Sat, 15 Dec 2012 04:19:21 +0000 (06:19 +0200)]
test: Added matrix-test for testing projective transform accuracy
This test uses __float128 data type when it is available
for implementing a "perfect" reference implementation. The
output from from pixman_transform_point_31_16() and
pixman_transform_point_31_16_affine() is compared with the
reference implementation to make sure that the rounding
errors may only show up in a single least significant bit.
The platforms and compilers, which do not support __float128
data type, can rely on crc32 checksum for the pseudorandom
transform results.
Siarhei Siamashka [Wed, 12 Dec 2012 00:41:55 +0000 (02:41 +0200)]
configure.ac: Added detection for __float128 support
GCC supports 128-bit floating point data type on some platforms (including
but not limited to x86 and x86-64). This may be useful for tests, which
need prefectly accurate reference implementations of certain algorithms.
Siarhei Siamashka [Fri, 14 Dec 2012 16:43:57 +0000 (18:43 +0200)]
Add higher precision "pixman_transform_point_*" functions
The following new functions are added:
pixman_transform_point_31_16_3d() -
Calculates the product of a matrix and a vector multiplication.
pixman_transform_point_31_16() -
Calculates the product of a matrix and a vector multiplication.
Then converts the homogenous resulting vector [x, y, z] to
cartesian [x', y', 1] variant, where x' = x / z, and y' = y / z.
pixman_transform_point_31_16_affine() -
A faster sibling of the other two functions, which assumes affine
transformation, where the bottom row of the matrix is [0, 0, 1] and
the last element of the input vector is set to 1.
These functions transform a point with 31.16 fixed point coordinates from
the destination space to a point with 48.16 fixed point coordinates in
the source space.
The results are accurate and the rounding errors may only show up in
the least significant bit. No overflows are possible for the affine
transformations as long as the input data is provided in 31.16 format.
In the case of projective transformations, some output values may be not
representable using 48.16 fixed point format. In this case the results
are clamped to return maximum or minimum 48.16 values (so that the caller
can at least handle NONE and PAD repeats correctly).
Siarhei Siamashka [Mon, 3 Dec 2012 15:42:21 +0000 (17:42 +0200)]
Faster fetch for the C variant of r5g6b5 src/dest iterator
Processing two pixels at once is used to reduce the number of
arithmetic operations.
The speedup relative to the generic fetch_scanline_r5g6b5() from
"pixman-access.c" (pixman was compiled with gcc 4.7.2):
MIPS 74K 480MHz : 20.32 MPix/s -> 26.47 MPix/s
ARM11 700MHz : 34.95 MPix/s -> 38.22 MPix/s
ARM Cortex-A8 1000MHz : 87.44 MPix/s -> 100.92 MPix/s
ARM Cortex-A9 1700MHz : 150.95 MPix/s -> 158.13 MPix/s
ARM Cortex-A15 1700MHz : 148.91 MPix/s -> 155.42 MPix/s
IBM Cell PPU 3200MHz : 75.29 MPix/s -> 98.33 MPix/s
Intel Core i7 2800MHz : 257.02 MPix/s -> 376.93 MPix/s
That's the performance for C code (SIMD and assembly optimizations
are disabled via PIXMAN_DISABLE environment variable).
Siarhei Siamashka [Mon, 3 Dec 2012 15:07:31 +0000 (17:07 +0200)]
Faster write-back for the C variant of r5g6b5 dest iterator
Unrolling loops improves performance, so just use it here.
Also GCC can't properly optimize this code for RISC processors and
allocate 0x1F001F constant in a register. Because this constant is
too large to be represented as an immediate operand in instructions,
GCC inserts some redundant arithmetics. This problem can be workarounded
by explicitly using a variable for 0x1F001F constant and also initializing
it by a read from another volatile variable. In this case GCC is forced
to allocate a register for it, because it is not seen as a constant anymore.
The speedup relative to the generic store_scanline_r5g6b5() from
"pixman-access.c" (pixman was compiled with gcc 4.7.2):
MIPS 74K 480MHz : 33.22 MPix/s -> 43.42 MPix/s
ARM11 700MHz : 50.16 MPix/s -> 78.23 MPix/s
ARM Cortex-A8 1000MHz : 117.75 MPix/s -> 196.34 MPix/s
ARM Cortex-A9 1700MHz : 177.04 MPix/s -> 320.32 MPix/s
ARM Cortex-A15 1700MHz : 231.44 MPix/s -> 261.64 MPix/s
IBM Cell PPU 3200MHz : 130.25 MPix/s -> 145.61 MPix/s
Intel Core i7 2800MHz : 502.21 MPix/s -> 721.73 MPix/s
That's the performance for C code (SIMD and assembly optimizations
are disabled via PIXMAN_DISABLE environment variable).
Siarhei Siamashka [Mon, 3 Dec 2012 04:32:46 +0000 (06:32 +0200)]
Added C variants of r5g6b5 fetch/write-back iterators
Adding specialized iterators for r5g6b5 color format allows us to work
on fine tuning performance of r5g6b5 fetch/write-back operations in the
pixman general "fetch -> combine -> store" pipeline.
These iterators also make "src_x888_0565" fast path redundant, so it can
be removed.
Chris Wilson [Wed, 23 Jan 2013 10:27:22 +0000 (10:27 +0000)]
Eliminate duplicate copies of channel flags for pixman_image_composite32()
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Chris Wilson [Sat, 12 Jan 2013 16:52:47 +0000 (16:52 +0000)]
Always return a valid function from lookup_combiner()
We should always have at least a C combiner available, so we never
expect the search to fail. If it does, emit an error and return a
dummy function.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Chris Wilson [Sat, 12 Jan 2013 08:28:32 +0000 (08:28 +0000)]
Always return a valid function from lookup_composite()
We never expect to fail to find the appropriate function as the
general_composite_rect should always match. So if somehow we fallthrough
the search, emit a _pixman_log_error() and return a dummy function.
Note that we remove some conditionals and a level of indentation hence a
large amount of code movement. This also reveals that in a few places we
are duplicating stack variables that can be eliminated later.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Chris Wilson [Tue, 8 Jan 2013 18:39:03 +0000 (18:39 +0000)]
sse2: Add fast paths for bilinear source with a solid mask
Based on the existing sse2_8888_n_8888 nearest scaling routines.
fishbowl on an i5-2500: 60.9s -> 56.9s
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Chris Wilson [Tue, 1 Jan 2013 19:41:54 +0000 (19:41 +0000)]
sse2: Add a fast path for add_n_8_8888
This path is being exercised by compositing of trapezoids for clipmasks, for
instance as used in the firefox-asteroids cairo-trace.
IVB i7-3720qm ./tests/lowlevel-blt-bench add_n_8_8888:
reference memcpy speed = 14846.7MB/s (3711.7MP/s for 32bpp fills)
before: L1: 681.10 L2: 735.14 M:701.44 ( 28.35%) HT:283.32 VT:213.23 R:208.93 RT: 77.89 ( 793Kops/s)
after: L1: 992.91 L2:1017.33 M:982.58 ( 39.88%) HT:458.93 VT:332.32 R:326.13 RT:136.66 (1287Kops/s)
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Chris Wilson [Tue, 1 Jan 2013 19:41:54 +0000 (19:41 +0000)]
sse2: Add a fast path for add_n_8888
This path is being exercised by inplace compositing of trapezoids, for
instance as used in the firefox-asteroids cairo-trace.
IVB i3-3720qm ./tests/lowlevel-blt-bench add_n_888:
reference memcpy speed = 14918.3MB/s (3729.6MP/s for 32bpp fills)
before: L1:1752.44 L2:2259.48 M:2215.73 ( 58.80%) HT:589.49 VT:404.04 R:424.69 RT:134.68 (1182Kops/s)
after: L1:3931.21 L2:6132.78 M:3440.17 ( 92.24%) HT:1337.70 VT:1357.64 R:1270.27 RT:359.78 (2161Kops/s)
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Jeff Muizelaar [Thu, 24 Jan 2013 19:49:41 +0000 (14:49 -0500)]
Add a version of bilinear_interpolation for precision <=4
Having 4 or fewer bits means we can do two components at
a time in a single 32 bit register.
Here are the results for firefox-fishtank on a Pandaboard with
4.6.3 and PIXMAN_DISABLE="arm-neon"
Before:
[ # ] backend test min(s) median(s) stddev. count
[ 0] image t-firefox-fishtank 7.841 7.910 0.70% 6/6
After:
[ # ] backend test min(s) median(s) stddev. count
[ 0] image t-firefox-fishtank 6.951 6.995 1.11% 6/6
Ben Avison [Sat, 19 Jan 2013 16:36:22 +0000 (16:36 +0000)]
Tweaks to lowlevel-blt-bench
This adds two extra tests, src_n_8 and src_8_8, which I have been
using to benchmark my ARMv6 changes.
I'd also like to propose that it requires an exact test name as the
executable's argument, as achieved by this strstr to strcmp change.
Without this, it is impossible to only benchmark (for example)
add_8_8, add_n_8 or src_n_8, due to those also being substrings of
many other test names.
Søren Sandmann Pedersen [Sat, 19 Jan 2013 17:29:48 +0000 (12:29 -0500)]
test: Use operator_name() and format_name() in composite.c
With the operator_name() and format_name() functions there is no
longer any reason for composite.c to have its own table of format and
operator names.
Søren Sandmann Pedersen [Sat, 19 Jan 2013 14:36:50 +0000 (09:36 -0500)]
utils.[ch]: Add new format_name() function
This function returns the name of the given format code, which is
useful for printing out debug information. The function is written as
a switch without a default value so that the compiler will warn if new
formats are added in the future. The fake formats used in the fast
path tables are also recognized.
The function is used in alpha_map.c, where it replaces an existing
format_name() function, and in blitters-test.c, affine-test.c, and
scaling-test.c.
Søren Sandmann Pedersen [Sat, 19 Jan 2013 13:55:27 +0000 (08:55 -0500)]
test/utils.[ch]: Add new function operator_name()
This function returns the name of the given operator, which is useful
for printing out debug information. The function is done as a switch
without a default value so that the compiler will warn if new
operators are added in the future.
The function is used in affine-test.c, scaling-test.c, and
blitters-test.c.
Søren Sandmann Pedersen [Sat, 12 Jan 2013 13:03:35 +0000 (08:03 -0500)]
README: Add guidelines on how to contribute patches
Ben Avison pointed out here:
http://lists.freedesktop.org/archives/pixman/2013-January/002485.html
that there isn't really any documentation about how to submit patches
to pixman. This patch adds some information to the README file.
v2: Incorporate some comments from Ben Avison
v3: Change gitweb URL to cgit
Matt Turner [Sat, 19 Jan 2013 00:53:32 +0000 (16:53 -0800)]
Convert INCLUDES to AM_CPPFLAGS
INCLUDES has been deprecated starting with automake 1.13. Convert all
occurrences with the recommended AM_CPPFLAGS replacement.
Matt Turner [Sat, 19 Jan 2013 00:49:00 +0000 (16:49 -0800)]
Add new demos and tests to .gitignore
Nemanja Lukic [Tue, 22 Jan 2013 02:01:05 +0000 (03:01 +0100)]
MIPS: DSPr2: Added more fast-paths:
- over_reverse_n_8888
- in_n_8_8
Performance numbers before/after on MIPS-74kc @ 1GHz:
lowlevel-blt-bench results
Referent (before):
over_reverse_n_8888 = L1: 19.42 L2: 19.07 M: 15.38 ( 40.80%) HT: 13.35 VT: 13.10 R: 12.92 RT: 8.27 ( 49Kops/s)
in_n_8_8 = L1: 21.20 L2: 22.86 M: 21.42 ( 14.21%) HT: 15.97 VT: 15.69 R: 15.47 RT: 8.00 ( 48Kops/s)
Optimized:
over_reverse_n_8888 = L1: 60.09 L2: 47.87 M: 28.65 ( 76.02%) HT: 23.58 VT: 22.51 R: 21.99 RT: 12.28 ( 60Kops/s)
in_n_8_8 = L1: 89.38 L2: 86.07 M: 65.48 ( 43.44%) HT: 44.64 VT: 41.50 R: 40.77 RT: 16.94 ( 66Kops/s)
Nemanja Lukic [Tue, 22 Jan 2013 01:59:44 +0000 (02:59 +0100)]
MIPS: DSPr2: Added more fast-paths for REVERSE operation:
- out_reverse_8_0565
- out_reverse_8_8888
Performance numbers before/after on MIPS-74kc @ 1GHz:
lowlevel-blt-bench results
Referent (before):
out_reverse_8_0565 = L1: 14.29 L2: 13.58 M: 12.14 ( 24.16%) HT: 9.23 VT: 9.12 R: 8.84 RT: 4.75 ( 36Kops/s)
out_reverse_8_8888 = L1: 27.46 L2: 23.24 M: 17.41 ( 57.73%) HT: 12.61 VT: 12.47 R: 11.79 RT: 5.86 ( 41Kops/s)
Optimized:
out_reverse_8_0565 = L1: 28.24 L2: 25.64 M: 20.63 ( 41.05%) HT: 16.69 VT: 16.14 R: 15.50 RT: 8.69 ( 52Kops/s)
out_reverse_8_8888 = L1: 52.78 L2: 41.44 M: 23.50 ( 77.94%) HT: 18.79 VT: 18.16 R: 16.90 RT: 9.11 ( 53Kops/s)
Søren Sandmann Pedersen [Thu, 20 Dec 2012 16:28:25 +0000 (11:28 -0500)]
pixman-filter.c: Cope with NULL returns from malloc()
v2: Don't return a pointer to uninitialized memory when the allocation
of horz and vert fails, but allocation of params doesn't.
Søren Sandmann Pedersen [Mon, 27 Aug 2012 02:06:27 +0000 (22:06 -0400)]
Handle solid images in the noop iterator
The noop src iterator already has code to handle solid images, but
that code never actually runs currently because it is not possible for
an image to have both a format code of PIXMAN_solid and a flag of
FAST_PATH_BITS_IMAGE.
If these two were to be set at the same time, the
fast_composite_tiled_repeat() fast path would trigger for solid images
(because it triggers for PIXMAN_any formats, which includes
PIXMAN_solid), but for solid images we can usually do better than that
fast path.
So this patch removes _pixman_solid_fill_iter_init() and instead
handles such images (along with repeating 1x1 bits images without an
alpha map) in pixman-noop.c.
When a 1x1R image is involved in the general composite path, before
this patch, it would hit this code in repeat() in pixman-inlines.h:
while (*c >= size)
*c -= size;
while (*c < 0)
*c += size;
and those loops could run for a huge number of iteratons (proportional
to the composite width). For such cases, the performance improvement
is really big:
./test/lowlevel-blt-bench -n add_n_8888:
Before:
add_n_8888 = L1: 3.86 L2: 3.78 M: 1.40 ( 0.06%) HT: 1.43 VT: 1.41 R: 1.41 RT: 1.38 ( 19Kops/s)
After:
add_n_8888 = L1:1236.86 L2:2468.49 M:1097.88 ( 49.04%) HT:476.49 VT:429.05 R:417.04 RT:155.12 ( 817Kops/s)
Marko Lindqvist [Thu, 3 Jan 2013 04:38:01 +0000 (06:38 +0200)]
Fix build with automake-1.13
Automake-1.13 has removed long obsolete AM_CONFIG_HEADER macro (
http://lists.gnu.org/archive/html/automake/2012-12/msg00038.html )
and autoreconf errors out upon seeing it.
Attached patch replaces obsolete AM_CONFIG_HEADER with now proper
AC_CONFIG_HEADERS.
Siarhei Siamashka [Thu, 20 Dec 2012 03:14:39 +0000 (05:14 +0200)]
Use more appropriate types and remove a magic constant
Siarhei Siamashka [Thu, 20 Dec 2012 03:00:46 +0000 (05:00 +0200)]
Define SIZE_MAX if it is not provided by the standard C headers
C++ compilers do not define SIZE_MAX. It is also not available
if the code is compiled by some C compilers:
http://lists.freedesktop.org/archives/pixman/2012-August/002196.html
Siarhei Siamashka [Sun, 16 Dec 2012 02:03:58 +0000 (04:03 +0200)]
Rename 'xor' variable to 'filler' (because 'xor' is a C++ keyword)
Søren Sandmann Pedersen [Sat, 15 Dec 2012 02:53:34 +0000 (21:53 -0500)]
float-combiner.c: Change tests for x == 0.0 tests to - FLT_MIN < x < FLT_MIN
pixman-float-combiner.c currently uses checks like these:
if (x == 0.0f)
...
else
... / x;
to prevent division by 0. In theory this is correct: a division-by-zero
exception is only supposed to happen when the floating point numerator is
exactly equal to a positive or negative zero.
However, in practice, the combination of x87 and gcc optimizations
causes issues. The x87 registers are 80 bits wide, which means the
initial test:
if (x == 0.0f)
may be false when x is an 80 bit floating point number, but when x is
rounded to a 32 bit single precision number, it becomes equal to
0.0. In principle, gcc should compensate for this quirk of x87, and
there are some options such as -ffloat-store, -fexcess-precision=standard,
and -std=c99 that will make it do so, but these all have a performance
cost. It is also possible to set the FPU to a mode that makes it do
all computation with single or double precision, but that would
require pixman to save the existing mode before doing anything with
floating point and restore it afterwards.
Instead, this patch side-steps the issue by replacing exact checks for
equality with zero with a new macro that checkes whether the value is
between -FLT_MIN and FLT_MIN.
There is extensive reading material about this issue linked off the
infamous gcc bug 323:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=323
Siarhei Siamashka [Thu, 6 Dec 2012 15:13:16 +0000 (17:13 +0200)]
ARM: make use of UQADD8 instruction even in generic C code paths
ARMv6 has UQADD8 instruction, which implements unsigned saturated
addition for 8-bit values packed in 32-bit registers. It is very useful
for UN8x4_ADD_UN8x4, UN8_rb_ADD_UN8_rb and ADD_UN8 macros (which would
otherwise need a lot of arithmetic operations to simulate this operation).
Since most of the major ARM linux distros are built for ARMv7, we are
much less dependent on runtime CPU detection and can get practical
benefits from conditional compilation here for a lot of users.
The results of cairo-perf-trace benchmark on ARM Cortex-A15 with pixman
compiled by gcc 4.7.2 and PIXMAN_DISABLE set to "arm-simd arm-neon":
Speedups
========
image firefox-talos-gfx (29938.22 0.12%) -> (27814.76 0.51%) : 1.08x speedup
image firefox-asteroids (23241.11 0.07%) -> (21795.19 0.07%) : 1.07x speedup
image firefox-canvas-alpha (174519.85 0.08%) -> (164788.64 0.20%) : 1.06x speedup
image poppler (9464.46 1.61%) -> (8991.53 0.14%) : 1.05x speedup
Siarhei Siamashka [Mon, 3 Dec 2012 01:01:21 +0000 (03:01 +0200)]
Faster conversion from a8r8g8b8 to r5g6b5 in C code
This change reduces 3 shifts, 3 ANDs and 2 ORs (total 8 arithmetic
operations) to 3 shifts, 2 ANDs and 2 ORs (total 7 arithmetic
operations).
We get garbage in the high 16 bits of the result, which might need
to be cleared when casting to uint16_t (it would bring us back to
total 8 arithmetic operations). However in the case if the result
of a8r8g8b8->r5g6b5 conversion is immediately stored to memory, no
extra instructions for clearing these garbage bits are needed.
This allows the a8r8g8b8->r5g6b5 conversion code to be compiled
into 4 instructions for ARM instead of 5 (assuming a good optimizing
compiler), which has no pipeline stalls on ARM11 as an additional
bonus.
The change in benchmark results for 'lowlevel-blt-bench src_8888_0565'
with PIXMAN_DISABLE="arm-simd arm-neon mips-dspr2 mmx sse2" and pixman
compiled by gcc-4.7.2:
MIPS 74K 480MHz : 40.44 MPix/s -> 40.13 MPix/s
ARM11 700MHz : 50.28 MPix/s -> 62.85 MPix/s
ARM Cortex-A8 1000MHz : 124.38 MPix/s -> 141.85 MPix/s
ARM Cortex-A15 1700MHz : 281.07 MPix/s -> 303.29 MPix/s
Intel Core i7 2800MHz : 515.92 MPix/s -> 531.16 MPix/s
The same trick was used in xomap (X server for Nokia N800/N810):
http://repository.maemo.org/pool/diablo/free/x/xorg-server/
xorg-server_1.3.99.0~git20070321-0osso20083801.tar.gz
Siarhei Siamashka [Mon, 3 Dec 2012 00:50:20 +0000 (02:50 +0200)]
Change CONVERT_XXXX_TO_YYYY macros into inline functions
It is easier and safer to modify their code in the case if the
calculations need some temporary variables. And the temporary
variables will be needed soon.
Siarhei Siamashka [Mon, 3 Dec 2012 03:44:36 +0000 (05:44 +0200)]
test: add "src_0565_8888" to lowlevel-blt-bench
Søren Sandmann Pedersen [Thu, 13 Dec 2012 20:37:40 +0000 (15:37 -0500)]
pixman_composite_trapezoids(): Check for NULL return from create_bits()
A check is needed that the creation of the temporary image in
pixman_composite_trapezoids() succeeds.
Fixes crash in stress-test -s 0x313c on my system.
Søren Sandmann Pedersen [Thu, 13 Dec 2012 20:26:17 +0000 (15:26 -0500)]
pixman_composite_trapezoids: Return early if mask_format is not of TYPE_ALPHA
stress-test -s 0x17ee crashes because pixman_composite_trapezoids() is
given a mask_format of PIXMAN_c8, which causes it to create a
temporary image with that format but without a palette. This causes
crashes later.
The only mask_format that we actually support are those of TYPE_ALPHA,
so this patch add a return_if_fail() to ensure this.
Similarly, although currently it won't crash if given an invalid
format, alpha-only formats have always been the only thing that made
sense for the pixman_rasterize_edges() functions, so add a
return_if_fail() ensuring that the destination format is of type
PIXMAN_TYPE_ALPHA.
Søren Sandmann Pedersen [Thu, 13 Dec 2012 16:21:16 +0000 (11:21 -0500)]
Add testing of trapezoids to stress-test
The entry points add_trapezoids(), rasterize_trapezoid() and
composite_trapezoid() are exercised with random trapezoids.
This uncovers crashes with stress-test seeds 0x17ee and 0x313c.
Søren Sandmann Pedersen [Sat, 8 Dec 2012 11:06:34 +0000 (06:06 -0500)]
demos/radial-test: Add checkerboard to display the alpha channel
Søren Sandmann Pedersen [Sat, 8 Dec 2012 11:46:38 +0000 (06:46 -0500)]
demos/conical-test: Use the draw_checkerboard() utility function
Instead of having its own copy.
Søren Sandmann Pedersen [Sat, 8 Dec 2012 11:44:24 +0000 (06:44 -0500)]
test/utils.[ch]: Add utility function to draw a checkerboard
This is useful in demo programs to display the alpha channel.
Søren Sandmann Pedersen [Sat, 8 Dec 2012 00:51:19 +0000 (19:51 -0500)]
radial: When comparing t to mindr, use >= rather than >
Radial gradients are conceptually rendered as a sequence of circles
generated by linearly extrapolating from the two circles given by the
gradient specification. Any circles in that sequence that would end up
with a negative radius are not drawn, a condition that is enforced by
checking that t * dr is bigger than mindr:
if (t * dr > mindr)
However, it is legitimate for a circle to have radius exactly 0, so
the test should use >= rather than >.
This gets rid of the dots in demos/radial-test except for when the c2
circle has radius 0 and a repeat mode of either NONE or NORMAL. Both
those dots correspond to a t value of 1.0, which is outside the
defined interval of [0.0, 1.0) and therefore subject to the repeat
algorithm. As a result, in the NONE case, a value of 1.0 turns into
transparent black. In the NORMAL case, 1.0 wraps around and becomes
0.0 which is red, unlike 0.99 which is blue.
Cc: ranma42@gmail.com
Søren Sandmann Pedersen [Sat, 8 Dec 2012 00:43:53 +0000 (19:43 -0500)]
demos/radial-test: Add zero-radius circles to demonstrate rendering bugs
Add two new gradient columns, one where the start circle is has radius
0 and one where the end circle has radius 0. All the new gradients
except for one are rendered with a bright dot in the middle. In most
but not all cases this is incorrect.
Cc: ranma42@gmail.com
Siarhei Siamashka [Sat, 8 Dec 2012 13:16:51 +0000 (15:16 +0200)]
test: Workaround unaligned MOVDQA bug (gcc.gnu.org/PR55614)
Just use SSE2 intrinsics to do unaligned memory accesses as
a workaround for this gcc bug related to vector extensions.
Siarhei Siamashka [Fri, 30 Nov 2012 10:00:47 +0000 (12:00 +0200)]
Improve performance of combine_over_u
The generic C over_u combiner can be a lot faster with the
addition of special shortcuts for 0xFF and 0x00 alpha/mask
values. This is already implemented in C and SSE2 fast paths.
Profiling the run of cairo-perf-trace benchmarks with PIXMAN_DISABLE
environment variable set to "fast mmx sse2" on Intel Core i7:
=== before ===
37.32% cairo-perf-trac libpixman-1.so.0.29.1 [.] combine_over_u
21.37% cairo-perf-trac libpixman-1.so.0.29.1 [.] bits_image_fetch_bilinear_no_repeat_8888
13.51% cairo-perf-trac libpixman-1.so.0.29.1 [.] bits_image_fetch_bilinear_affine_none_a8r8g8b8
2.96% cairo-perf-trac libpixman-1.so.0.29.1 [.] radial_compute_color
2.74% cairo-perf-trac libpixman-1.so.0.29.1 [.] fetch_scanline_a8
2.71% cairo-perf-trac libpixman-1.so.0.29.1 [.] fetch_scanline_x8r8g8b8
2.17% cairo-perf-trac libpixman-1.so.0.29.1 [.] _pixman_gradient_walker_pixel
1.86% cairo-perf-trac libcairo.so.2.11200.0 [.] _cairo_tor_scan_converter_generate
1.57% cairo-perf-trac libpixman-1.so.0.29.1 [.] bits_image_fetch_bilinear_affine_pad_a8r8g8b8
0.97% cairo-perf-trac libpixman-1.so.0.29.1 [.] combine_in_reverse_u
0.96% cairo-perf-trac libpixman-1.so.0.29.1 [.] combine_over_ca
=== after ===
28.79% cairo-perf-trac libpixman-1.so.0.29.1 [.] bits_image_fetch_bilinear_no_repeat_8888
18.44% cairo-perf-trac libpixman-1.so.0.29.1 [.] bits_image_fetch_bilinear_affine_none_a8r8g8b8
15.54% cairo-perf-trac libpixman-1.so.0.29.1 [.] combine_over_u
3.94% cairo-perf-trac libpixman-1.so.0.29.1 [.] radial_compute_color
3.69% cairo-perf-trac libpixman-1.so.0.29.1 [.] fetch_scanline_a8
3.69% cairo-perf-trac libpixman-1.so.0.29.1 [.] fetch_scanline_x8r8g8b8
2.94% cairo-perf-trac libpixman-1.so.0.29.1 [.] _pixman_gradient_walker_pixel
2.52% cairo-perf-trac libcairo.so.2.11200.0 [.] _cairo_tor_scan_converter_generate
2.08% cairo-perf-trac libpixman-1.so.0.29.1 [.] bits_image_fetch_bilinear_affine_pad_a8r8g8b8
1.31% cairo-perf-trac libpixman-1.so.0.29.1 [.] combine_in_reverse_u
1.29% cairo-perf-trac libpixman-1.so.0.29.1 [.] combine_over_ca
Søren Sandmann Pedersen [Mon, 26 Nov 2012 19:27:34 +0000 (14:27 -0500)]
Add fast paths for separable convolution
Similar to the fast paths for general affine access, add some fast
paths for the separable filter for all combinations of formats
x8r8g8b8, a8r8g8b8, r5g6b5, a8 with the four repeat modes.
It is easy to see the speedup in the demos/scale program.
Søren Sandmann Pedersen [Tue, 4 Dec 2012 18:17:49 +0000 (13:17 -0500)]
Add demo program for conical gradients
This new test is derived from radial-test.c and displays conical
gradients at various angles.
It also demonstrates how PIXMAN_REPEAT_NORMAL is supposed to work when
used with a gradient specification where the first stop is not a 0.0:
In this case the gradient is supposed to have a smooth transition from
the last stop back to the first stop with no sharp transitions. It
also shows that the repeat mode is not ignored for conical gradients
as one might be tempted to think.
Søren Sandmann Pedersen [Mon, 12 Nov 2012 17:27:39 +0000 (12:27 -0500)]
Add demos/zone_plate.png
The zone plate image is a useful test case for image scalers because
it contains all representable frequencies, so any imperfection in
resampling filters will show up as Moire patterns.
This version is symmetric around the midpoint of the image, so since
rotating it is supposed to be a noop, it can also be used to verify
that the resampling filters don't shift the image.
V2: Run the file through OptiPNG to cut the size in half, as suggested
by Siarhei.
Søren Sandmann Pedersen [Thu, 22 Nov 2012 15:18:26 +0000 (10:18 -0500)]
demos: Add new demo program, "scale"
This program allows interactively scaling and rotating images with
using various filters and repeat modes. It uses
pixman_filter_create_separate_convolution() to generate the filters.
Søren Sandmann Pedersen [Thu, 22 Nov 2012 15:16:16 +0000 (10:16 -0500)]
demos/gtk-utils.[ch]: Add pixman_image_from_file()
This function uses GdkPixbuf to load various common formats such as
.png and .jpg into a pixman image.
Søren Sandmann Pedersen [Thu, 22 Nov 2012 15:15:06 +0000 (10:15 -0500)]
Add new pixman_filter_create_separable_convolution() API
This new API is a helper function to create filter parameters suitable
for use with PIXMAN_FILTER_SEPARABLE_CONVOLUTION.
For each dimension, given a scale factor, reconstruction and sample
filter kernels, and a subsampling resolution, this function will
compute a convolution of the two kernels scaled appropriately, then
sample that convolution and return the resulting vectors in a form
suitable for being used as parameters to
PIXMAN_FILTER_SEPARABLE_CONVOLUTION.
The filter kernels offered are the following:
- IMPULSE: Dirac delta function, ie., point sampling
- BOX: Box filter
- LINEAR: Linear filter, aka. "Tent" filter
- CUBIC: Cubic filter, currently Mitchell-Netravali
- GAUSSIAN: Gaussian function, sigma=1, support=3*sigma
- LANCZOS2: Two-lobed Lanczos filter
- LANCZOS3: Three-lobed Lanczos filter
- LANCZOS3_STRETCHED: Three-lobed Lanczos filter, stretched by 4/3.0.
This is the "Nice" filter from Dirty Pixels by
Jim Blinn.
The intended way to use this function is to extract scaling factors
from the transformation and then pass those to this function to get a
filter suitable for compositing with that transformation. The filter
kernels can be chosen according to quality and performance tradeoffs.
To get equivalent quality to GdkPixbuf for downscalings, use BOX for
both reconstruction and sampling. For upscalings, use LINEAR for
reconstruction and IMPULSE for sampling (though note that for
upscaling in both X and Y directions, simply using
PIXMAN_FILTER_BILINEAR will likely be a better choice).
Søren Sandmann Pedersen [Thu, 22 Nov 2012 15:17:56 +0000 (10:17 -0500)]
rounding.txt: Describe how SEPARABLE_CONVOLUTION filter works
Add some notes on how to compute the convolution matrices to be used
with the SEPARABLE_CONVOLUTION filter.
Søren Sandmann Pedersen [Thu, 22 Nov 2012 15:14:06 +0000 (10:14 -0500)]
Add new filter PIXMAN_FILTER_SEPARABLE_CONVOLUTION
This filter is a new way to use a convolution matrix for filtering. In
contrast to the existing CONVOLUTION filter, this new variant is
different in two respects:
- It is subsampled: Instead of just one convolution matrix, this
filter chooses between a number of matrices based on the subpixel
sample location, allowing the convolution kernel to be sampled at a
higher resolution.
- It is separable: Each matrix is specified as the tensor product of
two vectors. This has the advantages that many fewer values have to
be stored, and that the filtering can be done separately in the x
and y dimensions (although the initial implementation doesn't
actually do that).
The motivation for this new filter is to improve image downsampling
quality. Currently, the best pixman can do is the regular convolution
filter which is limited to coarsely sampled convolution kernels.
With this new feature, any separable filter can be used at any desired
resolution.
Benjamin Gilbert [Sun, 2 Dec 2012 04:55:31 +0000 (23:55 -0500)]
Fix thread safety on mingw-w64 and clang
After finding a working TLS storage class specifier, configure was
continuing to test other candidates. This caused it to prefer
__declspec(thread) over __thread. However, __declspec(thread) is
ignored with a warning by mingw-w64 [1] and silently ignored by clang [2].
The resulting binary behaved as if PIXMAN_NO_TLS was defined.
Bug introduced by
a069da6c.
[1] https://bugs.freedesktop.org/show_bug.cgi?id=57591
[2] http://lists.freedesktop.org/archives/pixman/2012-October/002320.html
Siarhei Siamashka [Sun, 25 Nov 2012 00:59:25 +0000 (02:59 +0200)]
test: Get rid of the obsolete 'prng_rand_N' and 'prng_rand_u32'
They are the same as 'prng_rand_n' and 'prng_rand'
Siarhei Siamashka [Sun, 25 Nov 2012 00:50:35 +0000 (02:50 +0200)]
test: Switch to the new PRNG instead of old LCG
Wallclock time for running pixman "make check" (compile time not included):
----------------------------+----------------+-----------------------------+
| old PRNG (LCG) | new PRNG (Bob Jenkins) |
Processor type +----------------+------------+----------------+
| gcc 4.5 | gcc 4.5 | gcc 4.7 (simd) |
----------------------------+----------------+------------+----------------+
quad Intel Core i7 @2.8GHz | 0m49.494s | 0m43.722s | 0m37.560s |
dual ARM Cortex-A15 @1.7GHz | 5m8.465s | 4m37.375s | 3m45.819s |
IBM Cell PPU @3.2GHz | 23m0.821s | 20m38.316s | 16m37.513s |
----------------------------+----------------+------------+----------------+
But some tests got a particularly large boost. For example benchmarking and
profiling blitters-test on Core i7:
=== before ===
$ time ./blitters-test
real 0m10.907s
user 0m55.650s
sys 0m0.000s
70.45% blitters-test blitters-test [.] create_random_image
15.81% blitters-test blitters-test [.] compute_crc32_for_image_internal
2.26% blitters-test blitters-test [.] _pixman_implementation_lookup_composite
1.07% blitters-test libc-2.15.so [.] _int_free
0.89% blitters-test libc-2.15.so [.] malloc_consolidate
0.87% blitters-test libc-2.15.so [.] _int_malloc
0.75% blitters-test blitters-test [.] combine_conjoint_general_u
0.61% blitters-test blitters-test [.] combine_disjoint_general_u
0.40% blitters-test blitters-test [.] test_composite
0.31% blitters-test libc-2.15.so [.] _int_memalign
0.31% blitters-test blitters-test [.] _pixman_bits_image_setup_accessors
0.28% blitters-test libc-2.15.so [.] malloc
=== after ===
$ time ./blitters-test
real 0m3.655s
user 0m20.550s
sys 0m0.000s
41.77% blitters-test.n blitters-test.new [.] compute_crc32_for_image_internal
15.77% blitters-test.n blitters-test.new [.] prng_randmemset_r
6.15% blitters-test.n blitters-test.new [.] _pixman_implementation_lookup_composite
3.09% blitters-test.n libc-2.15.so [.] _int_free
2.68% blitters-test.n libc-2.15.so [.] malloc_consolidate
2.39% blitters-test.n libc-2.15.so [.] _int_malloc
2.27% blitters-test.n blitters-test.new [.] create_random_image
2.22% blitters-test.n blitters-test.new [.] combine_conjoint_general_u
1.52% blitters-test.n blitters-test.new [.] combine_disjoint_general_u
1.40% blitters-test.n blitters-test.new [.] test_composite
1.02% blitters-test.n blitters-test.new [.] prng_srand_r
1.00% blitters-test.n blitters-test.new [.] _pixman_image_validate
0.96% blitters-test.n blitters-test.new [.] _pixman_bits_image_setup_accessors
0.90% blitters-test.n libc-2.15.so [.] malloc
Siarhei Siamashka [Sat, 24 Nov 2012 21:22:48 +0000 (23:22 +0200)]
test: Search/replace 'lcg_*' -> 'prng_*'
The 'lcg' prefix is going to be misleading if we replace
PRNG algorithm.
Siarhei Siamashka [Sat, 24 Nov 2012 17:43:41 +0000 (19:43 +0200)]
test: Added a better PRNG (pseudorandom number generator)
This adds a fast SIMD-optimized variant of a small noncryptographic
PRNG originally developed by Bob Jenkins:
http://www.burtleburtle.net/bob/rand/smallprng.html
The generated pseudorandom data is good enough to pass "Big Crush"
tests from TestU01 (http://en.wikipedia.org/wiki/TestU01).
SIMD code uses http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html
which is a GCC specific extension. There is also a slower alternative
code path, which should work with any C compiler.
The performance of filling buffer with random data:
Intel Core i7 @2.8GHz (SSE2) : ~5.9 GB/s
ARM Cortex-A15 @1.7GHz (NEON) : ~2.2 GB/s
IBM Cell PPU @3.2GHz (Altivec) : ~1.7 GB/s
Siarhei Siamashka [Fri, 23 Nov 2012 07:07:23 +0000 (09:07 +0200)]
test: Change is_little_endian() into inline function
Also dropped redundant volatile keyword because any object
can be accessed via char* pointer without breaking aliasing
rules. The compilers are able to optimize this function to either
constant 0 or 1.
Søren Sandmann Pedersen [Wed, 21 Nov 2012 16:43:31 +0000 (11:43 -0500)]
Add text file rounding.txt describing how rounding works
It is not entirely obvious how pixman gets from "location in the
source image" to "pixel value stored in the destination". This file
describes how the filters work, and in particular how positions are
rounded to samples.
Søren Sandmann Pedersen [Wed, 21 Nov 2012 04:28:43 +0000 (23:28 -0500)]
Convolution filter: round color values instead of truncating
The pixel computed by the convolution filter should be rounded off,
not truncated. As a simple example consider a convolution matrix
consisting of five times 0x3333. If all five all five input pixels are
0xff, then the result of truncating will be
(5 * 0x3333 * 255) >> 16 = 254
But the real value of the computation is (5 * 0x3333 / 65536.0) * 254
= 254.9961, so the error is almost 1. If the user isn't very careful
about normalizing the convolution kernel so that it sums to one in
fixed point, such error might cause solid images to change color, or
opaque images to become translucent.
The fix is simply to round instead of truncate.
Søren Sandmann Pedersen [Tue, 20 Nov 2012 08:23:51 +0000 (03:23 -0500)]
Round fixed-point multiplication
After two fixed-point numbers are multiplied, the result is shifted
into place, but up until now pixman has simply discarded the low-order
bits instead of rounding to the closest number.
Fix that by adding 0x8000 (or 0x2 in one place) before shifting and
update the test checksums to match.
Stefan Weil [Tue, 13 Nov 2012 18:44:44 +0000 (19:44 +0100)]
test: Fix compiler warnings caused by unused code
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Stefan Weil [Tue, 13 Nov 2012 18:38:32 +0000 (19:38 +0100)]
pixman: Use uintptr_t in type casts from pointer to integral value
These modifications fix lots of compiler warnings for systems where
sizeof(unsigned long) != sizeof(void *).
This is especially true for MinGW-w64 (64 bit Windows).
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Stefan Weil [Tue, 13 Nov 2012 18:44:15 +0000 (19:44 +0100)]
Always use xmmintrin.h for 64 bit Windows
MinGW-w64 uses the GNU compiler and does not define _MSC_VER.
Nevertheless, it provides xmmintrin.h and must be handled
here like the MS compiler. Otherwise compilation fails due to
conflicting declarations.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Nemanja Lukic [Mon, 12 Nov 2012 21:48:51 +0000 (22:48 +0100)]
MIPS: DSPr2: Added several nearest neighbor fast paths with a8 mask:
Performance numbers before/after on MIPS-74kc @ 1GHz:
lowlevel-blt-bench -n
Referent (before):
over_8888_8_0565 = L1: 9.62 L2: 8.85 M: 7.40 ( 39.27%) HT: 5.67 VT: 5.61 R: 5.45 RT: 2.98 ( 22Kops/s)
over_0565_8_0565 = L1: 7.90 L2: 7.49 M: 6.72 ( 26.75%) HT: 5.24 VT: 5.20 R: 5.06 RT: 2.90 ( 22Kops/s)
Optimized:
over_8888_8_0565 = L1: 18.51 L2: 16.82 M: 12.13 ( 64.43%) HT: 10.06 VT: 9.88 R: 9.54 RT: 5.63 ( 31Kops/s)
over_0565_8_0565 = L1: 14.82 L2: 13.94 M: 11.34 ( 45.20%) HT: 9.45 VT: 9.35 R: 9.03 RT: 5.50 ( 31Kops/s)
Nemanja Lukic [Mon, 12 Nov 2012 21:48:53 +0000 (22:48 +0100)]
MIPS: DSPr2: Added more fast-paths for OVER operation:
Performance numbers before/after on MIPS-74kc @ 1GHz:
lowlevel-blt-bench results
Referent (before):
over_n_0565 = L1: 14.48 L2: 21.36 M: 17.57 ( 23.30%) HT: 6.95 VT: 6.44 R: 6.39 RT: 2.16 ( 22Kops/s)
over_n_8888 = L1: 92.60 L2: 86.13 M: 24.41 ( 64.74%) HT: 8.94 VT: 8.06 R: 8.00 RT: 2.53 ( 25Kops/s)
Optimized:
over_n_0565 = L1: 27.65 L2: 189.22 M: 58.19 ( 77.12%) HT: 52.80 VT: 49.88 R: 47.53 RT: 23.67 ( 72Kops/s)
over_n_8888 = L1: 235.99 L2: 230.86 M: 29.09 ( 77.11%) HT: 27.95 VT: 27.24 R: 26.58 RT: 18.10 ( 67Kops/s)
Nemanja Lukic [Mon, 12 Nov 2012 21:48:52 +0000 (22:48 +0100)]
MIPS: DSPr2: Added more fast-paths for SRC operation:
Performance numbers before/after on MIPS-74kc @ 1GHz:
lowlevel-blt-bench results
Referent (before):
src_n_8_8888 = L1: 13.79 L2: 22.47 M: 17.55 ( 58.28%) HT: 6.95 VT: 6.46 R: 6.34 RT: 2.07 ( 20Kops/s)
src_n_8_8 = L1: 20.22 L2: 20.21 M: 18.20 ( 24.17%) HT: 6.65 VT: 6.22 R: 6.11 RT: 2.03 ( 20Kops/s)
Optimized:
src_n_8_8888 = L1: 58.31 L2: 53.34 M: 25.69 ( 85.29%) HT: 22.55 VT: 21.44 R: 19.91 RT: 10.34 ( 48Kops/s)
src_n_8_8 = L1: 102.60 L2: 89.43 M: 65.01 ( 86.32%) HT: 37.87 VT: 37.02 R: 32.43 RT: 12.41 ( 51Kops/s)
Søren Sandmann Pedersen [Sun, 11 Nov 2012 19:05:54 +0000 (14:05 -0500)]
Allow src and dst to be identical in pixman_f_transform_invert()
It is useful to be able to invert a matrix in place, but currently
pixman_f_transform_invert() will produce wrong results if you pass the
same matrix as both source and destination.
Fix that by inverting into a temporary matrix and then copying that to
the destination.
Søren Sandmann Pedersen [Thu, 8 Nov 2012 08:11:51 +0000 (03:11 -0500)]
pixman.h: Add typedefs for pixman_f_transform and pixman_f_vector
Joshua Root [Fri, 9 Nov 2012 03:39:14 +0000 (14:39 +1100)]
Fix undeclared variable use and sysctlbyname error handling on ppc
Fixes bug 56889.
Søren Sandmann Pedersen [Wed, 31 Oct 2012 17:14:07 +0000 (13:14 -0400)]
pixman_image_composite: Reduce opaque masks to NULL
When the mask is known to be opaque, we might as well reduce it to
NULL to take advantage of the various fast paths that operate on NULL
masks.
Søren Sandmann Pedersen [Wed, 7 Nov 2012 18:45:09 +0000 (13:45 -0500)]
Post-release version bump to 0.29.1
Søren Sandmann Pedersen [Wed, 7 Nov 2012 18:40:34 +0000 (13:40 -0500)]
Pre-release version bump to 0.28.0
Søren Sandmann Pedersen [Thu, 25 Oct 2012 14:42:26 +0000 (10:42 -0400)]
Post-release version bump to 0.27.5
Søren Sandmann Pedersen [Thu, 25 Oct 2012 14:35:27 +0000 (10:35 -0400)]
Pre-release version bump to 0.27.4
Nemanja Lukic [Sun, 14 Oct 2012 09:58:52 +0000 (11:58 +0200)]
MIPS: DSPr2: Added more fast-paths for ADD operation: - add_8888_8888_8888 - add_8_8 - add_8888_8888
Performance numbers before/after on MIPS-74kc @ 1GHz:
lowlevel-blt-bench results
Referent (before):
add_8888_8888_8888 = L1: 17.55 L2: 13.35 M: 8.13 ( 93.95%) HT: 6.60 VT: 6.64 R: 6.45 RT: 3.47 ( 26Kops/s)
add_8_8 = L1: 86.07 L2: 84.89 M: 62.36 ( 90.11%) HT: 36.36 VT: 34.74 R: 29.56 RT: 11.56 ( 52Kops/s)
add_8888_8888 = L1: 95.59 L2: 73.05 M: 17.62 (101.84%) HT: 15.46 VT: 15.01 R: 13.94 RT: 6.71 ( 42Kops/s)
Optimized:
add_8888_8888_8888 = L1: 41.52 L2: 33.21 M: 11.97 (138.45%) HT: 10.47 VT: 10.19 R: 9.42 RT: 4.86 ( 32Kops/s)
add_8_8 = L1: 135.06 L2: 104.82 M: 57.13 ( 82.58%) HT: 34.79 VT: 36.60 R: 28.28 RT: 10.54 ( 51Kops/s)
add_8888_8888 = L1: 176.36 L2: 67.82 M: 17.48 (101.06%) HT: 15.16 VT: 14.62 R: 13.88 RT: 8.05 ( 45Kops/s)
Nemanja Lukic [Sun, 14 Oct 2012 09:58:51 +0000 (11:58 +0200)]
MIPS: DSPr2: Added more fast-paths for ADD operation: - add_0565_8_0565 - add_8888_8_8888 - add_8888_n_8888
Performance numbers before/after on MIPS-74kc @ 1GHz:
lowlevel-blt-bench results
Referent (before):
add_0565_8_0565 = L1: 8.89 L2: 8.37 M: 7.35 ( 29.22%) HT: 5.90 VT: 5.85 R: 5.67 RT: 3.31 ( 26Kops/s)
add_8888_8_8888 = L1: 17.22 L2: 14.17 M: 9.89 ( 65.56%) HT: 7.57 VT: 7.50 R: 7.36 RT: 4.10 ( 30Kops/s)
add_8888_n_8888 = L1: 17.79 L2: 14.87 M: 10.35 ( 54.89%) HT: 5.19 VT: 4.93 R: 4.92 RT: 1.90 ( 19Kops/s)
Optimized:
add_0565_8_0565 = L1: 21.72 L2: 20.01 M: 14.96 ( 59.54%) HT: 12.03 VT: 11.81 R: 11.26 RT: 6.33 ( 37Kops/s)
add_8888_8_8888 = L1: 47.42 L2: 38.64 M: 15.90 (105.48%) HT: 13.34 VT: 13.03 R: 11.84 RT: 6.63 ( 38Kops/s)
add_8888_n_8888 = L1: 54.83 L2: 42.66 M: 17.36 ( 92.11%) HT: 15.20 VT: 14.82 R: 13.66 RT: 7.83 ( 41Kops/s)
Nemanja Lukic [Sun, 14 Oct 2012 09:58:50 +0000 (11:58 +0200)]
MIPS: DSPr2: Added fast-paths for ADD operation: - add_n_8_8 - add_n_8_8888 - add_8_8_8
Performance numbers before/after on MIPS-74kc @ 1GHz:
lowlevel-blt-bench results
Referent (before):
add_n_8_8 = L1: 41.37 L2: 37.83 M: 30.38 ( 60.45%) HT: 23.70 VT: 22.85 R: 21.51 RT: 10.32 ( 45Kops/s)
add_n_8_8888 = L1: 16.01 L2: 14.46 M: 11.64 ( 46.32%) HT: 5.50 VT: 5.18 R: 5.06 RT: 1.89 ( 18Kops/s)
add_8_8_8 = L1: 13.26 L2: 12.47 M: 11.16 ( 29.61%) HT: 8.09 VT: 8.04 R: 7.68 RT: 3.90 ( 29Kops/s)
Optimized:
add_n_8_8 = L1: 96.03 L2: 79.37 M: 51.89 (103.31%) HT: 32.59 VT: 31.29 R: 28.52 RT: 11.08 ( 46Kops/s)
add_n_8_8888 = L1: 53.61 L2: 46.92 M: 23.78 ( 94.70%) HT: 19.06 VT: 18.64 R: 17.30 RT: 9.15 ( 43Kops/s)
add_8_8_8 = L1: 89.65 L2: 66.82 M: 37.10 ( 98.48%) HT: 22.10 VT: 21.74 R: 20.12 RT: 8.12 ( 41Kops/s)
Siarhei Siamashka [Thu, 18 Oct 2012 22:59:16 +0000 (01:59 +0300)]
Workaround for FTBFS with gcc 4.6 (gcc.gnu.org/PR54965)
GCC 4.6 has problems with force_inline, so just use normal inline instead.
Fixes: https://bugs.freedesktop.org/show_bug.cgi?id=55630
Søren Sandmann Pedersen [Fri, 12 Oct 2012 22:34:33 +0000 (18:34 -0400)]
pixman_composite_trapezoids(): don't clip to extents for some operators
pixman_composite_trapezoids() is supposed to composite across the
entire destination, but it actually only composites across the extent
of the trapezoids. For operators such as ADD or OVER this doesn't
matter since a zero source has no effect on the destination. But for
operators such as SRC or IN, it does matter.
So for such operators where a zero source has an effect, don't clip to
the trap extents.
Søren Sandmann Pedersen [Fri, 12 Oct 2012 22:29:56 +0000 (18:29 -0400)]
pixman_composite_trapezoids(): Factor out extents computation
The computation of the extents rectangle is moved to its own
function.
Søren Sandmann Pedersen [Fri, 12 Oct 2012 22:07:29 +0000 (18:07 -0400)]
Add new pixman_image_create_bits_no_clear() API
When pixman_image_create_bits() function is given NULL for bits, it
will allocate a new buffer and initialize it to zero. However, in some
cases, only a small region of the image is actually used; in that case
it is wasteful to touch all of the memory.
The new pixman_image_create_bits_no_clear() works exactly like
_create_bits() except that it doesn't initialize any newly allocated
memory.
Benny Siegert [Sun, 14 Oct 2012 14:28:48 +0000 (16:28 +0200)]
configure.ac: PIXMAN_LINK_WITH_ENV fix
(fixes bug #52101)
On MirBSD, the compiler produces a (harmless) warning when the compiler
is called without the standard CFLAGS:
foo.c:0: note: someone does not honour COPTS correctly, passed 0 times
However, PIXMAN_LINK_WITH_ENV considers _any_ output on stderr as an
error, even if the exit status of the compiler is 0. Furthermore, it
resets CFLAGS and LDFLAGS at the start. On MirBSD, this will lead to a
warning in each test, making all such tests fail. In particular, the
pthread_setspecific test fails, thus pixman is compiled without thread
support. This leads to compile errors later on, or at least it did when
I tried this on pkgsrc. Re-adding the saved CFLAGS, LDFLAGS and LIBS
before the test makes it work.
The second hunk inverts the order of the pthread flag checks. On BSD
systems (this is true at least on OpenBSD and MirBSD), both -lpthread
and -pthread work but the latter is "preferred", whatever this means.
Siarhei Siamashka [Fri, 28 Sep 2012 23:29:22 +0000 (02:29 +0300)]
Add missing force_inline to in() function used for C fast paths
Siarhei Siamashka [Sun, 8 Jul 2012 20:10:00 +0000 (23:10 +0300)]
MIPS: skip runtime detection for DSPr2 if -mdspr2 option is in CFLAGS
This provides a way to enable MIPS DSP ASE optimizations if running
under qemu-user (where /proc/cpuinfo contains information about the
host processor instead of the emulated one). Can be used for running
pixman test suite in qemu-user when having no access to real MIPS
hardware.
Søren Sandmann Pedersen [Thu, 11 Oct 2012 08:04:04 +0000 (04:04 -0400)]
region: Remove overlap argument from pixman_op()
This is used to compute whether the regions in question overlap, but
nothing makes use of this information, so it can be removed.
Søren Sandmann Pedersen [Thu, 11 Oct 2012 08:07:00 +0000 (04:07 -0400)]
region: Formatting fix
The while part of a do/while loop was formatted as if it were a while
loop with an empty body. Probably some indent tool misinterpreted the
code at some point.
Søren Sandmann Pedersen [Sun, 7 Oct 2012 21:58:32 +0000 (17:58 -0400)]
Only regard images as pixbufs if they have identity transformations
In order for a src/mask pair to be considered a pixbuf, they have to
have identical transformations, but we don't check for that. Since the
only fast paths we have for pixbufs require identity transformations,
it sufficies to check that both source and mask are
untransformed.
This is also the reason that this bug can't be triggered by any test
code - if the source and mask had different transformations, we would
consider them a pixbuf, but then wouldn't take the fast path because
at least one of the transformations would be different from the
identity.
Søren Sandmann Pedersen [Thu, 4 Oct 2012 16:41:08 +0000 (12:41 -0400)]
Remove BUILT_SOURCES
pixman-combine32.[ch] were the only built sources, so BUILT_SOURCES
can now be removed.
Søren Sandmann Pedersen [Sun, 23 Sep 2012 07:52:34 +0000 (03:52 -0400)]
Speed up pixman_expand_to_float()
GCC doesn't move the divisions out of the loop, so do it manually by
looking up the four (1.0f / mask) values in a table. Table lookups are
used under the theory that one L2 hit plus three L1 hits is preferable
to four floating point divisions.
Søren Sandmann Pedersen [Fri, 21 Sep 2012 22:36:16 +0000 (18:36 -0400)]
Don't auto-generate pixman-combine32.[ch] anymore
Since pixman-combine64.[ch] are not used anymore, there is no point
generating these files from pixman-combine.[ch].template.
Also get rid of dependency on perl in configure.ac.