Søren Sandmann Pedersen [Thu, 6 Jun 2013 20:32:59 +0000 (16:32 -0400)]
Add empty SSSE3 implementation
This commit adds a new, empty SSSE3 implementation and the associated
build system support.
configure.ac: detect whether the compiler understands SSSE3
intrinsics and set up the required CFLAGS
Makefile.am: Add libpixman-ssse3.la
pixman-x86.c: Add X86_SSSE3 feature flag and detect it in
detect_cpu_features().
pixman-ssse3.c: New file with an empty SSSE3 implementation
V2: Remove SSSE3_LDFLAGS since it isn't necessary unless Solaris
support is added.
Søren Sandmann Pedersen [Wed, 28 Aug 2013 19:36:13 +0000 (15:36 -0400)]
general: Ensure that iter buffers are aligned to 16 bytes
At the moment iter buffers are only guaranteed to be aligned to a 4
byte boundary. SIMD implementations benefit from the buffers being
aligned to 16 bytes, so ensure this is the case.
V2:
- Use uintptr_t instead of unsigned long
- allocate 3 * SCANLINE_BUFFER_LENGTH byte on stack rather than just
SCANLINE_BUFFER_LENGTH
- use sizeof (stack_scanline_buffer) instead of SCANLINE_BUFFER_LENGTH
to determine overflow
Siarhei Siamashka [Tue, 3 Sep 2013 01:39:54 +0000 (04:39 +0300)]
sse2: faster bilinear scaling (pack 4 pixels to write with MOVDQA)
The loops are already unrolled, so it was just a matter of packing
4 pixels into a single XMM register and doing aligned 128-bit
writes to memory via MOVDQA instructions for the SRC compositing
operator fast path. For the other fast paths, this XMM register
is also directly routed to further processing instead of doing
extra reshuffling. This replaces "8 PACKSSDW/PACKUSWB + 4 MOVD"
instructions with "3 PACKSSDW/PACKUSWB + 1 MOVDQA" per 4 pixels,
which results in a clear performance improvement.
There are also some other (less important) tweaks:
1. Convert 'pixman_fixed_t' to 'intptr_t' before using it as an
index for addressing memory. The problem is that 'pixman_fixed_t'
is a 32-bit data type and it has to be extended to 64-bit
offsets, which needs extra instructions on 64-bit systems.
2. Allow to recalculate the horizontal interpolation weights only
once per 4 pixels by treating the XMM register as four pairs
of 16-bit values. Each of these 16-bit/16-bit pairs can be
replicated to fill the whole 128-bit register by using PSHUFD
instructions. So we get "3 PADDW/PSRLW + 4 PSHUFD" instructions
per 4 pixels instead of "12 PADDW/PSRLW" per 4 pixels
(or "3 PADDW/PSRLW" per each pixel).
Now a good question is whether replacing "9 PADDW/PSRLW" with
"4 PSHUFD" is a favourable exchange. As it turns out, PSHUFD
instructions are very fast on new Intel processors (including
Atoms), but are rather slow on the first generation of Core2
(Merom) and on the other processors from that time or older.
A good instructions latency/throughput table, covering all the
relevant processors, can be found at:
http://www.agner.org/optimize/instruction_tables.pdf
Enabling this optimization is controlled by the PSHUFD_IS_FAST
define in "pixman-sse2.c".
3. One use of PSHUFD instruction (_mm_shuffle_epi32 intrinsic) in
the older code has been also replaced by PUNPCKLQDQ equivalent
(_mm_unpacklo_epi64 intrinsic) in PSHUFD_IS_FAST=0 configuration.
The PUNPCKLQDQ instruction is usually faster on older processors,
but has some side effects (instead of fully overwriting the
destination register like PSHUFD does, it retains half of the
original value, which may inhibit some compiler optimizations).
Benchmarks with "lowlevel-blt-bench -b src_8888_8888" using GCC 4.8.1 on
x86-64 system and default optimizations. The results are in MPix/s:
====== Intel Core2 T7300 (2GHz) ======
old: src_8888_8888 = L1: 128.69 L2: 125.07 M:124.86
over_8888_8888 = L1: 83.19 L2: 81.73 M: 80.63
over_8888_n_8888 = L1: 79.56 L2: 78.61 M: 77.85
over_8888_8_8888 = L1: 77.15 L2: 75.79 M: 74.63
new (PSHUFD_IS_FAST=0): src_8888_8888 = L1: 168.67 L2: 163.26 M:162.44
over_8888_8888 = L1: 102.91 L2: 100.43 M: 99.01
over_8888_n_8888 = L1: 97.40 L2: 95.64 M: 94.24
over_8888_8_8888 = L1: 98.04 L2: 95.83 M: 94.33
new (PSHUFD_IS_FAST=1): src_8888_8888 = L1: 154.67 L2: 149.16 M:148.48
over_8888_8888 = L1: 95.97 L2: 93.90 M: 91.85
over_8888_n_8888 = L1: 93.18 L2: 91.47 M: 90.15
over_8888_8_8888 = L1: 95.33 L2: 93.32 M: 91.42
====== Intel Core i7 860 (2.8GHz) ======
old: src_8888_8888 = L1: 323.48 L2: 318.86 M:314.81
over_8888_8888 = L1: 187.38 L2: 186.74 M:182.46
new (PSHUFD_IS_FAST=0): src_8888_8888 = L1: 373.06 L2: 370.94 M:368.32
over_8888_8888 = L1: 217.28 L2: 215.57 M:211.32
new (PSHUFD_IS_FAST=1): src_8888_8888 = L1: 401.98 L2: 397.65 M:395.61
over_8888_8888 = L1: 218.89 L2: 217.56 M:213.48
The most interesting benchmark is "src_8888_8888" (because this code can
be reused for a generic non-separable SSE2 bilinear fetch iterator).
The results shows that PSHUFD instructions are bad for Intel Core2 T7300
(Merom core) and good for Intel Core i7 860 (Nehalem core). Both of these
processors support SSSE3 instructions though, so they are not the primary
targets for SSE2 code. But without having any other more relevant hardware
to test, PSHUFD_IS_FAST=0 seems to be a reasonable default for SSE2 code
and old processors (until the runtime CPU features detection becomes
clever enough to recognize different microarchitectures).
(Rebased on top of patch that removes support for 8-bit bilinear
filtering -ssp)
Siarhei Siamashka [Thu, 5 Sep 2013 05:07:52 +0000 (08:07 +0300)]
test: safeguard the scaling-bench test against COW
The calloc call from pixman_image_create_bits may still
rely on http://en.wikipedia.org/wiki/Copy-on-write
Explicitly initializing the destination image results in
a more predictable behaviour.
V2:
- allocate 16 bytes aligned buffer with aligned stride instead
of delegating this to pixman_image_create_bits
- use memset for the allocated buffer instead of pixman solid fill
- repeat tests 3 times and select best results in order to filter
out even more measurement noise
Søren Sandmann Pedersen [Thu, 5 Sep 2013 02:32:15 +0000 (22:32 -0400)]
Drop support for 8-bit precision in bilinear filtering
The default has been 7-bit for a while now, and the quality
improvement with 8-bit precision is not enough to justify keeping the
code around as a compile-time option.
Søren Sandmann Pedersen [Sun, 1 Sep 2013 02:59:53 +0000 (22:59 -0400)]
Make the first argument to scanline fetchers have type bits_image_t *
Scanline fetchers haven't been used for images other than bits for a
long time, so by making the type reflect this fact, a bit of casting
can be saved in various places.
Matt Turner [Tue, 30 Jul 2013 20:22:29 +0000 (13:22 -0700)]
iwmmxt: Disallow if gcc version is < 4.8.
Later versions of gcc-4.7.x are capable of generating iwMMXt
instructions properly, but gcc-4.8 contains better support and other
fixes, including iwMMXt in conjunction with hardfp. The existing 4.5
requirement was based on attempts to have OLPC use a patched gcc to
build pixman. Let's just require gcc-4.8.
Søren Sandmann Pedersen [Wed, 28 Aug 2013 04:38:22 +0000 (00:38 -0400)]
fast_bilinear_cover_init: Don't install a finalizer on the error path
No memory is allocated in the error case, so a finalizer is not
necessary, and will cause problems if the data pointer is not
initialized to NULL.
Søren Sandmann Pedersen [Thu, 24 May 2012 06:49:05 +0000 (02:49 -0400)]
Add an iterator that can fetch bilinearly scaled images
This new iterator works in a separable way; that is, for a destination
scaline, it scales the two involved source scanlines and then caches
them so that they can be reused for the next destination scanlines.
There are two versions of the code, one that uses 64 bit arithmetic,
and one that uses 32 bit arithmetic only. The latter version is
used on 32 bit systems, where it is expected to be faster.
This scheme saves a substantial amount of arithmetic for larger
scalings; the per-pixel times for various configurations as reported
by scaling-bench are graphed here:
http://people.freedesktop.org/~sandmann/separable.v2/v2.png
The "sse2" graph is current default on x86, "mmx" is with sse2
disabled, "old c" is with sse2 and mmx disabled. The "new 32" and "new
64" graphs show times for the new code. As the graphs show, the 64 bit
version of the new code beats the "old c" for all scaling ratios.
The data was taken on a Sandy Bridge Core i3-2350M CPU @ 2.0 GHz
running in 64 bit mode.
The data used to generate the graph is available in this directory:
http://people.freedesktop.org/~sandmann/separable.v2/
There is also a Gnumeric spreadsheet v2.gnumeric containing the
per-pixel values and the graph.
V2:
- Add error message in the OOM/bad matrix case
- Save some shifts by storing the cached scanlines in AGBR order
- Special cased version that uses 32 bit arithmetic when sizeof(long) <= 4
Søren Sandmann Pedersen [Fri, 25 May 2012 15:38:41 +0000 (11:38 -0400)]
Add support for iter finalizers
Iterators may sometimes need to allocate auxillary memory. In order to
be able to free this memory, optional iterator finalizers are
required.
Søren Sandmann Pedersen [Wed, 22 May 2013 22:48:08 +0000 (18:48 -0400)]
test/scaling-bench.c: New benchmark for bilinear scaling
This new benchmark scales a 320 x 240 test a8r8g8b8 image by all
ratios from 0.1, 0.2, ... up to 10.0 and reports the time it to took
to do each of the scaling operations, and the time spent per
destination pixel.
The times reported for the scaling operations are given in
milliseconds, the times-per-pixel are in nanoseconds.
V2: Format output better
Søren Sandmann Pedersen [Wed, 7 Aug 2013 14:21:20 +0000 (10:21 -0400)]
RELEASING: Add note about changing the topic of the #cairo IRC channel
Siarhei Siamashka [Sat, 27 Jul 2013 16:25:32 +0000 (19:25 +0300)]
test: fix matrix-test on big endian systems
Andrea Canciani [Tue, 17 Jul 2012 14:14:20 +0000 (16:14 +0200)]
test: Fix build on MSVC
The MSVC compiler is very strict about variable declarations after
statements.
Move all the declarations of each block before any statement in the
same block to fix multiple instances of:
alpha-loop.c(XX) : error C2275: 'pixman_image_t' : illegal use of this
type as an expression
Alexander Troosh [Tue, 11 Jun 2013 11:55:34 +0000 (15:55 +0400)]
Require GTK+ version >= 2.16
I'm got bug in my system:
lcc: "scale.c", line 374: warning: function "gtk_scale_add_mark" declared
implicitly [-Wimplicit-function-declaration]
gtk_scale_add_mark (GTK_SCALE (widget), 0.0, GTK_POS_LEFT, NULL);
^
CCLD scale
scale.o: In function `app_new':
(.text+0x23e4): undefined reference to `gtk_scale_add_mark'
scale.o: In function `app_new':
(.text+0x250c): undefined reference to `gtk_scale_add_mark'
scale.o: In function `app_new':
(.text+0x2634): undefined reference to `gtk_scale_add_mark'
make[2]: *** [scale] Error 1
make[2]: Target `all' not remade because of errors.
$ pkg-config --modversion gtk+-2.0
2.12.1
The demos/scale.c use call to gtk_scale_add_mark() function from 2.16+
version of GTK+. Need do support old GTK+ (rewrite scale.c) or simple
demand of high version of GTK+, like this:
Matthieu Herrb [Sat, 8 Jun 2013 16:07:20 +0000 (18:07 +0200)]
configure.ac: Don't use '+=' since it's not POSIX
Reviewed-by: Matt Turner <mattst88@gmail.com>
Signed-off-by: Matthieu Herrb <matthieu.herrb@laas.fr>
Søren Sandmann Pedersen [Wed, 22 May 2013 13:01:36 +0000 (09:01 -0400)]
Consolidate all the iter_init_bits_stride functions
The SSE2, MMX, and fast implementations all have a copy of the
function iter_init_bits_stride that computes an image buffer and
stride.
Move that function to pixman-utils.c and share it among all the
implementations.
Søren Sandmann Pedersen [Tue, 21 May 2013 10:40:59 +0000 (06:40 -0400)]
Delete the old src/dest_iter_init() functions
Now that we are using the new _pixman_implementation_iter_init(), the
old _src/_dest_iter_init() functions are no longer needed, so they can
be deleted, and the corresponding fields in pixman_implementation_t
can be removed.
Søren Sandmann Pedersen [Tue, 21 May 2013 12:15:41 +0000 (08:15 -0400)]
Add _pixman_implementation_iter_init() and use instead of _src/_dest_init()
A new field, 'iter_info', is added to the implementation struct, and
all the implementations store a pointer to their iterator tables in
it. A new function, _pixman_implementation_iter_init(), is then added
that searches those tables, and the new function is called in
pixman-general.c and pixman-image.c instead of the old
_pixman_implementation_src_init() and _pixman_implementation_dest_init().
Søren Sandmann Pedersen [Wed, 22 May 2013 12:05:55 +0000 (08:05 -0400)]
general: Store the iter initializer in a one-entry pixman_iter_info_t table
In preparation for sharing all iterator initialization code from all
the implementations, move the general implementation to use a table of
pixman_iter_info_t.
The existing src_iter_init and dest_iter_init functions are
consolidated into one general_iter_init() function that checks the
iter_flags for whether it is dealing with a source or destination
iterator.
Unlike in the other implementations, the general_iter_init() function
stores its own get_scanline() and write_back() functions in the
iterator, so it relies on the initializer being called after
get_scanline and write_back being copied from the struct to the
iterator.
Søren Sandmann Pedersen [Tue, 21 May 2013 07:59:06 +0000 (03:59 -0400)]
fast: Replace the fetcher_info_t table with a pixman_iter_info_t table
Similar to the SSE2 and MMX patches, this commit replaces a table of
fetcher_info_t with a table of pixman_iter_info_t, and similar to the
noop patch, both fast_src_iter_init() and fast_dest_iter_init() are
now doing exactly the same thing, so their code can be shared in a new
function called fast_iter_init_common().
Søren Sandmann Pedersen [Tue, 21 May 2013 07:32:32 +0000 (03:32 -0400)]
mmx: Replace the fetcher_info_t table with a pixman_iter_info_t table
Similar to the SSE2 commit, information about the iterators is stored
in a table of pixman_iter_info_t.
Søren Sandmann Pedersen [Tue, 21 May 2013 07:29:09 +0000 (03:29 -0400)]
sse2: Replace the fetcher_info_t table with a pixman_iter_info_t table
Similar to the changes to noop, put all the iterators into a table of
pixman_iter_info_t and then do a generic search of that table during
iterator initialization.
Søren Sandmann Pedersen [Tue, 21 May 2013 12:14:44 +0000 (08:14 -0400)]
noop: Keep information about iterators in an array of pixman_iter_info_t
Instead of having a nest of if statements, store the information about
iterators in a table of a new struct type, pixman_iter_info_t, and
then walk that table when initializing iterators.
The new struct contains a format, a set of image flags, and a set of
iter flags, plus a pixman_iter_get_scanline_t, a
pixman_iter_write_back_t, and a new function type
pixman_iter_initializer_t.
If the iterator matches an entry, it is first initialized with the
given get_scanline and write_back functions, and then the provided
iter_initializer (if present) is run. Running the iter_initializer
after setting get_scanline and write_back allows the initializer to
override those fields if it wishes.
The table contains both source and destination iterators,
distinguished based on the recently-added ITER_SRC and ITER_DEST;
similarly, wide iterators are recognized with the ITER_WIDE
flag. Having both source and destination iterators in the table means
the noop_src_iter_init() and noop_dest_iter_init() functions become
identical, so this patch factors out their code in a new function
noop_iter_init_common() that both calls.
The following patches in this series will change all the
implementations to use an iterator table, and then move the table
search code to pixman-implementation.c.
Søren Sandmann Pedersen [Mon, 20 May 2013 13:44:05 +0000 (09:44 -0400)]
Always set the FAST_PATH_NO_ALPHA_MAP flag for non-BITS images
We only support alpha maps for BITS images, so it's always to ignore
the alpha map for non-BITS image. This makes it possible get rid of
the check for SOLID images since it will now be subsumed by the check
for FAST_PATH_NO_ALPHA_MAP.
Opaque masks are reduced to NULL images in pixman.c, and those can
also safely be treated as not having an alpha map, so set the
FAST_PATH_NO_ALPHA_MAP bit for those as well.
Søren Sandmann Pedersen [Thu, 6 Dec 2012 07:25:35 +0000 (02:25 -0500)]
Add ITER_WIDE iter flag
This will be useful for putting iterators into tables where they can
be looked up by iterator flags. Without this flag, wide iterators can
only be recognized by the absence of ITER_NARROW, which makes testing
for a match difficult.
Søren Sandmann Pedersen [Mon, 20 May 2013 13:04:22 +0000 (09:04 -0400)]
Add ITER_SRC and ITER_DEST iter flags
These indicate whether the iterator is for a source or a destination
image. Note iterator initializers are allowed to rely on one of these
being set, so they can't be left out the way it's generally harmless
(aside from potentil performance degradation) to leave out a
particular fast path flag.
Søren Sandmann Pedersen [Sat, 18 May 2013 15:39:34 +0000 (11:39 -0400)]
Make use of image flag in noop iterators
Similar to
c2230fe2aff, simply check against SAMPLES_COVER_CLIP_NEAREST
instead of comparing all the x/y/width/height parameters.
Markos Chandras [Wed, 15 May 2013 16:51:20 +0000 (09:51 -0700)]
Use AC_LINK_IFELSE to check if the Loongson MMI code can link
The Loongson code is compiled with -march=loongson2f to enable the MMI
instructions, but binutils refuses to link object code compiled with
different -march settings, leading to link failures later in the
compile. This avoids that problem by checking if we can link code
compiled for Loongson.
Reviewed-by: Matt Turner <mattst88@gmail.com>
Signed-off-by: Markos Chandras <markos.chandras@imgtec.com>
Matt Turner [Tue, 14 May 2013 19:40:50 +0000 (12:40 -0700)]
mmx: Document implementation(s) of pix_multiply().
I look at that function and can never remember what it does or how it
manages to do it.
ingmar@irsoft.de [Sat, 11 May 2013 09:55:04 +0000 (11:55 +0200)]
Fix broken build when HAVE_CONFIG_H is undefined, e.g. on Win32.
Build fix for platforms without a generated config.h, for example Win32.
Søren Sandmann Pedersen [Wed, 8 May 2013 23:40:12 +0000 (19:40 -0400)]
Post-release version bump to 0.31.1
Søren Sandmann Pedersen [Wed, 8 May 2013 23:31:22 +0000 (19:31 -0400)]
Pre-release version bump to 0.30.0
Søren Sandmann Pedersen [Tue, 30 Apr 2013 22:57:43 +0000 (18:57 -0400)]
Post-release version bump to 0.29.5
Søren Sandmann Pedersen [Tue, 30 Apr 2013 22:50:04 +0000 (18:50 -0400)]
Pre-release version bump to 0.29.4
Søren Sandmann Pedersen [Sat, 27 Apr 2013 08:27:39 +0000 (04:27 -0400)]
pixman/refactor: Delete this file
Essentially all of it is obsolete by now.
Nemanja Lukic [Mon, 15 Apr 2013 17:33:02 +0000 (19:33 +0200)]
MIPS: DSPr2: Added rpixbuf fast path.
Performance numbers before/after on MIPS-74kc @ 1GHz:
lowlevel-blt-bench results
Referent (before):
rpixbuf = L1: 14.63 L2: 13.55 M: 9.91 ( 79.53%) HT: 8.47 VT: 8.32 R: 8.17 RT: 4.90 ( 33Kops/s)
Optimized:
rpixbuf = L1: 45.69 L2: 37.30 M: 17.24 (138.31%) HT: 15.66 VT: 14.88 R: 13.97 RT: 8.38 ( 44Kops/s)
Nemanja Lukic [Mon, 15 Apr 2013 17:33:01 +0000 (19:33 +0200)]
MIPS: DSPr2: Added pixbuf fast path.
Performance numbers before/after on MIPS-74kc @ 1GHz:
lowlevel-blt-bench results
Referent (before):
pixbuf = L1: 18.18 L2: 16.47 M: 13.36 (107.27%) HT: 10.16 VT: 10.07 R: 9.84 RT: 5.54 ( 35Kops/s)
Optimized:
pixbuf = L1: 43.54 L2: 36.02 M: 17.08 (137.09%) HT: 15.58 VT: 14.85 R: 13.87 RT: 8.38 ( 44Kops/s)
Nemanja Lukic [Mon, 15 Apr 2013 17:33:00 +0000 (19:33 +0200)]
test: add "pixbuf" and "rpixbuf" to lowlevel-blt-bench
Add necessary support to lowlevel-blt benchmark for benchmarking pixbuf and
rpixbuf fast paths. bench_composite function now checks for pixbuf string in
testname, and if that is detected, use same bits for src and mask images.
Nemanja Lukic [Mon, 15 Apr 2013 17:32:59 +0000 (19:32 +0200)]
test: add "src_0888_8888_rev" and "src_0888_0565_rev" to lowlevel-blt-bench
Nemanja Lukic [Mon, 15 Apr 2013 17:32:58 +0000 (19:32 +0200)]
MIPS: DSPr2: Fix for bug in in_n_8 routine.
Rounding logic was not implemented right.
Instead of using rounding version of the 8-bit shift, logical shifts were used.
Also, code used unnecessary multiplications, which could be avoided by packing
4 destination (a8) pixel into one 32bit register. There were also, unnecessary
spills on stack. Code is rewritten to address mentioned issues.
The bug was revealed by increasing number of the iterations in blitters-test.
Performance numbers on MIPS-74kc @ 1GHz:
lowlevel-blt-bench results
Referent (before):
in_n_8 = L1: 21.20 L2: 22.86 M: 21.42 ( 14.21%) HT: 15.97 VT: 15.69 R: 15.47 RT: 8.00 ( 48Kops/s)
Optimized (first implementation, with bug):
in_n_8 = L1: 89.38 L2: 86.07 M: 65.48 ( 43.44%) HT: 44.64 VT: 41.50 R: 40.77 RT: 16.94 ( 66Kops/s)
Optimized (with bug fix, and code revisited):
in_n_8 = L1: 102.33 L2: 95.65 M: 70.54 ( 46.84%) HT: 48.35 VT: 45.06 R: 43.20 RT: 17.60 ( 66Kops/s)
Nemanja Lukic [Mon, 15 Apr 2013 17:32:57 +0000 (19:32 +0200)]
MIPS: DSPr2: Added src_0565_8888 nearest neighbor fast path.
Performance numbers before/after on MIPS-74kc @ 1GHz:
lowlevel-blt-bench results
Referent (before):
src_0565_8888 = L1: 20.70 L2: 19.22 M: 12.50 ( 49.79%) HT: 10.45 VT: 10.18 R: 9.99 RT: 5.31 ( 31Kops/s)
Optimized:
src_0565_8888 = L1: 62.98 L2: 53.44 M: 23.07 ( 91.87%) HT: 19.85 VT: 19.15 R: 17.70 RT: 9.68 ( 43Kops/s)
Nemanja Lukic [Mon, 15 Apr 2013 17:32:56 +0000 (19:32 +0200)]
MIPS: DSPr2: Added over_8888_0565 nearest neighbor fast path.
Performance numbers before/after on MIPS-74kc @ 1GHz:
lowlevel-blt-bench results
Referent (before):
over_8888_0565 = L1: 13.22 L2: 12.02 M: 9.77 ( 38.92%) HT: 8.58 VT: 8.35 R: 8.38 RT: 5.78 ( 35Kops/s)
Optimized:
over_8888_0565 = L1: 26.20 L2: 22.97 M: 15.92 ( 63.40%) HT: 13.33 VT: 13.13 R: 12.72 RT: 7.65 ( 39Kops/s)
Nemanja Lukic [Mon, 15 Apr 2013 17:32:55 +0000 (19:32 +0200)]
MIPS: DSPr2: Added over_8888_8888 nearest neighbor fast path.
Performance numbers before/after on MIPS-74kc @ 1GHz:
lowlevel-blt-bench results
Referent (before):
over_8888_8888 = L1: 19.47 L2: 16.30 M: 11.24 ( 59.69%) HT: 9.54 VT: 9.29 R: 9.47 RT: 6.24 ( 37Kops/s)
Optimized:
over_8888_8888 = L1: 43.67 L2: 33.30 M: 16.32 ( 86.65%) HT: 14.10 VT: 13.78 R: 12.96 RT: 7.85 ( 39Kops/s)
Nemanja Lukic [Mon, 15 Apr 2013 17:32:54 +0000 (19:32 +0200)]
MIPS: DSPr2: Fix bug in over_n_8888_8888_ca/over_n_8888_0565_ca routines
After introducing new PRNG (pseudorandom number generator) a bug in two DSPr2
routines was revealed. Bug manifested by wrong calculation in composite and
glyph tests, which caused make check to fail for MIPS DSPr2 optimizations.
Bug was in the calculation of the:
*dst = over (src, *dst) when ma == 0xffffffff
In this case src was not negated and shifted right by 24 bits, it was only
negated. When implementing this routine in the first place, I missplaced those
shifts, which alowed me to combine code for over operation and:
UN8x4_MUL_UN8x4 (s, ma);
UN8x4_MUL_UN8 (ma, srca);
ma = ~ma;
UN8x4_MUL_UN8x4_ADD_UN8x4 (d, ma, s);
So I decided to rewrite that piece of code from scratch. I changed logic, so
now assembly code mimics code from pixman-fast-path.c but processes two pixels
at a time. This code should be easier to debug and maintain.
The bug was revealed in commit
b31a6962. Errors were detected by composite
and glyph tests.
Siarhei Siamashka [Mon, 28 Jan 2013 05:00:12 +0000 (07:00 +0200)]
sse2: faster bilinear interpolation (get rid of XOR instruction)
The old code was calculating horizontal weights for right pixels
in the following way (for simplicity assume 8-bit interpolation
precision):
Start with "x = vx" and do increment "x += ux" after each pixel.
In this case right pixel weight for interpolation can be calculated
as "((x >> 8) ^ 0xFF) + 1", which is the same as "256 - (x >> 8)".
The new code instead:
Starts with "x = -(vx + 1)", performs increment "x += -ux" after
each pixel and calculates right weights as just "(x >> 8) + 1",
eliminating the need for XOR operation in the inner loop.
So we have one instruction less on the critical path. Benchmarks
with "lowlevel-blt-bench -b src_8888_8888" using GCC 4.7.2 on
x86-64 system and default optimizations:
Intel Core i7 860 (2.8GHz):
before: src_8888_8888 = L1: 291.37 L2: 288.58 M:285.38
after: src_8888_8888 = L1: 319.66 L2: 316.47 M:312.06
Intel Core2 T7300 (2GHz):
before: src_8888_8888 = L1: 121.95 L2: 118.38 M:118.52
after: src_8888_8888 = L1: 128.82 L2: 125.12 M:124.88
Intel Atom N450 (1.67GHz):
before: src_8888_8888 = L1: 64.25 L2: 62.37 M: 61.80
after: src_8888_8888 = L1: 64.23 L2: 62.37 M: 61.82
Inspired by the "sse2_bilinear_interpolation" function (single
pixel interpolation) from:
http://lists.freedesktop.org/archives/pixman/2013-January/002575.html
Siarhei Siamashka [Mon, 4 Mar 2013 22:59:13 +0000 (00:59 +0200)]
test: larger 0xFF/0x00 filled clusters in random images for blitters-test
Current blitters-test program had difficulties detecting a bug in
over_n_8888_8888_ca implementation for MIPS DSPr2:
http://lists.freedesktop.org/archives/pixman/2013-March/002645.html
In order to hit the buggy code path, two consecutive mask values had
to be equal to 0xFFFFFFFF because of loop unrolling. The current
blitters-test generates random images in such a way that each byte
has 25% probability for having 0xFF value. Hence each 32-bit mask
value has ~0.4% probability for 0xFFFFFFFF. Because we are testing
many compositing operations with many pixels, encountering at least
one 0xFFFFFFFF mask value reasonably fast is not a problem. If a
bug related to 0xFFFFFFFF mask value is artificialy introduced into
over_n_8888_8888_ca generic C function, it gets detected on 675591
iteration in blitters-test (out of 2000000).
However two consecutive 0xFFFFFFFF mask values are much less likely
to be generated, so the bug was missed by blitters-test.
This patch addresses the problem by also randomly setting the 32-bit
values in images to either 0xFFFFFFFF or 0x00000000 (also with 25%
probability). It allows to have larger clusters of consecutive 0x00
or 0xFF bytes in images which may have special shortcuts for handling
them in unrolled or SIMD optimized code.
Stefan Weil [Sat, 27 Apr 2013 06:00:38 +0000 (08:00 +0200)]
Trivial spelling fixes in comments
They were found by codespell.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Peter Breitenlohner [Mon, 8 Apr 2013 11:13:05 +0000 (13:13 +0200)]
Check for missing sqrtf() as, e.g., for Solaris 9
Signed-off-by: Peter Breitenlohner <peb@mppmu.mpg.de>
Søren Sandmann Pedersen [Thu, 14 Feb 2013 13:06:19 +0000 (08:06 -0500)]
Improve precision of calculations in pixman-gradient-walker.c
The computations in pixman-gradient-walker.c currently take place at
very limited 8 bit precision which results in quite visible artefacts
in gradients. An example is the one produced by demos/linear-gradient
which currently looks like this:
http://i.imgur.com/kQbX8nd.png
With the changes in this commit, the gradient looks like this:
http://i.imgur.com/nUlyuKI.png
The images are also available here:
http://people.freedesktop.org/~sandmann/gradients/before.png
http://people.freedesktop.org/~sandmann/gradients/after.png
This patch computes pixels using floating point, but uses a faster
algorithm, which makes up for the loss of performance.
== Theory:
In both the new and the old algorithm, the various gradient
implementations compute a parameter x that indicates how far along the
gradient the current scanline is. The current algorithm has a cache of
the two color stops surrounding the last parameter; those are used in
a SIMD-within-register fashion in this way:
t1 = walker->left_rb * idist + walker->right_rb * dist;
where dist and idist are the distances to the left and right color
stops respectively normalized to the distance between the left and
right stops. The normalization (which involves a division) is captured
in another cached variable "stepper". The cached values are recomputed
whenever the parameter moves in between two different stops (called
"reset" in the implementation).
Because idist and dist are computed in 8 bits only, a lot of
information is lost, which is quite visible as the image linked above
shows.
The new algorithm caches more information in the following way. When
interpolating between stops, the formula to be used is this:
t = ((x - left) / (right - left));
result = lc * (1 - t) + rc * t;
where
- x is the parameter as computed by the main gradient code,
- left is the position of the left color stop,
- right is the position of the right color stop
- lc is the color of the left color stop
- rc is the color of the right color stop
That formula can also be written like this:
result
= lc * (1 - t) + rc * t;
= lc + (rc - lc) * t
= lc + (rc - lc) * ((x - left) / (right - left))
= (rc - lc) / (right - left) * x +
lc - (left * (rc - lc)) / (right - left)
= s * x + b
where
s = (rc - lc) / (right - left)
and
b = lc - left * (rc - lc) / (right - left)
= (lc * (right - left) - left * (rc - lc)) / (right - left)
= (lc * right - rc * left) / (right - left)
To summarize, setting w = (right - left):
s = (rc - lc) / w
b = (lc * right - rc * left) / w
r = s * x + b
Since s and b only depend on the two active stops, both can be cached
so that the computation only needs to do one multiplication and one
addition per pixel (followed by premultiplication of the alpha
channel). That is, seven multiplications in total, which is the same
number as the old SIMD-within-register implementation had.
== Implementation notes:
The new formula described above is implemented in single precision
floating point, and the eight divisions necessary to compute the
cached values are done by multiplication with the reciprocal of the
distance between the color stops.
The alpha values used in the cached computation are scaled by 255.0,
whereas the RGB values are kept in the [0, 1] interval. The ensures
that after premultiplication, all values will be in the [0, 255]
interval.
This scaling is done by first dividing all the all the channels by
257, and then later on dividing the r, g, b channels by 255. It would
be more natural to do all this scaling in only one place, but
inexplicably, that results in a (substantial) slowdown on Sandy Bridge
with GCC v 4.7.
== Performance impact (median of three runs of radial-perf-test):
== Intel Sandy Bridge, Core i3 @ 1.2GHz
Before: 0.014553
After: 0.014410
Change: 1.0% faster
== AMD Barcelona @ 1.2 GHz
Before: 0.021735
After: 0.021328
Change: 1.9% faster
Ie., slightly faster, though conceivably there could be a negative
impact on machines with a bigger difference between integer and
floating point performance.
V2:
- Use 's' and 'b' in the variable names instead of 'm' and 'd'. This
way they match the explanation above
- Move variable declarations to the top of the function
- Remove unused stepper field
- Some formatting fixes
- Don't pointlessly include pixman-combine32.h
- Don't offset x for each pixel; go back to offsetting left_x and
right_x at reset time. The offsets cancel out in the formula above,
so there is no impact on the calcualations.
Søren Sandmann Pedersen [Fri, 8 Mar 2013 19:05:50 +0000 (14:05 -0500)]
Move the IS_ZERO() to pixman-private.h and rename to FLOAT_IS_ZERO()
Some upcoming changes to pixman-gradient-walker.c will need this
macro.
Søren Sandmann Pedersen [Mon, 25 Feb 2013 02:49:06 +0000 (21:49 -0500)]
test: Add radial-perf-test, a microbenchmark for radial gradients
This benchmark renders one of the radial gradients used in the
swfdec-youtube cairo trace 500 times and reports the average time it
took.
V2: Update .gitignore
Søren Sandmann Pedersen [Fri, 15 Feb 2013 01:32:31 +0000 (20:32 -0500)]
demos: Add linear-gradient demo program
This program displays a linear gradient from blue to yellow. Due to
limited precision in pixman-gradient-walker.c, it currently has some
ugly artefacts that gives it a 'brushed metal' appearance.
V2: Update .gitignore
Behdad Esfahbod [Fri, 8 Mar 2013 11:00:00 +0000 (06:00 -0500)]
Remove unused macro
Nemanja Lukic [Wed, 27 Feb 2013 13:40:51 +0000 (14:40 +0100)]
MIPS: DSPr2: Added more fast-paths for SRC operation:
- src_0888_8888_rev
- src_0888_0565_rev
Performance numbers before/after on MIPS-74kc @ 1GHz:
lowlevel-blt-bench results
Referent (before):
src_0888_8888_rev = L1: 51.88 L2: 42.00 M: 19.04 ( 88.50%) HT: 15.27 VT: 14.62 R: 14.13 RT: 7.12 ( 45Kops/s)
src_0888_0565_rev = L1: 31.96 L2: 30.90 M: 22.60 ( 75.03%) HT: 15.32 VT: 15.11 R: 14.49 RT: 6.64 ( 43Kops/s)
Optimized:
src_0888_8888_rev = L1: 222.73 L2: 113.70 M: 20.97 ( 97.35%) HT: 18.31 VT: 17.14 R: 16.71 RT: 9.74 ( 54Kops/s)
src_0888_0565_rev = L1: 100.37 L2: 74.27 M: 29.43 ( 97.63%) HT: 22.92 VT: 21.59 R: 20.52 RT: 10.56 ( 56Kops/s)
Nemanja Lukic [Wed, 27 Feb 2013 13:39:45 +0000 (14:39 +0100)]
MIPS: DSPr2: Added more fast-paths for OVER operation:
- over_8888_0565
- over_n_8_8
Performance numbers before/after on MIPS-74kc @ 1GHz:
lowlevel-blt-bench results
Referent (before):
over_8888_0565 = L1: 14.30 L2: 13.22 M: 10.43 ( 41.56%) HT: 12.51 VT: 12.95 R: 11.82 RT: 7.34 ( 49Kops/s)
over_n_8_8 = L1: 12.77 L2: 16.93 M: 15.03 ( 29.94%) HT: 10.78 VT: 10.72 R: 10.29 RT: 4.92 ( 33Kops/s)
Optimized:
over_8888_0565 = L1: 26.03 L2: 22.92 M: 15.68 ( 62.43%) HT: 16.19 VT: 16.27 R: 14.93 RT: 8.60 ( 52Kops/s)
over_n_8_8 = L1: 62.00 L2: 55.17 M: 40.29 ( 80.23%) HT: 26.77 VT: 25.64 R: 24.13 RT: 10.01 ( 47Kops/s)
Søren Sandmann Pedersen [Fri, 15 Feb 2013 23:34:46 +0000 (18:34 -0500)]
gtk-utils.c: Use cairo in show_image() rather than GdkPixbuf
GdkPixbufs are not premultiplied, so when using them to display pixman
images, there is some unecessary conversions going on: First the image
is converted to non-premultiplied, and then GdkPixbuf premultiplies
before sending the result to the X server. These conversions may cause
the displayed image to not be exactly identical to the original.
This patch just uses a cairo image surface instead, which avoids these
conversions.
Also make the comment about sRGB a little more concise.
Ben Avison [Wed, 6 Feb 2013 00:39:12 +0000 (00:39 +0000)]
Fix to lowlevel-blt-bench
The source, mask and destination buffers are initialised to 0xCC just after
they are allocated. Between each benchmark, there are a pair of memcpys,
from the destination buffer to the source buffer and back again (there are
no explanatory comments, but presumably this is an effort to flush the
caches). However, it has an unintended consequence, which is to change the
contents of the buffers on entry to subsequent benchmarks. This means it is
not a fair test: for example, with over_n_8888 (featured in the following
patches) it reports L2 and even M tests as being faster than the L1 test,
because after the L1 test, the source buffer is filled with fully opaque
pixels, for which over_n_8888 has a shortcut.
The fix here is simply to reverse the order of the memcpys, so src and
destination are both filled with 0xCC on entry to all tests.
Stefan Weil [Sat, 9 Feb 2013 11:40:16 +0000 (12:40 +0100)]
sse2: Use uintptr_t in type casts from pointer to integral value
Some recent code added new type casts from pointer to unsigned long.
These type casts result in compiler warnings for systems like
MinGW-w64 (64 bit Windows) where sizeof(unsigned long) != sizeof(void *).
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Søren Sandmann Pedersen [Thu, 31 Jan 2013 19:54:49 +0000 (14:54 -0500)]
lookup_composite: Don't update cache in case of error
If we fail to find a composite function, don't update the fast path
cache with the dummy compositing function.
Also make the error message state that the bug is likely caused by
issues with thread local storage.
Søren Sandmann Pedersen [Thu, 31 Jan 2013 19:36:38 +0000 (14:36 -0500)]
Turn on error logging at all times
While releasing 0.29.2 the distcheck run produced a number of error
messages that had to be fixed in
349015e1fc5d912ba4253133b90e751d0b.
These were not caught before so nobody had actually run pixman with
debugging turned on. It's not the first time this has happened, see
5b0563f39eb29e4ae431717696174da5 for example.
So this patch makes the return_if_fail() macros use unlikely() around
the expressions and then turns on error logging at all times. The
performance hit should negligible since we were already evaluating the
expressions.
The place where DEBUG actually does cause a performance hit is in the
region selfcheck code, and that will still only be enabled in
development snapshots.
Søren Sandmann Pedersen [Thu, 31 Jan 2013 19:31:26 +0000 (14:31 -0500)]
pixman-compiler.h: Add unlikely() macro
When compiling with GCC this macro expands to __builtin_expect((expr), 0).
On other compilers, it just expands to (expr).
Søren Sandmann Pedersen [Tue, 22 Jan 2013 13:29:57 +0000 (08:29 -0500)]
utils.c: Increase acceptable deviation to 0.0064 in pixel_checker_t
The check-formats programs reveals that the 8 bit pipeline cannot meet
the current 0.004 acceptable deviation specified in utils.c, so we
have to increase it. Some of the failing pixels were captured in
pixel-test, which with this commit now passes.
== a4r4g4b4 DISJOINT_XOR a8r8g8b8 ==
The DISJOINT_XOR operator applied to an a4r4g4b4 source pixel of
0xd0c0 and a destination pixel of 0x5300ea00 results in the exact
value:
fa = (1 - da) / sa = (1 - 0x53 / 255.0) / (0xd / 15.0) = 0.7782
fb = (1 - sa) / da = (1 - 0xd / 15.0) / (0x53 / 255.0) = 0.4096
r = fa * (0xc / 15.0) + fb * (0xea / 255.0) = 0.99853
But when computing in 8 bits, we get:
fa8 = ((255 - 0x53) * 255 + 0xdd / 2) / 0xdd = 0xc6
fb8 = ((255 - 0xdd) * 255 + 0x53 / 3) / 0x53 = 0x68
r8 = (fa8 * 0xcc + 127) / 255 + (fb8 * 0xea + 127) / 255 = 0xfd
and
0xfd / 255.0 = 0.
9921568627450981
for a deviation of 0.
00637118610187, which we then have to consider
acceptable given the current implementation.
By switching to computing the result with
r = (fa * s + fb * d + 127) / 255
rather than
r = (fa * s + 127) / 255 + (fb * d + 127) / 255
the deviation would be only 0.
00244961747442, so at some point it may
be worth doing either this, or switching to floating point for
operators that involve divisions.
Note that the conversion from 4 bits to 8 bits does not cause any
error in this case because both rounding and bit replication produces
an exact result when the number of from-bits divide the number of
to-bits.
== a8r8g8b8 OVER r5g6b5 ==
When OVER compositing the a8r8g8b8 pixel 0x0f00c300 with the x14r6g6b6
pixel 0x03c0, the true floating point value of the resulting green
channel is:
0xc3 / 255.0 + (1.0 - 0x0f / 255.0) * (0x0f / 63.0) = 0.9887955
but when compositing 8 bit values, where the 6-bit green channel is
converted to 8 bit through bit replication, the 8-bit result is:
0xc3 + ((255 - 0x0f) * 0x3c + 127) / 255 = 251
which corresponds to a real value of 0.984314. The difference from the
true value is 0.004482 which is bigger than the acceptable deviation
of 0.004. So, if we were to compute all the CONJOINT/DISJOINT
operators in floating point, or otherwise make them more accurate, the
acceptable deviation could be set at 0.0045.
If we were doing the 6-bit conversion with rounding:
(x / 63.0 * 255.0 + 0.5)
instead of bit replication, the deviation in this particular case
would be only 0.0005, so we may want to consider this at some
point.
Søren Sandmann Pedersen [Sat, 19 Jan 2013 21:32:15 +0000 (16:32 -0500)]
test: Add new pixel-test regression test
This test program contains a table of individual operator/pixel
combinations. For each pixel combination, images of various sizes are
filled with the pixels and then composited. The result is then
verified against the output of do_composite(). If the result doesn't
match, detailed error information is printed.
The initial 14 pixel combinations currently all fail.
Søren Sandmann Pedersen [Mon, 21 Jan 2013 20:02:53 +0000 (15:02 -0500)]
a1-trap-test: Add tests for operator_name and format_name()
The check-formats.c test depends on the exact format of the strings
returned from these functions, so add a test here.
a1-trap-test isn't the ideal place, but it seems like overkill to add
a new test just for these trivial checks.
Søren Sandmann Pedersen [Mon, 21 Jan 2013 20:54:05 +0000 (15:54 -0500)]
test: Add new check-formats utility
Given an operator and two formats, this program will composite and
check all pixels where the red and blue channels are 0. That is, if
the two formats are a8r8g8b8 and a4r4g4b4, all source pixels matching
the mask
0xff00ff00
are composited with the given operator against all destination pixels
matching the mask
0xf0f0
and the result is then verified against the do_composite() function
that was moved to utils.c earlier.
This program reveals that a number of operators and format
combinations are not computed to within the precision currently
accepted by pixel_checker_t. For example:
check-formats over a8r8g8b8 r5g6b5 | grep failed | wc -l
30
reveals that there are 30 pixel combinations where OVER produces
insufficiently precise results for the a8r8g8b8 and r5g6b5 formats.
Søren Sandmann Pedersen [Tue, 22 Jan 2013 12:36:19 +0000 (07:36 -0500)]
utils.[ch]: Add pixel_checker_get_masks()
This function returns the a, r, g, and b masks corresponding to the
pixel checker's format.
Søren Sandmann Pedersen [Tue, 22 Jan 2013 16:57:53 +0000 (11:57 -0500)]
test/utils.[ch]: Add pixel_checker_convert_pixel_to_color()
This function takes a pixel in the format corresponding to the pixel
checker, and converts to a color_t.
Søren Sandmann Pedersen [Sat, 19 Jan 2013 17:14:24 +0000 (12:14 -0500)]
test: Move do_composite() function from composite.c to utils.c
So that it can be used in other tests.
Søren Sandmann Pedersen [Wed, 30 Jan 2013 02:42:02 +0000 (21:42 -0500)]
Post-release version bump to 0.29.3
Søren Sandmann Pedersen [Wed, 30 Jan 2013 01:23:39 +0000 (20:23 -0500)]
Pre-release version bump to 0.29.2
Søren Sandmann Pedersen [Wed, 30 Jan 2013 01:23:31 +0000 (20:23 -0500)]
stresstest: Ensure that the rasterizer is only given alpha formats
In
c2cb303d33ec11390b93cabd90f0f9, return_if_fail()s were added to
prevent the trapezoid rasterizers from being called with non-alpha
formats. However, stress-test actually does call the rasterizers with
non-alpha formats, but because _pixman_log_error() is disabled in
versions with an odd minor number, the errors never materialized.
Fix this by changing the argument to random format to an enum of three
values DONT_CARE, PREFER_ALPHA, or REQUIRE_ALPHA, and then in the
switch that calls the trapezoid rasterizers, pass the appropriate
value for the function in question.
Søren Sandmann Pedersen [Mon, 28 Jan 2013 01:08:06 +0000 (20:08 -0500)]
Change default GPGKEY to
3892336E, which is soren.sandmann@gmail.com
The old one belongs to the email address sandmann@daimi.au.dk, which
doesn't work anyore.
Also use gpg to get the name and address for the "(Signed by ...)"
line since that works more reliably for me than using git.
Ben Avison [Thu, 24 Jan 2013 18:19:48 +0000 (18:19 +0000)]
Improve L1 and L2 benchmark tests for caches that don't use allocate-on-write
In particular this affects single-core ARMs (e.g. ARM11, Cortex-A8), which
are usually configured this way. For other CPUs, this should only add a
constant time, which will be cancelled out by the EXCLUDE_OVERHEAD runs.
The problems were caused by cachelines becoming permanently evicted from
the cache, because the code that was intended to pull them back in again on
each iteration assumed too long a cache line (for the L1 test) or failed to
read memory beyond the first pixel row (for the L2 test). Also, the reloading
of the source buffer was unnecessary.
These issues were identified by Siarhei in this post:
http://lists.freedesktop.org/archives/pixman/2013-January/002543.html
Søren Sandmann Pedersen [Fri, 18 Jan 2013 19:13:21 +0000 (14:13 -0500)]
pixman-combine-float.c: Use IS_ZERO() in clip_color() and set_sat()
The clip_color() function has some checks to avoid division by zero,
but they are done by comparing the value to 4 * FLT_EPSILON, where a
better choice is the IS_ZERO() macro that compares to +/- FLT_MIN.
In set_sat(), the check is that *max > *min before dividing by *max -
*min, but that has the potential problem that interactions between GCC
optimizions and 80 bit x87 registers could mean that (*max > *min) is
true in 80 bits, but (*max - *min) is 0 in 32 bits, so that the
division by zero is not prevented. Using IS_ZERO() here as well
prevents this.
Ben Avison [Sat, 19 Jan 2013 16:16:53 +0000 (16:16 +0000)]
ARMv6: Replacement add_8_8, over_8888_8888, over_8888_n_8888 and over_n_8_8888 routines
Improved by adding preloads, combining writes and using the SEL
instruction.
add_8_8
Before After
Mean StdDev Mean StdDev Confidence Change
L1 62.1 0.2 543.4 12.4 100.0% +774.9%
L2 38.7 0.4 116.8 1.7 100.0% +201.8%
M 40.0 0.1 110.1 0.5 100.0% +175.3%
HT 30.9 0.2 43.4 0.5 100.0% +40.4%
VT 30.6 0.3 39.2 0.5 100.0% +28.0%
R 21.3 0.2 35.4 0.4 100.0% +66.6%
RT 8.6 0.2 10.2 0.3 100.0% +19.4%
over_8888_8888
Before After
Mean StdDev Mean StdDev Confidence Change
L1 32.3 0.1 38.0 0.2 100.0% +17.7%
L2 15.9 0.4 30.6 0.5 100.0% +92.8%
M 13.3 0.0 25.6 0.0 100.0% +92.9%
HT 10.5 0.1 15.5 0.1 100.0% +47.1%
VT 10.4 0.1 14.6 0.1 100.0% +40.8%
R 10.3 0.1 15.8 0.1 100.0% +53.3%
RT 6.0 0.1 7.6 0.1 100.0% +25.9%
over_8888_n_8888
Before After
Mean StdDev Mean StdDev Confidence Change
L1 17.6 0.1 21.0 0.1 100.0% +19.2%
L2 11.2 0.2 19.2 0.1 100.0% +71.2%
M 10.2 0.0 19.6 0.0 100.0% +92.6%
HT 8.4 0.0 11.9 0.1 100.0% +41.7%
VT 8.3 0.0 11.3 0.1 100.0% +36.4%
R 8.3 0.0 11.8 0.1 100.0% +43.1%
RT 5.1 0.1 6.2 0.1 100.0% +21.3%
over_n_8_8888
Before After
Mean StdDev Mean StdDev Confidence Change
L1 17.5 0.1 22.8 0.8 100.0% +30.1%
L2 14.2 0.3 21.7 0.2 100.0% +52.6%
M 12.0 0.0 22.3 0.0 100.0% +84.8%
HT 10.5 0.1 14.1 0.1 100.0% +34.5%
VT 10.0 0.1 13.5 0.1 100.0% +35.3%
R 9.4 0.0 12.9 0.2 100.0% +37.7%
RT 5.5 0.1 6.5 0.2 100.0% +19.2%
Ben Avison [Sat, 19 Jan 2013 16:16:52 +0000 (16:16 +0000)]
ARMv6: New conversion routines
There was no previous attempt at accelerating these specifically for
ARMv6.
src_x888_8888
Before After
Mean StdDev Mean StdDev Confidence Change
L1 96.7 0.5 270.4 2.6 100.0% +179.5%
L2 44.6 2.7 110.6 9.7 100.0% +148.0%
M 26.9 0.1 87.6 0.5 100.0% +226.1%
HT 19.3 0.2 37.5 0.4 100.0% +93.7%
VT 18.6 0.1 33.7 0.4 100.0% +81.6%
R 18.4 0.1 32.2 0.3 100.0% +75.2%
RT 9.2 0.2 12.1 0.3 100.0% +31.4%
src_0565_8888
Before After
Mean StdDev Mean StdDev Confidence Change
L1 37.0 0.3 66.9 0.2 100.0% +80.8%
L2 30.3 0.2 55.9 0.3 100.0% +84.4%
M 25.9 0.0 62.3 0.2 100.0% +140.3%
HT 15.2 0.1 33.1 0.3 100.0% +116.9%
VT 15.1 0.1 30.7 0.3 100.0% +103.6%
R 14.2 0.1 27.6 0.3 100.0% +94.0%
RT 6.0 0.1 11.2 0.3 100.0% +87.2%
Ben Avison [Sat, 19 Jan 2013 16:16:51 +0000 (16:16 +0000)]
ARMv6: New blit routines
These are usable either as various composite operations, or via the
top-level function pixman_blt() which now does some blitting for the
first time on an ARMv6 platform (previously it just returned FALSE).
src_8888_8888
Before After
Mean StdDev Mean StdDev Confidence Change
L1 414.5 9.4 445.8 3.6 100.0% +7.6%
L2 93.3 20.7 114.5 12.9 100.0% +22.7%
M 57.0 0.2 89.2 0.5 100.0% +56.4%
HT 28.7 0.3 39.6 0.4 100.0% +37.9%
VT 25.5 0.2 35.3 0.4 100.0% +38.4%
R 20.1 0.1 33.8 0.3 100.0% +67.8%
RT 7.8 0.2 12.7 0.4 100.0% +62.7%
src_0565_0565
Before After
Mean StdDev Mean StdDev Confidence Change
L1 397.4 6.1 412.5 5.2 100.0% +3.8%
L2 143.2 10.9 141.9 6.5 68.9% -0.9% (insignificant)
M 90.7 0.4 133.5 0.7 100.0% +47.1%
HT 38.6 0.3 53.7 0.7 100.0% +39.0%
VT 33.0 0.3 47.3 0.6 100.0% +43.3%
R 25.7 0.2 42.1 0.5 100.0% +64.1%
RT 8.0 0.2 13.3 0.3 100.0% +65.6%
src_8_8
Before After
Mean StdDev Mean StdDev Confidence Change
L1 716.5 9.8 768.2 20.4 100.0% +7.2%
L2 246.2 12.7 260.5 8.8 100.0% +5.8%
M 146.8 0.7 227.9 0.7 100.0% +55.2%
HT 44.9 0.6 62.1 1.0 100.0% +38.2%
VT 35.6 0.4 53.4 0.7 100.0% +50.0%
R 29.7 0.3 48.2 0.6 100.0% +62.2%
RT 8.6 0.2 12.9 0.4 100.0% +49.3%
Ben Avison [Sat, 19 Jan 2013 16:16:50 +0000 (16:16 +0000)]
ARMv6: New fill routines
Note that this also effectively accelerates src_n_8888, src_n_0565 and
src_n_8 composite types, because of the fast paths in
pixman-fast-path.c implemented by fast_composite_solid_fill(), which
end up dispatching these platform-specific fill routines.
src_n_8888
Before After
Mean StdDev Mean StdDev Confidence Change
L1 157.3 1.1 574.2 8.7 100.0% +265.0%
L2 94.2 0.5 364.8 4.2 100.0% +287.3%
M 92.7 0.4 358.7 1.1 100.0% +287.1%
HT 68.5 0.9 133.6 4.0 100.0% +95.2%
VT 61.3 0.8 111.8 2.6 100.0% +82.4%
R 61.1 0.9 108.7 2.8 100.0% +78.1%
RT 24.6 1.0 28.6 1.6 100.0% +16.0%
src_n_0565
Before After
Mean StdDev Mean StdDev Confidence Change
L1 157.4 1.0 983.1 38.5 100.0% +524.6%
L2 93.6 0.5 696.0 14.3 100.0% +643.4%
M 92.7 0.4 680.5 1.0 100.0% +634.0%
HT 68.3 0.9 160.3 6.6 100.0% +134.6%
VT 61.1 0.8 130.1 3.4 100.0% +112.9%
R 61.0 0.8 125.4 4.1 100.0% +105.7%
RT 24.9 1.3 29.5 1.5 100.0% +18.2%
src_n_8
Before After
Mean StdDev Mean StdDev Confidence Change
L1 154.7 1.0 1324.4 48.5 100.0% +756.3%
L2 92.4 0.4 1178.4 10.9 100.0% +1175.6%
M 92.9 0.4 1275.7 2.1 100.0% +1273.5%
HT 68.2 1.0 169.8 5.5 100.0% +149.0%
VT 61.2 1.0 138.5 3.6 100.0% +126.3%
R 61.3 0.9 130.1 3.8 100.0% +112.4%
RT 25.5 1.3 29.2 1.9 100.0% +14.6%
Ben Avison [Mon, 28 Jan 2013 17:03:50 +0000 (17:03 +0000)]
ARMv6: Lay the groundwork for later patches in the series
Move the entire contents of pixman-arm-simd-asm.S to a new file;
ultimately this will only retain the scaled operations, so it is
named pixman-arm-simd-asm-scaled.S. Added new header file
pixman-arm-simd-asm.h, containing the macros which are the basis of
all the new ARMv6 implementations, although at this point in the
series, nothing uses them and the library should be binary-identical.
Søren Sandmann Pedersen [Sat, 26 Jan 2013 05:34:53 +0000 (00:34 -0500)]
demo/scale: Add a spin button to set the number of subsample bits
For large upscalings the level of subsampling for the filter has a
quite visible effect, so make it settable in the UI so that people can
experiment with various values.
Siarhei Siamashka [Sat, 15 Dec 2012 05:18:53 +0000 (07:18 +0200)]
Use pixman_transform_point_31_16() from pixman_transform_point()
Old functions pixman_transform_point() and pixman_transform_point_3d()
now become just wrappers for pixman_transform_point_31_16() and
pixman_transform_point_31_16_3d(). Eventually their uses should be
completely eliminated in the pixman code and replaced with their
extended range counterparts. This is needed in order to be able
to correctly handle any matrices and parameters that may come
to pixman from the code responsible for XRender implementation.
Siarhei Siamashka [Sat, 15 Dec 2012 04:19:21 +0000 (06:19 +0200)]
test: Added matrix-test for testing projective transform accuracy
This test uses __float128 data type when it is available
for implementing a "perfect" reference implementation. The
output from from pixman_transform_point_31_16() and
pixman_transform_point_31_16_affine() is compared with the
reference implementation to make sure that the rounding
errors may only show up in a single least significant bit.
The platforms and compilers, which do not support __float128
data type, can rely on crc32 checksum for the pseudorandom
transform results.
Siarhei Siamashka [Wed, 12 Dec 2012 00:41:55 +0000 (02:41 +0200)]
configure.ac: Added detection for __float128 support
GCC supports 128-bit floating point data type on some platforms (including
but not limited to x86 and x86-64). This may be useful for tests, which
need prefectly accurate reference implementations of certain algorithms.
Siarhei Siamashka [Fri, 14 Dec 2012 16:43:57 +0000 (18:43 +0200)]
Add higher precision "pixman_transform_point_*" functions
The following new functions are added:
pixman_transform_point_31_16_3d() -
Calculates the product of a matrix and a vector multiplication.
pixman_transform_point_31_16() -
Calculates the product of a matrix and a vector multiplication.
Then converts the homogenous resulting vector [x, y, z] to
cartesian [x', y', 1] variant, where x' = x / z, and y' = y / z.
pixman_transform_point_31_16_affine() -
A faster sibling of the other two functions, which assumes affine
transformation, where the bottom row of the matrix is [0, 0, 1] and
the last element of the input vector is set to 1.
These functions transform a point with 31.16 fixed point coordinates from
the destination space to a point with 48.16 fixed point coordinates in
the source space.
The results are accurate and the rounding errors may only show up in
the least significant bit. No overflows are possible for the affine
transformations as long as the input data is provided in 31.16 format.
In the case of projective transformations, some output values may be not
representable using 48.16 fixed point format. In this case the results
are clamped to return maximum or minimum 48.16 values (so that the caller
can at least handle NONE and PAD repeats correctly).
Siarhei Siamashka [Mon, 3 Dec 2012 15:42:21 +0000 (17:42 +0200)]
Faster fetch for the C variant of r5g6b5 src/dest iterator
Processing two pixels at once is used to reduce the number of
arithmetic operations.
The speedup relative to the generic fetch_scanline_r5g6b5() from
"pixman-access.c" (pixman was compiled with gcc 4.7.2):
MIPS 74K 480MHz : 20.32 MPix/s -> 26.47 MPix/s
ARM11 700MHz : 34.95 MPix/s -> 38.22 MPix/s
ARM Cortex-A8 1000MHz : 87.44 MPix/s -> 100.92 MPix/s
ARM Cortex-A9 1700MHz : 150.95 MPix/s -> 158.13 MPix/s
ARM Cortex-A15 1700MHz : 148.91 MPix/s -> 155.42 MPix/s
IBM Cell PPU 3200MHz : 75.29 MPix/s -> 98.33 MPix/s
Intel Core i7 2800MHz : 257.02 MPix/s -> 376.93 MPix/s
That's the performance for C code (SIMD and assembly optimizations
are disabled via PIXMAN_DISABLE environment variable).
Siarhei Siamashka [Mon, 3 Dec 2012 15:07:31 +0000 (17:07 +0200)]
Faster write-back for the C variant of r5g6b5 dest iterator
Unrolling loops improves performance, so just use it here.
Also GCC can't properly optimize this code for RISC processors and
allocate 0x1F001F constant in a register. Because this constant is
too large to be represented as an immediate operand in instructions,
GCC inserts some redundant arithmetics. This problem can be workarounded
by explicitly using a variable for 0x1F001F constant and also initializing
it by a read from another volatile variable. In this case GCC is forced
to allocate a register for it, because it is not seen as a constant anymore.
The speedup relative to the generic store_scanline_r5g6b5() from
"pixman-access.c" (pixman was compiled with gcc 4.7.2):
MIPS 74K 480MHz : 33.22 MPix/s -> 43.42 MPix/s
ARM11 700MHz : 50.16 MPix/s -> 78.23 MPix/s
ARM Cortex-A8 1000MHz : 117.75 MPix/s -> 196.34 MPix/s
ARM Cortex-A9 1700MHz : 177.04 MPix/s -> 320.32 MPix/s
ARM Cortex-A15 1700MHz : 231.44 MPix/s -> 261.64 MPix/s
IBM Cell PPU 3200MHz : 130.25 MPix/s -> 145.61 MPix/s
Intel Core i7 2800MHz : 502.21 MPix/s -> 721.73 MPix/s
That's the performance for C code (SIMD and assembly optimizations
are disabled via PIXMAN_DISABLE environment variable).
Siarhei Siamashka [Mon, 3 Dec 2012 04:32:46 +0000 (06:32 +0200)]
Added C variants of r5g6b5 fetch/write-back iterators
Adding specialized iterators for r5g6b5 color format allows us to work
on fine tuning performance of r5g6b5 fetch/write-back operations in the
pixman general "fetch -> combine -> store" pipeline.
These iterators also make "src_x888_0565" fast path redundant, so it can
be removed.
Chris Wilson [Wed, 23 Jan 2013 10:27:22 +0000 (10:27 +0000)]
Eliminate duplicate copies of channel flags for pixman_image_composite32()
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Chris Wilson [Sat, 12 Jan 2013 16:52:47 +0000 (16:52 +0000)]
Always return a valid function from lookup_combiner()
We should always have at least a C combiner available, so we never
expect the search to fail. If it does, emit an error and return a
dummy function.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Chris Wilson [Sat, 12 Jan 2013 08:28:32 +0000 (08:28 +0000)]
Always return a valid function from lookup_composite()
We never expect to fail to find the appropriate function as the
general_composite_rect should always match. So if somehow we fallthrough
the search, emit a _pixman_log_error() and return a dummy function.
Note that we remove some conditionals and a level of indentation hence a
large amount of code movement. This also reveals that in a few places we
are duplicating stack variables that can be eliminated later.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Chris Wilson [Tue, 8 Jan 2013 18:39:03 +0000 (18:39 +0000)]
sse2: Add fast paths for bilinear source with a solid mask
Based on the existing sse2_8888_n_8888 nearest scaling routines.
fishbowl on an i5-2500: 60.9s -> 56.9s
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Chris Wilson [Tue, 1 Jan 2013 19:41:54 +0000 (19:41 +0000)]
sse2: Add a fast path for add_n_8_8888
This path is being exercised by compositing of trapezoids for clipmasks, for
instance as used in the firefox-asteroids cairo-trace.
IVB i7-3720qm ./tests/lowlevel-blt-bench add_n_8_8888:
reference memcpy speed = 14846.7MB/s (3711.7MP/s for 32bpp fills)
before: L1: 681.10 L2: 735.14 M:701.44 ( 28.35%) HT:283.32 VT:213.23 R:208.93 RT: 77.89 ( 793Kops/s)
after: L1: 992.91 L2:1017.33 M:982.58 ( 39.88%) HT:458.93 VT:332.32 R:326.13 RT:136.66 (1287Kops/s)
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Chris Wilson [Tue, 1 Jan 2013 19:41:54 +0000 (19:41 +0000)]
sse2: Add a fast path for add_n_8888
This path is being exercised by inplace compositing of trapezoids, for
instance as used in the firefox-asteroids cairo-trace.
IVB i3-3720qm ./tests/lowlevel-blt-bench add_n_888:
reference memcpy speed = 14918.3MB/s (3729.6MP/s for 32bpp fills)
before: L1:1752.44 L2:2259.48 M:2215.73 ( 58.80%) HT:589.49 VT:404.04 R:424.69 RT:134.68 (1182Kops/s)
after: L1:3931.21 L2:6132.78 M:3440.17 ( 92.24%) HT:1337.70 VT:1357.64 R:1270.27 RT:359.78 (2161Kops/s)
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Jeff Muizelaar [Thu, 24 Jan 2013 19:49:41 +0000 (14:49 -0500)]
Add a version of bilinear_interpolation for precision <=4
Having 4 or fewer bits means we can do two components at
a time in a single 32 bit register.
Here are the results for firefox-fishtank on a Pandaboard with
4.6.3 and PIXMAN_DISABLE="arm-neon"
Before:
[ # ] backend test min(s) median(s) stddev. count
[ 0] image t-firefox-fishtank 7.841 7.910 0.70% 6/6
After:
[ # ] backend test min(s) median(s) stddev. count
[ 0] image t-firefox-fishtank 6.951 6.995 1.11% 6/6
Ben Avison [Sat, 19 Jan 2013 16:36:22 +0000 (16:36 +0000)]
Tweaks to lowlevel-blt-bench
This adds two extra tests, src_n_8 and src_8_8, which I have been
using to benchmark my ARMv6 changes.
I'd also like to propose that it requires an exact test name as the
executable's argument, as achieved by this strstr to strcmp change.
Without this, it is impossible to only benchmark (for example)
add_8_8, add_n_8 or src_n_8, due to those also being substrings of
many other test names.
Søren Sandmann Pedersen [Sat, 19 Jan 2013 17:29:48 +0000 (12:29 -0500)]
test: Use operator_name() and format_name() in composite.c
With the operator_name() and format_name() functions there is no
longer any reason for composite.c to have its own table of format and
operator names.
Søren Sandmann Pedersen [Sat, 19 Jan 2013 14:36:50 +0000 (09:36 -0500)]
utils.[ch]: Add new format_name() function
This function returns the name of the given format code, which is
useful for printing out debug information. The function is written as
a switch without a default value so that the compiler will warn if new
formats are added in the future. The fake formats used in the fast
path tables are also recognized.
The function is used in alpha_map.c, where it replaces an existing
format_name() function, and in blitters-test.c, affine-test.c, and
scaling-test.c.
Søren Sandmann Pedersen [Sat, 19 Jan 2013 13:55:27 +0000 (08:55 -0500)]
test/utils.[ch]: Add new function operator_name()
This function returns the name of the given operator, which is useful
for printing out debug information. The function is done as a switch
without a default value so that the compiler will warn if new
operators are added in the future.
The function is used in affine-test.c, scaling-test.c, and
blitters-test.c.
Søren Sandmann Pedersen [Sat, 12 Jan 2013 13:03:35 +0000 (08:03 -0500)]
README: Add guidelines on how to contribute patches
Ben Avison pointed out here:
http://lists.freedesktop.org/archives/pixman/2013-January/002485.html
that there isn't really any documentation about how to submit patches
to pixman. This patch adds some information to the README file.
v2: Incorporate some comments from Ben Avison
v3: Change gitweb URL to cgit