Andrea Canciani [Sun, 4 Sep 2011 19:56:20 +0000 (21:56 +0200)]
build-win32: Add root Makefile.win32
Add Makefile.win32 to the pixman root. This makefile can recursively
run the other ones to compile the library or the test suite.
Andrea Canciani [Sun, 4 Sep 2011 16:00:38 +0000 (18:00 +0200)]
build-win32: Share targets and variables across win32 makefiles
The win32 build system repeatedly defines some basic variables
(notably program names and flags) and C sources compilation rules.
They can be factored out to a common Makefile, to be included in every
other Makefile.win32.
Andrea Canciani [Sun, 4 Sep 2011 18:07:42 +0000 (20:07 +0200)]
build: Reuse test sources
Makefile.am and Makefile.win32 should not duplicate content, as this
leads to breaking the build when they are not kept in sync.
This can be avoided by listing sources, headers and common build
variables/rules in a Makefile.sources file.
In order to further simplify the test makefiles, the utility functions
are now in a static library, which gets linked to all the tests and
benchmarks.
Andrea Canciani [Sun, 4 Sep 2011 16:41:41 +0000 (09:41 -0700)]
build: Reuse sources and pixman-combine build rules
Makefile.am and Makefile.win32 should not duplicate content, as this
leads to breaking the build when they are not kept in sync.
This can be avoided by listing sources, headers and common build
variables/rules in a Makefile.sources file.
Andrea Canciani [Sun, 4 Sep 2011 18:07:57 +0000 (20:07 +0200)]
test: Fix compilation on win32
Adding scaling-helpers-test to the testsuite on win32 makes MSVC
complain about int64_t being used as an expression:
scaling-helpers-test.c(27) : error C2275: 'int64_t' : illegal use of
this type as an expression
Søren Sandmann Pedersen [Sun, 11 Sep 2011 23:44:06 +0000 (19:44 -0400)]
Use pkg-config to determine the flags to use with libpng
Previously we would unconditionally link with -lpng leading to build
failures on systems without libpng.
Søren Sandmann Pedersen [Tue, 22 Feb 2011 10:20:36 +0000 (05:20 -0500)]
test: New function to save a pixman image to .png
When debugging it is often very useful to be able to save an image as
a png file. This commit adds a function "write_png()" that does that.
If libpng is not available, then the function becomes a noop.
Søren Sandmann Pedersen [Sat, 10 Sep 2011 03:59:20 +0000 (23:59 -0400)]
Post-release version bump to 0.23.5
Søren Sandmann Pedersen [Sat, 10 Sep 2011 03:51:11 +0000 (23:51 -0400)]
Pre-release version bump to 0.23.4
Chris Wilson [Mon, 22 Aug 2011 14:29:25 +0000 (15:29 +0100)]
bits: optimise fetching width==1 repeats
Profiling ign.com, 20% of the entire render time was absorbed in this
single operation:
<< /content //COLOR_ALPHA /width 480 /height 800 >> surface context
<< /width 1 /height 677 /format //ARGB32 /source <|!!!@jGb!m5gD']#$jFHGWtZcK&2i)Up=!TuR9`G<8;ZQp[FQk;emL9ibhbEL&NTh-j63LhHo$E=mSG,0p71`cRJHcget4%<S\X+~> >> image pattern
//EXTEND_REPEAT set-extend
set-source
n 0 0 480 677 rectangle
fill+
pop
which is a simple composition of a single pixel wide image. Sadly this
is a workaround for lack of independent repeat-x/y handling in cairo and
pixman. Worse still is that the worst-case behaviour of the general repeat
path is for width 1 images...
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Taekyun Kim [Fri, 19 Aug 2011 12:20:08 +0000 (21:20 +0900)]
ARM: NEON better instruction scheduling of over_n_8888
New head, tail, tail/head blocks are added and instructions
are reordered to eliminate pipeline stalls
Performance numbers of before/after
- cortex a8 -
before : L1: 375.39 L2: 391.93 M:114.39 ( 40.99%) HT: 99.37 VT: 98.20 R: 90.24 RT: 32.87 ( 240Kops/s)
after : L1: 481.90 L2: 483.46 M:114.29 ( 40.69%) HT:106.91 VT: 93.38 R: 90.74 RT: 29.51 ( 236Kops/s)
- cortex a9 -
before : L1: 324.50 L2: 332.79 M:155.55 ( 47.51%) HT:111.93 VT: 93.58 R: 71.92 RT: 28.21 ( 233Kops/s)
after : L1: 355.87 L2: 364.49 M:156.90 ( 47.59%) HT:111.52 VT: 91.76 R: 72.16 RT: 28.22 ( 234Kops/s)
Taekyun Kim [Tue, 23 Aug 2011 06:00:11 +0000 (15:00 +0900)]
ARM: NEON better instruction scheduling of over_n_8_8888
tail/head block is expanded and reordered to eliminate stalls
Performance numbers of before/after
- cortex a8 -
before : L1: 201.35 L2: 190.48 M:101.94 ( 54.85%) HT: 78.41 VT: 63.83 R: 58.25 RT: 21.74 ( 191Kops/s)
after : L1: 257.65 L2: 255.49 M:102.04 ( 55.33%) HT: 79.19 VT: 65.46 R: 59.23 RT: 21.12 ( 189Kops/s)
- cortex a9 -
before : L1: 157.35 L2: 159.81 M:133.00 ( 60.94%) HT: 82.44 VT: 63.64 R: 51.66 RT: 19.15 ( 179Kops/s)
after : L1: 216.83 L2: 219.40 M:135.83 ( 61.80%) HT: 85.60 VT: 64.80 R: 52.23 RT: 19.16 ( 179Kops/s)
Andrea Canciani [Sat, 13 Aug 2011 14:18:17 +0000 (16:18 +0200)]
Workaround bug in llvm-gcc
llvm-gcc (shipped in Apple XCode 4.1.1 as the default compiler or in
the 2.9 release of LLVM) performs an invalid optimization which
unifies the empty_region and the bad_region structures because they
have the same content.
A bugreport has been filed against Apple Developers Tool for this
issue. This commit works around this bug by making one of the two
structures volatile, so that it cannot be merged.
Fixes region-contains-test.
Andrea Canciani [Wed, 29 Jun 2011 12:14:38 +0000 (14:14 +0200)]
win32: Build benchmarks
Add the makefile rules needed to compile lowlevel-blt-bench on win32
and fix the compilation errors.
Søren Sandmann Pedersen [Fri, 11 Mar 2011 22:09:34 +0000 (17:09 -0500)]
Move bilinear interpolation to pixman-inlines.h
Søren Sandmann Pedersen [Fri, 11 Mar 2011 21:09:21 +0000 (16:09 -0500)]
Use repeat() function from pixman-inlines.h in pixman-bits-image.c
The repeat() functionality was duplicated between pixman-bits-image.c
and pixman-inlines.h
Søren Sandmann Pedersen [Fri, 11 Mar 2011 21:07:24 +0000 (16:07 -0500)]
Rename pixman-fast-path.h to pixman-inlines.h
It is not really specific to pixman-fast-path.c.
Søren Sandmann Pedersen [Thu, 11 Aug 2011 10:30:43 +0000 (06:30 -0400)]
In pixman_image_create_bits() allow images larger than 2GB
There is no reason for pixman_image_create_bits() to check that the
image size fits in int32_t. The correct check is against size_t since
that is what the argument to calloc() is.
This patch fixes this by adding a new _pixman_multiply_overflows_size()
and using it in create_bits(). Also prepend an underscore to the names
of other similar functions since they are internal to pixman.
V2: Use int, not ssize_t for the arguments in create_bits() since
width/height are still limited to 32 bits, as pointed out by Chris
Wilson.
Søren Sandmann Pedersen [Mon, 8 Aug 2011 14:18:07 +0000 (10:18 -0400)]
Don't include stdint.h in lowlevel-blt-bench.c
Some systems don't have the file, and the types are already defined in
pixman.h.
https://bugs.freedesktop.org//show_bug.cgi?id=37422
Søren Sandmann Pedersen [Tue, 2 Aug 2011 07:03:48 +0000 (03:03 -0400)]
Use find_box_for_y() in pixman_region_contains_point() too
The same binary search from the previous commit can be used in this
function too.
V2: Remove check from loop that is not needed anymore, pointed out by
Andrea Canciani.
Søren Sandmann Pedersen [Tue, 2 Aug 2011 02:32:09 +0000 (22:32 -0400)]
Speed up pixman_region{,32}_contains_rectangle()
When someone selects some text in Firefox under a non-composited X
server and initiates a drag, a shaped window is created with a complex
shape corresponding to the outline of the text. Then, on every mouse
movement pixman_region_contains_rectangle() is called many times on
that complicated region. And pixman_region_contains_rectangle() is
doing a linear scan through the rectangles in the region, although the
scan does exit when it finds the first box that can't possibly
intersect the passed-in rectangle.
This patch changes the loop so that it uses a binary search to skip
boxes that don't overlap the current y position. The performance
improvement for the text dragging case is easily noticable.
V2: Use the binary search for the "getting up to speed or skippping
remainder of band" as well.
Søren Sandmann Pedersen [Tue, 2 Aug 2011 05:32:15 +0000 (01:32 -0400)]
New test of pixman_region_contains_{rectangle,point}
This test generates random regions and checks whether random boxes and
points are contained within them. The results are combined and a CRC32
value is computed and compared to a known-correct one.
Søren Sandmann Pedersen [Wed, 3 Aug 2011 22:38:20 +0000 (18:38 -0400)]
Fix lcg_rand_u32() to return 32 random bits.
The lcg_rand() function only returns 15 random bits, so lcg_rand_u32()
would always have 0 in bit 31 and bit 15. Fix that by calling
lcg_rand() three times, to generate 15, 15, and 2 random bits
respectively.
V2: Use the 10/11 most significant bits from the 3 lcg results and mix
them with the low ones from the adjacent one, as suggested by Andrea
Canciani.
Taekyun Kim [Thu, 4 Aug 2011 13:21:04 +0000 (22:21 +0900)]
ARM NEON: Standard fast path out_reverse_8_8888
This fast path is frequently used by cairo to do polygon rendering.
Existing NEON code generation framework is used.
Andrea Canciani [Mon, 18 Jul 2011 06:15:23 +0000 (08:15 +0200)]
radial: Fix typos and trailing whitespace
Correct a typo reported by James Cloos and some reported by automatic
spellchecking.
Remove trailing whitespace.
Siarhei Siamashka [Fri, 22 Jul 2011 21:27:34 +0000 (00:27 +0300)]
ARM: workaround binutils bug #12931 (code sections alignment)
More details in binutils bugtracker:
http://sourceware.org/bugzilla/show_bug.cgi?id=12931
The problem was encountered in the wild by Mozilla:
https://bugzilla.mozilla.org/show_bug.cgi?id=672787
Siarhei Siamashka [Fri, 15 Jul 2011 20:35:21 +0000 (23:35 +0300)]
C fast path for scaled src_x888_8888 with nearest filter
The necessity is justified by a message in the pixman mailing list:
http://lists.freedesktop.org/archives/pixman/2011-July/001330.html
NONE repeat is not supported, but could be added by tweaking
the interpretation and making use of 'fully_transparent_src'
scanline function argument.
Andrea Canciani [Fri, 15 Jul 2011 20:02:01 +0000 (22:02 +0200)]
radial: Improve documentation and naming
Add a comment to explain why the tests guarantee that the code always
computes the greatest valid root.
Rename "det" as "discr" to make it match the mathematical name
"discriminant".
Based on a patch by Jeff Muizelaar <jmuizelaar@mozilla.com>.
Søren Sandmann Pedersen [Mon, 4 Jul 2011 19:55:52 +0000 (15:55 -0400)]
Makefile.am: Add pixman@lists.freedesktop.org to RELEASE_ANNOUNCE_LIST
Søren Sandmann Pedersen [Mon, 4 Jul 2011 19:35:17 +0000 (15:35 -0400)]
Post-release version bump to 0.23.3
Søren Sandmann Pedersen [Mon, 4 Jul 2011 12:13:19 +0000 (08:13 -0400)]
Pre-release version bump to 0.23.2
Taekyun Kim [Mon, 13 Jun 2011 10:53:49 +0000 (19:53 +0900)]
Bilinear REPEAT_NORMAL source line extension for too short src_width
To avoid function call and other calculation overhead, extend source
scanline into temporary buffer when source width is too small.
Temporary buffer will be repeatedly accessed, so extension cost is
very small due to cache effect.
Taekyun Kim [Wed, 8 Jun 2011 08:17:42 +0000 (17:17 +0900)]
Enable REPEAT_NORMAL bilinear fast path entries
Taekyun Kim [Wed, 8 Jun 2011 08:14:29 +0000 (17:14 +0900)]
ARM: Add REPEAT_NORMAL functions to bilinear BIND macros
Now bilinear template support REPEAT_NORMAL, so functions for that
is added to PIXMAN_ARM_BIND_SCALED_BILINEAR_ macros. Fast path
entries are not enabled yet.
Taekyun Kim [Wed, 8 Jun 2011 08:11:24 +0000 (17:11 +0900)]
sse2: Declare bilinear src_8888_8888 REPEAT_NORMAL composite function
Now bilinear template support REPEAT_NORMAL, so declare composite
functions using it. Function is just declared not used yet.
Taekyun Kim [Wed, 8 Jun 2011 06:58:01 +0000 (15:58 +0900)]
REPEAT_NORMAL support for bilinear fast path template
The basic idea is to break down normal repeat into a set of
non-repeat scanline compositions and stitching them together.
Bilinear may interpolate last and first pixels of source scanline.
In this case, we can use temporary wrap around buffer.
Taekyun Kim [Wed, 8 Jun 2011 06:37:31 +0000 (15:37 +0900)]
Replace boolean arguments with flags for bilinear fast path template
By replacing boolean arguments with flags, the code can be more
readable and flags can be extended to do some more things later.
Currently following flags are defined.
FLAG_NONE
- No flags are turned on.
FLAG_HAVE_SOLID_MASK
- Template will generate solid mask composite functions.
FLAG_HAVE_NON_SOLID_MASK
- Template will generate bits mask composite functions.
FLAG_HAVE_SOLID_MASK and FLAG_NON_SOLID_MASK should be mutually
exclusive.
Søren Sandmann [Sat, 25 Jun 2011 14:16:25 +0000 (10:16 -0400)]
test: Make fuzzer-find-diff.pl executable
Søren Sandmann [Mon, 20 Jun 2011 00:29:08 +0000 (20:29 -0400)]
ARM: Fix two bugs in neon_composite_over_n_8888_0565_ca().
The first bug is that a vmull.u8 instruction would store its result in
the q1 register, clobbering the d2 register used later on. The second
is that a vraddhn instruction would overwrite d25, corrupting the q12
register used later.
Fixing the second bug caused a pipeline bubble where the d18 register
would be unavailable for a clock cycle. This is fixed by swapping the
instruction with its successor.
Søren Sandmann Pedersen [Sun, 19 Jun 2011 23:10:45 +0000 (19:10 -0400)]
blitters-test: Make common formats more likely to be tested.
Move the eight most common formats to the top of the list of image
formats and make create_random_image() much more likely to select one
of those eight formats.
This should help catch more bugs in SIMD optimized operations.
Andrea Canciani [Fri, 10 Jun 2011 06:56:10 +0000 (08:56 +0200)]
Silence autoconf warnings
Autoconf 2.86 reports:
warning: AC_LANG_CONFTEST: no AC_LANG_SOURCE call detected in body
Every code fragment must be wrapped in [AC_LANG_SOURCE([...])]
Søren Sandmann Pedersen [Fri, 25 Mar 2011 19:09:17 +0000 (15:09 -0400)]
Replace argumentxs to composite functions with a pointer to a struct
This allows more information, such as flags or the composite region,
to be passed to the composite functions.
Søren Sandmann Pedersen [Fri, 25 Mar 2011 18:20:43 +0000 (14:20 -0400)]
In pixman-general.c rename image_parameters to {src, mask, dest}_image
All the fast paths generally use these names as well.
Søren Sandmann Pedersen [Fri, 25 Mar 2011 18:17:08 +0000 (14:17 -0400)]
Replace instances of "dst_*" with "dest_*"
The variables in question were dst_x, dst_y, dst_image. The majority
of _x and _y uses were already dest_x and dest_y, while the majority
of _image uses were dst_image.
Søren Sandmann [Sat, 28 May 2011 16:32:35 +0000 (12:32 -0400)]
demos: Comment out some unused variables
Søren Sandmann [Sat, 28 May 2011 15:56:32 +0000 (11:56 -0400)]
sse2: Delete some unused variables
Søren Sandmann [Sat, 28 May 2011 15:51:31 +0000 (11:51 -0400)]
mmx: Delete some unused variables
Andrea Canciani [Mon, 23 May 2011 10:08:54 +0000 (12:08 +0200)]
Include noop in win32 builds
Nis Martensen [Mon, 2 May 2011 19:43:58 +0000 (21:43 +0200)]
Fix a few typos in pixman-combine.c.template
Some equations have too much multiplication with alpha.
Søren Sandmann Pedersen [Sat, 23 Apr 2011 14:26:49 +0000 (10:26 -0400)]
Move NOP src iterator into noop implementation.
The iterator for sources where neither RGB nor ALPHA is needed, really
belongs in the noop implementation.
Søren Sandmann Pedersen [Sat, 23 Apr 2011 14:24:41 +0000 (10:24 -0400)]
Move NULL iterator into pixman-noop.c
Iterating a NULL image returns NULL for all scanlines. We may as well
do this in the noop iterator.
Søren Sandmann Pedersen [Wed, 9 Feb 2011 04:42:36 +0000 (23:42 -0500)]
Add a noop src iterator
When the image is a8r8g8b8 and not transformed, and the fetched
rectangle is within the image bounds, scanlines can be fetched by
simply returning a pointer instead of copying the bits.
Søren Sandmann Pedersen [Mon, 24 Jan 2011 17:16:03 +0000 (12:16 -0500)]
Move noop dest fetching to noop implementation
It will at some point become useful to have CPU specific destination
iterators. However, a problem with that, is that such iterators should
not be used if we can composite directly in the destination image.
By moving the noop destination iterator to the noop implementation, we
can ensure that it will be chosen before any CPU specific iterator.
Søren Sandmann Pedersen [Mon, 24 Jan 2011 16:35:27 +0000 (11:35 -0500)]
Add a noop composite function for the DST operator
The DST operator doesn't actually do anything, so add a noop "fast
path" for it, instead of checking in pixman_image_composite32().
The performance tradeoff here is that we get rid of a test for DST in
the common case where the operator is not DST, in return for an extra
walk over the clip rectangles in the uncommon case where the operator
actually is DST.
Søren Sandmann Pedersen [Mon, 24 Jan 2011 16:31:49 +0000 (11:31 -0500)]
Add a "noop" implementation.
This new implementation is ahead of all other implementations in the
fallback chain and is supposed to contain operations that are "noops",
ie., they don't require any work. For example, it might contain a
"fast path" for the DST operator that doesn't actually do anything or
an iterator for a8r8g8b8 that just returns a pointer into the image.
Andrea Canciani [Thu, 5 May 2011 08:17:08 +0000 (10:17 +0200)]
test: Fix compilation on win32
MSVC complains about uint32_t being used as an expression:
composite.c(902) : error C2275: 'uint32_t' : illegal use of this type
as an expression
Dave Yeo [Mon, 9 May 2011 10:38:44 +0000 (12:38 +0200)]
Check for working mmap()
OS/2 doesn't have a working mmap().
Søren Sandmann Pedersen [Mon, 2 May 2011 09:11:49 +0000 (05:11 -0400)]
Post-release version bump to 0.23.1
Søren Sandmann Pedersen [Mon, 2 May 2011 09:06:33 +0000 (05:06 -0400)]
Pre-release version bump to 0.22.0
Søren Sandmann Pedersen [Tue, 19 Apr 2011 04:22:29 +0000 (00:22 -0400)]
Post-release version bump to 0.21.9
Søren Sandmann Pedersen [Tue, 19 Apr 2011 04:00:37 +0000 (00:00 -0400)]
Pre-release version bump to 0.21.8
Taekyun Kim [Wed, 13 Apr 2011 02:57:35 +0000 (11:57 +0900)]
ARM: Enable bilinear fast paths using scanline functions in pixman-arm-neon-asm-bilinear.S
Enable fast paths which is supported by scanline functions in
pixman-arm-neon-asm-bilinear.S
Taekyun Kim [Wed, 13 Apr 2011 02:48:40 +0000 (11:48 +0900)]
ARM: NEON scanline functions for bilinear scaling
General fetch->combine->store based bilinear scanline functions.
Need further optimizations and eventually will be replaced with optimal
functions one by one.
General functions should be located in pixman-arm-neon-asm-bilinear.S and
optimal functions in pixman-arm-neon-asm.S
Following general bilinear scanline functions are implemented
over_8888_8888
add_8888_8888
src_8888_8_8888
src_8888_8_0565
src_0565_8_x888
src_0565_8_0565
over_8888_8_8888
add_8888_8_8888
Taekyun Kim [Wed, 13 Apr 2011 02:43:44 +0000 (11:43 +0900)]
ARM: Common macro for scaled bilinear scanline function with A8 mask
Defining PIXMAN_ARM_BIND_SCALED_BILINEAR_SRC_A8_DST macro for declaration of
scaled bilinear scanline functions in common header.
Søren Sandmann Pedersen [Fri, 11 Mar 2011 12:52:57 +0000 (07:52 -0500)]
Offset rendering in pixman_composite_trapezoids() by (x_dst, y_dst)
Previously, this function would do coordinate calculations in such a
way that (x_dst, y_dst) would only affect the alignment of the source
image, but not of the traps, which would always be considered to be in
absolute destination coordinates. This is unlike the
pixman_image_composite() function which also registers the mask to the
destination.
This patch makes it so that traps are also offset by (x_dst, y_dst).
Also add a comment explaining how this function is supposed to
operate, and update tri-test.c and composite-trap-test.c to deal with
the new semantics.
Søren Sandmann Pedersen [Sun, 3 Apr 2011 03:24:48 +0000 (23:24 -0400)]
ARM: Add 'neon_composite_over_n_8888_0565_ca' fast path
This improves the performance of the firefox-talos-gfx benchmark with
the image16 backend. Benchmark on an 800 MHz ARM Cortex A8:
Before:
[ # ] backend test min(s) median(s) stddev. count
[ 0] image16 firefox-talos-gfx 121.773 122.218 0.15% 6/6
After:
[ # ] backend test min(s) median(s) stddev. count
[ 0] image16 firefox-talos-gfx 85.247 85.563 0.22% 6/6
V2: Slightly better instruction scheduling based on comments from Taekyun Kim.
V3: Eliminate all stalls from the inner loop. Also based on comments from Taekyun Kim.
Gilles Espinasse [Tue, 12 Apr 2011 20:44:56 +0000 (22:44 +0200)]
Fix OpenMP not supported case
PIXMAN_LINK_WITH_ENV did not fail unless -Wall -Werror is used.
So even when the compiler did not support OpenMP, USE_OPENMP was defined.
Fix that by running the second OpenMP test only when first AC_OPENMP find supported
configure tested in the cases :
gcc without libgomp support, no openmp option, --enable-openmp and --disable-openmp
gcc with libgomp support, no openmp option, --enable-openmp and --disable-openmp
Not tested with autoconf version not knowing openmp (<2.62)
Warn when --enable-openmp is requested but no support is found
Signed-off-by: Gilles Espinasse <g.esp@free.fr>
Gilles Espinasse [Tue, 12 Apr 2011 20:44:25 +0000 (22:44 +0200)]
Fix missing AC_MSG_RESULT value from Werror test
Use the correct variable name
Signed-off-by: Gilles Espinasse <g.esp@free.fr>
Siarhei Siamashka [Mon, 21 Mar 2011 18:25:27 +0000 (20:25 +0200)]
ARM: pipelined NEON implementation of bilinear scaled 'src_8888_0565'
Benchmark on ARM Cortex-A8 r1p3 @600MHz, 32-bit LPDDR @166MHz:
Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
before: op=1, src=
20028888, dst=
10020565, speed=33.59 MPix/s
after: op=1, src=
20028888, dst=
10020565, speed=46.25 MPix/s
Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
before: op=1, src=
20028888, dst=
10020565, speed=63.86 MPix/s
after: op=1, src=
20028888, dst=
10020565, speed=84.22 MPix/s
Siarhei Siamashka [Wed, 16 Mar 2011 15:24:49 +0000 (17:24 +0200)]
ARM: pipelined NEON implementation of bilinear scaled 'src_8888_8888'
Performance of the inner loop when working with the data in L1 cache:
ARM Cortex-A8: 41 cycles per 4 pixels (no stalls and partial dual issue)
ARM Cortex-A9: 48 cycles per 4 pixels (no stalls)
It might be still possible to improve performance even more on ARM Cortex-A8
with a better use of dual issue.
Benchmark on ARM Cortex-A8 r1p3 @600MHz, 32-bit LPDDR @166MHz:
Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
before: op=1, src=
20028888, dst=
20028888, speed=40.38 MPix/s
after: op=1, src=
20028888, dst=
20028888, speed=48.47 MPix/s
Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
before: op=1, src=
20028888, dst=
20028888, speed=79.68 MPix/s
after: op=1, src=
20028888, dst=
20028888, speed=93.11 MPix/s
Siarhei Siamashka [Thu, 17 Mar 2011 17:42:01 +0000 (19:42 +0200)]
ARM: support different levels of loop unrolling in bilinear scaler
Now an extra 'flag' parameter is supported in bilinear scaline scaling
function generation macro. It can be used to enable 4 or 8 pixels per
loop iteration unrolling and provide save/restore code for d8-d15
registers.
Siarhei Siamashka [Mon, 21 Mar 2011 16:41:53 +0000 (18:41 +0200)]
ARM: use less ARM instructions in NEON bilinear scaling code
This reduces code size and also puts less pressure on the
instruction decoder.
Siarhei Siamashka [Wed, 16 Mar 2011 14:33:41 +0000 (16:33 +0200)]
ARM: support for software pipelining in bilinear macros
Now it's possible to override the main loop of bilinear scaling code
with optimized pipelined implementation.
Siarhei Siamashka [Thu, 10 Mar 2011 14:12:23 +0000 (16:12 +0200)]
ARM: use aligned memory writes in NEON bilinear scaling code
Siarhei Siamashka [Thu, 10 Mar 2011 13:34:10 +0000 (15:34 +0200)]
ARM: tweaked horizontal weights update in NEON bilinear scaling code
Moving horizontal interpolation weights update instructions from the
beginning of loop to its end allows to hide some pipeline stalls and
improve performance.
Søren Sandmann Pedersen [Mon, 4 Apr 2011 00:32:30 +0000 (20:32 -0400)]
ARM: Tiny improvement in over_n_8888_8888_ca_process_pixblock_head
Instead of two
mvn d24, d24
mvn d25, d25
use just one
mvn q12, q12
Also move another vmvn instruction into the created pipeline bubble,
as pointed out by Siarhei.
Søren Sandmann Pedersen [Sat, 2 Apr 2011 18:12:12 +0000 (14:12 -0400)]
Makefile.am: Put development releases in "snapshots" directory
Up until now, all pixman release, both snapshots and releases were
uploaded to the "releases" directory on www.cairographics.org, but
it's better to development snapshots in the "snapshots" directory.
This patch changes Makefile.am to do that.
Søren Sandmann Pedersen [Tue, 22 Mar 2011 17:42:05 +0000 (13:42 -0400)]
test: Fix infinite loop in composite
When run in PIXMAN_RANDOMIZE_TESTS mode, this test would go into an
infinite loop because the loop started at 'seed' but the stop
condition was still N_TESTS.
Alexandros Frantzis [Fri, 18 Mar 2011 12:37:27 +0000 (14:37 +0200)]
Add support for the r8g8b8a8 and r8g8b8x8 formats to the tests.
Alexandros Frantzis [Fri, 18 Mar 2011 12:36:15 +0000 (14:36 +0200)]
Add simple support for the r8g8b8a8 and r8g8b8x8 formats.
This format is particularly useful on big-endian architectures, where RGBA in
memory/file order corresponds to r8g8b8a8 as an uint32_t. This is important
because RGBA is in some cases the only available choice (for example as a pixel
format in OpenGL ES 2.0).
Søren Sandmann Pedersen [Mon, 14 Mar 2011 18:56:22 +0000 (14:56 -0400)]
test: Randomize some tests if PIXMAN_RANDOMIZE_TESTS is set
This patch makes so that composite and stress-test will start from a
random seed if the PIXMAN_RANDOMIZE_TESTS environment variable is
set. Running the test suite in this mode is useful to get more test
coverage.
Also, in stress-test.c make it so that setting the initial seed causes
threads to be turned off. This makes it much easier to see when
something fails.
Søren Sandmann Pedersen [Sun, 13 Mar 2011 00:42:58 +0000 (19:42 -0500)]
Simplify the prototype for iterator initializers.
All of the information previously passed to the iterator initializers
is now available in the iterator itself, so there is no need to pass
it as arguments anymore.
Søren Sandmann Pedersen [Sun, 13 Mar 2011 00:12:35 +0000 (19:12 -0500)]
Fill out parts of iters in _pixman_implementation_{src,dest}_iter_init()
This makes _pixman_implementation_{src,dest}_iter_init() responsible
for filling parts of the information in the iterators. Specifically,
the information passed as arguments is stored in the iterator.
Also add a height field to pixman_iter_t().
Søren Sandmann Pedersen [Sun, 13 Mar 2011 00:06:02 +0000 (19:06 -0500)]
In delegate_{src,dest}_iter_init() call delegate directly.
There is no reason to go through
_pixman_implementation_{src,dest}_iter_init(), especially since
_pixman_implementation_src_iter_init() is doing various other checks
that only need to be done once.
Also call delegate->src_iter_init() directly in pixman-sse2.c
Siarhei Siamashka [Wed, 9 Mar 2011 11:55:48 +0000 (13:55 +0200)]
ARM: a bit faster NEON bilinear scaling for r5g6b5 source images
Instructions scheduling improved in the code responsible for fetching r5g6b5
pixels and converting them to the intermediate x8r8g8b8 color format used in
the interpolation part of code. Still a lot of NEON stalls are remaining,
which can be resolved later by the use of pipelining.
Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
before: op=1, src=
10020565, dst=
10020565, speed=32.29 MPix/s
op=1, src=
10020565, dst=
20020888, speed=36.82 MPix/s
after: op=1, src=
10020565, dst=
10020565, speed=41.35 MPix/s
op=1, src=
10020565, dst=
20020888, speed=49.16 MPix/s
Siarhei Siamashka [Wed, 9 Mar 2011 11:27:41 +0000 (13:27 +0200)]
ARM: NEON optimization for bilinear scaled 'src_0565_0565'
Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
before: op=1, src=
10020565, dst=
10020565, speed=3.30 MPix/s
after: op=1, src=
10020565, dst=
10020565, speed=32.29 MPix/s
Siarhei Siamashka [Wed, 9 Mar 2011 11:21:53 +0000 (13:21 +0200)]
ARM: NEON optimization for bilinear scaled 'src_0565_x888'
Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
before: op=1, src=
10020565, dst=
20020888, speed=3.39 MPix/s
after: op=1, src=
10020565, dst=
20020888, speed=36.82 MPix/s
Siarhei Siamashka [Wed, 9 Mar 2011 09:53:04 +0000 (11:53 +0200)]
ARM: NEON optimization for bilinear scaled 'src_8888_0565'
Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
before: op=1, src=
20028888, dst=
10020565, speed=6.56 MPix/s
after: op=1, src=
20028888, dst=
10020565, speed=61.65 MPix/s
Siarhei Siamashka [Wed, 9 Mar 2011 09:46:48 +0000 (11:46 +0200)]
ARM: use common macro template for bilinear scaled 'src_8888_8888'
This is a cleanup for old and now duplicated code. The performance improvement
is mostly coming from the enabled use of software prefetch, but instructions
scheduling is also slightly better.
Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
before: op=1, src=
20028888, dst=
20028888, speed=53.24 MPix/s
after: op=1, src=
20028888, dst=
20028888, speed=74.36 MPix/s
Siarhei Siamashka [Wed, 9 Mar 2011 09:34:15 +0000 (11:34 +0200)]
ARM: NEON: common macro template for bilinear scanline scalers
This allows to generate bilinear scanline scaling functions targeting
various source and destination color formats. Right now a8r8g8b8/x8r8g8b8
and r5g6b5 color formats are supported. More formats can be added if needed.
Siarhei Siamashka [Wed, 9 Mar 2011 08:59:46 +0000 (10:59 +0200)]
ARM: new bilinear fast path template macro in 'pixman-arm-common.h'
It can be reused in different ARM NEON bilinear scaling fast path functions.
Siarhei Siamashka [Sun, 6 Mar 2011 20:16:32 +0000 (22:16 +0200)]
ARM: assembly optimized nearest scaled 'src_8888_8888'
Benchmark on ARM Cortex-A8 r1p3 @500MHz, 32-bit LPDDR @166MHz:
Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
before: op=1, src=
20028888, dst=
20028888, speed=44.36 MPix/s
after: op=1, src=
20028888, dst=
20028888, speed=39.79 MPix/s
Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
before: op=1, src=
20028888, dst=
20028888, speed=102.36 MPix/s
after: op=1, src=
20028888, dst=
20028888, speed=163.12 MPix/s
Siarhei Siamashka [Mon, 7 Mar 2011 01:10:43 +0000 (03:10 +0200)]
ARM: common macro for nearest scaling fast paths
The code of nearest scaled 'src_0565_0565' function was generalized
and moved to a common macro, so that it can be reused for other
fast paths.
Siarhei Siamashka [Sun, 6 Mar 2011 14:17:12 +0000 (16:17 +0200)]
ARM: use prefetch in nearest scaled 'src_0565_0565'
Benchmark on ARM Cortex-A8 r1p3 @500MHz, 32-bit LPDDR @166MHz:
Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
before: op=1, src=
10020565, dst=
10020565, speed=75.02 MPix/s
after: op=1, src=
10020565, dst=
10020565, speed=73.63 MPix/s
Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
before: op=1, src=
10020565, dst=
10020565, speed=176.12 MPix/s
after: op=1, src=
10020565, dst=
10020565, speed=267.50 MPix/s
Søren Sandmann Pedersen [Fri, 4 Mar 2011 20:51:18 +0000 (15:51 -0500)]
test: Do endian swapping of the source and destination images.
Otherwise the test fails on big endian. Fix for bug 34767, reported by
Siarhei Siamashka.
Søren Sandmann Pedersen [Mon, 7 Mar 2011 18:45:54 +0000 (13:45 -0500)]
test: In image_endian_swap() use pixman_image_get_format() to get the bpp.
There is no reason to pass in the bpp as an argument; it can be gotten
directly from the image.
Siarhei Siamashka [Tue, 22 Feb 2011 16:45:03 +0000 (18:45 +0200)]
ARM: NEON optimization for bilinear scaled 'src_8888_8888'
Initial NEON optimization for bilinear scaling. Can be probably
improved more.
Benchmark on ARM Cortex-A8:
Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
before: op=1, src=
20028888, dst=
20028888, speed=6.70 MPix/s
after: op=1, src=
20028888, dst=
20028888, speed=44.27 MPix/s
Siarhei Siamashka [Mon, 21 Feb 2011 18:18:02 +0000 (20:18 +0200)]
SSE2 optimization for bilinear scaled 'src_8888_8888'
A primitive naive implementation of bilinear scaling using SSE2 intrinsics,
which only handles one pixel at a time. It is approximately 2x faster than
pixman general compositing path. Single pass processing without intermediate
temporary buffer contributes to ~15% and loop unrolling contributes to ~20%
of this speedup.
Benchmark on Intel Core i7 (x86-64):
Using cairo-perf-trace:
before: image firefox-planet-gnome 12.566 12.610 0.23% 6/6
after: image firefox-planet-gnome 10.961 11.013 0.19% 5/6
Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
before: op=1, src=
20028888, dst=
20028888, speed=70.48 MPix/s
after: op=1, src=
20028888, dst=
20028888, speed=165.38 MPix/s
Siarhei Siamashka [Mon, 21 Feb 2011 00:07:09 +0000 (02:07 +0200)]
test: check correctness of 'bilinear_pad_repeat_get_scanline_bounds'
Individual correctness check for the new bilinear scaling related
supplementary function. This test program uses a bit wider range
of input arguments, not covered by other tests.
Siarhei Siamashka [Sun, 20 Feb 2011 23:29:02 +0000 (01:29 +0200)]
Main loop template for fast single pass bilinear scaling
Can be used for implementing SIMD optimized fast path
functions which work with bilinear scaled source images.
Similar to the template for nearest scaling main loop, the
following types of mask are supported:
1. no mask
2. non-scaled a8 mask with SAMPLES_COVER_CLIP flag
3. solid mask
PAD repeat is fully supported. NONE repeat is partially
supported (right now only works if source image has alpha
channel or when alpha channel of the source image does not
have any effect on the compositing operation).