profile/ivi/pixman.git
13 years agosse2: Delete some unused variables
Søren Sandmann [Sat, 28 May 2011 15:56:32 +0000 (11:56 -0400)]
sse2: Delete some unused variables

13 years agommx: Delete some unused variables
Søren Sandmann [Sat, 28 May 2011 15:51:31 +0000 (11:51 -0400)]
mmx: Delete some unused variables

13 years agoInclude noop in win32 builds
Andrea Canciani [Mon, 23 May 2011 10:08:54 +0000 (12:08 +0200)]
Include noop in win32 builds

13 years agoFix a few typos in pixman-combine.c.template
Nis Martensen [Mon, 2 May 2011 19:43:58 +0000 (21:43 +0200)]
Fix a few typos in pixman-combine.c.template

Some equations have too much multiplication with alpha.

13 years agoMove NOP src iterator into noop implementation.
Søren Sandmann Pedersen [Sat, 23 Apr 2011 14:26:49 +0000 (10:26 -0400)]
Move NOP src iterator into noop implementation.

The iterator for sources where neither RGB nor ALPHA is needed, really
belongs in the noop implementation.

13 years agoMove NULL iterator into pixman-noop.c
Søren Sandmann Pedersen [Sat, 23 Apr 2011 14:24:41 +0000 (10:24 -0400)]
Move NULL iterator into pixman-noop.c

Iterating a NULL image returns NULL for all scanlines. We may as well
do this in the noop iterator.

13 years agoAdd a noop src iterator
Søren Sandmann Pedersen [Wed, 9 Feb 2011 04:42:36 +0000 (23:42 -0500)]
Add a noop src iterator

When the image is a8r8g8b8 and not transformed, and the fetched
rectangle is within the image bounds, scanlines can be fetched by
simply returning a pointer instead of copying the bits.

13 years agoMove noop dest fetching to noop implementation
Søren Sandmann Pedersen [Mon, 24 Jan 2011 17:16:03 +0000 (12:16 -0500)]
Move noop dest fetching to noop implementation

It will at some point become useful to have CPU specific destination
iterators. However, a problem with that, is that such iterators should
not be used if we can composite directly in the destination image.

By moving the noop destination iterator to the noop implementation, we
can ensure that it will be chosen before any CPU specific iterator.

13 years agoAdd a noop composite function for the DST operator
Søren Sandmann Pedersen [Mon, 24 Jan 2011 16:35:27 +0000 (11:35 -0500)]
Add a noop composite function for the DST operator

The DST operator doesn't actually do anything, so add a noop "fast
path" for it, instead of checking in pixman_image_composite32().

The performance tradeoff here is that we get rid of a test for DST in
the common case where the operator is not DST, in return for an extra
walk over the clip rectangles in the uncommon case where the operator
actually is DST.

13 years agoAdd a "noop" implementation.
Søren Sandmann Pedersen [Mon, 24 Jan 2011 16:31:49 +0000 (11:31 -0500)]
Add a "noop" implementation.

This new implementation is ahead of all other implementations in the
fallback chain and is supposed to contain operations that are "noops",
ie., they don't require any work. For example, it might contain a
"fast path" for the DST operator that doesn't actually do anything or
an iterator for a8r8g8b8 that just returns a pointer into the image.

13 years agotest: Fix compilation on win32
Andrea Canciani [Thu, 5 May 2011 08:17:08 +0000 (10:17 +0200)]
test: Fix compilation on win32

MSVC complains about uint32_t being used as an expression:

composite.c(902) : error C2275: 'uint32_t' : illegal use of this type
as an expression

13 years agoCheck for working mmap()
Dave Yeo [Mon, 9 May 2011 10:38:44 +0000 (12:38 +0200)]
Check for working mmap()

OS/2 doesn't have a working mmap().

13 years agoPost-release version bump to 0.23.1
Søren Sandmann Pedersen [Mon, 2 May 2011 09:11:49 +0000 (05:11 -0400)]
Post-release version bump to 0.23.1

13 years agoPre-release version bump to 0.22.0
Søren Sandmann Pedersen [Mon, 2 May 2011 09:06:33 +0000 (05:06 -0400)]
Pre-release version bump to 0.22.0

13 years agoPost-release version bump to 0.21.9
Søren Sandmann Pedersen [Tue, 19 Apr 2011 04:22:29 +0000 (00:22 -0400)]
Post-release version bump to 0.21.9

13 years agoPre-release version bump to 0.21.8
Søren Sandmann Pedersen [Tue, 19 Apr 2011 04:00:37 +0000 (00:00 -0400)]
Pre-release version bump to 0.21.8

13 years agoARM: Enable bilinear fast paths using scanline functions in pixman-arm-neon-asm-bilin...
Taekyun Kim [Wed, 13 Apr 2011 02:57:35 +0000 (11:57 +0900)]
ARM: Enable bilinear fast paths using scanline functions in pixman-arm-neon-asm-bilinear.S

Enable fast paths which is supported by scanline functions in
pixman-arm-neon-asm-bilinear.S

13 years agoARM: NEON scanline functions for bilinear scaling
Taekyun Kim [Wed, 13 Apr 2011 02:48:40 +0000 (11:48 +0900)]
ARM: NEON scanline functions for bilinear scaling

General fetch->combine->store based bilinear scanline functions.
Need further optimizations and eventually will be replaced with optimal
functions one by one.
General functions should be located in pixman-arm-neon-asm-bilinear.S and
optimal functions in pixman-arm-neon-asm.S

Following general bilinear scanline functions are implemented
    over_8888_8888
    add_8888_8888
    src_8888_8_8888
    src_8888_8_0565
    src_0565_8_x888
    src_0565_8_0565
    over_8888_8_8888
    add_8888_8_8888

13 years agoARM: Common macro for scaled bilinear scanline function with A8 mask
Taekyun Kim [Wed, 13 Apr 2011 02:43:44 +0000 (11:43 +0900)]
ARM: Common macro for scaled bilinear scanline function with A8 mask

Defining PIXMAN_ARM_BIND_SCALED_BILINEAR_SRC_A8_DST macro for declaration of
scaled bilinear scanline functions in common header.

13 years agoOffset rendering in pixman_composite_trapezoids() by (x_dst, y_dst)
Søren Sandmann Pedersen [Fri, 11 Mar 2011 12:52:57 +0000 (07:52 -0500)]
Offset rendering in pixman_composite_trapezoids() by (x_dst, y_dst)

Previously, this function would do coordinate calculations in such a
way that (x_dst, y_dst) would only affect the alignment of the source
image, but not of the traps, which would always be considered to be in
absolute destination coordinates. This is unlike the
pixman_image_composite() function which also registers the mask to the
destination.

This patch makes it so that traps are also offset by (x_dst, y_dst).

Also add a comment explaining how this function is supposed to
operate, and update tri-test.c and composite-trap-test.c to deal with
the new semantics.

13 years agoARM: Add 'neon_composite_over_n_8888_0565_ca' fast path
Søren Sandmann Pedersen [Sun, 3 Apr 2011 03:24:48 +0000 (23:24 -0400)]
ARM: Add 'neon_composite_over_n_8888_0565_ca' fast path

This improves the performance of the firefox-talos-gfx benchmark with
the image16 backend. Benchmark on an 800 MHz ARM Cortex A8:

Before:

[ # ]  backend                         test   min(s) median(s) stddev. count
[  0]  image16            firefox-talos-gfx  121.773  122.218   0.15%    6/6

After:

[ # ]  backend                         test   min(s) median(s) stddev. count
[  0]  image16            firefox-talos-gfx   85.247   85.563   0.22%    6/6

V2: Slightly better instruction scheduling based on comments from Taekyun Kim.
V3: Eliminate all stalls from the inner loop. Also based on comments from Taekyun Kim.

13 years agoFix OpenMP not supported case
Gilles Espinasse [Tue, 12 Apr 2011 20:44:56 +0000 (22:44 +0200)]
Fix OpenMP not supported case

PIXMAN_LINK_WITH_ENV did not fail unless -Wall -Werror is used.
So even when the compiler did not support OpenMP, USE_OPENMP was defined.
Fix that by running the second OpenMP test only when first AC_OPENMP find supported

configure tested in the cases :
gcc without libgomp support, no openmp option, --enable-openmp and --disable-openmp
gcc with libgomp support, no openmp option, --enable-openmp and --disable-openmp

Not tested with autoconf version not knowing openmp (<2.62)

Warn when --enable-openmp is requested but no support is found

Signed-off-by: Gilles Espinasse <g.esp@free.fr>
13 years agoFix missing AC_MSG_RESULT value from Werror test
Gilles Espinasse [Tue, 12 Apr 2011 20:44:25 +0000 (22:44 +0200)]
Fix missing AC_MSG_RESULT value from Werror test

Use the correct variable name

Signed-off-by: Gilles Espinasse <g.esp@free.fr>
13 years agoARM: pipelined NEON implementation of bilinear scaled 'src_8888_0565'
Siarhei Siamashka [Mon, 21 Mar 2011 18:25:27 +0000 (20:25 +0200)]
ARM: pipelined NEON implementation of bilinear scaled 'src_8888_0565'

Benchmark on ARM Cortex-A8 r1p3 @600MHz, 32-bit LPDDR @166MHz:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=20028888, dst=10020565, speed=33.59 MPix/s
  after:  op=1, src=20028888, dst=10020565, speed=46.25 MPix/s

Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=20028888, dst=10020565, speed=63.86 MPix/s
  after:  op=1, src=20028888, dst=10020565, speed=84.22 MPix/s

13 years agoARM: pipelined NEON implementation of bilinear scaled 'src_8888_8888'
Siarhei Siamashka [Wed, 16 Mar 2011 15:24:49 +0000 (17:24 +0200)]
ARM: pipelined NEON implementation of bilinear scaled 'src_8888_8888'

Performance of the inner loop when working with the data in L1 cache:
    ARM Cortex-A8: 41 cycles per 4 pixels (no stalls and partial dual issue)
    ARM Cortex-A9: 48 cycles per 4 pixels (no stalls)

It might be still possible to improve performance even more on ARM Cortex-A8
with a better use of dual issue.

Benchmark on ARM Cortex-A8 r1p3 @600MHz, 32-bit LPDDR @166MHz:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=20028888, dst=20028888, speed=40.38 MPix/s
  after:  op=1, src=20028888, dst=20028888, speed=48.47 MPix/s

Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=20028888, dst=20028888, speed=79.68 MPix/s
  after:  op=1, src=20028888, dst=20028888, speed=93.11 MPix/s

13 years agoARM: support different levels of loop unrolling in bilinear scaler
Siarhei Siamashka [Thu, 17 Mar 2011 17:42:01 +0000 (19:42 +0200)]
ARM: support different levels of loop unrolling in bilinear scaler

Now an extra 'flag' parameter is supported in bilinear scaline scaling
function generation macro. It can be used to enable 4 or 8 pixels per
loop iteration unrolling and provide save/restore code for d8-d15
registers.

13 years agoARM: use less ARM instructions in NEON bilinear scaling code
Siarhei Siamashka [Mon, 21 Mar 2011 16:41:53 +0000 (18:41 +0200)]
ARM: use less ARM instructions in NEON bilinear scaling code

This reduces code size and also puts less pressure on the
instruction decoder.

13 years agoARM: support for software pipelining in bilinear macros
Siarhei Siamashka [Wed, 16 Mar 2011 14:33:41 +0000 (16:33 +0200)]
ARM: support for software pipelining in bilinear macros

Now it's possible to override the main loop of bilinear scaling code
with optimized pipelined implementation.

13 years agoARM: use aligned memory writes in NEON bilinear scaling code
Siarhei Siamashka [Thu, 10 Mar 2011 14:12:23 +0000 (16:12 +0200)]
ARM: use aligned memory writes in NEON bilinear scaling code

13 years agoARM: tweaked horizontal weights update in NEON bilinear scaling code
Siarhei Siamashka [Thu, 10 Mar 2011 13:34:10 +0000 (15:34 +0200)]
ARM: tweaked horizontal weights update in NEON bilinear scaling code

Moving horizontal interpolation weights update instructions from the
beginning of loop to its end allows to hide some pipeline stalls and
improve performance.

13 years agoARM: Tiny improvement in over_n_8888_8888_ca_process_pixblock_head
Søren Sandmann Pedersen [Mon, 4 Apr 2011 00:32:30 +0000 (20:32 -0400)]
ARM: Tiny improvement in over_n_8888_8888_ca_process_pixblock_head

Instead of two

mvn d24, d24
mvn d25, d25

use just one

mvn q12, q12

Also move another vmvn instruction into the created pipeline bubble,
as pointed out by Siarhei.

13 years agoMakefile.am: Put development releases in "snapshots" directory
Søren Sandmann Pedersen [Sat, 2 Apr 2011 18:12:12 +0000 (14:12 -0400)]
Makefile.am: Put development releases in "snapshots" directory

Up until now, all pixman release, both snapshots and releases were
uploaded to the "releases" directory on www.cairographics.org, but
it's better to development snapshots in the "snapshots" directory.

This patch changes Makefile.am to do that.

13 years agotest: Fix infinite loop in composite
Søren Sandmann Pedersen [Tue, 22 Mar 2011 17:42:05 +0000 (13:42 -0400)]
test: Fix infinite loop in composite

When run in PIXMAN_RANDOMIZE_TESTS mode, this test would go into an
infinite loop because the loop started at 'seed' but the stop
condition was still N_TESTS.

13 years agoAdd support for the r8g8b8a8 and r8g8b8x8 formats to the tests.
Alexandros Frantzis [Fri, 18 Mar 2011 12:37:27 +0000 (14:37 +0200)]
Add support for the r8g8b8a8 and r8g8b8x8 formats to the tests.

13 years agoAdd simple support for the r8g8b8a8 and r8g8b8x8 formats.
Alexandros Frantzis [Fri, 18 Mar 2011 12:36:15 +0000 (14:36 +0200)]
Add simple support for the r8g8b8a8 and r8g8b8x8 formats.

This format is particularly useful on big-endian architectures, where RGBA in
memory/file order corresponds to r8g8b8a8 as an uint32_t. This is important
because RGBA is in some cases the only available choice (for example as a pixel
format in OpenGL ES 2.0).

13 years agotest: Randomize some tests if PIXMAN_RANDOMIZE_TESTS is set
Søren Sandmann Pedersen [Mon, 14 Mar 2011 18:56:22 +0000 (14:56 -0400)]
test: Randomize some tests if PIXMAN_RANDOMIZE_TESTS is set

This patch makes so that composite and stress-test will start from a
random seed if the PIXMAN_RANDOMIZE_TESTS environment variable is
set. Running the test suite in this mode is useful to get more test
coverage.

Also, in stress-test.c make it so that setting the initial seed causes
threads to be turned off. This makes it much easier to see when
something fails.

13 years agoSimplify the prototype for iterator initializers.
Søren Sandmann Pedersen [Sun, 13 Mar 2011 00:42:58 +0000 (19:42 -0500)]
Simplify the prototype for iterator initializers.

All of the information previously passed to the iterator initializers
is now available in the iterator itself, so there is no need to pass
it as arguments anymore.

13 years agoFill out parts of iters in _pixman_implementation_{src,dest}_iter_init()
Søren Sandmann Pedersen [Sun, 13 Mar 2011 00:12:35 +0000 (19:12 -0500)]
Fill out parts of iters in _pixman_implementation_{src,dest}_iter_init()

This makes _pixman_implementation_{src,dest}_iter_init() responsible
for filling parts of the information in the iterators. Specifically,
the information passed as arguments is stored in the iterator.

Also add a height field to pixman_iter_t().

13 years agoIn delegate_{src,dest}_iter_init() call delegate directly.
Søren Sandmann Pedersen [Sun, 13 Mar 2011 00:06:02 +0000 (19:06 -0500)]
In delegate_{src,dest}_iter_init() call delegate directly.

There is no reason to go through
_pixman_implementation_{src,dest}_iter_init(), especially since
_pixman_implementation_src_iter_init() is doing various other checks
that only need to be done once.

Also call delegate->src_iter_init() directly in pixman-sse2.c

13 years agoARM: a bit faster NEON bilinear scaling for r5g6b5 source images
Siarhei Siamashka [Wed, 9 Mar 2011 11:55:48 +0000 (13:55 +0200)]
ARM: a bit faster NEON bilinear scaling for r5g6b5 source images

Instructions scheduling improved in the code responsible for fetching r5g6b5
pixels and converting them to the intermediate x8r8g8b8 color format used in
the interpolation part of code. Still a lot of NEON stalls are remaining,
which can be resolved later by the use of pipelining.

Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=10020565, dst=10020565, speed=32.29 MPix/s
          op=1, src=10020565, dst=20020888, speed=36.82 MPix/s
  after:  op=1, src=10020565, dst=10020565, speed=41.35 MPix/s
          op=1, src=10020565, dst=20020888, speed=49.16 MPix/s

13 years agoARM: NEON optimization for bilinear scaled 'src_0565_0565'
Siarhei Siamashka [Wed, 9 Mar 2011 11:27:41 +0000 (13:27 +0200)]
ARM: NEON optimization for bilinear scaled 'src_0565_0565'

Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=10020565, dst=10020565, speed=3.30 MPix/s
  after:  op=1, src=10020565, dst=10020565, speed=32.29 MPix/s

13 years agoARM: NEON optimization for bilinear scaled 'src_0565_x888'
Siarhei Siamashka [Wed, 9 Mar 2011 11:21:53 +0000 (13:21 +0200)]
ARM: NEON optimization for bilinear scaled 'src_0565_x888'

Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=10020565, dst=20020888, speed=3.39 MPix/s
  after:  op=1, src=10020565, dst=20020888, speed=36.82 MPix/s

13 years agoARM: NEON optimization for bilinear scaled 'src_8888_0565'
Siarhei Siamashka [Wed, 9 Mar 2011 09:53:04 +0000 (11:53 +0200)]
ARM: NEON optimization for bilinear scaled 'src_8888_0565'

Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=20028888, dst=10020565, speed=6.56 MPix/s
  after:  op=1, src=20028888, dst=10020565, speed=61.65 MPix/s

13 years agoARM: use common macro template for bilinear scaled 'src_8888_8888'
Siarhei Siamashka [Wed, 9 Mar 2011 09:46:48 +0000 (11:46 +0200)]
ARM: use common macro template for bilinear scaled 'src_8888_8888'

This is a cleanup for old and now duplicated code. The performance improvement
is mostly coming from the enabled use of software prefetch, but instructions
scheduling is also slightly better.

Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=20028888, dst=20028888, speed=53.24 MPix/s
  after:  op=1, src=20028888, dst=20028888, speed=74.36 MPix/s

13 years agoARM: NEON: common macro template for bilinear scanline scalers
Siarhei Siamashka [Wed, 9 Mar 2011 09:34:15 +0000 (11:34 +0200)]
ARM: NEON: common macro template for bilinear scanline scalers

This allows to generate bilinear scanline scaling functions targeting
various source and destination color formats. Right now a8r8g8b8/x8r8g8b8
and r5g6b5 color formats are supported. More formats can be added if needed.

13 years agoARM: new bilinear fast path template macro in 'pixman-arm-common.h'
Siarhei Siamashka [Wed, 9 Mar 2011 08:59:46 +0000 (10:59 +0200)]
ARM: new bilinear fast path template macro in 'pixman-arm-common.h'

It can be reused in different ARM NEON bilinear scaling fast path functions.

13 years agoARM: assembly optimized nearest scaled 'src_8888_8888'
Siarhei Siamashka [Sun, 6 Mar 2011 20:16:32 +0000 (22:16 +0200)]
ARM: assembly optimized nearest scaled 'src_8888_8888'

Benchmark on ARM Cortex-A8 r1p3 @500MHz, 32-bit LPDDR @166MHz:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=20028888, dst=20028888, speed=44.36 MPix/s
  after:  op=1, src=20028888, dst=20028888, speed=39.79 MPix/s

Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=20028888, dst=20028888, speed=102.36 MPix/s
  after:  op=1, src=20028888, dst=20028888, speed=163.12 MPix/s

13 years agoARM: common macro for nearest scaling fast paths
Siarhei Siamashka [Mon, 7 Mar 2011 01:10:43 +0000 (03:10 +0200)]
ARM: common macro for nearest scaling fast paths

The code of nearest scaled 'src_0565_0565' function was generalized
and moved to a common macro, so that it can be reused for other
fast paths.

13 years agoARM: use prefetch in nearest scaled 'src_0565_0565'
Siarhei Siamashka [Sun, 6 Mar 2011 14:17:12 +0000 (16:17 +0200)]
ARM: use prefetch in nearest scaled 'src_0565_0565'

Benchmark on ARM Cortex-A8 r1p3 @500MHz, 32-bit LPDDR @166MHz:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=10020565, dst=10020565, speed=75.02 MPix/s
  after:  op=1, src=10020565, dst=10020565, speed=73.63 MPix/s

Benchmark on ARM Cortex-A8 r2p2 @1GHz, 32-bit LPDDR @200MHz:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=10020565, dst=10020565, speed=176.12 MPix/s
  after:  op=1, src=10020565, dst=10020565, speed=267.50 MPix/s

13 years agotest: Do endian swapping of the source and destination images.
Søren Sandmann Pedersen [Fri, 4 Mar 2011 20:51:18 +0000 (15:51 -0500)]
test: Do endian swapping of the source and destination images.

Otherwise the test fails on big endian. Fix for bug 34767, reported by
Siarhei Siamashka.

13 years agotest: In image_endian_swap() use pixman_image_get_format() to get the bpp.
Søren Sandmann Pedersen [Mon, 7 Mar 2011 18:45:54 +0000 (13:45 -0500)]
test: In image_endian_swap() use pixman_image_get_format() to get the bpp.

There is no reason to pass in the bpp as an argument; it can be gotten
directly from the image.

13 years agoARM: NEON optimization for bilinear scaled 'src_8888_8888'
Siarhei Siamashka [Tue, 22 Feb 2011 16:45:03 +0000 (18:45 +0200)]
ARM: NEON optimization for bilinear scaled 'src_8888_8888'

Initial NEON optimization for bilinear scaling. Can be probably
improved more.

Benchmark on ARM Cortex-A8:
 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=20028888, dst=20028888, speed=6.70 MPix/s
  after:  op=1, src=20028888, dst=20028888, speed=44.27 MPix/s

13 years agoSSE2 optimization for bilinear scaled 'src_8888_8888'
Siarhei Siamashka [Mon, 21 Feb 2011 18:18:02 +0000 (20:18 +0200)]
SSE2 optimization for bilinear scaled 'src_8888_8888'

A primitive naive implementation of bilinear scaling using SSE2 intrinsics,
which only handles one pixel at a time. It is approximately 2x faster than
pixman general compositing path. Single pass processing without intermediate
temporary buffer contributes to ~15% and loop unrolling contributes to ~20%
of this speedup.

Benchmark on Intel Core i7 (x86-64):
 Using cairo-perf-trace:
  before: image        firefox-planet-gnome   12.566   12.610   0.23%    6/6
  after:  image        firefox-planet-gnome   10.961   11.013   0.19%    5/6

 Microbenchmark (scaling 2000x2000 image with scale factor close to 1x):
  before: op=1, src=20028888, dst=20028888, speed=70.48 MPix/s
  after:  op=1, src=20028888, dst=20028888, speed=165.38 MPix/s

13 years agotest: check correctness of 'bilinear_pad_repeat_get_scanline_bounds'
Siarhei Siamashka [Mon, 21 Feb 2011 00:07:09 +0000 (02:07 +0200)]
test: check correctness of 'bilinear_pad_repeat_get_scanline_bounds'

Individual correctness check for the new bilinear scaling related
supplementary function. This test program uses a bit wider range
of input arguments, not covered by other tests.

13 years agoMain loop template for fast single pass bilinear scaling
Siarhei Siamashka [Sun, 20 Feb 2011 23:29:02 +0000 (01:29 +0200)]
Main loop template for fast single pass bilinear scaling

Can be used for implementing SIMD optimized fast path
functions which work with bilinear scaled source images.

Similar to the template for nearest scaling main loop, the
following types of mask are supported:
1. no mask
2. non-scaled a8 mask with SAMPLES_COVER_CLIP flag
3. solid mask

PAD repeat is fully supported. NONE repeat is partially
supported (right now only works if source image has alpha
channel or when alpha channel of the source image does not
have any effect on the compositing operation).

13 years agotest: Silence MSVC warnings
Andrea Canciani [Thu, 24 Feb 2011 11:53:39 +0000 (12:53 +0100)]
test: Silence MSVC warnings

MSVC does not notice non-returning functions (abort() / assert(0))
and warns about paths which end with them in non-void functions:

c:\cygwin\home\ranma42\code\fdo\pixman\test\fetch-test.c(114) :
warning C4715: 'reader' : not all control paths return a value
c:\cygwin\home\ranma42\code\fdo\pixman\test\stress-test.c(133) :
warning C4715: 'real_reader' : not all control paths return a value
c:\cygwin\home\ranma42\code\fdo\pixman\test\composite.c(431) :
warning C4715: 'calc_op' : not all control paths return a value

These warnings can be silenced by adding a return after the
termination call.

13 years agoDo not include unused headers
Andrea Canciani [Tue, 22 Feb 2011 21:43:48 +0000 (22:43 +0100)]
Do not include unused headers

pixman-combine32.h is included without being used both in
pixman-image.c and in pixman-general.c.

13 years agotest: Add Makefile for Win32
Andrea Canciani [Tue, 22 Feb 2011 21:04:49 +0000 (22:04 +0100)]
test: Add Makefile for Win32

13 years agotest: Fix tests for compilation on Windows
Andrea Canciani [Tue, 22 Feb 2011 20:46:37 +0000 (21:46 +0100)]
test: Fix tests for compilation on Windows

The Microsoft C compiler cannot handle subobject initialization and
Win32 does not provide snprintf.

Work around these limitations by using normal struct initialization
and using sprintf (a manual check shows that the buffer size is
sufficient).

13 years agoFix compilation on Win32
Andrea Canciani [Thu, 24 Feb 2011 09:44:04 +0000 (10:44 +0100)]
Fix compilation on Win32

Makefile.win32 contained a typo and was missing the dependency from
the built sources.

13 years agoPost-release version bump to 0.21.7
Søren Sandmann Pedersen [Tue, 22 Feb 2011 21:13:32 +0000 (16:13 -0500)]
Post-release version bump to 0.21.7

13 years agoPre-release version bump to 0.21.6
Søren Sandmann Pedersen [Tue, 22 Feb 2011 20:43:41 +0000 (15:43 -0500)]
Pre-release version bump to 0.21.6

13 years agoMinor fix to the RELEASING file
Søren Sandmann Pedersen [Tue, 22 Feb 2011 20:40:34 +0000 (15:40 -0500)]
Minor fix to the RELEASING file

13 years agoDelete pixman-x64-mmx-emulation.h from pixman/Makefile.am
Søren Sandmann Pedersen [Tue, 22 Feb 2011 20:28:17 +0000 (15:28 -0500)]
Delete pixman-x64-mmx-emulation.h from pixman/Makefile.am

13 years agoEnsure that tests run as the last step of a build for 'make check'
Siarhei Siamashka [Tue, 22 Feb 2011 17:28:08 +0000 (19:28 +0200)]
Ensure that tests run as the last step of a build for 'make check'

Previously 'make check' would compile and run tests first, and only
then proceed to compiling demos. Which is not very convenient
because of the need to scroll back console output to see the
tests verdict. Swapping order of SUBDIRS variable entries in
Makefile.am resolves this.

13 years agosse2: Minor coding style cleanups.
Søren Sandmann Pedersen [Fri, 18 Feb 2011 12:38:49 +0000 (07:38 -0500)]
sse2: Minor coding style cleanups.

Also make pixman_fill_sse2() static.

13 years agosse2: Remove pixman-x64-mmx-emulation.h
Søren Sandmann Pedersen [Fri, 18 Feb 2011 12:40:02 +0000 (07:40 -0500)]
sse2: Remove pixman-x64-mmx-emulation.h

Also stop including mmintrin.h

13 years agosse2: Delete obsolete or redundant comments
Søren Sandmann Pedersen [Fri, 18 Feb 2011 12:38:03 +0000 (07:38 -0500)]
sse2: Delete obsolete or redundant comments

13 years agosse2: Remove all the core_combine_* functions
Søren Sandmann Pedersen [Fri, 18 Feb 2011 12:07:45 +0000 (07:07 -0500)]
sse2: Remove all the core_combine_* functions

Now that _mm_empty() is not used anymore, they are no longer different
from the sse2_combine_* functions, so they can be consolidated.

13 years agosse2: Don't compile pixman-sse2.c with -mmmx anymore
Søren Sandmann Pedersen [Fri, 18 Feb 2011 10:15:50 +0000 (05:15 -0500)]
sse2: Don't compile pixman-sse2.c with -mmmx anymore

It's not necessary now that the file doesn't use MMX instructions.

13 years agosse2: Delete unused MMX functions and constants and all _mm_empty()s
Søren Sandmann Pedersen [Fri, 18 Feb 2011 10:07:08 +0000 (05:07 -0500)]
sse2: Delete unused MMX functions and constants and all _mm_empty()s

These are not needed because the SSE2 implementation doesn't use MMX
anymore.

13 years agosse2: Convert all uses of MMX registers to use SSE2 registers instead.
Søren Sandmann Pedersen [Fri, 18 Feb 2011 08:56:20 +0000 (03:56 -0500)]
sse2: Convert all uses of MMX registers to use SSE2 registers instead.

By avoiding use of MMX registers we won't need to call emms all over
the place, which avoids various miscompilation issues.

13 years agoCoding style: core_combine_in_u_pixelsse2 -> core_combine_in_u_pixel_sse2
Søren Sandmann Pedersen [Fri, 18 Feb 2011 08:57:55 +0000 (03:57 -0500)]
Coding style:  core_combine_in_u_pixelsse2 -> core_combine_in_u_pixel_sse2

13 years agoIn pixman_image_set_transform() allow NULL for transform
Søren Sandmann Pedersen [Tue, 15 Feb 2011 14:11:44 +0000 (09:11 -0500)]
In pixman_image_set_transform() allow NULL for transform

Previously, this would crash unless the existing transform were also
NULL.

13 years agoAvoid marking images dirty when properties are reset
Søren Sandmann Pedersen [Tue, 15 Feb 2011 09:55:02 +0000 (04:55 -0500)]
Avoid marking images dirty when properties are reset

When an image property is set to the same value that it already is,
there is no reason to mark the image dirty and incur a recomputation
of the flags.

13 years agoAdd new public function pixman_add_triangles()
Søren Sandmann Pedersen [Fri, 11 Feb 2011 13:57:42 +0000 (08:57 -0500)]
Add new public function pixman_add_triangles()

This allows some more code to be deleted from the X server. The
implementation consists of converting to trapezoids, and is shared
with pixman_composite_triangles().

13 years agoOptimize adding opaque trapezoids onto a8 destination.
Søren Sandmann Pedersen [Fri, 14 Jan 2011 11:19:08 +0000 (06:19 -0500)]
Optimize adding opaque trapezoids onto a8 destination.

When the source is opaque and the destination is alpha only, we can
avoid the temporary mask and just add the trapezoids directly.

13 years agoAdd a test program, tri-test
Søren Sandmann Pedersen [Wed, 12 Jan 2011 08:02:59 +0000 (03:02 -0500)]
Add a test program, tri-test

This program tests whether the new triangle support works.

13 years agoAdd support for triangles to pixman.
Søren Sandmann Pedersen [Tue, 11 Jan 2011 15:15:21 +0000 (10:15 -0500)]
Add support for triangles to pixman.

The Render X extension can draw triangles as well as trapezoids, but
the implementation has always converted them to trapezoids. This patch
moves the X server's triangle conversion code into pixman, where we
can reuse the pixman_composite_trapezoid() code.

13 years agoAdd a test program for pixman_composite_trapezoids().
Søren Sandmann Pedersen [Thu, 10 Feb 2011 15:37:08 +0000 (10:37 -0500)]
Add a test program for pixman_composite_trapezoids().

A CRC32 based test program to check that pixman_composite_trapezoids()
actually works.

13 years agoAdd pixman_composite_trapezoids().
Søren Sandmann Pedersen [Tue, 11 Jan 2011 14:23:43 +0000 (09:23 -0500)]
Add pixman_composite_trapezoids().

This function is an implementation of the X server request
Trapezoids. That request is what the X backend of cairo is using all
the time; by moving it into pixman we can hopefully make it faster.

13 years agotest/Makefile.am: Move all the TEST_LDADD into a new global LDADD.
Søren Sandmann Pedersen [Wed, 19 Jan 2011 00:40:53 +0000 (19:40 -0500)]
test/Makefile.am: Move all the TEST_LDADD into a new global LDADD.

This gets rid of a bunch of replicated *_LDADD clauses

13 years agoAdd @TESTPROGS_EXTRA_LDFLAGS@ to AM_LDFLAGS
Søren Sandmann Pedersen [Wed, 19 Jan 2011 00:20:18 +0000 (19:20 -0500)]
Add @TESTPROGS_EXTRA_LDFLAGS@ to AM_LDFLAGS

Instead of explicitly adding it to each test program.

13 years agoMove all the GTK+ based test programs to a new subdir, "demos"
Søren Sandmann Pedersen [Wed, 19 Jan 2011 00:16:39 +0000 (19:16 -0500)]
Move all the GTK+ based test programs to a new subdir, "demos"

This separates the test suite from the random gtk+ using test
programs. "demos" is somewhat misleading because the programs there
are not particularly exciting (with the possible exception of
composite-test which shows off all the compositing operators).

13 years agoSSE2 optimization for nearest scaled over_8888_n_8888
Siarhei Siamashka [Thu, 3 Feb 2011 22:47:36 +0000 (00:47 +0200)]
SSE2 optimization for nearest scaled over_8888_n_8888

This operation shows up a little bit in some of the html5 based
games from http://www.kesiev.com/akihabara/

=== Cairo trace of the game intro animation for 'Legend of Sadness' ===

before:
[  0]    image    firefox-legend-of-sadness   46.286   46.298   0.01%    5/6

after:
[  0]    image    firefox-legend-of-sadness   45.088   45.102   0.04%    6/6

=== Microbenchmark (scaling ~2000x~2000 -> ~2000x~2000) ===

before:
    translucent: op=3, src=8888, mask=s dst=8888, speed=131.30 MPix/s
    transparent: op=3, src=8888, mask=s dst=8888, speed=132.38 MPix/s
    opaque:      op=3, src=8888, mask=s dst=8888, speed=167.90 MPix/s
after:
    translucent: op=3, src=8888, mask=s dst=8888, speed=301.93 MPix/s
    transparent: op=3, src=8888, mask=s dst=8888, speed=770.70 MPix/s
    opaque:      op=3, src=8888, mask=s dst=8888, speed=301.80 MPix/s

13 years agoARM: NEON optimization for nearest scaled over_0565_8_0565
Siarhei Siamashka [Wed, 3 Nov 2010 13:22:28 +0000 (15:22 +0200)]
ARM: NEON optimization for nearest scaled over_0565_8_0565

In some cases may be used for html5 video when hardware acceleration
is not available.

13 years agoARM: NEON optimization for nearest scaled over_8888_8_0565
Siarhei Siamashka [Wed, 3 Nov 2010 13:16:28 +0000 (15:16 +0200)]
ARM: NEON optimization for nearest scaled over_8888_8_0565

In some cases may be used for html5 video when hardware acceleration
is not available.

13 years agoARM: new macro template for using scaled fast paths with a8 mask
Siarhei Siamashka [Wed, 3 Nov 2010 13:15:15 +0000 (15:15 +0200)]
ARM: new macro template for using scaled fast paths with a8 mask

13 years agoBetter support for NONE repeat in nearest scaling main loop template
Siarhei Siamashka [Wed, 2 Feb 2011 16:14:56 +0000 (18:14 +0200)]
Better support for NONE repeat in nearest scaling main loop template

Scaling function now gets an extra boolean argument, which is set
to TRUE when we are fetching padding pixels for NONE repeat. This
allows to make a decision whether to interpret alpha as 0xFF or 0x00
for such pixels when working with formats which don't have alpha
channel (for example x8r8g8b8 and r5g6b5).

13 years agoSupport for a8 and solid mask in nearest scaling main loop template
Siarhei Siamashka [Fri, 22 Oct 2010 14:54:41 +0000 (17:54 +0300)]
Support for a8 and solid mask in nearest scaling main loop template

In addition to the most common case of not having any mask at all, two
variants of scaling with mask show up in cairo traces:
1. non-scaled a8 mask with SAMPLES_COVER_CLIP flag
2. solid mask

This patch extends the nearest scaling main loop template to also
support these cases.

13 years agotest: Extend scaling-test to support a8/solid mask and ADD operation
Siarhei Siamashka [Fri, 22 Oct 2010 13:29:01 +0000 (16:29 +0300)]
test: Extend scaling-test to support a8/solid mask and ADD operation

Image width also has been increased because SIMD optimizations typically
do more unrolling in the inner loops, and this needs to be tested.

13 years agoUse const modifiers for source buffers in nearest scaling fast paths
Siarhei Siamashka [Mon, 17 Jan 2011 00:29:43 +0000 (02:29 +0200)]
Use const modifiers for source buffers in nearest scaling fast paths

13 years agoC fast paths for a simple 90/270 degrees rotation
Siarhei Siamashka [Fri, 30 Jul 2010 15:37:51 +0000 (18:37 +0300)]
C fast paths for a simple 90/270 degrees rotation

Depending on CPU architecture, performance is in the range of 1.5 to 4 times
slower than simple nonrotated copy (which would be an ideal case, perfectly
utilizing memory bandwidth), but still is more than 7 times faster if
compared to general path.

This implementation sets a performance baseline for rotation. The use
of SIMD instructions may further improve memory bandwidth utilization.

13 years agoNew flags for 90/180/270 rotation
Siarhei Siamashka [Thu, 29 Jul 2010 14:58:13 +0000 (17:58 +0300)]
New flags for 90/180/270 rotation

These flags are set when the transform is a simple nonscaled 90/180/270
degrees rotation.

13 years agotest: affine-test updated to stress 90/180/270 degrees rotation more
Siarhei Siamashka [Tue, 26 Oct 2010 12:40:01 +0000 (15:40 +0300)]
test: affine-test updated to stress 90/180/270 degrees rotation more

13 years agoAdd pixman-conical-gradient.c to Makefile.win32.
Søren Sandmann Pedersen [Thu, 10 Feb 2011 10:21:42 +0000 (05:21 -0500)]
Add pixman-conical-gradient.c to Makefile.win32.

Pointed out by Kirill Tishin.

13 years agoAdd SSE2 fetcher for 0565
Søren Sandmann Pedersen [Sun, 23 Jan 2011 21:53:26 +0000 (16:53 -0500)]
Add SSE2 fetcher for 0565

Before:

add_0565_0565 = L1:  61.08  L2:  61.03  M: 60.57 ( 10.95%)  HT: 46.85  VT: 45.25  R: 39.99  RT: 20.41 ( 233Kops/s)

After:

add_0565_0565 = L1:  77.84  L2:  76.25  M: 75.38 ( 13.71%)  HT: 55.99  VT: 54.56  R: 45.41  RT: 21.95 ( 255Kops/s)

13 years agoImprove performance of sse2_combine_over_u()
Søren Sandmann Pedersen [Fri, 31 Dec 2010 05:57:46 +0000 (00:57 -0500)]
Improve performance of sse2_combine_over_u()

Split this function into two, one that has a mask, and one that
doesn't. This is a fairly substantial speed-up in many cases.

New output of lowlevel-blt-bench over_x888_8_0565:

over_x888_8_0565 =  L1:  63.76  L2:  62.75  M: 59.37 ( 21.55%)  HT: 45.89  VT: 43.55  R: 34.51  RT: 16.80 ( 201Kops/s)

13 years agoAdd SSE2 fetcher for a8
Søren Sandmann Pedersen [Sun, 23 Jan 2011 21:17:17 +0000 (16:17 -0500)]
Add SSE2 fetcher for a8

New output of lowlevel-blt-bench over_x888_8_0565:

over_x888_8_0565 =  L1:  57.85  L2:  56.80  M: 54.14 ( 19.50%)  HT: 42.64  VT: 40.56  R: 32.67  RT: 16.22 ( 195Kops/s)

Based in part on code by Steve Snyder from

    https://bugs.freedesktop.org/show_bug.cgi?id=21173

13 years agoAdd SSE2 fetcher for x8r8g8b8
Søren Sandmann Pedersen [Wed, 12 Jan 2011 11:38:54 +0000 (06:38 -0500)]
Add SSE2 fetcher for x8r8g8b8

New output of lowlevel-blt-bench over_x888_8_0565:

over_x888_8_0565 =  L1:  55.68  L2:  55.11  M: 52.83 ( 19.04%)  HT: 39.62  VT: 37.70  R: 30.88  RT: 14.62 ( 174Kops/s)

The fetcher is looked up in a table, so that other fetchers can easily
be added.

See also https://bugs.freedesktop.org/show_bug.cgi?id=20709