Matt Turner [Fri, 29 Jun 2012 18:24:30 +0000 (14:24 -0400)]
Fix distcheck due to custom iwMMXt rules
Siarhei Siamashka [Mon, 25 Jun 2012 04:24:27 +0000 (07:24 +0300)]
sse2: faster bilinear scaling (use _mm_loadl_epi64)
Using _mm_loadl_epi64() to load two pixels at once (pairs of top
and bottom pixels) is faster than loading each pixel separately
and combining them with _mm_set_epi32().
=== cairo-perf-trace ===
before: image firefox-fishtank 66.912 66.931 0.13% 3/3
after: image firefox-fishtank 57.584 58.349 0.74% 3/3
=== lowlevel-blt-bench ===
before: src_8888_8888 = L1: 181.10 L2: 179.14 M:178.08 ( 11.02%) HT:153.22 VT:133.45 R:142.24 RT: 95.32
after: src_8888_8888 = L1: 228.68 L2: 225.75 M:223.98 ( 14.23%) HT:185.32 VT:155.06 R:162.73 RT:102.52
This improvement was suggested by Matt Turner on irc.
Siarhei Siamashka [Mon, 25 Jun 2012 04:11:59 +0000 (07:11 +0300)]
test: support nearest/bilinear scaling in lowlevel-blt-bench
Scale factor is selected to be nearly 1x, so that the MPix/s results
can be directly compared with the results of non-scaled compositing
operations.
Siarhei Siamashka [Sat, 23 Jun 2012 01:08:28 +0000 (04:08 +0300)]
test: Fix for strict aliasing issue in 'get_random_seed'
Gets rid of gcc warning when compiled with -fstrict-aliasing option in CFLAGS
Andrea Canciani [Wed, 20 Jun 2012 15:13:33 +0000 (17:13 +0200)]
build: Fix compilation on win32
When compiling using the win32 build system, config.h is not
available nor needed.
Fixes:
pixman-glyph.c(26) : fatal error C1083: Cannot open include file:
'config.h': No such file or directory
Matt Turner [Thu, 3 May 2012 03:13:43 +0000 (23:13 -0400)]
sse2: add src_x888_0565
Port of
2ddd1c498b to SSE2.
Uses the pmadd technique described in
http://software.intel.com/sites/landingpage/legacy/mmx/MMX_App_24-16_Bit_Conversion.pdf
Works around lack of packusdw instruction by first sign extending the
values.
fast: src_8888_0565 = L1: 681.40 L2: 689.20 M: 644.76 ( 25.51%) HT:404.42 VT:288.04 R:306.07 RT:150.80 (1619Kops/s)
mmx: src_8888_0565 = L1:2056.03 L2:1985.44 M:1574.91 ( 61.87%) HT:533.10 VT:376.35 R:416.10 RT:178.79 (1833Kops/s)
sse2: src_8888_0565 = L1:3793.42 L2:3653.44 M:1878.83 ( 73.94%) HT:535.03 VT:407.96 R:421.46 RT:163.31 (1727Kops/s)
and for reference, using packusdw
sse4: src_8888_0565 = L1:4396.18 L2:4229.25 M:1904.04 ( 75.18%) HT:559.79 VT:427.96 R:440.06 RT:165.71 (1744Kops/s)
Notice that MMX is faster in the RT case because it can operate on
8-bytes instead of the current 16-bytes for SSE2.
Matt Turner [Wed, 13 Jun 2012 17:18:49 +0000 (13:18 -0400)]
sse2: enable over_n_0565 for b5g6r5
Same as
b950bb12 for MMX.
Matt Turner [Wed, 13 Jun 2012 20:37:48 +0000 (16:37 -0400)]
.gitignore: add test/glyph-test
Søren Sandmann Pedersen [Wed, 13 Jun 2012 02:04:29 +0000 (22:04 -0400)]
test: Add missing break in stress-test.c
Found by coverity:
https://bugzilla.redhat.com/show_bug.cgi?id=756069
Siarhei Siamashka [Wed, 6 Jun 2012 20:54:20 +0000 (23:54 +0300)]
test: fix bisecting issue in fuzzer-find-diff.pl
Before bisecting to find the exact test which has failed, we
first need to make sure that the first test is fine (the first
test is "good" and the whole range is "bad"). Otherwise
test 2 gets incorrectly flagged as problematic in the case
if we already got a failure on test 1 right from the start.
Siarhei Siamashka [Wed, 6 Jun 2012 19:21:32 +0000 (22:21 +0300)]
test: OpenMP 2.5 requires signed loop iteration variables
Unsigned loop variables are only supported since version 3.0
of OpenMP specification. Changing loop variables to use int32_t
type fixes pixman build problems with path64 compiler.
Søren Sandmann Pedersen [Mon, 11 Jun 2012 23:13:45 +0000 (19:13 -0400)]
test: Make glyph test pass on big endian
The destination buffer was initialized with random uint32_t values, so
it started out different on big endian vs. little endian. Fix that by
initializing the buffer with random uint8_t values instead.
Søren Sandmann Pedersen [Sun, 8 Jan 2012 18:21:11 +0000 (13:21 -0500)]
bits-image: Turn all the fetchers into iterator getters
Instead of caching these fetchers in the image structure, and then
have the iterator getter call them from there, simply change them to
be iterator getters themselves.
This avoids an extra indirect function call and lets us get rid of the
get_scanline_32/64 fields in pixman_image_t.
Antti S. Lankila [Sun, 10 Jun 2012 16:22:56 +0000 (19:22 +0300)]
Faster unorm_to_unorm for wide processing.
Optimizing the unorm_to_unorm functions allows a speedup from:
src_8888_2x10 = L1: 62.08 L2: 60.73 M: 59.61 ( 4.30%) HT: 46.81
VT: 42.17 R: 43.18 RT: 26.01 (325Kops/s)
to:
src_8888_2x10 = L1: 76.94 L2: 78.43 M: 75.87 ( 5.59%) HT: 56.73
VT: 52.39 R: 53.00 RT: 29.29 (363Kops/s)
on a i7 Q720 -based laptop.
The key of the patch is the observation that unorm_to_unorm's work can
more easily be done with a simple multiplication and shift, when the
function is applied repeatedly and the parameters are not compile-time
constants. For instance, converting from 0xfe to 0xfefe (expanding
from 8 bits to 16 bits) can be done by calculating
c = c * 0x101
However, sometimes the result is not a neat replication of all the
bits. For instance, going from 10 bits to 16 bits can be done by
calculating
c = c * 0x401UL >> 4
where the intermediate result is 20 bit wide repetition of the 10-bit
pattern followed by shifting off the unnecessary lowest bits.
The patch has the algorithm to calculate the factor and the shift, and
converts the code to use it.
Matt Turner [Wed, 30 May 2012 20:44:04 +0000 (16:44 -0400)]
configure.ac: add iwmmxt2 configure flag
The flag allows the user to select whether pixman-mmx.c is compiled with
-march=iwmmxt or -march=iwmmxt2.
gcc has scheduling support for the Marvell CPU in the XO 1.75 when
building with -march=iwmmxt2.
Matt Turner [Wed, 30 May 2012 20:26:32 +0000 (16:26 -0400)]
autotools: use custom build rule to build iwMMXt code
gcc has no sane way of enabling iwmmxt code generation, like -msse for
SSE, so you have to use -march=iwmmxt{,2}. User CFLAGS are placed after
-march=iwmmxt and override the march value, so we have to use a custom
build rule to order the CFLAGS such that pixman-mmx.c will be built with
the necessary CFLAGS.
Søren Sandmann Pedersen [Tue, 3 May 2011 11:25:50 +0000 (07:25 -0400)]
Speed up _pixman_image_get_solid() in common cases
Make _pixman_image_get_solid() faster by special-casing the common
cases where the image is SOLID or a repeating a8r8g8b8 image.
This optimization together with the previous one results in a small
but reproducable performance improvement on the xfce4-terminal-a1
cairo trace:
[ # ] backend test min(s) median(s) stddev. count
Before:
[ 0] image xfce4-terminal-a1 1.221 1.239 1.21% 100/100
After:
[ 0] image xfce4-terminal-a1 1.170 1.199 1.26% 100/100
Either optimization by itself is difficult to separate from noise.
Søren Sandmann Pedersen [Mon, 28 May 2012 06:36:22 +0000 (02:36 -0400)]
Speed up _pixman_composite_glyphs_no_mask()
Bypass much of the overhead of pixman_image_composite32() by only
computing the composite region once instead of once per glyph, and by
only looking up the composite function whenever the glyph format or
flags change.
As part of this, the pixman_compute_composite_region32() was renamed
to _pixman_compute_composite_region32() and exported in
pixman-private.h.
I couldn't find a trace that would reliably demonstrate that this is
actually an improvement by itself (since _pixman_composite_glyphs_no_mask()
is called so rarely), but together with the following optimization for
solid sources, there is a small but reliable improvement to the
xfce4-a1-terminal cairo trace.
Søren Sandmann Pedersen [Mon, 28 May 2012 05:22:26 +0000 (01:22 -0400)]
Speed up pixman_composite_glyphs()
When adding glyphs to the mask, bypass most of the overhead of
pixman_image_composite32() by:
- Only looking up the composite function when the glyph changes either
format or flags.
- Only using a white source when the glyph format is different from
the mask format.
- Simply intersecting the glyph rectangle with the destination
rectangle instead of doing the full _pixman_composite_region32().
Performance results:
[ # ] backend test min(s) median(s) stddev. count
Before:
[ 0] image firefox-talos-gfx 6.570 6.577 0.13% 8/10
After:
[ 0] image firefox-talos-gfx 4.272 4.289 0.28% 10/10
V2: Changes to deal with white sources
Søren Sandmann Pedersen [Sun, 27 May 2012 22:23:20 +0000 (18:23 -0400)]
test: Add glyph-test
This test tests the new glyph cache and compositing API. Much of this
test is intending to making sure that clipping and alpha map handling
survive any optimizations that may be added to the glyph compositing.
V2: Evaluating lcg_rand_n() multiple times in an argument list lead
to undefined behavior.
Søren Sandmann Pedersen [Mon, 28 May 2012 20:14:12 +0000 (16:14 -0400)]
Add support for alpha maps to compute_crc32_for_image().
When a destination image I has an alpha map A, the following rules apply:
- If I has an alpha channel itself, the content of that channel is
undefined
- If A has RGB channels, the content of those channels is
undefined.
Hence in order to compute the CRC32 for such an image, we have to mask
off the alpha channel of the image, and the RGB channels of the alpha
map.
V2: Shifting by 32 is undefined in C
Søren Sandmann Pedersen [Sun, 27 May 2012 17:38:14 +0000 (13:38 -0400)]
Move CRC32 computation from blitters-test.c into utils.c
This way it can be used in other tests.
Søren Sandmann Pedersen [Tue, 29 May 2012 08:14:38 +0000 (04:14 -0400)]
Add pixman_glyph_cache_t API
This new API allows entire glyph strings to be composited in one go
which reduces overhead compared to multiple calls to
pixman_image_composite32().
The pixman_glyph_cache_t is a hash table that maps two keys (a "font"
and a "glyph" key, but they are just keys; there is no distinction
between them as far as pixman is concerned) to a glyph. Glyphs in the
cache can be composited through two new entry points
pixman_glyph_cache_composite_glyphs() and
pixman_glyph_cache_composite_glyphs_no_mask().
A glyph cache may only be inserted into when it is "frozen", which is
achieved by calling pixman_glyph_cache_freeze(). When
pixman_glyph_cache_thaw() is later called, if the cache has become too
crowded, some glyphs (currently the least-recently-used) will
automatically be evicted. This means that a user must ensure that all
the required glyphs are present in the cache before compositing a
string. The intended way to use the cache is like this:
pixman_glyph_t glyphs[MAX_GLYPHS];
pixman_glyph_cache_freeze (cache);
for (i = 0; i < n_glyphs; ++i)
{
const void *g;
if (!(g = pixman_glyph_cache_lookup (cache, font_key, glyph_key)))
{
img = <rasterize glyph as a pixman_image_t>;
g = pixman_glyph_cache_insert (cache, font_key, glyph_key,
glyph_origin_x, glyph_origin_y,
img);
if (!g)
{
/* Clean up out-of-memory condition */
goto oom;
}
glyphs[i].pos_x = glyph_x_pos;
glyphs[i].pos_y = glyph_y_pos;
glyphs[i].glyph = g;
}
}
pixman_composite_glyphs (op, src, dest, ..., cache, n_glyphs, glyphs);
pixman_glyph_cache_thaw (cache);
V2:
- Move glyphs to front of the MRU list when they are used. Pointed
out by Behdad Esfahbod.
- Composite glyphs with (white IN glyph) ADD mask in order to support
mixed a8 and a8r8g8b8 glyphs. Also pointed out by Behdad.
- Add pixman_glyph_get_mask_format
Søren Sandmann Pedersen [Wed, 27 Apr 2011 16:07:16 +0000 (12:07 -0400)]
Add doubly linked lists
This commit adds some new inline functions to maintain a doubly linked
list.
The way to use them is to embed a pixman_link_t into the structures
that should be linked, and use a pixman_list_t as the head of the
list.
The new functions are
pixman_list_init (pixman_list_t *list);
pixman_list_prepend (pixman_list_t *list, pixman_link_t *link);
pixman_list_move_to_front (pixman_list_t *list, pixman_link_t *link);
There are also a new macro:
CONTAINER_OF(type, member, data);
that can be used to get from a pointer to a member to the containing
structure.
V2: Use the C89 macro offsetof() instead of rolling our own -
suggested by Alan Coopersmith.
Søren Sandmann Pedersen [Thu, 24 May 2012 07:10:34 +0000 (03:10 -0400)]
Make use of image flags in mmx and sse2 iterators
Now that we have the full image flags available, the SSE2 and MMX
iterators can simply check against SAMPLES_COVER_CLIP_NEAREST (which
is computed in pixman_image_composite32()) instead of comparing all
the x/y/width/height parameters.
Søren Sandmann Pedersen [Thu, 24 May 2012 07:00:38 +0000 (03:00 -0400)]
Pass the full image flags to iterators
When pixman_image_composite32() is called some flags are computed that
indicate various things about the composite operation that can't be
deduced from the image flags themselves. These additional flags are
not currently available to iterators. All they can do is read the
image flags in image->common.flags.
Fix that by passing the info->{src, mask, dest}_flags on to the
iterator initialization and store the flags in the iter struct as
"image_flags". At the same time rename the *iterator* flags variable
to "iter_flags" to avoid confusion.
Matt Turner [Sun, 27 May 2012 17:01:57 +0000 (13:01 -0400)]
mmx: add missing _mm_empty calls
Fixes spurious test failures on x86-32.
Matt Turner [Fri, 18 May 2012 05:37:07 +0000 (01:37 -0400)]
mmx: add over_reverse_n_8888
Loongson:
over_reverse_n_8888 = L1: 16.04 L2: 15.35 M: 10.20 ( 27.96%) HT: 10.95 VT: 10.45 R: 9.18 RT: 6.99 ( 76Kops/s)
over_reverse_n_8888 = L1: 27.40 L2: 26.67 M: 16.97 ( 45.78%) HT: 16.66 VT: 15.38 R: 14.15 RT: 9.44 ( 97Kops/s)
image poppler 34.106 35.500 1.48% 6/6
image poppler 29.598 30.835 1.70% 6/6
ARM/iwMMXt:
over_reverse_n_8888 = L1: 15.63 L2: 14.33 M: 10.83 ( 27.55%) HT: 9.78 VT: 9.91 R: 9.49 RT: 6.96 ( 69Kops/s)
over_reverse_n_8888 = L1: 22.79 L2: 19.40 M: 13.76 ( 34.19%) HT: 11.66 VT: 11.86 R: 11.17 RT: 7.85 ( 75Kops/s)
image poppler 38.040 38.606 1.10% 6/6
image poppler 31.686 32.278 0.80% 5/6
Matt Turner [Fri, 18 May 2012 03:27:59 +0000 (23:27 -0400)]
mmx: add add_0565_0565
Loongson:
add_0565_0565 = L1: 15.37 L2: 14.91 M: 11.83 ( 16.06%) HT: 10.53 VT: 10.15 R: 9.74 RT: 6.19 ( 68Kops/s)
add_0565_0565 = L1: 45.06 L2: 46.71 M: 27.45 ( 38.00%) HT: 23.76 VT: 22.84 R: 18.96 RT: 9.79 ( 104Kops/s)
ARM/iwMMXt:
add_0565_0565 = L1: 12.87 L2: 11.58 M: 10.11 ( 12.50%) HT: 9.06 VT: 8.66 R: 7.70 RT: 5.62 ( 58Kops/s)
add_0565_0565 = L1: 31.14 L2: 28.87 M: 22.46 ( 28.60%) HT: 18.61 VT: 17.04 R: 15.21 RT: 9.35 ( 90Kops/s)
Matt Turner [Fri, 18 May 2012 03:29:51 +0000 (23:29 -0400)]
fast: add add_0565_0565 function
I'll need this code for header and tail alignment loops in MMX, so I
might as well implement a fast path here.
Matt Turner [Thu, 17 May 2012 17:22:18 +0000 (13:22 -0400)]
mmx: implement expand_4x565 in terms of expand_4xpacked565
Loongson:
over_n_0565 = L1: 38.57 L2: 38.88 M: 30.01 ( 20.97%) HT: 23.60 VT: 23.88 R: 21.95 RT: 11.65 ( 113Kops/s)
over_n_0565 = L1: 56.28 L2: 55.90 M: 34.20 ( 23.82%) HT: 25.66 VT: 26.60 R: 23.78 RT: 11.80 ( 115Kops/s)
over_8888_0565 = L1: 35.89 L2: 36.11 M: 21.56 ( 45.47%) HT: 18.33 VT: 17.90 R: 16.27 RT: 9.07 ( 98Kops/s)
over_8888_0565 = L1: 40.91 L2: 41.06 M: 23.13 ( 48.46%) HT: 19.24 VT: 18.71 R: 16.82 RT: 9.18 ( 99Kops/s)
over_n_8_0565 = L1: 28.92 L2: 29.12 M: 21.42 ( 30.00%) HT: 18.37 VT: 17.75 R: 16.15 RT: 8.79 ( 91Kops/s)
over_n_8_0565 = L1: 32.32 L2: 32.13 M: 22.44 ( 31.27%) HT: 19.15 VT: 18.66 R: 16.62 RT: 8.86 ( 92Kops/s)
over_n_8888_0565_ca = L1: 29.33 L2: 29.22 M: 18.99 ( 66.69%) HT: 16.69 VT: 16.22 R: 14.63 RT: 8.42 ( 88Kops/s)
over_n_8888_0565_ca = L1: 34.97 L2: 34.14 M: 20.32 ( 71.73%) HT: 17.67 VT: 17.19 R: 15.23 RT: 8.50 ( 89Kops/s)
ARM/iwMMXt:
over_n_0565 = L1: 29.70 L2: 30.53 M: 24.47 ( 14.84%) HT: 22.28 VT: 21.72 R: 21.13 RT: 12.58 ( 105Kops/s)
over_n_0565 = L1: 41.42 L2: 40.00 M: 30.95 ( 19.13%) HT: 27.06 VT: 27.28 R: 23.43 RT: 14.44 ( 114Kops/s)
over_8888_0565 = L1: 12.73 L2: 11.53 M: 9.07 ( 16.47%) HT: 9.00 VT: 9.25 R: 8.44 RT: 7.27 ( 76Kops/s)
over_8888_0565 = L1: 23.72 L2: 21.76 M: 15.89 ( 29.51%) HT: 14.36 VT: 14.05 R: 12.44 RT: 8.94 ( 86Kops/s)
over_n_8_0565 = L1: 6.80 L2: 7.15 M: 6.37 ( 7.90%) HT: 6.58 VT: 6.24 R: 6.49 RT: 5.94 ( 59Kops/s)
over_n_8_0565 = L1: 12.06 L2: 11.02 M: 10.16 ( 13.43%) HT: 9.57 VT: 8.49 R: 9.10 RT: 6.86 ( 69Kops/s)
over_n_8888_0565_ca = L1: 7.62 L2: 7.01 M: 6.27 ( 20.52%) HT: 6.00 VT: 6.07 R: 5.68 RT: 5.53 ( 57Kops/s)
over_n_8888_0565_ca = L1: 13.54 L2: 11.96 M: 9.76 ( 30.66%) HT: 9.72 VT: 8.45 R: 9.37 RT: 6.85 ( 67Kops/s)
Matt Turner [Mon, 14 May 2012 00:39:05 +0000 (20:39 -0400)]
mmx: add and use expand_4xpacked565 function
Loongson:
add_0565_0565 = L1: 14.39 L2: 13.98 M: 11.28 ( 15.22%) HT: 10.11 VT: 9.74 R: 9.39 RT: 6.05 ( 67Kops/s)
add_0565_0565 = L1: 15.37 L2: 14.91 M: 11.83 ( 16.06%) HT: 10.53 VT: 10.15 R: 9.74 RT: 6.19 ( 68Kops/s)
ARM/iwMMXt:
add_0565_0565 = L1: 11.12 L2: 10.40 M: 8.82 ( 10.65%) HT: 7.98 VT: 7.41 R: 7.57 RT: 5.21 ( 54Kops/s)
add_0565_0565 = L1: 12.87 L2: 11.58 M: 10.11 ( 12.50%) HT: 9.06 VT: 8.66 R: 7.70 RT: 5.62 ( 58Kops/s)
Søren Sandmann Pedersen [Sat, 26 May 2012 20:34:13 +0000 (16:34 -0400)]
Post-release version bump to 0.27.1
Søren Sandmann Pedersen [Sat, 26 May 2012 20:17:14 +0000 (16:17 -0400)]
Pre-release version bump to 0.26.0
Ingmar Runge [Sat, 19 May 2012 13:45:18 +0000 (15:45 +0200)]
Fix MSVC compilation
Only up to three SSE intrinsics supported in function declaration.
Søren Sandmann Pedersen [Thu, 24 May 2012 19:30:41 +0000 (15:30 -0400)]
test: Composite with solid images instead of using pixman_image_fill_*
There is a couple of places where the test suite uses the
pixman_image_fill_* functions to initialize images. These functions
can fail, and will do so if the "fast" implementation is disabled.
So to make sure the test suite passes even using
PIXMAN_DISABLE="fast", use pixman_image_composite32() with a solid
image instead of pixman_image_fill_*.
Nemanja Lukic [Wed, 2 May 2012 22:03:43 +0000 (00:03 +0200)]
MIPS: DSPr2: Added bilinear over_8888_8_8888 fast path.
Performance numbers before/after on MIPS-74kc @ 1GHz
Referent (before):
cairo-perf-trace:
[ # ] backend test min(s) median(s) stddev. count
[ # ] image: pixman 0.25.3
[ 0] image firefox-fishtank 2289.180 2290.567 0.05% 5/6
Optimized:
cairo-perf-trace:
[ # ] backend test min(s) median(s) stddev. count
[ # ] image: pixman 0.25.3
[ 0] image firefox-fishtank 1700.925 1708.314 0.22% 5/6
Nemanja Lukic [Wed, 23 May 2012 16:53:43 +0000 (18:53 +0200)]
MIPS: DSPr2: Fix bug in over_n_8888_8888_ca/over_n_8888_0565_ca routines
In main loop (unrolled by factor 2), instead of negating multiplied
mask values by srca, values of srca was negated, and passed as alpha
argument for
UN8x4_MUL_UN8x4_ADD_UN8x4 macro.
Instead of:
ma = ~ma;
UN8x4_MUL_UN8x4_ADD_UN8x4 (d, ma, s);
Code was doing this:
ma = ~srca;
UN8x4_MUL_UN8x4_ADD_UN8x4 (d, ma, s);
Key is in substituting registers s0/s1 (containing srca value), with
t0/t1 containing mask values multiplied by srca. Register usage is
also improved (less registers are saved on stack, for
over_n_8888_8888_ca routine).
The bug was introduced in commit
d2ee5631 and revealed by composite test.
Søren Sandmann Pedersen [Sun, 20 May 2012 17:09:16 +0000 (13:09 -0400)]
demos: Add parrot.jpg to EXTRA_DIST
Pointed out by Cyril Brulebois.
Matt Turner [Tue, 15 May 2012 20:32:08 +0000 (16:32 -0400)]
configure.ac: Fail the ARM/iwMMXt test if not compiling with -march=iwmmxt
If not compiling with -march=iwmmxt, the configure test will still pass,
thinking that the __builtin_arm_* intrinsic is a function instead of
generating a single instruction. Since no linking is done, the configure
test doesn't catch this, and we get linking errors in the build.
Søren Sandmann Pedersen [Tue, 15 May 2012 17:38:44 +0000 (13:38 -0400)]
Post-release version bump to 0.25.7
Søren Sandmann Pedersen [Tue, 15 May 2012 17:20:09 +0000 (13:20 -0400)]
Pre-release version bump to 0.25.6
Note that 0.25.4 was a botched release that doesn't have a tag and
doesn't correspond to any commit ID. It was however uploaded and
announced, so I'll just use the 0.25.6 version number.
Søren Sandmann Pedersen [Tue, 15 May 2012 17:19:19 +0000 (13:19 -0400)]
demos/Makefile.am: Add parrot.c to EXTRA_DIST
To get 'make distcheck' to pass.
Matt Turner [Sat, 12 May 2012 01:59:13 +0000 (21:59 -0400)]
configure.ac: Rename loongson -> loongson-mmi
Make it match with the other fast paths, and the PIXMAN_DISABLE value is
already loongson-mmi.
Matt Turner [Sat, 12 May 2012 01:49:42 +0000 (21:49 -0400)]
configure.ac: Fix loongson-mmi out-of-tree builds
When building out-of-tree, gcc wasn't able to find loongson-mmintrin.h
to compile the test program. Add -I$srcdir to CFLAGS to point gcc to it.
Nemanja Lukic [Wed, 2 May 2012 22:03:42 +0000 (00:03 +0200)]
MIPS: DSPr2: Added over_n_8_8888 and over_n_8_0565 fast paths.
Performance numbers before/after on MIPS-74kc @ 1GHz
Referent (before):
lowlevel-blt-bench:
over_n_8_8888 = L1: 10.40 L2: 9.79 M: 8.47 ( 33.62%) HT: 7.64 VT: 7.59 R: 7.48 RT: 5.30 ( 40Kops/s)
over_n_8_0565 = L1: 7.40 L2: 7.23 M: 6.78 ( 17.94%) HT: 6.23 VT: 6.17 R: 6.14 RT: 4.62 ( 37Kops/s)
Optimized:
lowlevel-blt-bench:
over_n_8_8888 = L1: 27.25 L2: 26.24 M: 18.15 ( 72.12%) HT: 14.52 VT: 14.31 R: 13.83 RT: 7.57 ( 48Kops/s)
over_n_8_0565 = L1: 18.91 L2: 17.59 M: 15.06 ( 39.90%) HT: 12.18 VT: 11.98 R: 11.83 RT: 6.80 ( 46Kops/s)
Matt Turner [Wed, 9 May 2012 23:20:55 +0000 (19:20 -0400)]
mmx: add and use pack_4x565 function
The pack_4x565 makes use of the pack_4xpacked565 function which uses pmadd.
Some of the speed up is probably attributable to removing the artificial
serialization imposed by the
vdest = pack_565 (..., vdest, 0);
vdest = pack_565 (..., vdest, 1);
...
pattern.
Loongson:
over_n_0565 = L1: 16.44 L2: 16.42 M: 13.83 ( 9.85%) HT: 12.83 VT: 12.61 R: 12.34 RT: 8.90 ( 93Kops/s)
over_n_0565 = L1: 42.48 L2: 42.53 M: 29.83 ( 21.20%) HT: 23.39 VT: 23.72 R: 21.80 RT: 11.60 ( 113Kops/s)
over_8888_0565 = L1: 15.61 L2: 15.42 M: 12.11 ( 25.79%) HT: 11.07 VT: 10.70 R: 10.37 RT: 7.25 ( 82Kops/s)
over_8888_0565 = L1: 35.01 L2: 35.20 M: 21.42 ( 45.57%) HT: 18.12 VT: 17.61 R: 16.09 RT: 9.01 ( 97Kops/s)
over_n_8_0565 = L1: 15.17 L2: 14.94 M: 12.57 ( 17.86%) HT: 11.96 VT: 11.52 R: 10.79 RT: 7.31 ( 79Kops/s)
over_n_8_0565 = L1: 29.83 L2: 29.79 M: 21.85 ( 30.94%) HT: 18.82 VT: 18.25 R: 16.15 RT: 8.72 ( 91Kops/s)
over_n_8888_0565_ca = L1: 15.25 L2: 15.02 M: 11.64 ( 41.39%) HT: 11.08 VT: 10.72 R: 10.02 RT: 7.00 ( 77Kops/s)
over_n_8888_0565_ca = L1: 30.12 L2: 29.99 M: 19.47 ( 68.99%) HT: 17.05 VT: 16.55 R: 14.67 RT: 8.38 ( 88Kops/s)
ARM/iwMMXt:
over_n_0565 = L1: 19.29 L2: 19.88 M: 17.38 ( 10.54%) HT: 15.53 VT: 16.11 R: 13.69 RT: 11.00 ( 96Kops/s)
over_n_0565 = L1: 36.02 L2: 34.85 M: 28.04 ( 16.97%) HT: 22.12 VT: 24.21 R: 22.36 RT: 12.22 ( 103Kops/s)
over_8888_0565 = L1: 18.38 L2: 16.59 M: 12.34 ( 22.29%) HT: 11.67 VT: 11.71 R: 11.02 RT: 6.89 ( 72Kops/s)
over_8888_0565 = L1: 24.96 L2: 22.17 M: 15.11 ( 26.81%) HT: 14.14 VT: 13.71 R: 13.18 RT: 8.13 ( 78Kops/s)
over_n_8_0565 = L1: 14.65 L2: 12.44 M: 11.56 ( 14.50%) HT: 10.93 VT: 10.39 R: 10.06 RT: 7.05 ( 70Kops/s)
over_n_8_0565 = L1: 18.37 L2: 14.98 M: 13.97 ( 16.51%) HT: 12.67 VT: 10.35 R: 11.80 RT: 8.14 ( 74Kops/s)
over_n_8888_0565_ca = L1: 14.27 L2: 12.93 M: 10.52 ( 33.23%) HT: 9.70 VT: 9.90 R: 9.31 RT: 6.34 ( 65Kops/s)
over_n_8888_0565_ca = L1: 19.69 L2: 17.58 M: 13.40 ( 42.35%) HT: 11.75 VT: 11.33 R: 11.17 RT: 7.49 ( 73Kops/s)
Matt Turner [Thu, 10 May 2012 20:15:34 +0000 (16:15 -0400)]
configure.ac: make -march=loongson2f come before CFLAGS
Otherwise we'd have -march=loongson2f being overridden by automake's
CFLAGS ordering which causes build failures when -march=<not loongson2f>
is specified by the user.
Søren Sandmann Pedersen [Tue, 8 May 2012 14:05:18 +0000 (10:05 -0400)]
Add Makefile.win32 and Makefile.win32.common to EXTRA_DIST
https://bugs.freedesktop.org/show_bug.cgi?id=46905
Matt Turner [Thu, 10 May 2012 02:50:50 +0000 (22:50 -0400)]
.gitignore: add demos/checkerboard and demos/quad2quad
Matt Turner [Fri, 27 Apr 2012 18:12:56 +0000 (14:12 -0400)]
mmx: Use wpackhus in src_x888_0565 on iwMMXt
iwMMXt which has an unsigned saturation pack instruction, while MMX/EXT
and Loongson don't.
ARM/iwMMXt:
src_8888_0565 = L1: 110.38 L2: 82.33 M: 40.92 ( 73.22%) HT: 35.63 VT: 32.22 R: 30.07 RT: 18.40 ( 132Kops/s)
src_8888_0565 = L1: 117.91 L2: 83.05 M: 41.52 ( 75.58%) HT: 37.63 VT: 35.40 R: 29.37 RT: 19.39 ( 134Kops/s)
Matt Turner [Thu, 19 Apr 2012 21:33:27 +0000 (17:33 -0400)]
mmx: add src_8888_0565
Uses the pmadd technique described in
http://software.intel.com/sites/landingpage/legacy/mmx/MMX_App_24-16_Bit_Conversion.pdf
The technique uses the packssdw instruction which uses signed
saturatation. This works in their example because they pack 888 to 555
leaving the high bit as zero. For packing to 565, it is unsuitable, so
we replace it with an or+shuffle.
Loongson:
src_8888_0565 = L1: 106.13 L2: 83.57 M: 33.46 ( 68.90%) HT: 30.29 VT: 27.67 R: 26.11 RT: 15.06 ( 135Kops/s)
src_8888_0565 = L1: 122.10 L2: 117.53 M: 37.97 ( 78.58%) HT: 33.14 VT: 30.09 R: 29.01 RT: 15.76 ( 139Kops/s)
ARM/iwMMXt:
src_8888_0565 = L1: 67.88 L2: 56.61 M: 31.20 ( 56.74%) HT: 29.22 VT: 27.01 R: 25.39 RT: 19.29 ( 130Kops/s)
src_8888_0565 = L1: 110.38 L2: 82.33 M: 40.92 ( 73.22%) HT: 35.63 VT: 32.22 R: 30.07 RT: 18.40 ( 132Kops/s)
Matt Turner [Wed, 18 Apr 2012 20:24:28 +0000 (16:24 -0400)]
mmx: add x8f8g8b8 fetcher
Loongson:
add_x888_x888 = L1: 29.36 L2: 27.81 M: 14.05 ( 38.74%) HT: 12.45 VT: 11.78 R: 11.52 RT: 7.23 ( 75Kops/s)
add_x888_x888 = L1: 36.06 L2: 34.55 M: 14.81 ( 41.03%) HT: 14.01 VT: 13.41 R: 13.06 RT: 9.06 ( 90Kops/s)
src_x888_8_x888 = L1: 21.92 L2: 20.15 M: 13.35 ( 41.42%) HT: 11.70 VT: 10.95 R: 10.53 RT: 6.18 ( 65Kops/s)
src_x888_8_x888 = L1: 25.43 L2: 23.51 M: 14.12 ( 44.00%) HT: 13.14 VT: 12.50 R: 11.86 RT: 7.49 ( 76Kops/s)
over_x888_8_0565 = L1: 10.64 L2: 10.17 M: 7.74 ( 21.35%) HT: 6.83 VT: 6.55 R: 6.34 RT: 4.03 ( 46Kops/s)
over_x888_8_0565 = L1: 11.41 L2: 10.97 M: 8.07 ( 22.36%) HT: 7.42 VT: 7.18 R: 6.92 RT: 4.62 ( 52Kops/s)
ARM/iwMMXt:
add_x888_x888 = L1: 22.10 L2: 18.93 M: 13.48 ( 32.29%) HT: 11.32 VT: 10.64 R: 10.36 RT: 6.51 ( 61Kops/s)
add_x888_x888 = L1: 24.26 L2: 20.83 M: 14.52 ( 35.64%) HT: 12.66 VT: 12.98 R: 11.34 RT: 7.69 ( 72Kops/s)
src_x888_8_x888 = L1: 19.33 L2: 17.66 M: 14.26 ( 38.43%) HT: 11.53 VT: 10.83 R: 10.57 RT: 6.12 ( 58Kops/s)
src_x888_8_x888 = L1: 21.23 L2: 19.60 M: 15.41 ( 42.55%) HT: 12.66 VT: 13.30 R: 11.55 RT: 7.32 ( 67Kops/s)
over_x888_8_0565 = L1: 8.15 L2: 7.56 M: 6.50 ( 15.58%) HT: 5.73 VT: 5.49 R: 5.50 RT: 3.53 ( 38Kops/s)
over_x888_8_0565 = L1: 8.35 L2: 7.85 M: 6.68 ( 16.40%) HT: 6.12 VT: 5.97 R: 5.78 RT: 4.03 ( 43Kops/s)
Matt Turner [Wed, 18 Apr 2012 20:14:08 +0000 (16:14 -0400)]
mmx: add a8 fetcher
oprofile of xfce4-terminal-a1
210535 9.0407 libpixman-1.so.0.25.3 fetch_scanline_a8
144802 6.0054 libpixman-1.so.0.25.3 mmx_fetch_a8
Loongson:
add_8_8_8 = L1: 17.98 L2: 17.28 M: 14.28 ( 19.79%) HT: 11.11 VT: 10.38 R: 9.97 RT: 5.14 ( 55Kops/s)
add_8_8_8 = L1: 20.44 L2: 19.65 M: 15.62 ( 21.53%) HT: 12.86 VT: 11.98 R: 11.32 RT: 6.13 ( 64Kops/s)
src_8888_8_0565 = L1: 19.97 L2: 18.59 M: 13.42 ( 32.55%) HT: 11.46 VT: 10.78 R: 10.33 RT: 5.87 ( 61Kops/s)
src_8888_8_0565 = L1: 21.16 L2: 19.68 M: 13.94 ( 33.64%) HT: 12.31 VT: 11.52 R: 11.02 RT: 6.54 ( 68Kops/s)
src_x888_8_x888 = L1: 20.54 L2: 18.88 M: 13.07 ( 40.74%) HT: 11.05 VT: 10.36 R: 10.02 RT: 5.68 ( 60Kops/s)
src_x888_8_x888 = L1: 21.92 L2: 20.15 M: 13.35 ( 41.42%) HT: 11.70 VT: 10.95 R: 10.53 RT: 6.18 ( 65Kops/s)
over_x888_8_0565 = L1: 10.32 L2: 9.85 M: 7.63 ( 21.13%) HT: 6.56 VT: 6.30 R: 6.12 RT: 3.80 ( 43Kops/s)
over_x888_8_0565 = L1: 10.64 L2: 10.17 M: 7.74 ( 21.35%) HT: 6.83 VT: 6.55 R: 6.34 RT: 4.03 ( 46Kops/s)
ARM/iwMMXt:
add_8_8_8 = L1: 13.10 L2: 11.67 M: 10.74 ( 13.46%) HT: 8.62 VT: 8.15 R: 7.94 RT: 4.39 ( 44Kops/s)
add_8_8_8 = L1: 13.81 L2: 12.79 M: 11.63 ( 13.93%) HT: 9.33 VT: 9.20 R: 9.04 RT: 5.43 ( 52Kops/s)
src_8888_8_0565 = L1: 16.62 L2: 15.07 M: 12.52 ( 27.46%) HT: 10.07 VT: 10.17 R: 9.95 RT: 5.64 ( 54Kops/s)
src_8888_8_0565 = L1: 16.84 L2: 16.11 M: 13.22 ( 27.71%) HT: 11.74 VT: 10.90 R: 10.80 RT: 6.66 ( 62Kops/s)
src_x888_8_x888 = L1: 17.49 L2: 16.22 M: 13.73 ( 38.73%) HT: 10.10 VT: 10.33 R: 9.55 RT: 5.21 ( 52Kops/s)
src_x888_8_x888 = L1: 19.33 L2: 17.66 M: 14.26 ( 38.43%) HT: 11.53 VT: 10.83 R: 10.57 RT: 6.12 ( 58Kops/s)
over_x888_8_0565 = L1: 7.57 L2: 7.29 M: 6.37 ( 15.97%) HT: 5.53 VT: 5.33 R: 5.21 RT: 3.22 ( 35Kops/s)
over_x888_8_0565 = L1: 8.15 L2: 7.56 M: 6.50 ( 15.58%) HT: 5.73 VT: 5.49 R: 5.50 RT: 3.53 ( 38Kops/s)
Matt Turner [Wed, 18 Apr 2012 20:08:57 +0000 (16:08 -0400)]
mmx: add r5g6b5 fetcher
Loongson:
add_0565_0565 = L1: 12.73 L2: 12.26 M: 10.05 ( 13.87%) HT: 8.77 VT: 8.50 R: 8.25 RT: 5.28 ( 58Kops/s)
add_0565_0565 = L1: 14.04 L2: 13.63 M: 10.96 ( 15.19%) HT: 9.73 VT: 9.43 R: 9.11 RT: 5.93 ( 64Kops/s)
ARM/iwMMXt:
add_0565_0565 = L1: 10.36 L2: 10.03 M: 9.04 ( 10.88%) HT: 3.11 VT: 7.16 R: 7.72 RT: 5.12 ( 51Kops/s)
add_0565_0565 = L1: 10.84 L2: 10.20 M: 9.15 ( 11.46%) HT: 7.60 VT: 7.82 R: 7.70 RT: 5.41 ( 53Kops/s)
Matt Turner [Tue, 17 Apr 2012 16:16:55 +0000 (12:16 -0400)]
mmx: Use Loongson pextrh instruction in expand565
Same story as pinsrh in the previous commit.
text data bss dec hex filename
25336 1952 0 27288 6a98 .libs/libpixman_loongson_mmi_la-pixman-mmx.o
25072 1952 0 27024 6990 .libs/libpixman_loongson_mmi_la-pixman-mmx.o
-dsll: 95
+dsll: 70
-dsrl: 135
+dsrl: 105
-ldc1: 462
+ldc1: 445
-lw: 721
+lw: 700
+pextrh: 30
Matt Turner [Tue, 17 Apr 2012 15:28:33 +0000 (11:28 -0400)]
mmx: Use Loongson pinsrh instruction in pack_565
The pinsrh instruction is analogous to MMX EXT's pinsrw, except like
other Loongson vector instructions it cannot access the general purpose
registers. In the cases of other Loongson vector instructions, this is a
headache, but it is actually a good thing here. Since the instruction is
different from MMX, I've named the intrinsic loongson_insert_pi16.
text data bss dec hex filename
25976 1952 0 27928 6d18 .libs/libpixman_loongson_mmi_la-pixman-mmx.o
25336 1952 0 27288 6a98 .libs/libpixman_loongson_mmi_la-pixman-mmx.o
-and: 181
+and: 147
-dsll: 143
+dsll: 95
-dsrl: 87
+dsrl: 135
-ldc1: 523
+ldc1: 462
-lw: 767
+lw: 721
+pinsrh: 35
Matt Turner [Fri, 24 Feb 2012 20:23:09 +0000 (15:23 -0500)]
mmx: don't pack and unpack src unnecessarily
The combine function was store8888'ing the result, and all consumers
were immediately load8888'ing it, causing lots of unnecessary pack and
unpack instructions.
It's a very straight forward conversion, except for mmx_combine_over_u
and mmx_combine_saturate_u. mmx_combine_over_u was testing the integer
result to skip pixels, so we use the is_* functions to test the __m64
data directly without loading it into an integer register.
For mmx_combine_saturate_u there's not a lot we can do, since it uses
DIV_UN8.
Matt Turner [Fri, 24 Feb 2012 22:39:39 +0000 (17:39 -0500)]
mmx: introduce is_equal, is_opaque, and is_zero functions
To be used by the next commit.
Matt Turner [Thu, 23 Feb 2012 21:25:11 +0000 (16:25 -0500)]
mmx: simplify srcsrcsrcsrc calculation in over_n_8_0565
Matt Turner [Thu, 23 Feb 2012 21:15:56 +0000 (16:15 -0500)]
mmx: remove unnecessary uint64_t<->__m64 conversions
Loongson:
add_8888_8888 = L1: 68.73 L2: 55.09 M: 25.39 ( 68.18%) HT: 25.28 VT: 22.42 R: 20.74 RT: 13.26 ( 131Kops/s)
add_8888_8888 = L1: 159.19 L2: 114.10 M: 30.74 ( 77.91%) HT: 27.63 VT: 24.99 R: 24.61 RT: 14.49 ( 141Kops/s)
Matt Turner [Fri, 24 Feb 2012 17:43:43 +0000 (12:43 -0500)]
mmx: compile on MIPS for Loongson MMI optimizations
image image16
evolution 32.985 -> 29.667 27.314 -> 23.870
firefox-planet-gnome 197.982 -> 180.437 220.986 -> 205.057
gnome-system-monitor 48.482 -> 49.752 52.820 -> 49.528
gnome-terminal-vim 60.799 -> 50.528 51.655 -> 44.131
grads-heat-map 3.167 -> 3.181 3.328 -> 3.321
gvim 38.646 -> 32.552 38.126 -> 34.453
midori-zoomed 44.371 -> 43.338 28.860 -> 28.865
ocitysmap 23.065 -> 18.057 23.046 -> 18.055
poppler 43.676 -> 36.077 43.065 -> 36.090
swfdec-giant-steps 20.166 -> 20.365 22.354 -> 16.578
swfdec-youtube 31.502 -> 28.118 44.052 -> 41.771
xfce4-terminal-a1 69.517 -> 51.288 62.225 -> 53.309
Matt Turner [Wed, 15 Feb 2012 06:19:07 +0000 (01:19 -0500)]
mmx: make ldq_u take __m64* directly
Before, if __m64 is allocated in vector or floating-point registers,
__m64 vs = ldq_u((uint64_t *)src);
would cause src to be loaded into an integer register and then
transferred to an __m64 register. By switching ldq_u's argument type to
__m64 we give the compile enough information to recognize that it can
load to the vector register directly.
This patch is necessary for the Loongson optimizations when __m64 is
typedef'd as double.
Matt Turner [Fri, 24 Feb 2012 17:34:41 +0000 (12:34 -0500)]
mmx: add load function and use it in add_8888_8888
Matt Turner [Fri, 24 Feb 2012 17:32:03 +0000 (12:32 -0500)]
mmx: add store function and use it in add_8888_8888
Søren Sandmann Pedersen [Thu, 5 Apr 2012 04:52:21 +0000 (00:52 -0400)]
bits_image_fetch_pixel_convolution(): Make sure channels are signed
In the computation:
srtot += RED_8 (pixel) * f
RED_8 (pixel) is an unsigned quantity, which means the signed filter
coefficient f gets converted to an unsigned integer before the
multiplication. We get away with this because when the 32 bit unsigned
result is converted to int32_t, the correct sign is produced. But if
srtot had been an int64_t, the result would have been a very large
positive number.
Fix this by explicitly casting the channels to int.
Søren Sandmann Pedersen [Thu, 5 Apr 2012 04:42:55 +0000 (00:42 -0400)]
test/utils.c: Clip values to the [0, 255] interval
Unpremultiplying a superluminescent pixel can result in values greater
than 255.
Matt Turner [Wed, 18 Apr 2012 22:14:13 +0000 (18:14 -0400)]
configure.ac: fix iwMMXt/gcc version error message
Matt Turner [Sun, 15 Apr 2012 18:03:08 +0000 (14:03 -0400)]
mmx: fix _mm_shuffle_pi16 function when compiling without optimization
The last argument must be an immediate value, and when compiling without
optimization the compiler might not recognize this. So use a macro if
not optimizing.
Matt Turner [Sun, 15 Apr 2012 18:00:17 +0000 (14:00 -0400)]
configure.ac: require >= gcc-4.5 for ARM iwMMXt
We're using a patched gcc-4.5, and having to modify configure.ac and
autoreconf between changes is annoying. And besides, 4.5, 4.6, and 4.7's
iwMMXt intrinsic support is equally broken, and we test a known broken
intrinsic in the configure test program, so the version check is rather
meaningless.
Matt Turner [Thu, 5 Apr 2012 21:36:05 +0000 (17:36 -0400)]
mmx: Use force_inline instead of __inline__ (bug 46906)
Fixes the build on MSVC.
Matt Turner [Thu, 15 Mar 2012 23:16:20 +0000 (19:16 -0400)]
mmx: enable over_n_0565 for b5g6r5
Signed-off-by: Matt Turner <mattst88@gmail.com>
Søren Sandmann Pedersen [Mon, 2 Apr 2012 19:16:18 +0000 (15:16 -0400)]
gtk-utils.c: In pixbuf_from_argb32() use a8r8g8b8_to_rgba_np()
Instead of inlining a copy of that functionality.
Søren Sandmann Pedersen [Mon, 2 Apr 2012 19:09:16 +0000 (15:09 -0400)]
test/utils.c: Rename and export the pngify_pixels() function.
This function converts from a8r8g8b8 to non-premultiplied RGBA (the
PNG or GdkPixbuf format that has the channels in this order: R, G, B,
A in memory regardless of the computer's endianness). The function's
new name is a8r8g8b8_to_rgba_np().
Søren Sandmann Pedersen [Mon, 2 Apr 2012 18:59:02 +0000 (14:59 -0400)]
gtk-utils.c: Don't include pixman-private.h
Use pixman_image_get_format() instead of image->bits.format.
Søren Sandmann Pedersen [Sun, 25 Mar 2012 16:14:54 +0000 (12:14 -0400)]
Rename fast_composite_add_1000_1000 to _add_1_1()
The 1000_1000 name is a relic from before the refactoring.
Søren Sandmann Pedersen [Sun, 16 Jan 2011 11:46:52 +0000 (06:46 -0500)]
Add the original parrot image.
This is the Parrot image that was downscaled and cropped before being
used in the composite-test.c demo.
Søren Sandmann Pedersen [Wed, 6 Oct 2010 10:06:59 +0000 (06:06 -0400)]
composite-test.c: Add a parrot image
Instead of the yellow square, use a parrot as the source image. This
demonstrates the various blend modes much better.
The parrot is a cropped version of finger painting by Rubens LP:
http://www.flickr.com/photos/dorubens/
4030604504/in/set-
72157622586088192/
where the background has been removed. Used here under Creative
Commons Attribution. The artist's web site:
http://www.rubenslp.com.br/
Søren Sandmann Pedersen [Wed, 6 Oct 2010 07:56:55 +0000 (03:56 -0400)]
composite-test.c: Use similar gradient to the one in the PDF spec.
Søren Sandmann Pedersen [Wed, 12 Oct 2011 08:49:27 +0000 (04:49 -0400)]
demos: Add checkerboard demo
This is a simple demo that displays a checkboard with a projective
transformation.
Søren Sandmann Pedersen [Wed, 12 Oct 2011 08:48:33 +0000 (04:48 -0400)]
demos: Add quad2quad program
This program can compute the projective transformation that transforms
one quadrilateral into another. The code is basically maxima[1] output
translated into C.
[1] http://maxima.sourceforge.net/
Søren Sandmann Pedersen [Wed, 14 Mar 2012 21:11:14 +0000 (17:11 -0400)]
Use "=a" and "=d" constraints for rdtsc inline assembly
In 32 bit mode the "=A" constraint refers to the register pair
edx:eax, but according to GCC developers this is not the case in 64
bit mode, where it refers to "rax".
Hence, using "=A" for rdtsc is incorrect in 64 bit mode.
See http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21249
Jeremy Huddleston [Fri, 16 Mar 2012 18:37:23 +0000 (11:37 -0700)]
configure.ac: Fix a copy-paste-o in TLS detection
Regression from:
a069da6c66da407cc52e1e92321d69c68fd6beb5
Signed-off-by: Jeremy Huddleston <jeremyhu@apple.com>
Tested-by: Matt Turner <mattst88@gmail.com>
Matt Turner [Wed, 14 Mar 2012 20:48:00 +0000 (16:48 -0400)]
Use AC_LANG_SOURCE for DSPr2 configure program
Signed-off-by: Matt Turner <mattst88@gmail.com>
Chun-wei Fan [Fri, 9 Mar 2012 07:54:06 +0000 (15:54 +0800)]
Just include xmmintrin.h on MSVC as well
The xmmintrin.h as shipped with recent Visual C++ (2003+) provides
_mm_shuffle_pi16 and _mm_mulhi_pu16, so including that header
will do for using these functions, and MSVC does not like the GCC-specific
implementations of _mm_shuffle_pi16 and _mm_mulhi_pu16 that is
currently in the code.
_MM_SHUFFLE is declared in the same way in MSVC's xmmintrin.h, so don't
re-define it here to avoid a compilation warning.
Jeremy Huddleston [Wed, 14 Mar 2012 17:26:18 +0000 (10:26 -0700)]
Fix a false-negative in MMX check
Silence warnings that could make -Werror give a false negative
Use signed char to avoid cases where int8_t isn't declared
Reported-by: Mike Lothian <mike@fireburn.co.uk>
Tested-by: Mike Lothian <mike@fireburn.co.uk>
Reviewed-by: Matt Turner <mattst88@gmail.com>
Signed-off-by: Jeremy Huddleston <jeremyhu@apple.com>
Nemanja Lukic [Sun, 11 Mar 2012 17:52:25 +0000 (18:52 +0100)]
MIPS: DSPr2: Added over_n_8888_8888_ca and over_n_8888_0565_ca fast paths.
Performance numbers before/after on MIPS-74kc @ 1GHz
Referent (before):
lowlevel-blt-bench:
over_n_8888_8888_ca = L1: 8.32 L2: 7.65 M: 6.38 ( 51.08%) HT: 5.78 VT: 5.74 R: 5.84 RT: 4.39 ( 37Kops/s)
over_n_8888_0565_ca = L1: 7.40 L2: 6.95 M: 6.16 ( 41.06%) HT: 5.72 VT: 5.52 R: 5.63 RT: 4.28 ( 36Kops/s)
cairo-perf-trace:
[ # ] backend test min(s) median(s) stddev. count
[ # ] image: pixman 0.25.3
[ 0] image xfce4-terminal-a1 138.223 139.070 0.33% 6/6
[ # ] image16: pixman 0.25.3
[ 0] image16 xfce4-terminal-a1 132.763 132.939 0.06% 5/6
Optimized:
lowlevel-blt-bench:
over_n_8888_8888_ca = L1: 19.35 L2: 23.84 M: 13.68 (109.39%) HT: 11.39 VT: 11.19 R: 11.27 RT: 6.90 ( 47Kops/s)
over_n_8888_0565_ca = L1: 18.68 L2: 17.00 M: 12.56 ( 83.70%) HT: 10.72 VT: 10.45 R: 10.43 RT: 5.79 ( 43Kops/s)
cairo-perf-trace:
[ # ] backend test min(s) median(s) stddev. count
[ # ] image: pixman 0.25.3
[ 0] image xfce4-terminal-a1 130.400 131.720 0.46% 6/6
[ # ] image16: pixman 0.25.3
[ 0] image16 xfce4-terminal-a1 125.830 126.604 0.34% 6/6
Jeremy Huddleston [Thu, 8 Mar 2012 17:41:34 +0000 (09:41 -0800)]
Expand TLS support beyond __thread to __declspec(thread)
This code was pretty much coppied from a similar commit that I made to
xorg-server in April.
cf: xorg/xserver:
bb4d145bd25e2aee988b100ecf1105ea3b6a40b8
Signed-off-by: Jeremy Huddleston <jeremyhu@apple.com>
Jeremy Huddleston [Thu, 8 Mar 2012 17:41:32 +0000 (09:41 -0800)]
Disable MMX when incompatible clang is being used.
Signed-off-by: Jeremy Huddleston <jeremyhu@apple.com>
Jeremy Huddleston [Thu, 8 Mar 2012 17:41:33 +0000 (09:41 -0800)]
Silence a warning about unused pixman_have_mmx
Signed-off-by: Jeremy Huddleston <jeremyhu@apple.com>
Jeremy Huddleston [Thu, 8 Mar 2012 17:41:31 +0000 (09:41 -0800)]
Revert "Disable MMX when Clang is being used."
This reverts commit
5eb4c12a79b3017ec6cc22ab756f53f225731533.
Søren Sandmann Pedersen [Thu, 8 Mar 2012 15:11:20 +0000 (10:11 -0500)]
Post-release version bump to 0.25.3
Søren Sandmann Pedersen [Thu, 8 Mar 2012 14:33:16 +0000 (09:33 -0500)]
Pre-release version bump to 0.25.2
Søren Sandmann Pedersen [Thu, 8 Mar 2012 14:29:46 +0000 (09:29 -0500)]
mmx: Squash a warning by making the argument to ldl_u() const
Alan Coopersmith [Sat, 25 Feb 2012 02:02:56 +0000 (18:02 -0800)]
Just use xmmintrin.h when building with Solaris Studio compilers
Since the Solaris Studio compilers don't have a mode where MMX
instructions are available and SSE instructions are not, we can
just use the <xmmintrin.h> header directly.
Fixes build failure due to Studio not supporting the __gnu_inline__
or __artificial__ attributes.
Signed-off-by: Alan Coopersmith <alan.coopersmith@oracle.com>
Acked-by: Matt Turner <mattst88@gmail.com>
Nemanja Lukic [Wed, 29 Feb 2012 11:04:33 +0000 (12:04 +0100)]
MIPS: DSPr2: Added mips_dspr2_blt and mips_dspr2_fill routines.
Performance numbers before/after on MIPS-74kc @ 1GHz
Referent (before):
lowlevel-blt-bench:
src_n_0565 = L1: 238.14 L2: 233.15 M: 57.88 ( 77.23%) HT: 53.22 VT: 49.99 R: 47.73 RT: 24.79 ( 91Kops/s)
src_n_8888 = L1: 190.19 L2: 187.57 M: 28.94 ( 77.23%) HT: 27.91 VT: 27.33 R: 26.64 RT: 14.68 ( 77Kops/s)
cairo-perf-trace:
[ # ] backend test min(s) median(s) stddev. count
[ # ] image: pixman 0.25.1
[ 0] image gnome-system-monitor 268.460 269.712 0.22% 6/6
Optimized:
lowlevel-blt-bench:
src_n_0565 = L1:1081.39 L2: 258.22 M:189.59 (252.91%) HT: 60.23 VT: 55.01 R: 53.44 RT: 23.68 ( 89Kops/s)
src_n_8888 = L1: 653.46 L2: 113.55 M:135.26 (360.86%) HT: 38.99 VT: 37.38 R: 34.95 RT: 18.67 ( 84Kops/s)
cairo-perf-trace:
[ # ] backend test min(s) median(s) stddev. count
[ # ] image: pixman 0.25.1
[ 0] image gnome-system-monitor 246.565 246.706 0.04% 6/6
Søren Sandmann Pedersen [Thu, 1 Mar 2012 07:24:54 +0000 (02:24 -0500)]
pixman-access.c: Remove some unused macros
The macros related to palette entries:
RGB15_TO_ENTRY,
RGB24_TO_ENTRY,
RGB24_TO_ENTRY_Y
are not used anywhere.
Søren Sandmann Pedersen [Wed, 29 Feb 2012 09:44:46 +0000 (04:44 -0500)]
pixman-accessors.h: Delete unused macros
The MEMCPY_WRAPPED and ACCESS macros are not used anymore.
Søren Sandmann Pedersen [Sun, 26 Feb 2012 22:35:20 +0000 (17:35 -0500)]
Move fetching for solid bits images to pixman-noop.c
This should be a bit faster because it can reuse the scanline on each iteration.
Matt Turner [Sat, 25 Feb 2012 01:11:11 +0000 (20:11 -0500)]
lowlevel-blt-bench: add in_8_8 and in_n_8_8
Signed-off-by: Matt Turner <mattst88@gmail.com>