Graph program

Development Platform:

Visual C++

pnggccrd.c：Code Content

/* pnggccrd.c - mixed C/assembler version of utilities to read a PNG file
*
* For Intel x86 CPU (Pentium-MMX or later) and GNU C compiler.
*
* See http://www.intel.com/drg/pentiumII/appnotes/916/916.htm
* and http://www.intel.com/drg/pentiumII/appnotes/923/923.htm
* for Intel's performance analysis of the MMX vs. non-MMX code.
*
* libpng version 1.2.5 - October 3, 2002
* For conditions of distribution and use, see copyright notice in png.h
*
* Based on MSVC code contributed by Nirav Chhatrapati, Intel Corp., 1998.
* Interface to libpng contributed by Gilles Vollant, 1999.
* GNU C port by Greg Roelofs, 1999-2001.
*
* Lines 2350-4300 converted in place with intel2gas 1.3.1:
*
* intel2gas -mdI pnggccrd.c.partially-msvc -o pnggccrd.c
*
* and then cleaned up by hand. See http://hermes.terminal.at/intel2gas/ .
*
* NOTE: A sufficiently recent version of GNU as (or as.exe under DOS/Windows)
* is required to assemble the newer MMX instructions such as movq.
* For djgpp, see
*
* ftp://ftp.simtel.net/pub/simtelnet/gnu/djgpp/v2gnu/bnu281b.zip
*
* (or a later version in the same directory). For Linux, check your
* distribution's web site(s) or try these links:
*
* http://rufus.w3.org/linux/RPM/binutils.html
* http://www.debian.org/Packages/stable/devel/binutils.html
* ftp://ftp.slackware.com/pub/linux/slackware/slackware/slakware/d1/
* binutils.tgz
*
* For other platforms, see the main GNU site:
*
* ftp://ftp.gnu.org/pub/gnu/binutils/
*
* Version 2.5.2l.15 is definitely too old...
*/
/*
* TEMPORARY PORTING NOTES AND CHANGELOG (mostly by Greg Roelofs)
* =====================================
*
* 19991006:
* - fixed sign error in post-MMX cleanup code (16- & 32-bit cases)
*
* 19991007:
* - additional optimizations (possible or definite):
* x [DONE] write MMX code for 64-bit case (pixel_bytes == 8) [not tested]
* - write MMX code for 48-bit case (pixel_bytes == 6)
* - figure out what's up with 24-bit case (pixel_bytes == 3):
* why subtract 8 from width_mmx in the pass 4/5 case?
* (only width_mmx case) (near line 1606)
* x [DONE] replace pixel_bytes within each block with the true
* constant value (or are compilers smart enough to do that?)
* - rewrite all MMX interlacing code so it's aligned with
* the *beginning* of the row buffer, not the end. This
* would not only allow one to eliminate half of the memory
* writes for odd passes (that is, pass == odd), it may also
* eliminate some unaligned-data-access exceptions (assuming
* there's a penalty for not aligning 64-bit accesses on
* 64-bit boundaries). The only catch is that the "leftover"
* pixel(s) at the end of the row would have to be saved,
* but there are enough unused MMX registers in every case,
* so this is not a problem. A further benefit is that the
* post-MMX cleanup code (C code) in at least some of the
* cases could be done within the assembler block.
* x [DONE] the "v3 v2 v1 v0 v7 v6 v5 v4" comments are confusing,
* inconsistent, and don't match the MMX Programmer's Reference
* Manual conventions anyway. They should be changed to
* "b7 b6 b5 b4 b3 b2 b1 b0," where b0 indicates the byte that
* was lowest in memory (e.g., corresponding to a left pixel)
* and b7 is the byte that was highest (e.g., a right pixel).
*
* 19991016:
* - Brennan's Guide notwithstanding, gcc under Linux does *not*
* want globals prefixed by underscores when referencing them--
* i.e., if the variable is const4, then refer to it as const4,
* not _const4. This seems to be a djgpp-specific requirement.
* Also, such variables apparently *must* be declared outside
* of functions; neither static nor automatic variables work if
* defined within the scope of a single function, but both
* static and truly global (multi-module) variables work fine.
*
* 19991023:
* - fixed png_combine_row() non-MMX replication bug (odd passes only?)
* - switched from string-concatenation-with-macros to cleaner method of
* renaming global variables for djgpp--i.e., always use prefixes in
* inlined assembler code (== strings) and conditionally rename the
* variables, not the other way around. Hence _const4, _mask8_0, etc.
*
* 19991024:
* - fixed mmxsupport()/png_do_read_interlace() first-row bug
* This one was severely weird: even though mmxsupport() doesn't touch
* ebx (where "row" pointer was stored), it nevertheless managed to zero
* the register (even in static/non-fPIC code--see below), which in turn
* caused png_do_read_interlace() to return prematurely on the first row of
* interlaced images (i.e., without expanding the interlaced pixels).
* Inspection of the generated assembly code didn't turn up any clues,
* although it did point at a minor optimization (i.e., get rid of
* mmx_supported_local variable and just use eax). Possibly the CPUID
* instruction is more destructive than it looks? (Not yet checked.)
* - "info gcc" was next to useless, so compared fPIC and non-fPIC assembly
* listings... Apparently register spillage has to do with ebx, since
* it's used to index the global offset table. Commenting it out of the
* input-reg lists in png_combine_row() eliminated compiler barfage, so
* ifdef'd with __PIC__ macro: if defined, use a global for unmask
*
* 19991107:
* - verified CPUID clobberage: 12-char string constant ("GenuineIntel",
* "AuthenticAMD", etc.) placed in ebx:ecx:edx. Still need to polish.
*
* 19991120:
* - made "diff" variable (now "_dif") global to simplify conversion of
* filtering routines (running out of regs, sigh). "diff" is still used
* in interlacing routines, however.
* - fixed up both versions of mmxsupport() (ORIG_THAT_USED_TO_CLOBBER_EBX
* macro determines which is used); original not yet tested.
*
* 20000213:
* - when compiling with gcc, be sure to use -fomit-frame-pointer
*
* 20000319:
* - fixed a register-name typo in png_do_read_interlace(), default (MMX) case,
* pass == 4 or 5, that caused visible corruption of interlaced images
*
* 20000623:
* - Various problems were reported with gcc 2.95.2 in the Cygwin environment,
* many of the form "forbidden register 0 (ax) was spilled for class AREG."
* This is explained at http://gcc.gnu.org/fom_serv/cache/23.html, and
* Chuck Wilson supplied a patch involving dummy output registers. See
* http://sourceforge.net/bugs/?func=detailbug&bug_id=108741&group_id=5624
* for the original (anonymous) SourceForge bug report.
*
* 20000706:
* - Chuck Wilson passed along these remaining gcc 2.95.2 errors:
* pnggccrd.c: In function `png_combine_row':
* pnggccrd.c:525: more than 10 operands in `asm'
* pnggccrd.c:669: more than 10 operands in `asm'
* pnggccrd.c:828: more than 10 operands in `asm'
* pnggccrd.c:994: more than 10 operands in `asm'
* pnggccrd.c:1177: more than 10 operands in `asm'
* They are all the same problem and can be worked around by using the
* global _unmask variable unconditionally, not just in the -fPIC case.
* Reportedly earlier versions of gcc also have the problem with more than
* 10 operands; they just don't report it. Much strangeness ensues, etc.
*
* 20000729:
* - enabled png_read_filter_row_mmx_up() (shortest remaining unconverted
* MMX routine); began converting png_read_filter_row_mmx_sub()
* - to finish remaining sections:
* - clean up indentation and comments
* - preload local variables
* - add output and input regs (order of former determines numerical
* mapping of latter)
* - avoid all usage of ebx (including bx, bh, bl) register [20000823]
* - remove "$" from addressing of Shift and Mask variables [20000823]
*
* 20000731:
* - global union vars causing segfaults in png_read_filter_row_mmx_sub()?
*
* 20000822:
* - ARGH, stupid png_read_filter_row_mmx_sub() segfault only happens with
* shared-library (-fPIC) version! Code works just fine as part of static
* library. Damn damn damn damn damn, should have tested that sooner.
* ebx is getting clobbered again (explicitly this time); need to save it
* on stack or rewrite asm code to avoid using it altogether. Blargh!
*
* 20000823:
* - first section was trickiest; all remaining sections have ebx -> edx now.
* (-fPIC works again.) Also added missing underscores to various Shift*
* and *Mask* globals and got rid of leading "$" signs.
*
* 20000826:
* - added visual separators to help navigate microscopic printed copies
* (http://pobox.com/~newt/code/gpr-latest.zip, mode 10); started working
* on png_read_filter_row_mmx_avg()
*
* 20000828:
* - finished png_read_filter_row_mmx_avg(): only Paeth left! (930 lines...)
* What the hell, did png_read_filter_row_mmx_paeth(), too. Comments not
* cleaned up/shortened in either routine, but functionality is complete
* and seems to be working fine.
*
* 20000829:
* - ahhh, figured out last(?) bit of gcc/gas asm-fu: if register is listed
* as an input reg (with dummy output variables, etc.), then it *cannot*
* also appear in the clobber list or gcc 2.95.2 will barf. The solution
* is simple enough...
*
* 20000914:
* - bug in png_read_filter_row_mmx_avg(): 16-bit grayscale not handled
* correctly (but 48-bit RGB just fine)
*
* 20000916:
* - fixed bug in png_read_filter_row_mmx_avg(), bpp == 2 case; three errors:
* - "_ShiftBpp.use = 24;" should have been "_ShiftBpp.use = 16;"
* - "_ShiftRem.use = 40;" should have been "_ShiftRem.use = 48;"
* - "psllq _ShiftRem, %%mm2" should have been "psrlq _ShiftRem, %%mm2"
*
* 20010101:
* - added new png_init_mmx_flags() function (here only because it needs to
* call mmxsupport(), which should probably become global png_mmxsupport());
* modified other MMX routines to run conditionally (png_ptr->asm_flags)
*
* 20010103:
* - renamed mmxsupport() to png_mmx_support(), with auto-set of mmx_supported,
* and made it public; moved png_init_mmx_flags() to png.c as internal func
*
* 20010104:
* - removed dependency on png_read_filter_row_c() (C code already duplicated
* within MMX version of png_read_filter_row()) so no longer necessary to
* compile it into pngrutil.o
*
* 20010310:
* - fixed buffer-overrun bug in png_combine_row() C code (non-MMX)
*
* 20020304:
* - eliminated incorrect use of width_mmx in pixel_bytes == 8 case
*
* STILL TO DO:
* - test png_do_read_interlace() 64-bit case (pixel_bytes == 8)
* - write MMX code for 48-bit case (pixel_bytes == 6)
* - figure out what's up with 24-bit case (pixel_bytes == 3):
* why subtract 8 from width_mmx in the pass 4/5 case?
* (only width_mmx case) (near line 1606)
* - rewrite all MMX interlacing code so it's aligned with beginning
* of the row buffer, not the end (see 19991007 for details)
* x pick one version of mmxsupport() and get rid of the other
* - add error messages to any remaining bogus default cases
* - enable pixel_depth == 8 cases in png_read_filter_row()? (test speed)
* x add support for runtime enable/disable/query of various MMX routines
*/
#define PNG_INTERNAL
#include "png.h"
#if defined(PNG_USE_PNGGCCRD)
int PNGAPI png_mmx_support(void);
#ifdef PNG_USE_LOCAL_ARRAYS
static const int FARDATA png_pass_start[7] = {0, 4, 0, 2, 0, 1, 0};
static const int FARDATA png_pass_inc[7] = {8, 8, 4, 4, 2, 2, 1};
static const int FARDATA png_pass_width[7] = {8, 4, 4, 2, 2, 1, 1};
#endif
#if defined(PNG_ASSEMBLER_CODE_SUPPORTED)
/* djgpp, Win32, and Cygwin add their own underscores to global variables,
* so define them without: */
#if defined(__DJGPP__) || defined(WIN32) || defined(__CYGWIN__)
# define _mmx_supported mmx_supported
# define _const4 const4
# define _const6 const6
# define _mask8_0 mask8_0
# define _mask16_1 mask16_1
# define _mask16_0 mask16_0
# define _mask24_2 mask24_2
# define _mask24_1 mask24_1
# define _mask24_0 mask24_0
# define _mask32_3 mask32_3
# define _mask32_2 mask32_2
# define _mask32_1 mask32_1
# define _mask32_0 mask32_0
# define _mask48_5 mask48_5
# define _mask48_4 mask48_4
# define _mask48_3 mask48_3
# define _mask48_2 mask48_2
# define _mask48_1 mask48_1
# define _mask48_0 mask48_0
# define _LBCarryMask LBCarryMask
# define _HBClearMask HBClearMask
# define _ActiveMask ActiveMask
# define _ActiveMask2 ActiveMask2
# define _ActiveMaskEnd ActiveMaskEnd
# define _ShiftBpp ShiftBpp
# define _ShiftRem ShiftRem
#ifdef PNG_THREAD_UNSAFE_OK
# define _unmask unmask
# define _FullLength FullLength
# define _MMXLength MMXLength
# define _dif dif
# define _patemp patemp
# define _pbtemp pbtemp
# define _pctemp pctemp
#endif
#endif
/* These constants are used in the inlined MMX assembly code.
Ignore gcc's "At top level: defined but not used" warnings. */
/* GRR 20000706: originally _unmask was needed only when compiling with -fPIC,
* since that case uses the %ebx register for indexing the Global Offset Table
* and there were no other registers available. But gcc 2.95 and later emit
* "more than 10 operands in `asm'" errors when %ebx is used to preload unmask
* in the non-PIC case, so we'll just use the global unconditionally now.
*/
#ifdef PNG_THREAD_UNSAFE_OK
static int _unmask;
#endif
static unsigned long long _mask8_0 = 0x0102040810204080LL;
static unsigned long long _mask16_1 = 0x0101020204040808LL;
static unsigned long long _mask16_0 = 0x1010202040408080LL;
static unsigned long long _mask24_2 = 0x0101010202020404LL;
static unsigned long long _mask24_1 = 0x0408080810101020LL;
static unsigned long long _mask24_0 = 0x2020404040808080LL;
static unsigned long long _mask32_3 = 0x0101010102020202LL;
static unsigned long long _mask32_2 = 0x0404040408080808LL;
static unsigned long long _mask32_1 = 0x1010101020202020LL;
static unsigned long long _mask32_0 = 0x4040404080808080LL;
static unsigned long long _mask48_5 = 0x0101010101010202LL;
static unsigned long long _mask48_4 = 0x0202020204040404LL;
static unsigned long long _mask48_3 = 0x0404080808080808LL;
static unsigned long long _mask48_2 = 0x1010101010102020LL;
static unsigned long long _mask48_1 = 0x2020202040404040LL;
static unsigned long long _mask48_0 = 0x4040808080808080LL;
static unsigned long long _const4 = 0x0000000000FFFFFFLL;
//static unsigned long long _const5 = 0x000000FFFFFF0000LL; // NOT USED
static unsigned long long _const6 = 0x00000000000000FFLL;
// These are used in the row-filter routines and should/would be local
// variables if not for gcc addressing limitations.
// WARNING: Their presence probably defeats the thread safety of libpng.
#ifdef PNG_THREAD_UNSAFE_OK
static png_uint_32 _FullLength;
static png_uint_32 _MMXLength;
static int _dif;
static int _patemp; // temp variables for Paeth routine
static int _pbtemp;
static int _pctemp;
#endif
void /* PRIVATE */
png_squelch_warnings(void)
{
#ifdef PNG_THREAD_UNSAFE_OK
_dif = _dif;
_patemp = _patemp;
_pbtemp = _pbtemp;
_pctemp = _pctemp;
_MMXLength = _MMXLength;
#endif
_const4 = _const4;
_const6 = _const6;
_mask8_0 = _mask8_0;
_mask16_1 = _mask16_1;
_mask16_0 = _mask16_0;
_mask24_2 = _mask24_2;
_mask24_1 = _mask24_1;
_mask24_0 = _mask24_0;
_mask32_3 = _mask32_3;
_mask32_2 = _mask32_2;
_mask32_1 = _mask32_1;
_mask32_0 = _mask32_0;
_mask48_5 = _mask48_5;
_mask48_4 = _mask48_4;
_mask48_3 = _mask48_3;
_mask48_2 = _mask48_2;
_mask48_1 = _mask48_1;
_mask48_0 = _mask48_0;
}
#endif /* PNG_ASSEMBLER_CODE_SUPPORTED */
static int _mmx_supported = 2;
/*===========================================================================*/
/* */
/* P N G _ C O M B I N E _ R O W */
/* */
/*===========================================================================*/
#if defined(PNG_HAVE_ASSEMBLER_COMBINE_ROW)
#define BPP2 2
#define BPP3 3 /* bytes per pixel (a.k.a. pixel_bytes) */
#define BPP4 4
#define BPP6 6 /* (defined only to help avoid cut-and-paste errors) */
#define BPP8 8
/* Combines the row recently read in with the previous row.
This routine takes care of alpha and transparency if requested.
This routine also handles the two methods of progressive display
of interlaced images, depending on the mask value.
The mask value describes which pixels are to be combined with
the row. The pattern always repeats every 8 pixels, so just 8
bits are needed. A one indicates the pixel is to be combined; a
zero indicates the pixel is to be skipped. This is in addition
to any alpha or transparency value associated with the pixel.
If you want all pixels to be combined, pass 0xff (255) in mask. */
/* Use this routine for the x86 platform - it uses a faster MMX routine
if the machine supports MMX. */
void /* PRIVATE */
png_combine_row(png_structp png_ptr, png_bytep row, int mask)
{
png_debug(1, "in png_combine_row (pnggccrd.c)n");
#if defined(PNG_ASSEMBLER_CODE_SUPPORTED)
if (_mmx_supported == 2) {
/* this should have happened in png_init_mmx_flags() already */
png_warning(png_ptr, "asm_flags may not have been initialized");
png_mmx_support();
}
#endif
if (mask == 0xff)
{
png_debug(2,"mask == 0xff: doing single png_memcpy()n");
png_memcpy(row, png_ptr->row_buf + 1,
(png_size_t)((png_ptr->width * png_ptr->row_info.pixel_depth + 7) >> 3));
}
else /* (png_combine_row() is never called with mask == 0) */
{
switch (png_ptr->row_info.pixel_depth)
{
case 1: /* png_ptr->row_info.pixel_depth */
{
png_bytep sp;
png_bytep dp;
int s_inc, s_start, s_end;
int m;
int shift;
png_uint_32 i;
sp = png_ptr->row_buf + 1;
dp = row;
m = 0x80;
#if defined(PNG_READ_PACKSWAP_SUPPORTED)
if (png_ptr->transformations & PNG_PACKSWAP)
{
s_start = 0;
s_end = 7;
s_inc = 1;
}
else
#endif
{
s_start = 7;
s_end = 0;
s_inc = -1;
}
shift = s_start;
for (i = 0; i < png_ptr->width; i++)
{
if (m & mask)
{
int value;
value = (*sp >> shift) & 0x1;
*dp &= (png_byte)((0x7f7f >> (7 - shift)) & 0xff);
*dp |= (png_byte)(value << shift);
}
if (shift == s_end)
{
shift = s_start;
sp++;
dp++;
}
else
shift += s_inc;
if (m == 1)
m = 0x80;
else
m >>= 1;
}
break;
}
case 2: /* png_ptr->row_info.pixel_depth */
{
png_bytep sp;
png_bytep dp;
int s_start, s_end, s_inc;
int m;
int shift;
png_uint_32 i;
int value;
sp = png_ptr->row_buf + 1;
dp = row;
m = 0x80;
#if defined(PNG_READ_PACKSWAP_SUPPORTED)
if (png_ptr->transformations & PNG_PACKSWAP)
{
s_start = 0;
s_end = 6;
s_inc = 2;
}
else
#endif
{
s_start = 6;
s_end = 0;
s_inc = -2;
}
shift = s_start;
for (i = 0; i < png_ptr->width; i++)
{
if (m & mask)
{
value = (*sp >> shift) & 0x3;
*dp &= (png_byte)((0x3f3f >> (6 - shift)) & 0xff);
*dp |= (png_byte)(value << shift);
}
if (shift == s_end)
{
shift = s_start;
sp++;
dp++;
}
else
shift += s_inc;
if (m == 1)
m = 0x80;
else
m >>= 1;
}
break;
}
case 4: /* png_ptr->row_info.pixel_depth */
{
png_bytep sp;
png_bytep dp;
int s_start, s_end, s_inc;
int m;
int shift;
png_uint_32 i;
int value;
sp = png_ptr->row_buf + 1;
dp = row;
m = 0x80;
#if defined(PNG_READ_PACKSWAP_SUPPORTED)
if (png_ptr->transformations & PNG_PACKSWAP)
{
s_start = 0;
s_end = 4;
s_inc = 4;
}
else
#endif
{
s_start = 4;
s_end = 0;
s_inc = -4;
}
shift = s_start;
for (i = 0; i < png_ptr->width; i++)
{
if (m & mask)
{
value = (*sp >> shift) & 0xf;
*dp &= (png_byte)((0xf0f >> (4 - shift)) & 0xff);
*dp |= (png_byte)(value << shift);
}
if (shift == s_end)
{
shift = s_start;
sp++;
dp++;
}
else
shift += s_inc;
if (m == 1)
m = 0x80;
else
m >>= 1;
}
break;
}
case 8: /* png_ptr->row_info.pixel_depth */
{
png_bytep srcptr;
png_bytep dstptr;
#if defined(PNG_ASSEMBLER_CODE_SUPPORTED) && defined(PNG_THREAD_UNSAFE_OK)
#if !defined(PNG_1_0_X)
if ((png_ptr->asm_flags & PNG_ASM_FLAG_MMX_READ_COMBINE_ROW)
/* && _mmx_supported */ )
#else
if (_mmx_supported)
#endif
{
png_uint_32 len;
int diff;
int dummy_value_a; // fix 'forbidden register spilled' error
int dummy_value_d;
int dummy_value_c;
int dummy_value_S;
int dummy_value_D;
_unmask = ~mask; // global variable for -fPIC version
srcptr = png_ptr->row_buf + 1;
dstptr = row;
len = png_ptr->width &~7; // reduce to multiple of 8
diff = (int) (png_ptr->width & 7); // amount lost
__asm__ __volatile__ (
"movd _unmask, %%mm7 nt" // load bit pattern
"psubb %%mm6, %%mm6 nt" // zero mm6
"punpcklbw %%mm7, %%mm7 nt"
"punpcklwd %%mm7, %%mm7 nt"
"punpckldq %%mm7, %%mm7 nt" // fill reg with 8 masks
"movq _mask8_0, %%mm0 nt"
"pand %%mm7, %%mm0 nt" // nonzero if keep byte
"pcmpeqb %%mm6, %%mm0 nt" // zeros->1s, v versa
// preload "movl len, %%ecx nt" // load length of line
// preload "movl srcptr, %%esi nt" // load source
// preload "movl dstptr, %%edi nt" // load dest
"cmpl $0, %%ecx nt" // len == 0 ?
"je mainloop8end nt"
"mainloop8: nt"
"movq (%%esi), %%mm4 nt" // *srcptr
"pand %%mm0, %%mm4 nt"
"movq %%mm0, %%mm6 nt"
"pandn (%%edi), %%mm6 nt" // *dstptr
"por %%mm6, %%mm4 nt"
"movq %%mm4, (%%edi) nt"
"addl $8, %%esi nt" // inc by 8 bytes processed
"addl $8, %%edi nt"
"subl $8, %%ecx nt" // dec by 8 pixels processed
"ja mainloop8 nt"
"mainloop8end: nt"
// preload "movl diff, %%ecx nt" // (diff is in eax)
"movl %%eax, %%ecx nt"
"cmpl $0, %%ecx nt"
"jz end8 nt"
// preload "movl mask, %%edx nt"
"sall $24, %%edx nt" // make low byte, high byte
"secondloop8: nt"
"sall %%edx nt" // move high bit to CF
"jnc skip8 nt" // if CF = 0
"movb (%%esi), %%al nt"
"movb %%al, (%%edi) nt"
"skip8: nt"
"incl %%esi nt"
"incl %%edi nt"
"decl %%ecx nt"
"jnz secondloop8 nt"
"end8: nt"
"EMMS nt" // DONE
: "=a" (dummy_value_a), // output regs (dummy)
"=d" (dummy_value_d),
"=c" (dummy_value_c),
"=S" (dummy_value_S),
"=D" (dummy_value_D)
: "3" (srcptr), // esi // input regs
"4" (dstptr), // edi
"0" (diff), // eax
// was (unmask) "b" RESERVED // ebx // Global Offset Table idx
"2" (len), // ecx
"1" (mask) // edx
#if 0 /* MMX regs (%mm0, etc.) not supported by gcc 2.7.2.3 or egcs 1.1 */
: "%mm0", "%mm4", "%mm6", "%mm7" // clobber list
#endif
);
}
else /* mmx _not supported - Use modified C routine */
#endif /* PNG_ASSEMBLER_CODE_SUPPORTED */
{
register png_uint_32 i;
png_uint_32 initial_val = png_pass_start[png_ptr->pass];
/* png.c: png_pass_start[] = {0, 4, 0, 2, 0, 1, 0}; */
register int stride = png_pass_inc[png_ptr->pass];
/* png.c: png_pass_inc[] = {8, 8, 4, 4, 2, 2, 1}; */
register int rep_bytes = png_pass_width[png_ptr->pass];
/* png.c: png_pass_width[] = {8, 4, 4, 2, 2, 1, 1}; */
png_uint_32 len = png_ptr->width &~7; /* reduce to mult. of 8 */
int diff = (int) (png_ptr->width & 7); /* amount lost */
register png_uint_32 final_val = len; /* GRR bugfix */
srcptr = png_ptr->row_buf + 1 + initial_val;
dstptr = row + initial_val;
for (i = initial_val; i < final_val; i += stride)
{
png_memcpy(dstptr, srcptr, rep_bytes);
srcptr += stride;
dstptr += stride;
}
if (diff) /* number of leftover pixels: 3 for pngtest */
{
final_val+=diff /* *BPP1 */ ;
for (; i < final_val; i += stride)
{
if (rep_bytes > (int)(final_val-i))
rep_bytes = (int)(final_val-i);
png_memcpy(dstptr, srcptr, rep_bytes);
srcptr += stride;
dstptr += stride;
}
}
} /* end of else (_mmx_supported) */
break;
} /* end 8 bpp */
case 16: /* png_ptr->row_info.pixel_depth */
{
png_bytep srcptr;
png_bytep dstptr;
#if defined(PNG_ASSEMBLER_CODE_SUPPORTED) && defined(PNG_THREAD_UNSAFE_OK)
#if !defined(PNG_1_0_X)
if ((png_ptr->asm_flags & PNG_ASM_FLAG_MMX_READ_COMBINE_ROW)
/* && _mmx_supported */ )
#else
if (_mmx_supported)
#endif
{
png_uint_32 len;
int diff;
int dummy_value_a; // fix 'forbidden register spilled' error
int dummy_value_d;
int dummy_value_c;
int dummy_value_S;
int dummy_value_D;
_unmask = ~mask; // global variable for -fPIC version
srcptr = png_ptr->row_buf + 1;
dstptr = row;
len = png_ptr->width &~7; // reduce to multiple of 8
diff = (int) (png_ptr->width & 7); // amount lost //
__asm__ __volatile__ (
"movd _unmask, %%mm7 nt" // load bit pattern
"psubb %%mm6, %%mm6 nt" // zero mm6
"punpcklbw %%mm7, %%mm7 nt"
"punpcklwd %%mm7, %%mm7 nt"
"punpckldq %%mm7, %%mm7 nt" // fill reg with 8 masks
"movq _mask16_0, %%mm0 nt"
"movq _mask16_1, %%mm1 nt"
"pand %%mm7, %%mm0 nt"
"pand %%mm7, %%mm1 nt"
"pcmpeqb %%mm6, %%mm0 nt"
"pcmpeqb %%mm6, %%mm1 nt"
// preload "movl len, %%ecx nt" // load length of line
// preload "movl srcptr, %%esi nt" // load source
// preload "movl dstptr, %%edi nt" // load dest
"cmpl $0, %%ecx nt"
"jz mainloop16end nt"
"mainloop16: nt"
"movq (%%esi), %%mm4 nt"
"pand %%mm0, %%mm4 nt"
"movq %%mm0, %%mm6 nt"
"movq (%%edi), %%mm7 nt"
"pandn %%mm7, %%mm6 nt"
"por %%mm6, %%mm4 nt"
"movq %%mm4, (%%edi) nt"
"movq 8(%%esi), %%mm5 nt"
"pand %%mm1, %%mm5 nt"
"movq %%mm1, %%mm7 nt"
"movq 8(%%edi), %%mm6 nt"
"pandn %%mm6, %%mm7 nt"
"por %%mm7, %%mm5 nt"
"movq %%mm5, 8(%%edi) nt"
"addl $16, %%esi nt" // inc by 16 bytes processed
"addl $16, %%edi nt"
"subl $8, %%ecx nt" // dec by 8 pixels processed
"ja mainloop16 nt"
"mainloop16end: nt"
// preload "movl diff, %%ecx nt" // (diff is in eax)
"movl %%eax, %%ecx nt"
"cmpl $0, %%ecx nt"
"jz end16 nt"
// preload "movl mask, %%edx nt"
"sall $24, %%edx nt" // make low byte, high byte
"secondloop16: nt"
"sall %%edx nt" // move high bit to CF
"jnc skip16 nt" // if CF = 0
"movw (%%esi), %%ax nt"
"movw %%ax, (%%edi) nt"
"skip16: nt"
"addl $2, %%esi nt"
"addl $2, %%edi nt"
"decl %%ecx nt"
"jnz secondloop16 nt"
"end16: nt"
"EMMS nt" // DONE
: "=a" (dummy_value_a), // output regs (dummy)
"=c" (dummy_value_c),
"=d" (dummy_value_d),
"=S" (dummy_value_S),
"=D" (dummy_value_D)
: "0" (diff), // eax // input regs
// was (unmask) " " RESERVED // ebx // Global Offset Table idx
"1" (len), // ecx
"2" (mask), // edx
"3" (srcptr), // esi
"4" (dstptr) // edi
#if 0 /* MMX regs (%mm0, etc.) not supported by gcc 2.7.2.3 or egcs 1.1 */
: "%mm0", "%mm1", "%mm4" // clobber list
, "%mm5", "%mm6", "%mm7"
#endif
);
}
else /* mmx _not supported - Use modified C routine */
#endif /* PNG_ASSEMBLER_CODE_SUPPORTED */
{
register png_uint_32 i;
png_uint_32 initial_val = BPP2 * png_pass_start[png_ptr->pass];
/* png.c: png_pass_start[] = {0, 4, 0, 2, 0, 1, 0}; */
register int stride = BPP2 * png_pass_inc[png_ptr->pass];
/* png.c: png_pass_inc[] = {8, 8, 4, 4, 2, 2, 1}; */
register int rep_bytes = BPP2 * png_pass_width[png_ptr->pass];
/* png.c: png_pass_width[] = {8, 4, 4, 2, 2, 1, 1}; */
png_uint_32 len = png_ptr->width &~7; /* reduce to mult. of 8 */
int diff = (int) (png_ptr->width & 7); /* amount lost */
register png_uint_32 final_val = BPP2 * len; /* GRR bugfix */
srcptr = png_ptr->row_buf + 1 + initial_val;
dstptr = row + initial_val;
for (i = initial_val; i < final_val; i += stride)
{
png_memcpy(dstptr, srcptr, rep_bytes);
srcptr += stride;
dstptr += stride;
}
if (diff) /* number of leftover pixels: 3 for pngtest */
{
final_val+=diff*BPP2;
for (; i < final_val; i += stride)
{
if (rep_bytes > (int)(final_val-i))
rep_bytes = (int)(final_val-i);
png_memcpy(dstptr, srcptr, rep_bytes);
srcptr += stride;
dstptr += stride;
}
}
} /* end of else (_mmx_supported) */
break;
} /* end 16 bpp */
case 24: /* png_ptr->row_info.pixel_depth */
{
png_bytep srcptr;
png_bytep dstptr;
#if defined(PNG_ASSEMBLER_CODE_SUPPORTED) && defined(PNG_THREAD_UNSAFE_OK)
#if !defined(PNG_1_0_X)
if ((png_ptr->asm_flags & PNG_ASM_FLAG_MMX_READ_COMBINE_ROW)
/* && _mmx_supported */ )
#else
if (_mmx_supported)
#endif
{
png_uint_32 len;
int diff;
int dummy_value_a; // fix 'forbidden register spilled' error
int dummy_value_d;
int dummy_value_c;
int dummy_value_S;
int dummy_value_D;
_unmask = ~mask; // global variable for -fPIC version
srcptr = png_ptr->row_buf + 1;
dstptr = row;
len = png_ptr->width &~7; // reduce to multiple of 8
diff = (int) (png_ptr->width & 7); // amount lost //
__asm__ __volatile__ (
"movd _unmask, %%mm7 nt" // load bit pattern
"psubb %%mm6, %%mm6 nt" // zero mm6
"punpcklbw %%mm7, %%mm7 nt"
"punpcklwd %%mm7, %%mm7 nt"
"punpckldq %%mm7, %%mm7 nt" // fill reg with 8 masks
"movq _mask24_0, %%mm0 nt"
"movq _mask24_1, %%mm1 nt"
"movq _mask24_2, %%mm2 nt"
"pand %%mm7, %%mm0 nt"
"pand %%mm7, %%mm1 nt"
"pand %%mm7, %%mm2 nt"
"pcmpeqb %%mm6, %%mm0 nt"
"pcmpeqb %%mm6, %%mm1 nt"
"pcmpeqb %%mm6, %%mm2 nt"
// preload "movl len, %%ecx nt" // load length of line
// preload "movl srcptr, %%esi nt" // load source
// preload "movl dstptr, %%edi nt" // load dest
"cmpl $0, %%ecx nt"
"jz mainloop24end nt"
"mainloop24: nt"
"movq (%%esi), %%mm4 nt"
"pand %%mm0, %%mm4 nt"
"movq %%mm0, %%mm6 nt"
"movq (%%edi), %%mm7 nt"
"pandn %%mm7, %%mm6 nt"
"por %%mm6, %%mm4 nt"
"movq %%mm4, (%%edi) nt"
"movq 8(%%esi), %%mm5 nt"
"pand %%mm1, %%mm5 nt"
"movq %%mm1, %%mm7 nt"
"movq 8(%%edi), %%mm6 nt"
"pandn %%mm6, %%mm7 nt"
"por %%mm7, %%mm5 nt"
"movq %%mm5, 8(%%edi) nt"
"movq 16(%%esi), %%mm6 nt"
"pand %%mm2, %%mm6 nt"
"movq %%mm2, %%mm4 nt"
"movq 16(%%edi), %%mm7 nt"
"pandn %%mm7, %%mm4 nt"
"por %%mm4, %%mm6 nt"
"movq %%mm6, 16(%%edi) nt"
"addl $24, %%esi nt" // inc by 24 bytes processed
"addl $24, %%edi nt"
"subl $8, %%ecx nt" // dec by 8 pixels processed
"ja mainloop24 nt"
"mainloop24end: nt"
// preload "movl diff, %%ecx nt" // (diff is in eax)
"movl %%eax, %%ecx nt"
"cmpl $0, %%ecx nt"
"jz end24 nt"
// preload "movl mask, %%edx nt"
"sall $24, %%edx nt" // make low byte, high byte
"secondloop24: nt"
"sall %%edx nt" // move high bit to CF
"jnc skip24 nt" // if CF = 0
"movw (%%esi), %%ax nt"
"movw %%ax, (%%edi) nt"
"xorl %%eax, %%eax nt"
"movb 2(%%esi), %%al nt"
"movb %%al, 2(%%edi) nt"
"skip24: nt"
"addl $3, %%esi nt"
"addl $3, %%edi nt"
"decl %%ecx nt"
"jnz secondloop24 nt"
"end24: nt"
"EMMS nt" // DONE
: "=a" (dummy_value_a), // output regs (dummy)
"=d" (dummy_value_d),
"=c" (dummy_value_c),
"=S" (dummy_value_S),
"=D" (dummy_value_D)
: "3" (srcptr), // esi // input regs
"4" (dstptr), // edi
"0" (diff), // eax
// was (unmask) "b" RESERVED // ebx // Global Offset Table idx
"2" (len), // ecx
"1" (mask) // edx
#if 0 /* MMX regs (%mm0, etc.) not supported by gcc 2.7.2.3 or egcs 1.1 */
: "%mm0", "%mm1", "%mm2" // clobber list
, "%mm4", "%mm5", "%mm6", "%mm7"
#endif
);
}
else /* mmx _not supported - Use modified C routine */
#endif /* PNG_ASSEMBLER_CODE_SUPPORTED */
{
register png_uint_32 i;
png_uint_32 initial_val = BPP3 * png_pass_start[png_ptr->pass];
/* png.c: png_pass_start[] = {0, 4, 0, 2, 0, 1, 0}; */
register int stride = BPP3 * png_pass_inc[png_ptr->pass];
/* png.c: png_pass_inc[] = {8, 8, 4, 4, 2, 2, 1}; */
register int rep_bytes = BPP3 * png_pass_width[png_ptr->pass];
/* png.c: png_pass_width[] = {8, 4, 4, 2, 2, 1, 1}; */
png_uint_32 len = png_ptr->width &~7; /* reduce to mult. of 8 */
int diff = (int) (png_ptr->width & 7); /* amount lost */
register png_uint_32 final_val = BPP3 * len; /* GRR bugfix */
srcptr = png_ptr->row_buf + 1 + initial_val;
dstptr = row + initial_val;
for (i = initial_val; i < final_val; i += stride)
{
png_memcpy(dstptr, srcptr, rep_bytes);
srcptr += stride;
dstptr += stride;
}
if (diff) /* number of leftover pixels: 3 for pngtest */
{
final_val+=diff*BPP3;
for (; i < final_val; i += stride)
{
if (rep_bytes > (int)(final_val-i))
rep_bytes = (int)(final_val-i);
png_memcpy(dstptr, srcptr, rep_bytes);
srcptr += stride;
dstptr += stride;
}
}
} /* end of else (_mmx_supported) */
break;
} /* end 24 bpp */
case 32: /* png_ptr->row_info.pixel_depth */
{
png_bytep srcptr;
png_bytep dstptr;
#if defined(PNG_ASSEMBLER_CODE_SUPPORTED) && defined(PNG_THREAD_UNSAFE_OK)
#if !defined(PNG_1_0_X)
if ((png_ptr->asm_flags & PNG_ASM_FLAG_MMX_READ_COMBINE_ROW)
/* && _mmx_supported */ )
#else
if (_mmx_supported)
#endif
{
png_uint_32 len;
int diff;
int dummy_value_a; // fix 'forbidden register spilled' error
int dummy_value_d;
int dummy_value_c;
int dummy_value_S;
int dummy_value_D;
_unmask = ~mask; // global variable for -fPIC version
srcptr = png_ptr->row_buf + 1;
dstptr = row;
len = png_ptr->width &~7; // reduce to multiple of 8
diff = (int) (png_ptr->width & 7); // amount lost //
__asm__ __volatile__ (
"movd _unmask, %%mm7 nt" // load bit pattern
"psubb %%mm6, %%mm6 nt" // zero mm6
"punpcklbw %%mm7, %%mm7 nt"
"punpcklwd %%mm7, %%mm7 nt"
"punpckldq %%mm7, %%mm7 nt" // fill reg with 8 masks
"movq _mask32_0, %%mm0 nt"
"movq _mask32_1, %%mm1 nt"
"movq _mask32_2, %%mm2 nt"
"movq _mask32_3, %%mm3 nt"
"pand %%mm7, %%mm0 nt"
"pand %%mm7, %%mm1 nt"
"pand %%mm7, %%mm2 nt"
"pand %%mm7, %%mm3 nt"
"pcmpeqb %%mm6, %%mm0 nt"
"pcmpeqb %%mm6, %%mm1 nt"
"pcmpeqb %%mm6, %%mm2 nt"
"pcmpeqb %%mm6, %%mm3 nt"
// preload "movl len, %%ecx nt" // load length of line
// preload "movl srcptr, %%esi nt" // load source
// preload "movl dstptr, %%edi nt" // load dest
"cmpl $0, %%ecx nt" // lcr
"jz mainloop32end nt"
"mainloop32: nt"
"movq (%%esi), %%mm4 nt"
"pand %%mm0, %%mm4 nt"
"movq %%mm0, %%mm6 nt"
"movq (%%edi), %%mm7 nt"
"pandn %%mm7, %%mm6 nt"
"por %%mm6, %%mm4 nt"
"movq %%mm4, (%%edi) nt"
"movq 8(%%esi), %%mm5 nt"
"pand %%mm1, %%mm5 nt"
"movq %%mm1, %%mm7 nt"
"movq 8(%%edi), %%mm6 nt"
"pandn %%mm6, %%mm7 nt"
"por %%mm7, %%mm5 nt"
"movq %%mm5, 8(%%edi) nt"
"movq 16(%%esi), %%mm6 nt"
"pand %%mm2, %%mm6 nt"
"movq %%mm2, %%mm4 nt"
"movq 16(%%edi), %%mm7 nt"
"pandn %%mm7, %%mm4 nt"
"por %%mm4, %%mm6 nt"
"movq %%mm6, 16(%%edi) nt"
"movq 24(%%esi), %%mm7 nt"
"pand %%mm3, %%mm7 nt"
"movq %%mm3, %%mm5 nt"
"movq 24(%%edi), %%mm4 nt"
"pandn %%mm4, %%mm5 nt"
"por %%mm5, %%mm7 nt"
"movq %%mm7, 24(%%edi) nt"
"addl $32, %%esi nt" // inc by 32 bytes processed
"addl $32, %%edi nt"
"subl $8, %%ecx nt" // dec by 8 pixels processed
"ja mainloop32 nt"
"mainloop32end: nt"
// preload "movl diff, %%ecx nt" // (diff is in eax)
"movl %%eax, %%ecx nt"
"cmpl $0, %%ecx nt"
"jz end32 nt"
// preload "movl mask, %%edx nt"
"sall $24, %%edx nt" // low byte => high byte
"secondloop32: nt"
"sall %%edx nt" // move high bit to CF
"jnc skip32 nt" // if CF = 0
"movl (%%esi), %%eax nt"
"movl %%eax, (%%edi) nt"
"skip32: nt"
"addl $4, %%esi nt"
"addl $4, %%edi nt"
"decl %%ecx nt"
"jnz secondloop32 nt"
"end32: nt"
"EMMS nt" // DONE
: "=a" (dummy_value_a), // output regs (dummy)
"=d" (dummy_value_d),
"=c" (dummy_value_c),
"=S" (dummy_value_S),
"=D" (dummy_value_D)
: "3" (srcptr), // esi // input regs
"4" (dstptr), // edi
"0" (diff), // eax
// was (unmask) "b" RESERVED // ebx // Global Offset Table idx
"2" (len), // ecx
"1" (mask) // edx
#if 0 /* MMX regs (%mm0, etc.) not supported by gcc 2.7.2.3 or egcs 1.1 */
: "%mm0", "%mm1", "%mm2", "%mm3" // clobber list
, "%mm4", "%mm5", "%mm6", "%mm7"
#endif
);
}
else /* mmx _not supported - Use modified C routine */
#endif /* PNG_ASSEMBLER_CODE_SUPPORTED */
{
register png_uint_32 i;
png_uint_32 initial_val = BPP4 * png_pass_start[png_ptr->pass];
/* png.c: png_pass_start[] = {0, 4, 0, 2, 0, 1, 0}; */
register int stride = BPP4 * png_pass_inc[png_ptr->pass];
/* png.c: png_pass_inc[] = {8, 8, 4, 4, 2, 2, 1}; */
register int rep_bytes = BPP4 * png_pass_width[png_ptr->pass];
/* png.c: png_pass_width[] = {8, 4, 4, 2, 2, 1, 1}; */
png_uint_32 len = png_ptr->width &~7; /* reduce to mult. of 8 */
int diff = (int) (png_ptr->width & 7); /* amount lost */
register png_uint_32 final_val = BPP4 * len; /* GRR bugfix */
srcptr = png_ptr->row_buf + 1 + initial_val;
dstptr = row + initial_val;
for (i = initial_val; i < final_val; i += stride)
{
png_memcpy(dstptr, srcptr, rep_bytes);
srcptr += stride;
dstptr += stride;
}
if (diff) /* number of leftover pixels: 3 for pngtest */
{
final_val+=diff*BPP4;
for (; i < final_val; i += stride)
{
if (rep_bytes > (int)(final_val-i))
rep_bytes = (int)(final_val-i);
png_memcpy(dstptr, srcptr, rep_bytes);
srcptr += stride;
dstptr += stride;
}
}
} /* end of else (_mmx_supported) */
break;
} /* end 32 bpp */
case 48: /* png_ptr->row_info.pixel_depth */
{
png_bytep srcptr;
png_bytep dstptr;
#if defined(PNG_ASSEMBLER_CODE_SUPPORTED) && defined(PNG_THREAD_UNSAFE_OK)
#if !defined(PNG_1_0_X)
if ((png_ptr->asm_flags & PNG_ASM_FLAG_MMX_READ_COMBINE_ROW)
/* && _mmx_supported */ )
#else
if (_mmx_supported)
#endif
{
png_uint_32 len;
int diff;
int dummy_value_a; // fix 'forbidden register spilled' error
int dummy_value_d;
int dummy_value_c;
int dummy_value_S;
int dummy_value_D;
_unmask = ~mask; // global variable for -fPIC version
srcptr = png_ptr->row_buf + 1;
dstptr = row;
len = png_ptr->width &~7; // reduce to multiple of 8
diff = (int) (png_ptr->width & 7); // amount lost //
__asm__ __volatile__ (
"movd _unmask, %%mm7 nt" // load bit pattern
"psubb %%mm6, %%mm6 nt" // zero mm6
"punpcklbw %%mm7, %%mm7 nt"
"punpcklwd %%mm7, %%mm7 nt"
"punpckldq %%mm7, %%mm7 nt" // fill reg with 8 masks
"movq _mask48_0, %%mm0 nt"
"movq _mask48_1, %%mm1 nt"
"movq _mask48_2, %%mm2 nt"
"movq _mask48_3, %%mm3 nt"
"movq _mask48_4, %%mm4 nt"
"movq _mask48_5, %%mm5 nt"
"pand %%mm7, %%mm0 nt"
"pand %%mm7, %%mm1 nt"
"pand %%mm7, %%mm2 nt"
"pand %%mm7, %%mm3 nt"
"pand %%mm7, %%mm4 nt"
"pand %%mm7, %%mm5 nt"
"pcmpeqb %%mm6, %%mm0 nt"
"pcmpeqb %%mm6, %%mm1 nt"
"pcmpeqb %%mm6, %%mm2 nt"
"pcmpeqb %%mm6, %%mm3 nt"
"pcmpeqb %%mm6, %%mm4 nt"
"pcmpeqb %%mm6, %%mm5 nt"
// preload "movl len, %%ecx nt" // load length of line
// preload "movl srcptr, %%esi nt" // load source
// preload "movl dstptr, %%edi nt" // load dest
"cmpl $0, %%ecx nt"
"jz mainloop48end nt"
"mainloop48: nt"
"movq (%%esi), %%mm7 nt"
"pand %%mm0, %%mm7 nt"
"movq %%mm0, %%mm6 nt"
"pandn (%%edi), %%mm6 nt"
"por %%mm6, %%mm7 nt"
"movq %%mm7, (%%edi) nt"
"movq 8(%%esi), %%mm6 nt"
"pand %%mm1, %%mm6 nt"
"movq %%mm1, %%mm7 nt"
"pandn 8(%%edi), %%mm7 nt"
"por %%mm7, %%mm6 nt"
"movq %%mm6, 8(%%edi) nt"
"movq 16(%%esi), %%mm6 nt"
"pand %%mm2, %%mm6 nt"
"movq %%mm2, %%mm7 nt"
"pandn 16(%%edi), %%mm7 nt"
"por %%mm7, %%mm6 nt"
"movq %%mm6, 16(%%edi) nt"
"movq 24(%%esi), %%mm7 nt"
"pand %%mm3, %%mm7 nt"
"movq %%mm3, %%mm6 nt"
"pandn 24(%%edi), %%mm6 nt"
"por %%mm6, %%mm7 nt"
"movq %%mm7, 24(%%edi) nt"
"movq 32(%%esi), %%mm6 nt"
"pand %%mm4, %%mm6 nt"
"movq %%mm4, %%mm7 nt"
"pandn 32(%%edi), %%mm7 nt"
"por %%mm7, %%mm6 nt"
"movq %%mm6, 32(%%edi) nt"
"movq 40(%%esi), %%mm7 nt"
"pand %%mm5, %%mm7 nt"
"movq %%mm5, %%mm6 nt"
"pandn 40(%%edi), %%mm6 nt"
"por %%mm6, %%mm7 nt"
"movq %%mm7, 40(%%edi) nt"
"addl $48, %%esi nt" // inc by 48 bytes processed
"addl $48, %%edi nt"
"subl $8, %%ecx nt" // dec by 8 pixels processed
"ja mainloop48 nt"
"mainloop48end: nt"
// preload "movl diff, %%ecx nt" // (diff is in eax)
"movl %%eax, %%ecx nt"
"cmpl $0, %%ecx nt"
"jz end48 nt"
// preload "movl mask, %%edx nt"
"sall $24, %%edx nt" // make low byte, high byte
"secondloop48: nt"
"sall %%edx nt" // move high bit to CF
"jnc skip48 nt" // if CF = 0
"movl (%%esi), %%eax nt"
"movl %%eax, (%%edi) nt"
"skip48: nt"
"addl $4, %%esi nt"
"addl $4, %%edi nt"
"decl %%ecx nt"
"jnz secondloop48 nt"
"end48: nt"
"EMMS nt" // DONE
: "=a" (dummy_value_a), // output regs (dummy)
"=d" (dummy_value_d),
"=c" (dummy_value_c),
"=S" (dummy_value_S),
"=D" (dummy_value_D)
: "3" (srcptr), // esi // input regs
"4" (dstptr), // edi
"0" (diff), // eax
// was (unmask) "b" RESERVED // ebx // Global Offset Table idx
"2" (len), // ecx
"1" (mask) // edx
#if 0 /* MMX regs (%mm0, etc.) not supported by gcc 2.7.2.3 or egcs 1.1 */
: "%mm0", "%mm1", "%mm2", "%mm3" // clobber list
, "%mm4", "%mm5", "%mm6", "%mm7"
#endif
);
}
else /* mmx _not supported - Use modified C routine */
#endif /* PNG_ASSEMBLER_CODE_SUPPORTED */
{
register png_uint_32 i;
png_uint_32 initial_val = BPP6 * png_pass_start[png_ptr->pass];
/* png.c: png_pass_start[] = {0, 4, 0, 2, 0, 1, 0}; */
register int stride = BPP6 * png_pass_inc[png_ptr->pass];
/* png.c: png_pass_inc[] = {8, 8, 4, 4, 2, 2, 1}; */
register int rep_bytes = BPP6 * png_pass_width[png_ptr->pass];
/* png.c: png_pass_width[] = {8, 4, 4, 2, 2, 1, 1}; */
png_uint_32 len = png_ptr->width &~7; /* reduce to mult. of 8 */
int diff = (int) (png_ptr->width & 7); /* amount lost */
register png_uint_32 final_val = BPP6 * len; /* GRR bugfix */
srcptr = png_ptr->row_buf + 1 + initial_val;
dstptr = row + initial_val;
for (i = initial_val; i < final_val; i += stride)
{
png_memcpy(dstptr, srcptr, rep_bytes);
srcptr += stride;
dstptr += stride;
}
if (diff) /* number of leftover pixels: 3 for pngtest */
{
final_val+=diff*BPP6;
for (; i < final_val; i += stride)
{
if (rep_bytes > (int)(final_val-i))
rep_bytes = (int)(final_val-i);
png_memcpy(dstptr, srcptr, rep_bytes);
srcptr += stride;
dstptr += stride;
}
}
} /* end of else (_mmx_supported) */
break;
} /* end 48 bpp */
case 64: /* png_ptr->row_info.pixel_depth */
{
png_bytep srcptr;
png_bytep dstptr;
register png_uint_32 i;
png_uint_32 initial_val = BPP8 * png_pass_start[png_ptr->pass];
/* png.c: png_pass_start[] = {0, 4, 0, 2, 0, 1, 0}; */
register int stride = BPP8 * png_pass_inc[png_ptr->pass];
/* png.c: png_pass_inc[] = {8, 8, 4, 4, 2, 2, 1}; */
register int rep_bytes = BPP8 * png_pass_width[png_ptr->pass];
/* png.c: png_pass_width[] = {8, 4, 4, 2, 2, 1, 1}; */
png_uint_32 len = png_ptr->width &~7; /* reduce to mult. of 8 */
int diff = (int) (png_ptr->width & 7); /* amount lost */
register png_uint_32 final_val = BPP8 * len; /* GRR bugfix */
srcptr = png_ptr->row_buf + 1 + initial_val;
dstptr = row + initial_val;
for (i = initial_val; i < final_val; i += stride)
{
png_memcpy(dstptr, srcptr, rep_bytes);
srcptr += stride;
dstptr += stride;
}
if (diff) /* number of leftover pixels: 3 for pngtest */
{
final_val+=diff*BPP8;
for (; i < final_val; i += stride)
{
if (rep_bytes > (int)(final_val-i))
rep_bytes = (int)(final_val-i);
png_memcpy(dstptr, srcptr, rep_bytes);
srcptr += stride;
dstptr += stride;
}
}
break;
} /* end 64 bpp */
default: /* png_ptr->row_info.pixel_depth != 1,2,4,8,16,24,32,48,64 */
{
/* this should never happen */
png_warning(png_ptr, "Invalid row_info.pixel_depth in pnggccrd");
break;
}
} /* end switch (png_ptr->row_info.pixel_depth) */
} /* end if (non-trivial mask) */
} /* end png_combine_row() */
#endif /* PNG_HAVE_ASSEMBLER_COMBINE_ROW */
/*===========================================================================*/
/* */
/* P N G _ D O _ R E A D _ I N T E R L A C E */
/* */
/*===========================================================================*/
#if defined(PNG_READ_INTERLACING_SUPPORTED)
#if defined(PNG_HAVE_ASSEMBLER_READ_INTERLACE)
/* png_do_read_interlace() is called after any 16-bit to 8-bit conversion
* has taken place. [GRR: what other steps come before and/or after?]
*/
void /* PRIVATE */
png_do_read_interlace(png_structp png_ptr)
{
png_row_infop row_info = &(png_ptr->row_info);
png_bytep row = png_ptr->row_buf + 1;
int pass = png_ptr->pass;
#if defined(PNG_READ_PACKSWAP_SUPPORTED)
png_uint_32 transformations = png_ptr->transformations;
#endif
png_debug(1, "in png_do_read_interlace (pnggccrd.c)n");
#if defined(PNG_ASSEMBLER_CODE_SUPPORTED)
if (_mmx_supported == 2) {
#if !defined(PNG_1_0_X)
/* this should have happened in png_init_mmx_flags() already */
png_warning(png_ptr, "asm_flags may not have been initialized");
#endif
png_mmx_support();
}
#endif
if (row != NULL && row_info != NULL)
{
png_uint_32 final_width;
final_width = row_info->width * png_pass_inc[pass];
switch (row_info->pixel_depth)
{
case 1:
{
png_bytep sp, dp;
int sshift, dshift;
int s_start, s_end, s_inc;
png_byte v;
png_uint_32 i;
int j;
sp = row + (png_size_t)((row_info->width - 1) >> 3);
dp = row + (png_size_t)((final_width - 1) >> 3);
#if defined(PNG_READ_PACKSWAP_SUPPORTED)
if (transformations & PNG_PACKSWAP)
{
sshift = (int)((row_info->width + 7) & 7);
dshift = (int)((final_width + 7) & 7);
s_start = 7;
s_end = 0;
s_inc = -1;
}
else
#endif
{
sshift = 7 - (int)((row_info->width + 7) & 7);
dshift = 7 - (int)((final_width + 7) & 7);
s_start = 0;
s_end = 7;
s_inc = 1;
}
for (i = row_info->width; i; i--)
{
v = (png_byte)((*sp >> sshift) & 0x1);
for (j = 0; j < png_pass_inc[pass]; j++)
{
*dp &= (png_byte)((0x7f7f >> (7 - dshift)) & 0xff);
*dp |= (png_byte)(v << dshift);
if (dshift == s_end)
{
dshift = s_start;
dp--;
}
else
dshift += s_inc;
}
if (sshift == s_end)
{
sshift = s_start;
sp--;
}
else
sshift += s_inc;
}
break;
}
case 2:
{
png_bytep sp, dp;
int sshift, dshift;
int s_start, s_end, s_inc;
png_uint_32 i;
sp = row + (png_size_t)((row_info->width - 1) >> 2);
dp = row + (png_size_t)((final_width - 1) >> 2);
#if defined(PNG_READ_PACKSWAP_SUPPORTED)
if (transformations & PNG_PACKSWAP)
{
sshift = (png_size_t)(((row_info->width + 3) & 3) << 1);
dshift = (png_size_t)(((final_width + 3) & 3) << 1);
s_start = 6;
s_end = 0;
s_inc = -2;
}
else
#endif
{
sshift = (png_size_t)((3 - ((row_info->width + 3) & 3)) << 1);
dshift = (png_size_t)((3 - ((final_width + 3) & 3)) << 1);
s_start = 0;
s_end = 6;
s_inc = 2;
}
for (i = row_info->width; i; i--)
{
png_byte v;
int j;
v = (png_byte)((*sp >> sshift) & 0x3);
for (j = 0; j < png_pass_inc[pass]; j++)
{
*dp &= (png_byte)((0x3f3f >> (6 - dshift)) & 0xff);
*dp |= (png_byte)(v << dshift);
if (dshift == s_end)
{
dshift = s_start;
dp--;
}
else
dshift += s_inc;
}
if (sshift == s_end)
{
sshift = s_start;
sp--;
}
else
sshift += s_inc;
}
break;
}
case 4:
{
png_bytep sp, dp;
int sshift, dshift;
int s_start, s_end, s_inc;
png_uint_32 i;
sp = row + (png_size_t)((row_info->width - 1) >> 1);
dp = row + (png_size_t)((final_width - 1) >> 1);
#if defined(PNG_READ_PACKSWAP_SUPPORTED)
if (transformations & PNG_PACKSWAP)
{
sshift = (png_size_t)(((row_info->width + 1) & 1) << 2);
dshift = (png_size_t)(((final_width + 1) & 1) << 2);
s_start = 4;
s_end = 0;
s_inc = -4;
}
else
#endif
{
sshift = (png_size_t)((1 - ((row_info->width + 1) & 1)) << 2);
dshift = (png_size_t)((1 - ((final_width + 1) & 1)) << 2);
s_start = 0;
s_end = 4;
s_inc = 4;
}
for (i = row_info->width; i; i--)
{
png_byte v;
int j;
v = (png_byte)((*sp >> sshift) & 0xf);
for (j = 0; j < png_pass_inc[pass]; j++)
{
*dp &= (png_byte)((0xf0f >> (4 - dshift)) & 0xff);
*dp |= (png_byte)(v << dshift);
if (dshift == s_end)
{
dshift = s_start;
dp--;
}
else
dshift += s_inc;
}
if (sshift == s_end)
{
sshift = s_start;
sp--;
}
else
sshift += s_inc;
}
break;
}
/*====================================================================*/
default: /* 8-bit or larger (this is where the routine is modified) */
{
#if 0
// static unsigned long long _const4 = 0x0000000000FFFFFFLL; no good
// static unsigned long long const4 = 0x0000000000FFFFFFLL; no good
// unsigned long long _const4 = 0x0000000000FFFFFFLL; no good
// unsigned long long const4 = 0x0000000000FFFFFFLL; no good
#endif
png_bytep sptr, dp;
png_uint_32 i;
png_size_t pixel_bytes;
int width = (int)row_info->width;
pixel_bytes = (row_info->pixel_depth >> 3);
/* point sptr at the last pixel in the pre-expanded row: */
sptr = row + (width - 1) * pixel_bytes;
/* point dp at the last pixel position in the expanded row: */
dp = row + (final_width - 1) * pixel_bytes;
/* New code by Nirav Chhatrapati - Intel Corporation */
#if defined(PNG_ASSEMBLER_CODE_SUPPORTED)
#if !defined(PNG_1_0_X)
if ((png_ptr->asm_flags & PNG_ASM_FLAG_MMX_READ_INTERLACE)
/* && _mmx_supported */ )
#else
if (_mmx_supported)
#endif
{
//--------------------------------------------------------------
if (pixel_bytes == 3)
{
if (((pass == 0) || (pass == 1)) && width)
{
int dummy_value_c; // fix 'forbidden register spilled'
int dummy_value_S;
int dummy_value_D;
__asm__ __volatile__ (
"subl $21, %%edi nt"
// (png_pass_inc[pass] - 1)*pixel_bytes
".loop3_pass0: nt"
"movd (%%esi), %%mm0 nt" // x x x x x 2 1 0
"pand _const4, %%mm0 nt" // z z z z z 2 1 0
"movq %%mm0, %%mm1 nt" // z z z z z 2 1 0
"psllq $16, %%mm0 nt" // z z z 2 1 0 z z
"movq %%mm0, %%mm2 nt" // z z z 2 1 0 z z
"psllq $24, %%mm0 nt" // 2 1 0 z z z z z
"psrlq $8, %%mm1 nt" // z z z z z z 2 1
"por %%mm2, %%mm0 nt" // 2 1 0 2 1 0 z z
"por %%mm1, %%mm0 nt" // 2 1 0 2 1 0 2 1
"movq %%mm0, %%mm3 nt" // 2 1 0 2 1 0 2 1
"psllq $16, %%mm0 nt" // 0 2 1 0 2 1 z z
"movq %%mm3, %%mm4 nt" // 2 1 0 2 1 0 2 1
"punpckhdq %%mm0, %%mm3 nt" // 0 2 1 0 2 1 0 2
"movq %%mm4, 16(%%edi) nt"
"psrlq $32, %%mm0 nt" // z z z z 0 2 1 0
"movq %%mm3, 8(%%edi) nt"
"punpckldq %%mm4, %%mm0 nt" // 1 0 2 1 0 2 1 0
"subl $3, %%esi nt"
"movq %%mm0, (%%edi) nt"
"subl $24, %%edi nt"
"decl %%ecx nt"
"jnz .loop3_pass0 nt"
"EMMS nt" // DONE
: "=c" (dummy_value_c), // output regs (dummy)
"=S" (dummy_value_S),
"=D" (dummy_value_D)
: "1" (sptr), // esi // input regs
"2" (dp), // edi
"0" (width) // ecx
// doesn't work "i" (0x0000000000FFFFFFLL) // %1 (a.k.a. _const4)
#if 0 /* %mm0, ..., %mm4 not supported by gcc 2.7.2.3 or egcs 1.1 */
: "%mm0", "%mm1", "%mm2" // clobber list
, "%mm3", "%mm4"
#endif
);
}
else if (((pass == 2) || (pass == 3)) && width)
{
int dummy_value_c; // fix 'forbidden register spilled'
int dummy_value_S;
int dummy_value_D;
__asm__ __volatile__ (
"subl $9, %%edi nt"
// (png_pass_inc[pass] - 1)*pixel_bytes
".loop3_pass2: nt"
"movd (%%esi), %%mm0 nt" // x x x x x 2 1 0
"pand _const4, %%mm0 nt" // z z z z z 2 1 0
"movq %%mm0, %%mm1 nt" // z z z z z 2 1 0
"psllq $16, %%mm0 nt" // z z z 2 1 0 z z
"movq %%mm0, %%mm2 nt" // z z z 2 1 0 z z
"psllq $24, %%mm0 nt" // 2 1 0 z z z z z
"psrlq $8, %%mm1 nt" // z z z z z z 2 1
"por %%mm2, %%mm0 nt" // 2 1 0 2 1 0 z z
"por %%mm1, %%mm0 nt" // 2 1 0 2 1 0 2 1
"movq %%mm0, 4(%%edi) nt"
"psrlq $16, %%mm0 nt" // z z 2 1 0 2 1 0
"subl $3, %%esi nt"
"movd %%mm0, (%%edi) nt"
"subl $12, %%edi nt"
"decl %%ecx nt"
"jnz .loop3_pass2 nt"
"EMMS nt" // DONE
: "=c" (dummy_value_c), // output regs (dummy)
"=S" (dummy_value_S),
"=D" (dummy_value_D)
: "1" (sptr), // esi // input regs
"2" (dp), // edi
"0" (width) // ecx
#if 0 /* %mm0, ..., %mm2 not supported by gcc 2.7.2.3 or egcs 1.1 */
: "%mm0", "%mm1", "%mm2" // clobber list
#endif
);
}
else if (width) /* && ((pass == 4) || (pass == 5)) */
{
int width_mmx = ((width >> 1) << 1) - 8; // GRR: huh?
if (width_mmx < 0)
width_mmx = 0;
width -= width_mmx; // 8 or 9 pix, 24 or 27 bytes
if (width_mmx)
{
// png_pass_inc[] = {8, 8, 4, 4, 2, 2, 1};
// sptr points at last pixel in pre-expanded row
// dp points at last pixel position in expanded row
int dummy_value_c; // fix 'forbidden register spilled'
int dummy_value_S;
int dummy_value_D;
__asm__ __volatile__ (
"subl $3, %%esi nt"
"subl $9, %%edi nt"
// (png_pass_inc[pass] + 1)*pixel_bytes
".loop3_pass4: nt"
"movq (%%esi), %%mm0 nt" // x x 5 4 3 2 1 0
"movq %%mm0, %%mm1 nt" // x x 5 4 3 2 1 0
"movq %%mm0, %%mm2 nt" // x x 5 4 3 2 1 0
"psllq $24, %%mm0 nt" // 4 3 2 1 0 z z z
"pand _const4, %%mm1 nt" // z z z z z 2 1 0
"psrlq $24, %%mm2 nt" // z z z x x 5 4 3
"por %%mm1, %%mm0 nt" // 4 3 2 1 0 2 1 0
"movq %%mm2, %%mm3 nt" // z z z x x 5 4 3
"psllq $8, %%mm2 nt" // z z x x 5 4 3 z
"movq %%mm0, (%%edi) nt"
"psrlq $16, %%mm3 nt" // z z z z z x x 5
"pand _const6, %%mm3 nt" // z z z z z z z 5
"por %%mm3, %%mm2 nt" // z z x x 5 4 3 5
"subl $6, %%esi nt"
"movd %%mm2, 8(%%edi) nt"
"subl $12, %%edi nt"
"subl $2, %%ecx nt"
"jnz .loop3_pass4 nt"
"EMMS nt" // DONE
: "=c" (dummy_value_c), // output regs (dummy)
"=S" (dummy_value_S),
"=D" (dummy_value_D)
: "1" (sptr), // esi // input regs
"2" (dp), // edi
"0" (width_mmx) // ecx
#if 0 /* %mm0, ..., %mm3 not supported by gcc 2.7.2.3 or egcs 1.1 */
: "%mm0", "%mm1" // clobber list
, "%mm2", "%mm3"
#endif
);
}
sptr -= width_mmx*3;
dp -= width_mmx*6;
for (i = width; i; i--)
{
png_byte v[8];
int j;
png_memcpy(v, sptr, 3);
for (j = 0; j < png_pass_inc[pass]; j++)
{
png_memcpy(dp, v, 3);
dp -= 3;
}
sptr -= 3;
}
}
} /* end of pixel_bytes == 3 */
//--------------------------------------------------------------
else if (pixel_bytes == 1)
{
if (((pass == 0) || (pass == 1)) && width)
{
int width_mmx = ((width >> 2) << 2);
width -= width_mmx; // 0-3 pixels => 0-3 bytes
if (width_mmx)
{
int dummy_value_c; // fix 'forbidden register spilled'
int dummy_value_S;
int dummy_value_D;
__asm__ __volatile__ (
"subl $3, %%esi nt"
"subl $31, %%edi nt"
".loop1_pass0: nt"
"movd (%%esi), %%mm0 nt" // x x x x 3 2 1 0
"movq %%mm0, %%mm1 nt" // x x x x 3 2 1 0
"punpcklbw %%mm0, %%mm0 nt" // 3 3 2 2 1 1 0 0
"movq %%mm0, %%mm2 nt" // 3 3 2 2 1 1 0 0
"punpcklwd %%mm0, %%mm0 nt" // 1 1 1 1 0 0 0 0
"movq %%mm0, %%mm3 nt" // 1 1 1 1 0 0 0 0
"punpckldq %%mm0, %%mm0 nt" // 0 0 0 0 0 0 0 0
"punpckhdq %%mm3, %%mm3 nt" // 1 1 1 1 1 1 1 1
"movq %%mm0, (%%edi) nt"
"punpckhwd %%mm2, %%mm2 nt" // 3 3 3 3 2 2 2 2
"movq %%mm3, 8(%%edi) nt"
"movq %%mm2, %%mm4 nt" // 3 3 3 3 2 2 2 2
"punpckldq %%mm2, %%mm2 nt" // 2 2 2 2 2 2 2 2
"punpckhdq %%mm4, %%mm4 nt" // 3 3 3 3 3 3 3 3
"movq %%mm2, 16(%%edi) nt"
"subl $4, %%esi nt"
"movq %%mm4, 24(%%edi) nt"
"subl $32, %%edi nt"
"subl $4, %%ecx nt"
"jnz .loop1_pass0 nt"
"EMMS nt" // DONE
: "=c" (dummy_value_c), // output regs (dummy)
"=S" (dummy_value_S),
"=D" (dummy_value_D)
: "1" (sptr), // esi // input regs
"2" (dp), // edi
"0" (width_mmx) // ecx
#if 0 /* %mm0, ..., %mm4 not supported by gcc 2.7.2.3 or egcs 1.1 */
: "%mm0", "%mm1", "%mm2" // clobber list
, "%mm3", "%mm4"
#endif
);
}
sptr -= width_mmx;
dp -= width_mmx*8;
for (i = width; i; i--)
{
int j;
/* I simplified this part in version 1.0.4e
* here and in several other instances where
* pixel_bytes == 1 -- GR-P
*
* Original code:
*
* png_byte v[8];
* png_memcpy(v, sptr, pixel_bytes);
* for (j = 0; j < png_pass_inc[pass]; j++)
* {
* png_memcpy(dp, v, pixel_bytes);
* dp -= pixel_bytes;
* }
* sptr -= pixel_bytes;
*
* Replacement code is in the next three lines:
*/
for (j = 0; j < png_pass_inc[pass]; j++)
{
*dp-- = *sptr;
}
--sptr;
}
}
else if (((pass == 2) || (pass == 3)) && width)
{
int width_mmx = ((width >> 2) << 2);
width -= width_mmx; // 0-3 pixels => 0-3 bytes
if (width_mmx)
{
int dummy_value_c; // fix 'forbidden register spilled'
int dummy_value_S;
int dummy_value_D;
__asm__ __volatile__ (
"subl $3, %%esi nt"
"subl $15, %%edi nt"
".loop1_pass2: nt"
"movd (%%esi), %%mm0 nt" // x x x x 3 2 1 0
"punpcklbw %%mm0, %%mm0 nt" // 3 3 2 2 1 1 0 0
"movq %%mm0, %%mm1 nt" // 3 3 2 2 1 1 0 0
"punpcklwd %%mm0, %%mm0 nt" // 1 1 1 1 0 0 0 0
"punpckhwd %%mm1, %%mm1 nt" // 3 3 3 3 2 2 2 2
"movq %%mm0, (%%edi) nt"
"subl $4, %%esi nt"
"movq %%mm1, 8(%%edi) nt"
"subl $16, %%edi nt"
"subl $4, %%ecx nt"
"jnz .loop1_pass2 nt"
"EMMS nt" // DONE
: "=c" (dummy_value_c), // output regs (dummy)
"=S" (dummy_value_S),
"=D" (dummy_value_D)
: "1" (sptr), // esi // input regs
"2" (dp), // edi
"0" (width_mmx) // ecx
#if 0 /* %mm0, %mm1 not supported by gcc 2.7.2.3 or egcs 1.1 */
: "%mm0", "%mm1" // clobber list
#endif
);
}
sptr -= width_mmx;
dp -= width_mmx*4;
for (i = width; i; i--)
{
int j;
for (j = 0; j < png_pass_inc[pass]; j++)
{
*dp-- = *sptr;
}
--sptr;
}
}
else if (width) /* && ((pass == 4) || (pass == 5)) */
{
int width_mmx = ((width >> 3) << 3);
width -= width_mmx; // 0-3 pixels => 0-3 bytes
if (width_mmx)
{
int dummy_value_c; // fix 'forbidden register spilled'
int dummy_value_S;
int dummy_value_D;
__asm__ __volatile__ (
"subl $7, %%esi nt"
"subl $15, %%edi nt"
".loop1_pass4: nt"
"movq (%%esi), %%mm0 nt" // 7 6 5 4 3 2 1 0
"movq %%mm0, %%mm1 nt" // 7 6 5 4 3 2 1 0
"punpcklbw %%mm0, %%mm0 nt" // 3 3 2 2 1 1 0 0
"punpckhbw %%mm1, %%mm1 nt" // 7 7 6 6 5 5 4 4
"movq %%mm1, 8(%%edi) nt"
"subl $8, %%esi nt"
"movq %%mm0, (%%edi) nt"
"subl $16, %%edi nt"
"subl $8, %%ecx nt"
"jnz .loop1_pass4 nt"
"EMMS nt" // DONE
: "=c" (dummy_value_c), // output regs (none)
"=S" (dummy_value_S),
"=D" (dummy_value_D)
: "1" (sptr), // esi // input regs
"2" (dp), // edi
"0" (width_mmx) // ecx
#if 0 /* %mm0, %mm1 not supported by gcc 2.7.2.3 or egcs 1.1 */
: "%mm0", "%mm1" // clobber list
#endif
);
}
sptr -= width_mmx;
dp -= width_mmx*2;
for (i = width; i; i--)
{
int j;
for (j = 0; j < png_pass_inc[pass]; j++)
{
*dp-- = *sptr;
}
--sptr;
}
}
} /* end of pixel_bytes == 1 */
//--------------------------------------------------------------
else if (pixel_bytes == 2)
{
if (((pass == 0) || (pass == 1)) && width)
{
int width_mmx = ((width >> 1) << 1);
width -= width_mmx; // 0,1 pixels => 0,2 bytes
if (width_mmx)
{
int dummy_value_c; // fix 'forbidden register spilled'
int dummy_value_S;
int dummy_value_D;
__asm__ __volatile__ (
"subl $2, %%esi nt"
"subl $30, %%edi nt"
".loop2_pass0: nt"
"movd (%%esi), %%mm0 nt" // x x x x 3 2 1 0
"punpcklwd %%mm0, %%mm0 nt" // 3 2 3 2 1 0 1 0
"movq %%mm0, %%mm1 nt" // 3 2 3 2 1 0 1 0
"punpckldq %%mm0, %%mm0 nt" // 1 0 1 0 1 0 1 0
"punpckhdq %%mm1, %%mm1 nt" // 3 2 3 2 3 2 3 2
"movq %%mm0, (%%edi) nt"
"movq %%mm0, 8(%%edi) nt"
"movq %%mm1, 16(%%edi) nt"
"subl $4, %%esi nt"
"movq %%mm1, 24(%%edi) nt"
"subl $32, %%edi nt"
"subl $2, %%ecx nt"
"jnz .loop2_pass0 nt"
"EMMS nt" // DONE
: "=c" (dummy_value_c), // output regs (dummy)
"=S" (dummy_value_S),
"=D" (dummy_value_D)
: "1" (sptr), // esi // input regs
"2" (dp), // edi
"0" (width_mmx) // ecx
#if 0 /* %mm0, %mm1 not supported by gcc 2.7.2.3 or egcs 1.1 */
: "%mm0", "%mm1" // clobber list
#endif
);
}
sptr -= (width_mmx*2 - 2); // sign fixed
dp -= (width_mmx*16 - 2); // sign fixed
for (i = width; i; i--)
{
png_byte v[8];
int j;
sptr -= 2;
png_memcpy(v, sptr, 2);
for (j = 0; j < png_pass_inc[pass]; j++)
{
dp -= 2;
png_memcpy(dp, v, 2);
}
}
}
else if (((pass == 2) || (pass == 3)) && width)
{
int width_mmx = ((width >> 1) << 1) ;
width -= width_mmx; // 0,1 pixels => 0,2 bytes
if (width_mmx)
{
int dummy_value_c; // fix 'forbidden register spilled'
int dummy_value_S;
int dummy_value_D;
__asm__ __volatile__ (
"subl $2, %%esi nt"
"subl $14, %%edi nt"
".loop2_pass2: nt"
"movd (%%esi), %%mm0 nt" // x x x x 3 2 1 0
"punpcklwd %%mm0, %%mm0 nt" // 3 2 3 2 1 0 1 0
"movq %%mm0, %%mm1 nt" // 3 2 3 2 1 0 1 0
"punpckldq %%mm0, %%mm0 nt" // 1 0 1 0 1 0 1 0
"punpckhdq %%mm1, %%mm1 nt" // 3 2 3 2 3 2 3 2
"movq %%mm0, (%%edi) nt"
"subl $4, %%esi nt"
"movq %%mm1, 8(%%edi) nt"
"subl $16, %%edi nt"
"subl $2, %%ecx nt"
"jnz .loop2_pass2 nt"
"EMMS nt" // DONE
: "=c" (dummy_value_c), // output regs (dummy)
"=S" (dummy_value_S),
"=D" (dummy_value_D)
: "1" (sptr), // esi // input regs
"2" (dp), // edi
"0" (width_mmx) // ecx
#if 0 /* %mm0, %mm1 not supported by gcc 2.7.2.3 or egcs 1.1 */
: "%mm0", "%mm1" // clobber list
#endif
);
}
sptr -= (width_mmx*2 - 2); // sign fixed
dp -= (width_mmx*8 - 2); // sign fixed
for (i = width; i; i--)
{
png_byte v[8];
int j;
sptr -= 2;
png_memcpy(v, sptr, 2);
for (j = 0; j < png_pass_inc[pass]; j++)
{
dp -= 2;
png_memcpy(dp, v, 2);
}
}
}
else if (width) // pass == 4 or 5
{
int width_mmx = ((width >> 1) << 1) ;
width -= width_mmx; // 0,1 pixels => 0,2 bytes
if (width_mmx)
{
int dummy_value_c; // fix 'forbidden register spilled'
int dummy_value_S;
int dummy_value_D;
__asm__ __volatile__ (
"subl $2, %%esi nt"
"subl $6, %%edi nt"
".loop2_pass4: nt"
"movd (%%esi), %%mm0 nt" // x x x x 3 2 1 0
"punpcklwd %%mm0, %%mm0 nt" // 3 2 3 2 1 0 1 0
"subl $4, %%esi nt"
"movq %%mm0, (%%edi) nt"
"subl $8, %%edi nt"
"subl $2, %%ecx nt"
"jnz .loop2_pass4 nt"
"EMMS nt" // DONE
: "=c" (dummy_value_c), // output regs (dummy)
"=S" (dummy_value_S),
"=D" (dummy_value_D)
: "1" (sptr), // esi // input regs
"2" (dp), // edi
"0" (width_mmx) // ecx
#if 0 /* %mm0 not supported by gcc 2.7.2.3 or egcs 1.1 */
: "%mm0" // clobber list
#endif
);
}
sptr -= (width_mmx*2 - 2); // sign fixed
dp -= (width_mmx*4 - 2); // sign fixed
for (i = width; i; i--)
{
png_byte v[8];
int j;
sptr -= 2;
png_memcpy(v, sptr, 2);
for (j = 0; j < png_pass_inc[pass]; j++)
{
dp -= 2;
png_memcpy(dp, v, 2);
}
}
}
} /* end of pixel_bytes == 2 */
//--------------------------------------------------------------
else if (pixel_bytes == 4)
{
if (((pass == 0) || (pass == 1)) && width)
{
int width_mmx = ((width >> 1) << 1);
width -= width_mmx; // 0,1 pixels => 0,4 bytes
if (width_mmx)
{
int dummy_value_c; // fix 'forbidden register spilled'
int dummy_value_S;
int dummy_value_D;
__asm__ __volatile__ (
"subl $4, %%esi nt"
"subl $60, %%edi nt"
".loop4_pass0: nt"
"movq (%%esi), %%mm0 nt" // 7 6 5 4 3 2 1 0
"movq %%mm0, %%mm1 nt" // 7 6 5 4 3 2 1 0
"punpckldq %%mm0, %%mm0 nt" // 3 2 1 0 3 2 1 0
"punpckhdq %%mm1, %%mm1 nt" // 7 6 5 4 7 6 5 4
"movq %%mm0, (%%edi) nt"
"movq %%mm0, 8(%%edi) nt"
"movq %%mm0, 16(%%edi) nt"
"movq %%mm0, 24(%%edi) nt"
"movq %%mm1, 32(%%edi) nt"
"movq %%mm1, 40(%%edi) nt"
"movq %%mm1, 48(%%edi) nt"
"subl $8, %%esi nt"
"movq %%mm1, 56(%%edi) nt"
"subl $64, %%edi nt"
"subl $2, %%ecx nt"
"jnz .loop4_pass0 nt"
"EMMS nt" // DONE
: "=c" (dummy_value_c), // output regs (dummy)
"=S" (dummy_value_S),
"=D" (dummy_value_D)
: "1" (sptr), // esi // input regs
"2" (dp), // edi
"0" (width_mmx) // ecx
#if 0 /* %mm0, %mm1 not supported by gcc 2.7.2.3 or egcs 1.1 */
: "%mm0", "%mm1" // clobber list
#endif
);
}
sptr -= (width_mmx*4 - 4); // sign fixed
dp -= (width_mmx*32 - 4); // sign fixed
for (i = width; i; i--)
{
png_byte v[8];
int j;
sptr -= 4;
png_memcpy(v, sptr, 4);
for (j = 0; j < png_pass_inc[pass]; j++)
{
dp -= 4;
png_memcpy(dp, v, 4);
}
}
}
else if (((pass == 2) || (pass == 3)) && width)
{
int width_mmx = ((width >> 1) << 1);
width -= width_mmx; // 0,1 pixels => 0,4 bytes
if (width_mmx)
{
int dummy_value_c; // fix 'forbidden register spilled'
int dummy_value_S;
int dummy_value_D;
__asm__ __volatile__ (
"subl $4, %%esi nt"
"subl $28, %%edi nt"
".loop4_pass2: nt"
"movq (%%esi), %%mm0 nt" // 7 6 5 4 3 2 1 0
"movq %%mm0, %%mm1 nt" // 7 6 5 4 3 2 1 0
"punpckldq %%mm0, %%mm0 nt" // 3 2 1 0 3 2 1 0
"punpckhdq %%mm1, %%mm1 nt" // 7 6 5 4 7 6 5 4
"movq %%mm0, (%%edi) nt"
"movq %%mm0, 8(%%edi) nt"
"movq %%mm1, 16(%%edi) nt"
"movq %%mm1, 24(%%edi) nt"
"subl $8, %%esi nt"
"subl $32, %%edi nt"
"subl $2, %%ecx nt"
"jnz .loop4_pass2 nt"
"EMMS nt" // DONE
: "=c" (dummy_value_c), // output regs (dummy)
"=S" (dummy_value_S),
"=D" (dummy_value_D)
: "1" (sptr), // esi // input regs
"2" (dp), // edi
"0" (width_mmx) // ecx
#if 0 /* %mm0, %mm1 not supported by gcc 2.7.2.3 or egcs 1.1 */
: "%mm0", "%mm1" // clobber list
#endif
);
}
sptr -= (width_mmx*4 - 4); // sign fixed
dp -= (width_mmx*16 - 4); // sign fixed
for (i = width; i; i--)
{
png_byte v[8];
int j;
sptr -= 4;
png_memcpy(v, sptr, 4);
for (j = 0; j < png_pass_inc[pass]; j++)
{
dp -= 4;
png_memcpy(dp, v, 4);
}
}
}
else if (width) // pass == 4 or 5
{
int width_mmx = ((width >> 1) << 1) ;
width -= width_mmx; // 0,1 pixels => 0,4 bytes
if (width_mmx)
{
int dummy_value_c; // fix 'forbidden register spilled'
int dummy_value_S;
int dummy_value_D;
__asm__ __volatile__ (
"subl $4, %%esi nt"
"subl $12, %%edi nt"
".loop4_pass4: nt"
"movq (%%esi), %%mm0 nt" // 7 6 5 4 3 2 1 0
"movq %%mm0, %%mm1 nt" // 7 6 5 4 3 2 1 0
"punpckldq %%mm0, %%mm0 nt" // 3 2 1 0 3 2 1 0
"punpckhdq %%mm1, %%mm1 nt" // 7 6 5 4 7 6 5 4
"movq %%mm0, (%%edi) nt"
"subl $8, %%esi nt"
"movq %%mm1, 8(%%edi) nt"
"subl $16, %%edi nt"
"subl $2, %%ecx nt"
"jnz .loop4_pass4 nt"
"EMMS nt" // DONE
: "=c" (dummy_value_c), // output regs (dummy)
"=S" (dummy_value_S),
"=D" (dummy_value_D)
: "1" (sptr), // esi // input regs
"2" (dp), // edi
"0" (width_mmx) // ecx
#if 0 /* %mm0, %mm1 not supported by gcc 2.7.2.3 or egcs 1.1 */
: "%mm0", "%mm1" // clobber list
#endif
);
}
sptr -= (width_mmx*4 - 4); // sign fixed
dp -= (width_mmx*8 - 4); // sign fixed
for (i = width; i; i--)
{
png_byte v[8];
int j;
sptr -= 4;
png_memcpy(v, sptr, 4);
for (j = 0; j < png_pass_inc[pass]; j++)
{
dp -= 4;
png_memcpy(dp, v, 4);
}
}
}
} /* end of pixel_bytes == 4 */
//--------------------------------------------------------------
else if (pixel_bytes == 8)
{
// GRR TEST: should work, but needs testing (special 64-bit version of rpng2?)
// GRR NOTE: no need to combine passes here!
if (((pass == 0) || (pass == 1)) && width)
{
int dummy_value_c; // fix 'forbidden register spilled'
int dummy_value_S;
int dummy_value_D;
// source is 8-byte RRGGBBAA
// dest is 64-byte RRGGBBAA RRGGBBAA RRGGBBAA RRGGBBAA ...
__asm__ __volatile__ (
"subl $56, %%edi nt" // start of last block
".loop8_pass0: nt"
"movq (%%esi), %%mm0 nt" // 7 6 5 4 3 2 1 0
"movq %%mm0, (%%edi) nt"
"movq %%mm0, 8(%%edi) nt"
"movq %%mm0, 16(%%edi) nt"
"movq %%mm0, 24(%%edi) nt"
"movq %%mm0, 32(%%edi) nt"
"movq %%mm0, 40(%%edi) nt"
"movq %%mm0, 48(%%edi) nt"
"subl $8, %%esi nt"
"movq %%mm0, 56(%%edi) nt"
"subl $64, %%edi nt"
"decl %%ecx nt"
"jnz .loop8_pass0 nt"
"EMMS nt" // DONE
: "=c" (dummy_value_c), // output regs (dummy)
"=S" (dummy_value_S),
"=D" (dummy_value_D)
: "1" (sptr), // esi // input regs
"2" (dp), // edi
"0" (width) // ecx
#if 0 /* %mm0 not supported by gcc 2.7.2.3 or egcs 1.1 */
: "%mm0" // clobber list
#endif
);
}
else if (((pass == 2) || (pass == 3)) && width)
{
// source is 8-byte RRGGBBAA
// dest is 32-byte RRGGBBAA RRGGBBAA RRGGBBAA RRGGBBAA
// (recall that expansion is _in place_: sptr and dp
// both point at locations within same row buffer)
{
int dummy_value_c; // fix 'forbidden register spilled'
int dummy_value_S;
int dummy_value_D;
__asm__ __volatile__ (
"subl $24, %%edi nt" // start of last block
".loop8_pass2: nt"
"movq (%%esi), %%mm0 nt" // 7 6 5 4 3 2 1 0
"movq %%mm0, (%%edi) nt"
"movq %%mm0, 8(%%edi) nt"
"movq %%mm0, 16(%%edi) nt"
"subl $8, %%esi nt"
"movq %%mm0, 24(%%edi) nt"
"subl $32, %%edi nt"
"decl %%ecx nt"
"jnz .loop8_pass2 nt"
"EMMS nt" // DONE
: "=c" (dummy_value_c), // output regs (dummy)
"=S" (dummy_value_S),
"=D" (dummy_value_D)
: "1" (sptr), // esi // input regs
"2" (dp), // edi
"0" (width) // ecx
#if 0 /* %mm0 not supported by gcc 2.7.2.3 or egcs 1.1 */
: "%mm0" // clobber list
#endif
);
}
}
else if (width) // pass == 4 or 5
{
// source is 8-byte RRGGBBAA
// dest is 16-byte RRGGBBAA RRGGBBAA
{
int dummy_value_c; // fix 'forbidden register spilled'
int dummy_value_S;
int dummy_value_D;
__asm__ __volatile__ (
"subl $8, %%edi nt" // start of last block
".loop8_pass4: nt"
"movq (%%esi), %%mm0 nt" // 7 6 5 4 3 2 1 0
"movq %%mm0, (%%edi) nt"
"subl $8, %%esi nt"
"movq %%mm0, 8(%%edi) nt"
"subl $16, %%edi nt"
"decl %%ecx nt"
"jnz .loop8_pass4 nt"
"EMMS nt" // DONE
: "=c" (dummy_value_c), // output regs (dummy)
"=S" (dummy_value_S),
"=D" (dummy_value_D)
: "1" (sptr), // esi // input regs
"2" (dp), // edi
"0" (width) // ecx
#if 0 /* %mm0 not supported by gcc 2.7.2.3 or egcs 1.1 */
: "%mm0" // clobber list
#endif
);
}
}
} /* end of pixel_bytes == 8 */
//--------------------------------------------------------------
else if (pixel_bytes == 6)
{
for (i = width; i; i--)
{
png_byte v[8];
int j;
png_memcpy(v, sptr, 6);
for (j = 0; j < png_pass_inc[pass]; j++)
{
png_memcpy(dp, v, 6);
dp -= 6;
}
sptr -= 6;
}
} /* end of pixel_bytes == 6 */
//--------------------------------------------------------------
else
{
for (i = width; i; i--)
{
png_byte v[8];
int j;
png_memcpy(v, sptr, pixel_bytes);
for (j = 0; j < png_pass_inc[pass]; j++)
{
png_memcpy(dp, v, pixel_bytes);
dp -= pixel_bytes;
}
sptr-= pixel_bytes;
}
}
} // end of _mmx_supported ========================================
else /* MMX not supported: use modified C code - takes advantage
* of inlining of png_memcpy for a constant */
/* GRR 19991007: does it? or should pixel_bytes in each
* block be replaced with immediate value (e.g., 1)? */
/* GRR 19991017: replaced with constants in each case */
#endif /* PNG_ASSEMBLER_CODE_SUPPORTED */
{
if (pixel_bytes == 1)
{
for (i = width; i; i--)
{
int j;
for (j = 0; j < png_pass_inc[pass]; j++)
{
*dp-- = *sptr;
}
--sptr;
}
}
else if (pixel_bytes == 3)
{
for (i = width; i; i--)
{
png_byte v[8];
int j;
png_memcpy(v, sptr, 3);
for (j = 0; j < png_pass_inc[pass]; j++)
{
png_memcpy(dp, v, 3);
dp -= 3;
}
sptr -= 3;
}
}
else if (pixel_bytes == 2)
{
for (i = width; i; i--)
{
png_byte v[8];
int j;
png_memcpy(v, sptr, 2);
for (j = 0; j < png_pass_inc[pass]; j++)
{
png_memcpy(dp, v, 2);
dp -= 2;
}
sptr -= 2;
}
}
else if (pixel_bytes == 4)
{
for (i = width; i; i--)
{
png_byte v[8];
int j;
png_memcpy(v, sptr, 4);
for (j = 0; j < png_pass_inc[pass]; j++)
{
#ifdef PNG_DEBUG
if (dp < row || dp+3 > row+png_ptr->row_buf_size)
{
printf("dp out of bounds: row=%d, dp=%d, rp=%dn",
row, dp, row+png_ptr->row_buf_size);
printf("row_buf=%dn",png_ptr->row_buf_size);
}
#endif
png_memcpy(dp, v, 4);
dp -= 4;
}
sptr -= 4;
}
}
else if (pixel_bytes == 6)
{
for (i = width; i; i--)
{
png_byte v[8];
int j;
png_memcpy(v, sptr, 6);
for (j = 0; j < png_pass_inc[pass]; j++)
{
png_memcpy(dp, v, 6);
dp -= 6;
}
sptr -= 6;
}
}
else if (pixel_bytes == 8)
{
for (i = width; i; i--)
{
png_byte v[8];
int j;
png_memcpy(v, sptr, 8);