Half-Precision Floating Point Format

Half precision floating point is a 16-bit binary floating-point interchange format. It was not part of the original ANSI/IEEE 754 Standard for Binary Floating-Point Arithmetic published in 1985 but is included in the current version of the standard, IEEE 754-2008 (previously known as IEEE 754r) which was published last August. See this Wikipedia article for background information.

Floating point formats defined by this standard are classified as either interchange or non-interchange. In the standard storage formats are narrow interchange formats, i.e. the set of floating point values that can be stored by the specified binary encoding is a proper subset of wider floating point formats such as the 32-bit float and 64-bit double. In particular, the standard defines the encodings for one binary storage floating-point format of 16 bits length and radix 2, and one decimal storage floating-point format of 32 bits length. Note that these two formats are for storage only and are not defined for in-memory arithmetic operations. The remainder of this post is about the 16-bit binary storage format which we will refer to as half precision.

Half precision (also known as a 1.5.10 or s10e5 minifloat) was added to the standard because it is the de facto storage format for certain floating-point values frequently used in modern graphics processing units (GPUs) where minimizing memory usage and bus traffic is a major challange and priority.& It is used in several computer graphics environments including OpenGL, OpenEXR, and by hardware in MP3 decoders and nVidia graphic cards.&nbsp This format became popular because it can store a larger range of values than an int16 without requiring the bandwidth and storage space of a float type. Typically this increased range of numbers is used to preserve more highlighting and shadow detail. The minimum and maximum representable values are 2.98Ã—10-8 and 65504 respectively.

I only know of one C or C++ compiler which supports half precision i.e. Sourcery G++ Lite. It uses an __fp16 type to represent half precision with a number of limitations. However, whether __fp16 becomes part of ISO C remains to be seen.& The lastest version of the C++ ABI also provides some support for name mangling of half precisions. The GNU debugger appears to have some limited support. Ruby supports half precision (IEEE_binary16) using the float-formats package (but only for little endian platforms according to the float-formats README). The Python structs module which is the logical home for half-precision support does not currently support this format. A cursory search of CPAN did not reveal any modules with support for half precision in Perl.

The following is a small C program which demonstrates how to encode and decode the half precision binary format. It is a based on some code from OGRE (Object-oriented Graphics Rendering Engine) header.

/*
**  This program is free software; you can redistribute it and/or modify it under
**  the terms of the GNU Lesser General Public License, as published by the Free 
**  Software Foundation; either version 2 of the License, or (at your option) any
**  later version.
**
**   IEEE 758-2008 Half-precision Floating Point Format
**   --------------------------------------------------
**
**   | Field    | Last | First | Note
**   |----------|------|-------|----------
**   | Sign     | 15   | 15    |
**   | Exponent | 14   | 10    | Bias = 15
**   | Fraction | 9    | 0     |
*/

#include <stdio.h>
#include <inttypes.h>

typedef uint16_t HALF;

/* ----- prototypes ------ */
float HALFToFloat(HALF);
HALF floatToHALF(float);
static uint32_t halfToFloatI(HALF);
static HALF floatToHalfI(uint32_t);

float
HALFToFloat(HALF y)
{
    union { float f; uint32_t i; } v;
    v.i = halfToFloatI(y);
    return v.f;
}

uint32_t
static halfToFloatI(HALF y)
{
    int s = (y >> 15) & 0x00000001;                            // sign
    int e = (y >> 10) & 0x0000001f;                            // exponent
    int f =  y        & 0x000003ff;                            // fraction

    // need to handle 7c00 INF and fc00 -INF?
    if (e == 0) {
        // need to handle +-0 case f==0 or f=0x8000?
        if (f == 0)                                            // Plus or minus zero
            return s << 31;
        else {                                                 // Denormalized number -- renormalize it
            while (!(f & 0x00000400)) {
                f <<= 1;
                e -=  1;
            }
            e += 1;
            f &= ~0x00000400;
        }
    } else if (e == 31) {
        if (f == 0)                                             // Inf
            return (s << 31) | 0x7f800000;
        else                                                    // NaN
            return (s << 31) | 0x7f800000 | (f << 13);
    }

    e = e + (127 - 15);
    f = f << 13;

    return ((s << 31) | (e << 23) | f);
}

HALF
floatToHALF(float i)
{
    union { float f; uint32_t i; } v;
    v.f = i;
    return floatToHalfI(v.i);
}

HALF
static floatToHalfI(uint32_t i)
{
    register int s =  (i >> 16) & 0x00008000;                   // sign
    register int e = ((i >> 23) & 0x000000ff) - (127 - 15);     // exponent
    register int f =   i        & 0x007fffff;                   // fraction

    // need to handle NaNs and Inf?
    if (e <= 0) {
        if (e < -10) {
            if (s)                                              // handle -0.0
               return 0x8000;
            else
               return 0;
        }
        f = (f | 0x00800000) >> (1 - e);
        return s | (f >> 13);
    } else if (e == 0xff - (127 - 15)) {
        if (f == 0)                                             // Inf
            return s | 0x7c00;
        else {                                                  // NAN
            f >>= 13;
            return s | 0x7c00 | f | (f == 0);
        }
    } else {
        if (e > 30)                                             // Overflow
            return s | 0x7c00;
        return s | (e << 10) | (f >> 13);
    }
}

int
main(int argc, char *argv[])
{
   float f1, f2;
   HALF h;

   printf("Please enter a floating point number: ");
   scanf("%f", &f1);

   h = floatToHALF(f1);
   f2 = HALFToFloat(h);

   printf("Results are: %f %f %#lx\n", f1, f2, h);
}

The following example shows one way to read a single floating point number (1.0) encoded in a half precision binary floating point two byte string into a float using Python.

import struct

def HalfToFloat(h):
    s = int((h >> 15) & 0x00000001)     # sign
    e = int((h >> 10) & 0x0000001f)     # exponent
    f = int(h &         0x000003ff)     # fraction

    if e == 0:
       if f == 0:
          return int(s << 31)
       else:
          while not (f & 0x00000400):
             f <<= 1
             e -= 1
          e += 1
          f &= ~0x00000400
          print s,e,f
    elif e == 31:
       if f == 0:
          return int((s << 31) | 0x7f800000)
       else:
          return int((s << 31) | 0x7f800000 | (f << 13))

    e = e + (127 -15)
    f = f << 13

    return int((s << 31) | (e << 23) | f)


if __name__ == "__main__":

    # a half precision binary floating point string
    FP16='\x00\x3c'

    v = struct.unpack('H', FP16)
    x = HalfToFloat(v[0])

    # hack to coerce to float
    str = struct.pack('I',x)
    f=struct.unpack('f', str)

    # print the resulting floating point
    print f[0]

I hope this post provided you with some useful information on the interesting subject of the half precision floating point binary format. I suspect within a few years most compilers and scripting langauges will support it.

Musings

Translate

See Also

Archives

Half-Precision Floating Point Format

1 comment to Half-Precision Floating Point Format

Musings

Translate

See Also

Archives

Half-Precision Floating Point Format

1 comment to Half-Precision Floating Point Format

Tags