Translate

Archives

Image of Linux Kernel Development (3rd Edition)
Image of XSLT 2.0 and XPath 2.0 Programmer's Reference (Programmer to Programmer)
Image of RHCE Red Hat Certified Engineer Linux Study Guide (Exam RH302) (Certification Press)
Image of Modern Operating Systems (3rd Edition)

Manipulating Binary Data Using The Korn Shell

Most people are unaware that ksh93 (Korn Shell 93) can handle binary data. As the following examples will demonstrate, ksh93 is perfectly capable of generating binary data files, making an exact copy of a binary file and manipulating binary files.

For my first example, I demonstrate how to create a 256-byte binary file containing all the binary values from 0x00 (NUL) to 0xFF.

#!/bin/ksh93

typeset -i8 value

redirect 3>out.hex || exit 1

for ((value = 0; value < 256; value++))
do
   print -u 3 -f "\\${value#8#}"
done

redirect 3<&- || echo 'cannot close FD 3'

exit 0


As you can see a perfect binary file was created containing the full ASCII table (NUL to Ox7F) plus the values from 0x89 to 0xFF (sometimes known as the Extended ASCII Table.) By the way, redirect is simply an alias for command exec. I assume you are familiar with manipulating file descriptors in ksh93 or other shells. If not, read the appropriate section of the ksh93 man page.

Here is a hex dump, using xxd, of out.hex:

$ xxd out.hex
00000000: 0001 0203 0405 0607 0809 0a0b 0c0d 0e0f  ................
00000010: 1011 1213 1415 1617 1819 1a1b 1c1d 1e1f  ................
00000020: 2021 2223 2425 2627 2829 2a2b 2c2d 2e2f   !"#$%&'()*+,-./
00000030: 3031 3233 3435 3637 3839 3a3b 3c3d 3e3f  0123456789:;< =>?
00000040: 4041 4243 4445 4647 4849 4a4b 4c4d 4e4f  @ABCDEFGHIJKLMNO
00000050: 5051 5253 5455 5657 5859 5a5b 5c5d 5e5f  PQRSTUVWXYZ[\]^_
00000060: 6061 6263 6465 6667 6869 6a6b 6c6d 6e6f  `abcdefghijklmno
00000070: 7071 7273 7475 7677 7879 7a7b 7c7d 7e7f  pqrstuvwxyz{|}~.
00000080: 8081 8283 8485 8687 8889 8a8b 8c8d 8e8f  ................
00000090: 9091 9293 9495 9697 9899 9a9b 9c9d 9e9f  ................
000000a0: a0a1 a2a3 a4a5 a6a7 a8a9 aaab acad aeaf  ................
000000b0: b0b1 b2b3 b4b5 b6b7 b8b9 babb bcbd bebf  ................
000000c0: c0c1 c2c3 c4c5 c6c7 c8c9 cacb cccd cecf  ................
000000d0: d0d1 d2d3 d4d5 d6d7 d8d9 dadb dcdd dedf  ................
000000e0: e0e1 e2e3 e4e5 e6e7 e8e9 eaeb eced eeef  ................
000000f0: f0f1 f2f3 f4f5 f6f7 f8f9 fafb fcfd feff  ................
$

My next example simply uses builtin ksh93 functionality to copy a binary file, image.jpg, to a binary file named image.cpy.

!/bin/ksh93
#
# copy a binary file
#
   
typeset -b byte

command exec 3<image.jpg || exit 1

bytes=0

eof=$(3<#((EOF)))
3<#((0))

:> image.cpy

while (( $(3<#((CUR))) < $eof ))
do
    # print "At offset $(3<#)"
    read -r -u 3 -N 1 byte
    printf "%B" byte >> image.cpy
    (( bytes ++ ))
done

redirect 3<&- || echo 'cannot close FD 3'

print "$bytes copied"
exit 0


The key to understanding how this script works is understanding what typeset -b does. From the ksh93 manpage:

-b     The variable can hold any number of bytes of data.  The data can be text
       or binary.  The value is represented by the base64 encoding of the  data.
       If -Z is also specified, the size in bytes of the data in the buffer will
       be determined by the size associated with the -Z.  If the  base64  string
       assigned  results in more data, it will be truncated.  Otherwise, it will
       be filled with bytes whose value is zero.  The printf format  %B  can  be
       used  to  output  the  actual  data  in this buffer instead of the base64
       encoding of the data.


The script works by reading the source file byte by byte, and storing the read byte in the variable called byte. Internally, this byte is stored as a base64-encoded string. This was David Korn’s solution to the design issue of how to store a NUL ((character 0 in the portable character set corresponding to US ASCII) in a NUL terminated string. Remember, unlike some other programming languages such as Pascal, strings are NUL terminated in the C programming language which is what ksh93 and zsh are written in.

The zsh shell uses a different mechanism but the end result is the same. It uses a guard byte, Meta, to guard the following byte.

From ..zsh/Src/zsh.h:

/* Meta together with the character following Meta denotes the character *
 * which is the exclusive or of 32 and the character following Meta.     *
 * This is used to represent characters which otherwise has special      *
 * meaning for zsh.  These are the characters for which the imeta() test *
 * is true: the null character, and the characters from Meta to Marker.  */

#define Meta            ((char) 0x83)

/* Note that the fourth character in DEFAULT_IFS is Meta *
 * followed by a space which denotes the null character. */

#define DEFAULT_IFS     " \t\n\203 "

The interesting thing about the zsh special character guard mechanism is that zsh provides a mechanism to adjust the behavior of the two byte sequence Meta NUL using the options POSIX_STRINGS (setopt posixstrings) or NO_POSIX_STRINGS (setopt noposixstrings.) When unset, the entire string including Meta bytes and NUL, is output to files where necessary, although owing to “restrictions of the library interface a string is truncated at the NUL character in file names, environment variables, or in arguments to external programs.”

For my next example, we are going to reverse a binary file, i.e. image.gpj to image.jpg. Have a look at the following code:

!/bin/ksh93
#
# reverse a binary file
#

typeset -b byte

redirect 3< image.gpj || exit 1

eof=$(3<#((EOF)))

read -r -u 3 -N 1 byte
printf "%B" byte > image.jpg
3<#((CUR - 1))

while (( $(3<#) > 0 ))
do
    read -r -u 3 -N 1 byte
    printf "%B" byte >> image.jpg
    3<#((CUR - 2))
done

read -r -u 3 -N 1 byte
printf "%B" byte >> image.jpg

redirect 3<&- || echo 'cannot close FD 3'

exit 0


Again, I use typeset -b byte to declare byte to be a binary type which, by the way, can hold up to 64KB of either binary or text data. Again, I use the ksh93 I/O mechanism to open the input file, image.gpj using file descriptor 3. Again, I read byte by byte but this time backwards from the last byte of the file to the byte at offset 0. While decrement the file offset by 2 in the loop? Simple, read advances the file offset by 1, so the script has to compensate for the last read and also decrement the offset by 1 so that next read reads the previous byte in the file.

Obviously this script is fairly inefficient as it reads and writes individual bytes. Use strace to understand how inefficient it is. Actually it turns out that a lot of the inefficiencies are actually due to the design of the sfio routines.

In the following example, the previous script has been modified to read and write in chunks of 16 bytes where possible.

#!/bin/ksh
#
# reverse a binary file - chunks
#

typeset -b bytes

redirect 3< image.gpj || exit 1

eof=$(3<#((EOF)))

read -r -u 3 -N 16 bytes
printf "%B" bytes > image.jpg
3<#((CUR - 16))

offset=0

while (( $(3<#) > 0 ))
do
    # print "At offset $(3<#)"
    read -r -u 3 -N 16 bytes
    printf "%B" bytes >> image.jpg
    offset=$(3<#)
    if (( offset > 32 ))
    then
       3<#((CUR - 32))
    else
       break
    fi
done

3<#((0))
# print "At final offset $offset"
read -r -u 3 -N $((offset - 16))  bytes

redirect 3<&- || echo 'cannot close FD 3'

exit 0

This script works as intended but I am pushing the limits of ksh93 file I/O using the CUR and EOF builtins. If instead of redirecting output to image.jpg using > and/or >>, I assigned file descriptor 4 to image.jpg, the script will never terminate. This is due to a implementation/design issue in ksh93 when using either or both of these two builtin variables, CUR or EOF, and more than one file descriptor simultaneously.

Look at how CUR and EOF are set in ../ast/src/cmd/ksh93/sh/io.c

static Sfdouble_t nget_cur_eof(Namval_t *np, Namfun_t *fp) {
    struct Eof *ep = (struct Eof *)fp;
    Sfoff_t end, cur = lseek(ep->fd, (Sfoff_t)0, SEEK_CUR);

    if (*np->nvname == 'C') return (Sfdouble_t)cur;
    if (cur < 0) return ((Sfdouble_t)-1);
    end = lseek(ep->fd, (Sfoff_t)0, SEEK_END);
    lseek(ep->fd, (Sfoff_t)0, SEEK_CUR);
    return (Sfdouble_t)end;
}

static const Namdisc_t EOF_disc = {sizeof(struct Eof), 0, 0, nget_cur_eof};

static Sfoff_t file_offset(Shell_t *shp, int fd, char *fname) {
    Sfio_t *sp;
    char *cp;
    Sfoff_t off;
    struct Eof endf;
    Namval_t *mp = nv_open("EOF", shp->var_tree, 0);
    Namval_t *pp = nv_open("CUR", shp->var_tree, 0);

    sh_iovalidfd(shp, fd);
    sp = shp->sftable[fd];

    memset(&endf, 0, sizeof(struct Eof));
    endf.fd = fd;
    endf.hdr.disc = &EOF_disc;
    endf.hdr.nofree = 1;
    if (mp) nv_stack(mp, &endf.hdr);
    if (pp) nv_stack(pp, &endf.hdr);
    if (sp) sfsync(sp);
    off = sh_strnum(shp, fname, &cp, 0);
    if (mp) nv_stack(mp, NULL);
    if (pp) nv_stack(pp, NULL);
    return *cp ? (Sfoff_t)-1 : off;
}


As you can see, the code for these two builtins (name-value pairs) is inextricably tangled together in both the file_offset function and the discipline function associated with each builtin.. Not the best of designs; the result being that the shell can easily get confused as to which file descriptor to use. A redesign is definitely warranted if ksh93 is intended to support seeking to more than one user-specified file offset in a shell script. The man page is silent on the issue.

My final example shows how to work around this issue by limiting the use of EOF and avoiding the use of CUR.

!/bin/ksh
#
# reverse a binary file
#

typeset -b byte

redirect 3< image.gpj || exit 1
iof=$(3<#((EOF)))

redirect 4> image.jpg || exit 1
# oof=0

read -r -u 3 -N 1 byte
3<#(( --iof ))
print -u 4 -f  "%B" byte
# (( oof++ ))

while (( iof > 0 ))
do
    # print "At offset $iof $oof"

    read -r -u 3 -N 1 byte
    3<#(( --iof))
    print -u 4 -f "%B" byte
    # (( oof++ ))
done

read -r -u 3 -N 1 byte
print -u 4 -f "%B" byte

redirect 3<&- || echo 'cannot close FD 3'
redirect 4>&- || echo 'cannot close FD 4'

exit 0


In the above script, the builtin variable EOF is used but once, i.e. to initially set the variable iof which is used to store the current offset of the input file. The CUR builtin variable, used in previous examples, is never used. The script then tracks the location of the input file offset using iof from that point on on until the script exits when iof decrements to 0.

Well, I have run out of time and must finish this blog. The above examples should have adequately demonstrated to you that ksh93 is perfectly capable of handling NULs and binary data. The next tine somebody tells you that ksh93 cannot handle binary data internally, or that zsh is the only shell that can handle binary data, just point that person to this blog point.

Enjoy!

Comments are closed.