Translate

Archives

Localizing Korn Shell Scripts

In response to recent messages on the ast-users mailing list asking for a how-to or FAQ on how to localize Korn Shell (ksh93) shell scripts, I decided to write this post as there is a paucity of good information available on the Internet or in print on this particular topic.

First of all what is meant by localization? Internationalization (internationalisation, I18N) and localization (localisation, L12N) are means of adapting software applications to different languages and cultural differences. Internationalization is the process of designing and engineering a software application so that it can readily support various languages and regions differences without changes to the source code. Localization is the process of adapting internationalized software for a particular geographical area by translating text strings in the user interface into a local language and providing any necessary environment variables which affect codesets, character sorting order, date and time display, thousands separators and suchlike.

An example is probably the simplest way to demonstrate what is involved in localizing a shell script and the process. Assume we want to localize the following very simple shell script called demo which is located in the subdirectory /example:

#!/bin/ksh

name="John Kane"

print  "Simple demonstration of  ksh93 message translation"
print  "Message locale is: $LC_MESSAGES"

echo   "Hello"
print  "Goodbye"
printf "Welcome %s\n" $name

print  "This string is not translated because it is not in the message catalog"

exit 0


This shell script is to be localized for French and Italian users so that the strings enclosed in double quotations (message text strings) are displayed in their native language.

Before the shell script can be localized, it must first be internationalized. The message text strings must be written in a format which ksh93 understands to mean replace this text string if possible by the appropriate text string from a message catalog for the current locale.. Fortunately this is easy to do in ksh93 using the special syntax $”…”. A $ in front of a double quoted string is ignored in the C or POSIX locale but in other locales may cause the text inside the double quotes (the default message text string) to be replaced by a locale specific message text string. Why the use of may instead of shall in the previous sentence? Well, if the shell script has not yet been localized, a suitable message catalog may not yet exist and therefore the default message text string will be displayed.

Here is the internationalized version of demo.

#!/bin/ksh

name="John Kane"

print  "Simple demonstration of  ksh93 message translation"
print  "Message locale is: $LC_MESSAGES"

echo   $"Hello"
print  $"Goodbye"
printf $"Welcome %s\n" $name

print  $"This string is not translated because it is not in the message catalog"

exit 0


The default message text strings are still displayed if you execute the shell script. It works just like the original version since no message catalogs have so far been created.

The next stage of the process is to extract the message text strings and translate them into the appropriate languages. You can manually extract these text strings or you can let ksh93 do all the work of extracting the text strings by invoking ksh93 with the -D option.

$ ksh -D demo
"Hello"
"Goodbye"
"Welcome %s\n"
"This string should not be translated because it is not in the message catalog"
$


Incidentally the bash shell also has support for the $”…” message string syntax and for the -D option. However the Bash Reference Manual does not document this functionality. Instead it documents the GNU gettext PO (portable object) file format and localization methodology.

When localizing a shell script a decision has to be made as to where to place the localized message catalogs. Typically they are placed in a subdirectory under the directory where the script is located but can be placed elsewhere if the NLSPATH environmental variable is set. ksh93 supports the following locations for message catalogs by default:

${ROOT}/share/lib/locale/%l/%C/%N
${ROOT}/share/locale/%l/%C/%N
${ROOT}/lib/locale/%l/%C/%N


where ${ROOT} is the directory containing the shell script and %l,%C and %N have the same meaning as when used with the NLSPATH environmental variable.

NLSPATH is the environmental variable which catopen() uses to attempt to locate message catalogues. The NLS in NLSPATH stands for National Language Support. An NLSPATH variable consists of one or more templates. Templates consist of of an optional prefix, one or more format elements, a filename and an optional suffix. Templates are separated by colons. For example, the following NLSPATH variable consists of two templates:

NLSPATH=":%N.cat:/shlib/message/%L/%N.cat"


A leading colon or two adjacent colons (‘::’) is equivalent to specifying %N. A string describing the current locale is expected to have the form language[_territory[.codeset]], e.g. en_US.utf8, de_DE.utf8, as all three components are used by NLSPATH formatting elements.

  • %N This format element is substituted with the name of the message catalog file.
  • %L  This format element is substituted with the current locale name.
  • %l  This format element is substituted with the language component of the current locale name.
  • %t  This format element is substituted with the territory component of the current locale name.
  • %c  This format element is substituted with the codeset component of the current locale name.

In order to demonstrate the use of NLSPATH and to keep things simple, our example places them in a subdirectory under where the shell script is located using following directory structure:

/example/demo
/example/demo/locale
/example/demo/locale/C
/example/demo/locale/fr
/example/demo/locale/it


Note the /example/demo/locale/C subdirectory. It is mandatory to have a message catalog in this
directory – otherwise localization does not work and only the default message text strings are displayed. The message text strings in this message catalog must be exactly the same as the message text strings in your script. Code in libast (../libast/port/mg.c) compares the default message text string from the shell script to all the message text strings in this message catalog. If there is a match, the message catalog set and member numbers (more about these shortly) are used to quickly retrieve the corresponding message text string from the appropriate locale message catalog if one exists.

Here is the message text source file (demo.msg) for the C locale:

$quote "
$set 3  This is the C locale message set
1 "Hello"
2 "Goodbye"
3 "Welcome %s\\n"


Message text source files must conform to the gencat format specification. See IEEE Std 1003.1-2008 for the full specification. Here is a brief summary of the more important directives in this specification:

  • $ comment  A line beginning with $ followed by a blank character is treated as a comment.
  • $delset N comment  Delete message set N from an existing message catalog. Any text following the set number is treated as a comment.
  • $quote C  Specifies a quote character C to surround message-text so that trailing spaces or empty messages are visible in a message source line. By default no quoting of message-text is recognized.
  • $set N comment  Specifies the set identifier N of the following messages until the next $set or EOF. Any text following the set identifier is treated as a comment. If no $set directive is specified, all messages are placed in message set 1.
  • M message-text  The message-text is stored in the message catalog with the set identifier specified by the last $set directive and with a message identifier of M.

Refer to the msggen or gencat man pages or the IEEE Standard for the complete specification. There is much more to the specification than what I have described here.

You must use the AT&T AST (Advanced Software Technologies) Open Source Collection msggen utility to generate (compile) a message catalog from a message text source file. In case you are unaware of it, ksh93 is also part of the AST Open Source Collection. The msggen utility is part of the ast-base package; it is not part of the ast-ksh package. Note the use of 3 for the set number in the above example. This is mandatory for ksh93 shell scripts. It is hardcoded into ksh93; no other set number will work. AST libraries use set 1, AST command and utilities use set 2 and shell scripts use set 3.

Message catalogs produced by msggen are platform independent and are smaller than the equivalent catalog produced by the gencat utility. On the other hand, message catalogs produced by gencat are platform dependent and may have to be recompiled when ported to a different platform. If you have difficulty distinguishing between message catalogs produced by gencat and those produced by msggen, an easy way to differentiate the two formats is by means of the first 4 bytes of a message catalog. Those generated by msggen contain the magic string:

"\015\023\007\000"


Here is how to use msggen to generate a message catalog from a message text source file.

$ msggen locale/C/demo.cat demo.msg


If the specified message catalog already exists msggen merges the message text source file into this message catalog, otherwise a new message catalog is created. If set and message numbers collide, the new message text will replace the message text currently contained in the message catalog. Non-ASCII characters must be UTF-8 encoded. Message text source files containing symbolic identifiers cannot be processed by the msgget utility.

You do not have to give a message catalog an extension. However it is common practice for message catalogs to use the .cat extension. In fact you can call the message catalog anything you like but the default is for the name of the message catalog to be exactly the same as the name of the shell script. You can work around this restriction to a certain extent by the use of an appropriate NLSPATH string.

In this example we set NLSPATH so that it handles the .cat extension.

$ NLSPATH=/example/locale/%l/%N.cat; export NLSPATH


where %l is the language element and %N is the catalog name parameter.

You can also use msggen to check a compiled message catalog:

$ msggen -l locale/C/demo.cat
$quote "
$set 3
1 "Hello"
2 "Goodbye"
3 "Welcome %s\\n"


Another useful feature of msggen is that you can use it to retrieve a specific message string from a message catalog as shown here (3 is the set, 2 is the id):

$ msgget C demo 3.2
Goodbye
$ msgget fr_FR.utf8  demo 3.2
Au Revoir
$  


Here is the contents of the French message text source file (demo.msg.fr):

$quote "
$set 3  This is the French locale message set
1 "Bonjour"
2 "Au Revoir"
3 "Bienvenu %s\\n"


which is compiled into the French locale message catalog by:

$ msggen locale/fr/demo.cat demo.msg.fr


and here is the contents of the Italian message text source file (demo.msg.it):

$quote "
$set 3  This is the Italian locale message set
1 "Ciao"
2 "Addio"
3 "Benvenuto %s\\n"


which is compiled into the Italian locale message catalog by:

$ msggen locale/it/demo.cat demo.msg.it


Now that we have generated all the necessary message catalogs and placed them in the appropriate subdirectories, we are ready to test the localization of the demo script.

$ NLSPATH=/example/locale/%l/%N.cat; export NLSPATH
$ LC_MESSAGES=en_US.utf8; export LC_MESSAGES
$ ./demo
Simple demonstration of  ksh93 message translation
Message locale is: en_US.utf8
Hello
Goodbye
Welcome John Kane
This string should not be translated because it is not in the message catalog

$ LC_MESSAGES=fr_FR.utf8; export LC_MESSAGES
$ ./demo
Simple demonstration of  ksh93 message translation
Message locale is: fr_FR.utf8
Bonjour
Au Revoir
Bienvenu John Kane
This string should not be translated because it is not in the message catalog

$ LC_MESSAGES=it_IT.utf8; export LC_MESSAGES
$ ./demo
Simple demonstration of  ksh93 message translation
Message locale is: it_IT.utf8
Ciao
Addio
Benvenuto John Kane
This string should not be translated because it is not in the message catalog


In the above example I used the LC_MESSAGES environmental variable to indicate to ksh93 which message catalog to use when displaying message strings. This is all that ksh93 actually needs to locate the right message catalog.

In the real world, however, the LANG environmental variable would be set to the appropriate locale instead of just LC_MESSAGES. This can cause unexpected output and errors in your scripts. Consider floating point numbers for example. Do not assume that the decimal point is always a period.

 
$ float pi=3.14159; printf "%.5f\n" pi
3.14159
$ LANG=es_ES.utf8; export LANG
$ locale -k LC_NUMERIC
decimal_point=","
thousands_sep=""
grouping=-1;-1
numeric-decimal-point-wc=44
numeric-thousands-sep-wc=0
numeric-codeset="UTF-8"
$ printf "%.5f\n" pi
3,14159
$ float pi=3.14159; printf "%.5f\n" pi
ksh: 3.14159: arithmetic syntax error
$


When the locale is set to es_ES (Spain), the decimal point is a comma – not a period as in the USA. Note how the assignment float pi=3.14159 fails in the es_ES locale because of the use of a period as the decimal point.

Do not assume that numbers group in threes and that the grouping separator is a comma. Consider the following:

$ printf "%d %'d\n" 10000000 10000000
10000000 10000000
$ LC_NUMERIC=en_GB printf "%d %'d\n" 10000000 10000000
10000000 10,000,000
$ LC_NUMERIC=de_DE printf "%d %'d\n" 10000000 10000000
10000000 10.000.000
$ LC_NUMERIC=de_CH printf "%d %'d\n" 10000000 10000000
10000000 10'000'000


Note the use of %’d to indicate that the grouping separator should be included in the output.

Do not make assumptions about the format of the output of commands such as the date and who commands. Such assumptions will generally fail in a non-US locale. For example, determining the day of the month by piping output of the command to awk command will fail in a non_US locale. If a shell script makes assumptions about the format of the output from locale-sensitive commands and utilities, then it needs to be changed.

Finally, for the more technically inclined, here is the source code for a loadable ksh93 builtin called gencat which can generate message catalogs from message text source files and also list the contents of a message catalog. Most of this code is not mine – hence the AT&T copyright. I merely modified the source code for msggen so as to make it a loadable ksh93 builtin and removed unnecessary options and code.

/***********************************************************************
*                                                                      *
*               This software is part of the ast package               *
*          Copyright (c) 2000-2010 AT&T Intellectual Property          *
*                      and is licensed under the                       *
*                  Common Public License, Version 1.0                  *
*                    by AT&T Intellectual Property                     *
*                                                                      *
*                A copy of the License is available at                 *
*            http://www.opensource.org/licenses/cpl1.0.txt             *
*                                                                      *
***********************************************************************/
#include <shell.h>
#include <ctype.h>
#include <ccode.h>
#include <error.h> 
#include <mc.h>

#define SH_DICT "gencat"

static const char usage[] =
   "[-?\n@(#)$Id: gencat 2010-07-16 $\n]"
   "[-author?Finnbarr P. Murphy <fpmATfpmurphyDOTcom>]"
   "[-licence?http://www.opensource.org/licenses/cpl1.0.txt]"
   "[+NAME?gencat - generate a message catalog for ksh93]"
   "[l:list?list message catalog contents]"
   "\n"
   "\ncatfile [msgfile]\n"
   "\n"
   "[+EXIT STATUS?] {"
      "[+0?Success.]"
      "[+>0?An error occurred.]"
   "}"
;


typedef struct Xl_s
{
    struct Xl_s* next;
    char*        date;
    char         name[1];
} Xl_t;


/* append s to the translation list */
static Xl_t*
translation(Xl_t* xp, 
            register char* s)
{
    register Xl_t*    px;
    register char*    t;
    char              *d, *e;

    do {
        for (; isspace(*s); s++);
        for (d = e = 0, t = s; *t; t++)
            if (*t == ',') {
                e = t;
                *e++ = 0;
                break;
            } else if (isspace(*t))
                d = t;

        if (d) {
            *d++ = 0;
            for (px = xp; px; px = px->next)
                if (streq(px->name, s)) {
                    if (strcoll(px->date, d) < 0) {
                        free(px->date);
                        if (!(px->date = strdup(d)))
                            error(ERROR_SYSTEM|3, "out of space [translation]");
                    }
                    break;
                }
            if (!px) {
                if (!(px = newof(0, Xl_t, 1, strlen(s))) || !(px->date = strdup(d)))
                    error(ERROR_SYSTEM|3, "out of space [translation]");
                strcpy(px->name, s);
                px->next = xp;
                xp = px;
            }
        }
    } while (s = e);

    return xp;
}


static int
ccsfprintf(int from, 
           int to, 
           Sfio_t* sp, 
           const char* format, 
           ...)
{
    va_list  ap;
    Sfio_t*  tp;
    char*     s;
    int       n;

    va_start(ap, format);
    if (from == to)
        n = sfvprintf(sp, format, ap);
    else if (tp = sfstropen()) {
        n = sfvprintf(tp, format, ap);
        s = sfstrbase(tp);
        ccmaps(s, n, from, to);
        n = sfwrite(sp, s, n);
        sfstrclose(tp);
    } else
        n = -1;

    return n;
}

/* entry point */
int
b_gencat(int argc, 
         char *argv[], 
         void *extra)
{
    register Mc_t  *mc;
    register char  *s, *t;
    register int   c,q, i;
    int            num, list = 0, set = 0;
    char           *b, *e;
    char           *catfile, *msgfile;
    Sfio_t         *sp, *mp, *tp;
    Xl_t           *px, *bp, *xp = 0;

    NoP(argc);
    error_info.id = "gencat";

    for (;;) {
        switch (optget(argv, usage)) {
        case 'l':
            list = 1;
            continue;
        case '?':
            error(ERROR_USAGE|4, "%s", opt_info.arg);
            continue;
        case ':':
            error(2, "%s", opt_info.arg);
            continue;
        }
        break;
    }

    argv += opt_info.index;
    if (error_info.errors || !(catfile = *argv++))
        error(ERROR_USAGE|4, "%s", optusage(NiL));

    if (list) {
        if (!(sp = sfopen(NiL, catfile, "r")))
            error(ERROR_SYSTEM|3, "cannot read catalog: %s", catfile);
        if (!(mc = mcopen(sp)))
            error(ERROR_SYSTEM|3, "catalog content error: %s", catfile);
        sfclose(sp);

        if (*mc->translation) {
            ccsfprintf(CC_NATIVE, CC_ASCII, sfstdout, "$translation ");
            sfprintf(sfstdout, "%s", mc->translation);
            ccsfprintf(CC_NATIVE, CC_ASCII, sfstdout, "\n");
        }

        ccsfprintf(CC_NATIVE, CC_ASCII, sfstdout, "$quote \"\n");

        for (set = 1; set <= mc->num; set++)
            if (mc->set[set].num) {
                ccsfprintf(CC_NATIVE, CC_ASCII, sfstdout, "$set %d\n", set);
                for (num = 1; num <= mc->set[set].num; num++)
                if (s = mc->set[set].msg[num]) {
                    ccsfprintf(CC_NATIVE, CC_ASCII, sfstdout, "%d \"", num);
                    while (c = *s++) {
                        switch (c) {
                            case 0x22: /* " */
                            case 0x5C: /* \ */
                                sfputc(sfstdout, 0x5C);
                                break;
                            case 0x07: /* \a */
                                c = 0x61;
                                sfputc(sfstdout, 0x5C);
                                break;
                            case 0x08: /* \b */
                                c = 0x62;
                                sfputc(sfstdout, 0x5C);
                                break;
                            case 0x0A: /* \n */
                                c = 0x6E;
                                sfputc(sfstdout, 0x5C);
                                break;
                            case 0x0B: /* \v */
                                c = 0x76;
                                sfputc(sfstdout, 0x5C);
                                break;
                            case 0x0C: /* \f */
                                c = 0x66;
                                sfputc(sfstdout, 0x5C);
                                break;
                            case 0x0D: /* \r */
                                c = 0x72;
                                sfputc(sfstdout, 0x5C);
                                break;
                         }
                         /*...UNDENT*/
                         sfputc(sfstdout, c);
                    }
                    ccsfprintf(CC_NATIVE, CC_ASCII, sfstdout, "\"\n");
               }
         }
         mcclose(mc);
         return error_info.errors != 0;
    }


    if (!(msgfile = *argv++) || *argv)
            error(3, "exactly one message file must be specified");

    /* open the files and handles */
    if (!(tp = sfstropen()))
        error(ERROR_SYSTEM|3, "out of space [string stream]");
    if (!(mp = sfopen(NiL, msgfile, "r")))
        error(ERROR_SYSTEM|3, "%s: cannot read message file", msgfile);
    sp = sfopen(NiL, catfile, "r");
    if (!(mc = mcopen(sp)))
        error(ERROR_SYSTEM|3, "%s: catalog content error", catfile);
    if (sp)
        sfclose(sp);
    xp = translation(xp, mc->translation);

    /* read the message file */
    set = 1;
    error_info.file = msgfile;
    while (s = sfgetr(mp, '\n', 1)) {
        error_info.line++;
        if (!*s)
            continue;
        if (*s == '$') {
            if (!*++s || isspace(*s))
                continue;
            for (t = s; *s && !isspace(*s); s++);
            if (*s)
                *s++ = 0;
            if (streq(t, "delset")) {
                while (isspace(*s))
                    s++;
                num = (int)strtol(s, NiL, 0);
                if (num < mc->num && mc->set[num].num)
                    for (i = 1; i <= mc->set[num].num; i++)
                        mcput(mc, num, i, NiL);
            } else if (streq(t, "quote")) {
                q = *s ? *s : 0;
            } else if (streq(t, "set")) {
                while (isspace(*s))
                    s++;
                num = (int)strtol(s, &e, 0);
                if (e != s)
                    set = num;
                else
                    error(2, "set number expected");
            } else if (streq(t, "translation"))
                xp = translation(xp, s);
        } else {
            t = s + sfvalue(mp);
            num = (int)strtol(s, &e, 0);
            if (e != s) {
                s = e;
                if (!*s) {
                    if (mcput(mc, set, num, NiL))
                        error(2, "(%d,%d): cannot delete message", set, num);
                } else if (isspace(*s++)) {
                    if (t > (s + 1) && *(t -= 2) == '\\') {
                        sfwrite(tp, s, t - s);
                        while (s = sfgetr(mp, '\n', 0)) {
                            error_info.line++;
                            t = s + sfvalue(mp);
                            if (t <= (s + 1) || *(t -= 2) != '\\')
                                break;
                            sfwrite(tp, s, t - s);
                        }
                        if (!(s = sfstruse(tp)))
                            error(ERROR_SYSTEM|3, "out of space");
                    }
                    if (q) {
                        if (*s++ != q) {
                            error(2, "(%d,%d): %c quote expected", set, num, q);
                            continue;
                        }
                        b = t = s;
                        while (c = *s++) {
                            if (c == '\\') {
                                c = chresc(s - 1, &e);
                                s = e;
                                if (c)
                                    *t++ = c;
                                else
                                    error(1, "nul character ignored");
                            } else if (c == q)
                                break;
                            else
                                *t++ = c;
                        }
                        if (*s) {
                            error(2, "(%d,%d): characters after quote not expected", set, num);
                            continue;
                        }
                        *t = 0;
                        s = b;
                    }
                    if (mcput(mc, set, num, s))
                        error(2, "(%d,%d): cannot add message", set, num);
                } else
                    error(2, "message text expected");
            } else
                error(2, "message number expected");
        }
    }

    error_info.file = 0;
    error_info.line = 0;

    /* fix up the translation record */
    if (xp) {
        t = "";
        for (;;) {
            for (bp = 0, px = xp; px; px = px->next)
                if (px->date && (!bp || strcoll(bp->date, px->date) < 0))
                    bp = px;
            if (!bp)
                break;
            sfprintf(tp, "%s%s %s", t, bp->name, bp->date);
            t = ", ";
            bp->date = 0;
        }
        if (!(mc->translation = sfstruse(tp)))
            error(ERROR_SYSTEM|3, "out of space");
    }

    /* dump the catalog to a local temporary file  Rename if no errors */
    if (!(s = pathtemp(NiL, 0, "", error_info.id, NiL)) || !(sp = sfopen(NiL, s, "w")))
        error(ERROR_SYSTEM|3, "%s: cannot write catalog file", catfile);
    if (mcdump(mc, sp) || mcclose(mc) || sfclose(sp)) {
        remove(s);
        error(ERROR_SYSTEM|3, "%s: temporary catalog file write error", s);
    }

    remove(catfile);
    if (rename(s, catfile))
        error(ERROR_SYSTEM|3, "%s: cannot rename temporary catalog file %s", catfile, s);

    return error_info.errors != 0;
}


Assuming that you have the ast-base include files available at /usr/include/ast together with libshell.a in the build directory, you can build a shared object called gencat.so using the following build script:

$ gcc -fPIC -g -c -I /include/ast gencat.c
$ gcc -shared -W1,-soname,gencat.so -o gencat.so gencat.o libshell.a 


and load the gencat builtin into ksh93 as follows:

$ builtin -f ./gencat.so  gencat


You can then use the gencat builtin to generate ksh93 compatible message catalogs.

In conclusion, I hope this post helps readers understand how to localize their ksh93 shell scripts. Obviously it is only a quick introduction to the subject. Please let me know if there is anything important that I have not discussed or got wrong and I will update the post.

P.S. The above example was tested on ksh93 version 93t+.2010-03-05
 

1 comment to Localizing Korn Shell Scripts

  • Jim Ryan

    Thank you. Thank you! I have looked for information about Korn Shell script localization for long time but this is the first detailed account that I have come across.