Translate

Archives

POSIX Way to Check User Input in Shell Scripts

This short blog post demonstrates a number of ways to check user input that work with the POSIX shell, can be localized, and should work with most other shells that derive from the Bourne shell. The examples provided avoid all non-POSIX shell syntax and extensions. In particular, the examples demonstrate how to make your user input checking locale-aware.

Let’s start off with a simple example that is not locale-aware. This example demonstrates how to check that the input, in this case a command line argument, is an integer.

$ cat demo1.sh
#!/bin/sh

is_integer()
{
    case "${1#[+-]}" in
        (*[!0123456789]*) return 1 ;;
        ('')              return 1 ;;
        (*)               return 0 ;;
    esac
}

[ $# -ne 1 ] && {
    echo "One argument required"
    exit 1
}

if is_integer "$1"
then
   echo "Argument is an integer [$1]"
else
   echo "Argument is not an integer [$1]"
fi
$

$ ./demo1.sh  '123'
Argument is an integer [123]
$ ./demo1.sh  '12 3'
Argument is not an integer [12 3]
$ ./demo1.sh  '123c'
Argument is not an integer [123c]
$


The is_integer function strips off a reading or + from the inputted string . The first case statement tests if the string contains any character that is not 0,1,2,3,4,5,6,7,8 or 9 using a bracket expression. The exclamation mark (! immediately after the left square bracket means negate the match.

Sometimes you just want to check that a user inputted an integer containing only the digits 0 through 9 irrespective of the locale and the above is_integer function matches that requirement. However you may want to make The is_integer function locale-aware and typically you would use a POSIX character class to achieve this.

The POSIX standard defines 12 character classes. An expression of the form [[:name:]] matches the named character class “name”. For example [[:digit:]] matches any digit. In the POSIX C locale, it would match 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9. An expression of the form [![:name:]] matches anything except the named character class “name”. For example [![:digit:]] matches anything but digits.

The next example modifies the is_integer function to use the POSIX [:isdigit:] character class to test for digits and should work across all locales.

$ cat demo2.sh
#!/bin/sh

is_integer()
{
    case "${1#[+-]}" in
        (*[![:digit:]]*) return 1 ;;
        ('')             return 1 ;;
        (*)              return 0 ;;
    esac
}

[ $# -ne 1 ] && {
    echo "One argument required"
    exit 1
}

if is_integer "$1"
then
   echo "Argument is an integer [$1]"
else
   echo "Argument is not an integer [$1]"
fi
$

$ ./demo2.sh  '123'
Argument is an integer [123]
$ ./demo2.sh  '12 3'
Argument is not an integer [12 3]
$ ./demo2.sh  '123c'
Argument is not an integer [123c]
$`

Another common use case is validating that a user only inputted either “Y”,”N”,”YES” or “NO”, in uppercase, lowercase or some combination of uppercase and lowercase in response to a yes or no question.

Here is our first version of an is_yesno function:

$ cat demo3.sh
#!/bin/sh

is_yesno()
{
    yn=$(echo "$1" | tr '[:lower:]' '[:upper:]')
    case "$yn" in
        (Y|YES) return 1 ;;
        (N|NO)  return 1 ;;
        *)      return 0 ;;
    esac
}

[ $# -ne 1 ] && {
    echo "One argument required"
    exit 1
}

if is_yesno "$1" 
then
   echo "Argument is invalid [$1]"
else
   echo "Argument is valid [$1]"
fi
$

$ ./demo3.sh
One argument required
$ ./demo3.sh S
Argument is invalid [S]
$ ./demo3.sh Y
Argument is valid [Y]
$ ./demo3.sh n
Argument is valid [n]
$ 


Note the use of the tr utility with two POSIX character classes, i.e [:lower:] and [:upper:], to convert the input to uppercase so as to simplify the case statement.

The previous example only tests “English” input. We can do better than that. The POSIX standard specifies a number of categories that locales must support. One of these is the LC_MESSAGES category which we now examine in more detail. This category of a locale definition source file defines the format for affirmative and negative system responses.

The following is an example of a possible LC_MESSAGES category listed in a locale definition source file:

LC_MESSAGES
#
yesexpr   "([yY][[:alpha:]]*)|(OK)"
noexpr    "[nN][[:alpha:]]*"
yesstr    "Y:y:yes"
nostr     "N:n:no"
#
END LC_MESSAGES


The keywords yesexpr and noexpr are mandated by the current POSIX standard (IEEE Std 1003.1-2017 which is also The Open Group Base Specifications Issue 7, 2018 edition). They define extended regular expressions (EREs).

The keyword yesstr defines a colon-separated string of acceptable affirmative responses. The keyword nostr defines a colon-separated string of acceptable negative responses. These two keywords, yesstr and nostr were deprecated in a previous revision of the POSIX standard, and were removed from the current standard. However most Unix and Unix-like operating systems continue to support these keywords.

You can use the locale command to view, amongst other things, the keywords and strings in the LC_MESSAGES category. Here is what a recent version of Fedora outputs:

$ locale LC_MESSAGES
^[+1yY]
^[-0nN]
yes
no
UTF-8

$ locale -k LC_MESSAGES
yesexpr="^[+1yY]"
noexpr="^[-0nN]"
yesstr="yes"
nostr="no"
messages-codeset="UTF-8"

$ locale -k yesexpr
yesexpr="^[+1yY]"

$ LC_ALL=de_DE.UTF-8 locale -k LC_MESSAGES
yesexpr="^[+1jJyY]"
noexpr="^[-0nN]"
yesstr="ja"
nostr="nein"
messages-codeset="UTF-8"

$ LC_ALL=de_DE.UTF-8 locale -k yesexpr
yesexpr="^[+1jJyY]"
$ LC_ALL=de_DE.UTF-8 locale yesexpr
^[+1jJyY]"


Interestingly, and somewhat unexpectedly, note that yesexpr specifies that 1 and + are valid affirmative responses, and noexpr specifies that 0 and are valid negative responses.

In the next example, the is_yesno function used in the previous example is modified to become locale-aware.

$ cat demo4.sh
#!/bin/sh

is_yesno()
{
    yn="$1"

    yesexpr=$(locale yesexpr)
    yesstr=$(locale yesstr)
    noexpr=$(locale noexpr)
    nostr=$(locale nostr)

    if [ -z "$yesstr" ]
    then
        case "$yn" in
            ${yesexpr##^}) return 1 ;;
            ${noexpr##^})  return 1 ;;
        esac
    else
        yn=$(echo "$yn" | tr '[:upper:]' '[:lower:]')
        case "$yn" in
            ("$yesstr"|yes) return 1 ;;
            ${yesexpr##^})  return 1 ;;
            ("$nostr"|no)   return 1 ;;
            ${noexpr##^})   return 1 ;;
        esac
    fi

    return 0
}

[ $# -ne 1 ] && {
    echo "One argument required"
    exit 1
}

if is_yesno "$1"
then
   echo "Argument is invalid [$1]"
else
   echo "Argument is valid [$1]"
fi
$

$ ./demo4.sh
One argument required
$ ./demo4.sh y
Argument is valid [y]
$ ./demo4.sh N
Argument is valid [N]
$ ./demo4.sh 0
Argument is valid [0]
$ ./demo4.sh 1
Argument is valid [1]
$ ./demo4.sh -
Argument is valid [-]
$ ./demo4.sh +
Argument is valid [+]

$ LC_MESSAGES=de_DE.UTF-8 ./demo4.sh j
Argument is valid [j]
$ LC_MESSAGES=de_DE.UTF-8 ./demo4.sh J
Argument is valid [J]
$ LC_MESSAGES=de_DE.UTF-8 ./demo4.sh Y
Argument is valid [Y]
$ LC_MESSAGES=de_DE.UTF-8 ./demo4.sh n
Argument is valid [n]
$ LC_MESSAGES=de_DE.UTF-8 ./demo4.sh k
Argument is invalid [k]
$

Here is the full list of the character classes defined in the current POSIX standard. Their interpretation in shell scripts depends on the LC_CTYPE locale category setting. For example, [:alnum:] means the character class of numbers and letters in the current locale.

  • [:alnum:] – Alphanumeric characters: [:alpha:] and [:digit:]. In the POSIX C locale and ASCII encoding, this is the same as [0-9A-Za-z].
  • [:alpha:] – Alphabetic characters: [:lower:] and [:upper:]. In the POSIX C locale and ASCII encoding, this is the same as [A-Za-z].
  • [:blank:] – Blank characters: space and tab.
  • [:cntrl:] – Control characters. In ASCII, these characters have octal codes 000 through 037, and 177 (DEL). In other character sets, these are the equivalent characters, if any.
  • [:digit:] – Digits: 0 1 2 3 4 5 6 7 8 9.
  • [:graph:] – Graphical characters: [:alnum:] and [:punct:].
  • [:lower:] – Lowercase letters. In the C locale and ASCII encoding, this is a b c d e f g h i j k l m n o p q r s t u v w x y z.
  • [:print:] – Printable characters: [:alnum:], [:punct:], and space.
  • [:punct:] – Punctuation characters. In the POSIX C locale and ASCII encoding, this is ! ” # $ % & ‘ ( ) * + , – . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~.
  • [:space:] – Space characters. In the POSIX C locale, this is tab, newline, vertical tab, form feed, carriage return, and space.
  • [:upper:] – Uppercase letters. In the POSIX C locale and ASCII encoding, this is A B C D E F G H I J K L M N O P Q R S T U V W X Y Z.
  • [:xdigit:] – Hexadecimal digits: 1 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f.

Note that the square brackets in these POSIX character class names are part of the symbolic names, and must be included in addition to the square brackets delimiting the bracket expression.

I will extend this post as I come across other useful examples of locale-aware user input validation that conforms to the POSIX standard.

Comments are closed.