pericmd 008: Basic structure of a CLI application (Getopt::Long)

Here’s the basic structure of a typical CLI application, when using just Getopt::Long. This is the nauniq script, in its entirety:

#!perl

use 5.010001;
use strict;
use warnings;

use Getopt::Long;

# VERSION

my %Opts = (
    append         => 0,
    check_chars    => -1,
    forget_pattern => undef,
    ignore_case    => 0,
    md5            => 0,
    num_entries    => -1,
    read_output    => 0,
    show_unique    => 1,
    show_repeated  => 0,
    skip_chars     => 0,
);

sub parse_cmdline {
    my $res = GetOptions(
        'repeated|d'       =>
            sub { $Opts{show_unique} = 0; $Opts{show_repeated} = 1 },
        'ignore-case|i'    => \$Opts{ignore_case},
        'num-entries=i'    => \$Opts{num_entries},
        'skip-chars|s=i'   => \$Opts{skip_chars},
        'unique|u'         =>
            sub { $Opts{show_unique} = 1; $Opts{show_repeated} = 0 },
        'check-chars|w=i'  => \$Opts{check_chars},
        'a'                => sub {
            $Opts{append} = 1; $Opts{read_output} = 1;
        },
        'append'           => \$Opts{append},
        'forget-pattern=s' => sub {
            my ($cbobj, $val) = @_;
            eval { $val = $Opts{ignore_case} ? qr/$val/i : qr/$val/ };
            if ($@) {
                warn "Invalid regex pattern in --forget-pattern: $@\n"; exit 99;
            }
            $Opts{forget_pattern} = $val;
        },
        'md5'              => \$Opts{md5},
        'read-output'      => \$Opts{read_output},
        'help|h'           => sub {
            print <<USAGE;
Usage:
  nauniq [OPTIONS]... [INPUT [OUTPUT]]
  nauniq --help
Options:
  --repeated, -d
  --ignore-case, -i
  --num-entries=N, -n
  --skip-chars=N, -s
  --unique, -u
  --check-chars=N, -w
  --append
  --read-output
  -a
  --md5
  --forget-pattern=S
For more details, see the manpage/documentation.
USAGE
            exit 0;
        },
    );
    exit 99 if !$res;
}

sub run {
    my $ifh; # input handle
    if (@ARGV) {
        my $fname = shift @ARGV;
        if ($fname eq '-') {
            $ifh = *STDIN;
        } else {
            open $ifh, "<", $fname or die "Can't open input file $fname: $!\n";
        }
    } else {
        $ifh = *STDIN;
    }

    my $phase = 2;
    my $ofh; # output handle
    if (@ARGV) {
        my $fname = shift @ARGV;
        if ($fname eq '-') {
            $ofh = *STDOUT;
        } else {
            open $ofh,
                ($Opts{read_output} ? "+" : "") . ($Opts{append} ? ">>" : ">"),
                    $fname
                or die "Can't open output file $fname: $!\n";
            if ($Opts{read_output}) {
                seek $ofh, 0, 0;
                $phase = 1;
            }
        }
    } else {
        $ofh = *STDOUT;
    }

    my ($line, $memkey);
    my %mem;
    my $sub_reset_mem = sub {
        if ($Opts{num_entries} > 0) {
            require Tie::Cache;
            tie %mem, 'Tie::Cache', $Opts{num_entries};
        } else {
            %mem = ();
        }
    };
    $sub_reset_mem->();
    require Digest::MD5 if $Opts{md5};
    no warnings; # we want to shut up 'substr outside of string'
    while (1) {
        if ($phase == 1) {
            # phase 1 is just reading the output file
            $line = <$ofh>;
            if (!$line) {
                $phase = 2;
                next;
            }
        } else {
            $line = <$ifh>;
            if (!$line) {
                last;
            }
        }
        if ($Opts{forget_pattern} && $line =~ $Opts{forget_pattern}) {
            $sub_reset_mem->();
        }

        $memkey = $Opts{check_chars} > 0 ?
            substr($line, $Opts{skip_chars}, $Opts{check_chars}) :
                substr($line, $Opts{skip_chars});
        $memkey = lc($memkey) if $Opts{ignore_case};
        $memkey = Digest::MD5::md5($memkey) if $Opts{md5};

        if ($phase == 2) {
            if ($mem{$memkey}) {
                print $ofh $line if $Opts{show_repeated};
            } else {
                print $ofh $line if $Opts{show_unique};
            }
        }

        $mem{$memkey} = 1;
    }
}

# MAIN

parse_cmdline();
run();

1;
# ABSTRACT: Non-adjacent uniq
# PODNAME:

=head1 SYNOPSIS

 nauniq [OPTION]... [INPUT [OUTPUT]]


=head1 DESCRIPTION

C<nauniq> is similar to the Unix command C<uniq> but detects repeated lines even
if they are not adjacent. To do this, C<nauniq> must remember the lines being
fed to it. There are options to control memory usage: option to only remember a
certain number of unique lines, option to remember a certain number of
characters for each line, and option to only remember the MD5 hash (instead of
the content) of each line.


=head1 OPTIONS

=over

=item * --repeated, -d

Print only duplicate lines. The opposite of C<--unique>.

=item * --ignore-case, -i

Ignore case.

=item * --num-entries=N

Number of unique entries to remember. The default is -1 (unlimited). This option
is to control memory usage, but the consequence is that lines that are too far
apart will be forgotten.

=item * --skip-chars=N, -s

Number of characters from the beginning of line to skip when checking
uniqueness.

=item * --unique, -u

Print only unique lines. This is the default. The opposite of C<--repeated>.

=item * --check-chars=N, -w

The amount of characters to check for uniqueness. The default is -1 (check all
characters in a line).

=item * --append

Open output file in append mode. See also C<-a>.

=item * -a

Equivalent to C<--append --read-output>.

=item * --forget-pattern=S

This is an alternative to C<--num-entries>. Instead of instructing C<nauniq> to
remember only a fixed number of entries, you can specify a regex pattern to
trigger the forgetting the lines. An example use-case of this is when you have a
file like this:

 * entries for 2014-03-13
 foo
 bar
 baz
 * entries for 2014-03-14
 foo
 baz

and you want unique lines for each day (in which you'll specify
C<--forget-pattern '^\*'>).

=item * --md5

Remember the MD5 hash instead of the actual characters of the line. Might be
useful to reduce memory usage if the lines are long.

=item * --read-output

Whether to read output file first. This option works only with C<--append> and
is usually used via C<-a> to append lines to file if they do not exist yet in
the file.

=back


=head1 EXIT CODES

0 on success.

255 on I/O error.

99 on command-line options error.


=head1 FAQ

=head2 How do I append lines to a file only if they do not exist in the file?

You cannot do this with C<uniq>:

 % ( cat FILE ; produce-lines ) | uniq - FILE
 % ( cat FILE ; produce-lines ) | uniq >> FILE

as it will clobber the file first. But you can do this with C<nauniq>:

 % produce-lines | nauniq -a - FILE


=head1 SEE ALSO

L<uniq>

=cut

As you can see, there are three big parts of the application: option/argument parsing (in parse_cmdline() subroutine, lines 24-71), main/core logic (in run() subroutine, lines 73-154), and manpage (POD, lines 161-278).

Argument parsing

Argument parsing is done as usual. We put the values of the options in a hash %Opts (the initial capital letter is to signify that this is a global variable that is used across subroutines). We predeclare %Opts to set the default values. Then we call Getopt::Long’s GetOptions(). There is an option handler for –help and –verbose which displays the help message or version number and exit immediately. The rest of the option handlers set values.

Main/core logic

This is the actual program. What it does exactly in this example is irrelevant because I only want to show the structure of a CLI application, but anyway: nauniq is a utility like Unix’s uniq, except that it can maintain uniqueness even though the lines are not adjacent. I use it often, usually to append lines to a log file where I don’t want previously added lines to be re-added between runs. The core of the program is basically this loop:

while (<>) {
    print unless $memory{$_}++;
}

except that there are several options.

Manpage/POD

Perl grew up in the Unix environment and it shows. Creating a manpage is very easy: you just write POD documentation, which should come natural to any Perl developer. Despite the invention of newer documentation format, and despite lacking fancy stuffs like hypertext or pictures/videos, manpage is still one of the (if not the) most useful and most often used format in Unix CLI land, probably because it is simple and searchable. Texinfo, for example, is invented later and is meant to replace manpage, but fails to trump manpage and falls out of favor itself.

Aside from the NAME, SYNOPSIS, DESCRIPTION, or SEE ALSO sections that are usually present in a Perl module, a CLI application usually also has these sections: OPTIONS (list of command-line options and their description, sometimes subdivided into categories if the list is quite long), EXIT CODES (list of possible exit codes, for another example see curl’s manpage), ENVIRONMENT (list of environment variables that are observed by the application), FILES (list of configuration file paths that are searched by the application).

 

Advertisements

One thought on “pericmd 008: Basic structure of a CLI application (Getopt::Long)

  1. Pingback: pericmd 009: Avoiding repetitions | perlancar's blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s