The True Warts of Perl

Preface

Perl has a reputation for being an unreadable language. It can be, in the wrong hands, and it has a long history of being used for quick hacks, so wrong hands are plentiful, and unreadable examples are easy to find. This article is not about unreadable Perl; this article is about actual flaws in the language that make writing good code harder.

I like Perl. I have been programming in it for many years, and I intend to program in it for many more. However, as with any programming language, Perl has flaws; honestly acknowledging them is the only way Perl programmers are going to be able to work around them, and perhaps one day fix them.

Some of these warts have been partially addressed. I try to make mention when this is the case. Some of these warts are addressed in Perl 6, but this article is about Perl 5, so I won't discuss those. Most of these warts are fundamental parts of the language that cannot be fixed or changed without seriously breaking a lot of code.

do{}, the Block that Isn't

The problem with do {} is one of inconsistency. It resembles a block, but isn't one. It causes special behavior when used with a trailing while.

do {} is evaluated as an expression. It's used primarily to group several expressions wherever one expression is needed; it's also used to introduce a new scope for an otherwise simple expression, such as the idiom for slurping a file: do { local $/; <$fh> }.

Unfortunately, this is where the similarity with a block ends. You cannot use loop control statements (next, last, redo) with do{}. This makes it especially difficult to use in a do { } while (...).

Normally, LEFT_EXPR while RIGHT_EXPR will evaluated LEFT_EXPR if RIGHT_EXPR is true. This does not hold if the LEFT_EXPR is a do {}; in that case, the LEFT_EXPR is evaluated once, regardless of RIGHT_EXPR.

Action at a Distance

Action at a distance is when an action in one point in a code base affects completely unrelated code elsewhere. Global variables are the biggest single cause of action at a distance; any code anywhere can, and probably eventually will, modify a global variable. This leads to bugs that involve debugging the entire code base to see what may have altered the variable.

Unfortunately, Perl still relies heavily on global variables for some basic operations. The biggest culprits are $_, $/, and $@, though occasionally using bareword filehandles (which are package globals) can cause hard to trace bugs.

Using local() on the variable can solve most of the issues that you might run into, but it's not a perfect solution. local() simply hides the value of the global and restores it at the end of the enclosing block; any code called within that block will see the new value. Even local() itself can get you in trouble; see Tie Magic Still There .

The implicit nature of $_, combined with the aliasing nature of foreach, can lead to some difficult to trace problems:

    my @x = qw( one two three );
    foreach (@x) {
        open(my $fh, '<', 'foo.txt') || die("open: $!.\n");
        while (<$fh>) {
            print;
        }
    }

    print join(" ", @x), "\n";

The result of the above is that all of the elements in @x are undef. The implicit assignment to $_ by the while loop leaves $_ undef at the end of the loop, and $_ is an alias for the element in @x.

Thankfully, the filehandle problem has been solved with the introduction of lexical filehandles. Unfortunately, a lot of documentation and code still uses bareword filehandles.

Exceptions at a Distance

The exception variable, $@, is global, and therefore subject to action at a distance problems. Any eval in the call chain can change the variable, effectively hiding or changing the exception being raised.

The normal idiom for checking for exceptions is to call the code within an eval block, then check $@ afterwards:

    eval { potentially_fatal_action() };
    if ($@) {
        # handle the exception
    }

The problem with this is that any evals within cleanup code will hide or change the exception that was raised:

    sub Foo::DESTROY { eval 1 }
    eval {
        my $x = bless {}, 'Foo';
        die("Raise\n");
    };

After this code executes, $@ is empty, instead of being "Raise\n". This is because, when the Foo object went out of scope, the DESTROY method was called; the DESTROY method runs an eval, which raises no exception, effectively assigning $@ = ''. This is the value as seen after the eval block.

There is a partial solution for this. I have not fully tested it yet, but the principle is to rely on the eval return value to determine if an exception was raised, to store any exceptions raised, and examine the call stack. If the exception was within another eval then the exception was caught and can be ignored. Unfortunately, this solution depends on $SIG{__DIE__}, another global; if any of the called code changes $SIG{__DIE__} and neglects to call the original then the exception message may be useless. The below try() subroutine is part of my as yet unreleased Exception::Class::Functions module.

    use List::Util qw();

    sub try (&) {
        my($code) = @_;

        my $framecount = 0;
        $framecount++ while caller($framecount);

        # We use $framecount to strip off frames inside the __DIE__ handler.
        # The extra two frames are for the outer block eval, and the function call
        # inside.
        $framecount += 2;

        my @errors;
        my $old_sigdie = $SIG{__DIE__};
        local $SIG{__DIE__} = sub {
            my @frames;
            my $i = 0;
            while (my @caller = caller($i)) {
                push @frames, $caller[3];
                $i++;
            }

            splice(@frames, -$framecount);

            push @errors, @_ == 1 ? $_[0] : join("", @_)
                unless List::Util::first { $_ eq '(eval)' } @frames;

            $old_sigdie->(@_) if $old_sigdie;
        };

        if (eval { $code->(); 1 }) {
            $@ = '';
            return undef;

        } else {
            $@ = List::Util::first { length } reverse @errors;
            $@ = 'Internal error, unable to determine actual error.'
                unless length $@;
            return $@;
        }
    }

Tie Magic Still There

Normally when you want to confine changes to a global to a specific block you use local(). Unfortunately, if the variable just localized has tie magic associated with it, the tie is not broken; any updates to the variable will call the underlying tie methods.

This is not typically a problem, as most variables you'd wish to tie are lexicals that you cannot localize anyways. However, recall that $_ is global, and is implicit for several operations; suddenly a tied value can work its way into code that would otherwise be safe. What's worse is code that is trying to be careful, by localizing $_ before using it, still gets the tied magic.

For example:

    use warnings;
    use strict;

    {
        package MyTie;
        sub TIESCALAR { bless \my $dummy, $_[0]     }
        sub FETCH     { print "FETCH\n"; ${ $_[0] } }
        sub STORE {
            my($self, $value) = @_;
            print "STORE ", defined($value) ? $value : 'undef', "\n";
            ${ $_[0] } = $_[1];
        }
    }

    tie(my $tied, 'MyTie');
    $tied = "good";

    foreach ($tied) {
        frobnicate();
    }

    print "$tied\n";

    sub frobnicate {
        my @n = @_;
        local $_;
        $_ = "foo";
    }

The output of this program is:

    STORE good
    FETCH
    STORE undef
    STORE foo
    STORE good
    FETCH
    good

The first STORE and FETCH are from the first foreach. The second STORE is from the local $_. The third STORE is from the $_ assignment. The fourth STORE is from exiting the frobnicate() block, as the original value is restored. These extra STOREs and FETCH are not a problem here, but if you're dealing with a class where STORE or FETCH has side effects, there have just now been four extraneous method calls, from code that was otherwise trying to be careful.

There are two workarounds for this problem; the first is to localize the entire GLOB, i.e. local *_. The other, introduced in 5.10 as a direct result of this problem, is to use my $_. Either of these workarounds causes the example program to display only one STORE and one FETCH.

Incidentally, foreach and map loops are safe from this problem. They both have a special form of localization that doesn't invoke the tying callbacks, and prevents any changes to the variable from being visible once the loop ends. If any of the values being iterated over are tied, however, it does propagate to any code within the loop.

The Hash Iterator

In order to support the each operator, each hash has its own internal iterator. This iterator is not reset until the hash is exhausted, or keys or values is called on the hash.

This is another example of action at a distance. It most often causes a problem when a hash reference is being passed around, and as it's being passed around it's being iterated over:

    my %data = ( one => 1, two => 2, three => 3, four => 4, five => 5 );

    while (my($k, $v) = each %data) {
        frobnicate(\%data);
    }

    sub frobnicate {
        my($href) = @_;
        print join(" ", keys %$href), "\n";
    }

This code actually results in an infinite loop; the subroutine is resetting the iterator, and the while loop never gets past the first key. Usually the problem is more subtle:

    my %data = ( one => 1, two => 2, three => 3, four => 4, five => 5 );

    dumper(\%data, 'one');
    dumper(\%data, 'two');

    sub dumper {
        my($href, $key) = @_;
        while (my($k, $v) = each %$href) {
            print "$k = $v\n";
            return if $k eq $key;
        }
    }

The output of the code is hard to determine; the first dumper call may be leaving the hash iterator in the middle of the hash, depending on what order the key-value pairs are returned. This leaves the second call to only find part of the hash, and perhaps not find the key it was looking for. Sometimes the code looks like it's working normally, while other times it appears the hash is incomplete.

The workaround is to never leave the iterator in the middle of the hash. In a subroutine, always reset the hash with keys or values, or make sure to exhaust the hash, before returning. If you are calling code within the each loop, don't pass it a reference to the hash, and don't call any subroutines that have access to the hash (subroutines in the same scope if it's a lexical, or anywhere if it's a global). If you need to pass a reference to the hash, or call a subroutine that has access to it, do so after the each loop is done.

Regex Match Variables

Perl provides three variables, $`, $&, and $' after a successful regex match. In use English terms these are $PREMATCH, $MATCH, and $POSTMATCH; they are filled with the text before the match, the text the regex matched, and the text after the match, respectively.

Unfortunately, there are two problems with these variables. First, they are global; second, filling them takes extra effort. Because these variables are global, any mention of them forces perl to fill them in for every regex match, because there is no real way to know what match the user wants the variables for.

So, any mention of these variables anywhere in a program or module will cause perl to fill them for every pattern match. Depending on the number and complexity of any pattern matches performed, this can result in a serious performance hit.

The only workaround is to never mention any of these variables in your code, or use any modules that mention them. If you use the English module you need to import -no_match_vars, otherwise using the module will count as mentioning them.

With recent Perls (5.10 and up) you can use the /p option on the regex. This will fill ${^PREMATCH}, ${^MATCH}, and ${^POSTMATCH}, but will only do so for that regex. While this is useful going forward, the old variables are still available, and still impose a performance penalty globally.