Perl Weekly Challenge: Week 186

Challenge 1:

Zip List

You are given two list @a and @b of same size.

Create a subroutine sub zip(@a, @b) that merge the two list as shown in the example below.

Example
Input:  @a = qw/1 2 3/; @b = qw/a b c/;
Output: zip(@a, @b) should return qw/1 a 2 b 3 c/;
        zip(@b, @a) should return qw/a 1 b 2 c 3/;

Raku has an operator, Z that performs the zip operation. So our zip() suboutine can be a simple wrapper around it like this right?

sub zip(@a, @b) {
    return @a Z @b;
}

Not quite. Raku will include each pair from the two lists as its own list giving (for example 1):

((1 a) (2 b) (3 c))

which is not what the spec wants. We have to "flatten" the list with .flat() to merge all the sub-lists. The return type for .flat() is Sequence so for good measure, we should also use .list() to turn it back into a List. The final result is this:

sub zip(@a, @b) {
    return (@a Z @b).flat.list;
}

(Full code on Github.)

Perl doesn't have a Z operator but it is easy enough to mimic it.

sub zip {
    my @a = @{ $_[0] };
    my @b = @{ $_[1] };

    my @result;
    for my $i (0 .. scalar @b - 1) {
        push @result, $a[$i], $b[$i];
    }
    return @result;
}

(Full code on Github.)

This isn't exactly equivalent to Raku; it flattens the result internally for one thing. Also the arrays have to be passed by reference i.e. zip(\@a, \@b) which is not quite according to spec. I tried to make it work properly with prototypes and even signatures (which are still experimental in the version of Perl, 5.30, which I am using) but I was not able to get it right.

Challenge 2:

Split Array

You are given a string with possible unicode characters.

Create a subroutine sub makeover($str) that replace the unicode characters with ascii equivalent. For this task, let us assume it only contains alphabets.

Example 1
Input: $str = 'ÃÊÍÒÙ';
Output: 'AEIOU'
Example 2
Input: $str = 'ãÊíÒÙ';
Output: 'aEiOU'

Perl.

In todays global environment, plain old ASCII doesn't cut it anymore. Luckily there is Unicode which does a brilliant job of capturing the complexity and diversity of Mankinds various languages while still remaining interoperable with legacy encodings such as ASCII. And Perl and Raku have some of the best support for Unicode in any programming language.

Unicode makes a distinction between graphemes and codepoints. A grapheme is the actual character you see and use i.e. વ્યા in my surname. However a character may be a composite of several unicode codepoints. The character in the last sentence for example is encoded as GUJARATI LETTER VA + GUJARATI SIGN VIRAMA + GUJARATI LETTER YA + GUJARATI VOWEL SIGN AA. (By the way these names are defined by Unicode. They represent numeric values of differing lengths but it is much more convenient to use the names.) It is up to whatever font you are using to display this. Good quality Gujarati fonts will omit the virama (which indicates there is no intermediate vowel sound between the previous and next characters) and display a ligature of va and ya. Lesser fonts might show the virama between entire va and ya characters and if you don't have a Gujarati font at all, you might see blanks spaces, squares or numbers.

Many languages particularly in Europe extend ASCII with various accents and other phonetic signs. For interopability with legacy encodings in those languages, Unicode has codepoint for entire characters such as à for example. It is encoded as LATIN CAPITAL LETTER A WITH TILDE. But it could also be broken down to LATIN CAPITAL LETTER A + COMBINING TILDE. The Unicode standard refers to methods of composition or decomposition as normalization. In particular, the decomposition of a character into a base + modifiers is called normalization form D.

We can use this in Perl with the Unicode::Normalize module which is part of the standard Perl distribution.

use Unicode::Normalize qw/ normalize /;

Note that we are importing the normalize() function. We can use it in our makeover() function.

sub makeover {
    my ($str) = @_;

    my $result = normalize('D', $str);

Once we have a normalized string, all we have to do is strip out any codepoints which are not ASCII lower or upper case alphabetical characters and return the result.

    $result =~ s/[^a-zA-Z]+//g;

    return $result;
}

This is sufficient to be able to make over the example strings. If you wanted to use this function in the real world, there is one catch; due to the need to support legacy encodings, there are certain characters which cannot be decomposed to an ASCII base + signs. They have to be converted by hand which I did with the following code after the call to normalize().

    state %special = (
        "\N{LATIN CAPITAL LETTER AE}"            => 'AE',
        "\N{LATIN CAPITAL LETTER ETH}"           => 'D',
        "\N{LATIN CAPITAL LETTER F WITH HOOK}"   => 'F',
        "\N{LATIN CAPITAL LETTER O WITH STROKE}" => 'O',
        "\N{LATIN CAPITAL LETTER SHARP S}"       => 'SS',
        "\N{LATIN CAPITAL LETTER THORN}"         => 'TH',
        "\N{LATIN SMALL LETTER AE}"              => 'ae',
        "\N{LATIN SMALL LETTER ETH}"             => 'd',
        "\N{LATIN SMALL LETTER F WITH HOOK}"     => 'f',
        "\N{LATIN SMALL LETTER O WITH STROKE}"   => 'o',
        "\N{LATIN SMALL LETTER SHARP S}"         => 'ss',
        "\N{LATIN SMALL LETTER THORN}"           => 'th',
        "\N{LATIN CAPITAL LIGATURE OE}"          => 'OE',
        "\N{LATIN SMALL LIGATURE OE}"            => 'oe',
    );
    for my $k (keys %special) {
        $result =~ s/$k/$special{$k}/g;
    }

(Full code on Github.)

(There must be some reason why Æ is a letter but Œ is a ligature but I don't know it.)

Raku doesn't need a module for normalization form D, the String class has a method .NFD() for it. sub makeover($str) { my $result = $str.NFD

The problem is that it returns a Blob not a string. A Blob doesn't have a method to convert into a String but it does have .list() which gives a List of codepoints. Running chr() on that list via .map() converts it into a Sequence of characters and finally .str() will convert that Sequence to a String.

        .list.map({ chr($_); })
        .Str;

We manually convert the special characters as we did in Perl though note the \N escape for unicode codepoint names has now become \c.

    state %special = (
        "\c[LATIN CAPITAL LETTER AE]"            => 'AE',
        "\c[LATIN CAPITAL LETTER ETH]"           => 'D',
        "\c[LATIN CAPITAL LETTER F WITH HOOK]"   => 'F',
        "\c[LATIN CAPITAL LETTER O WITH STROKE]" => 'O',
        "\c[LATIN CAPITAL LETTER SHARP S]"       => 'SS',
        "\c[LATIN CAPITAL LETTER THORN]"         => 'TH',
        "\c[LATIN SMALL LETTER AE]"              => 'ae',
        "\c[LATIN SMALL LETTER ETH]"             => 'd',
        "\c[LATIN SMALL LETTER F WITH HOOK]"     => 'f',
        "\c[LATIN SMALL LETTER O WITH STROKE]"   => 'o',
        "\c[LATIN SMALL LETTER SHARP S]"         => 'ss',
        "\c[LATIN SMALL LETTER THORN]"           => 'th',
        "\c[LATIN CAPITAL LIGATURE OE]"          => 'OE',
        "\c[LATIN SMALL LIGATURE OE]"            => 'oe',
    );
    for %special.keys -> $k {
        $result = $result.subst(/$k/, %special{$k}, :g);
    }

And then we strip out the non-alphabet characters just as in Perl.

    $result = $result.subst(/<-[a..zA..Z]>*/, q{}, :g);

    return $result;
}

(Full code on Github.)