Perl Weekly Challenge: Week 10

Challenge 1:

Write a script to encode/decode Roman numerals. For example, given Roman numeral CCXLVI, it should return 246. Similarly, for decimal number 39, it should >return XXXIX. Checkout wikipedia page for more information.

For Raku (Perl 6) I used the powerful grammar feature to solve this challenge. There are four parts:

The grammar RomanNumerals defines the structure of valid Roman numerals. It breaks a numeral into thousands, hundreds, tens, and ones, each represented by their own tokens. Each token matches specific Roman numeral patterns i.e. MMM for 3000, IV for 4 etc. The grammar can only match numbers up to 3000 but that was the practical limit for Roman numerals any way.

grammar RomanNumerals {
    token TOP {
        <thousands>? <hundreds>? <tens>? <ones>?
    }

    token thousands {
        <ThreeM> | <TwoM> | <OneM>
    }

    token hundreds {
        <NineC> | <EightC> | <SevenC> | <SixC> | <FiveC> | <FourC> | <ThreeC> | 
        <TwoC> | <OneC>
    }

    token tens {
        <NineX> | <EightX> | <SevenX> | <SixX> | <FiveX> | <FourX> | <ThreeX> |
        <TwoX> | <OneX>
    }

    token ones {
        <NineI> | <EightI> | <SevenI> | <SixI> | <FiveI> | <FourI> | <ThreeI> |
        <TwoI> | <OneI>
    }

    token ThreeM { MMM }
    token TwoM { MM }
    token OneM { M }

    token NineC { CM }
    token EightC { DCCC }
    token SevenC { DCC }
    token SixC { DC }
    token FiveC { D }
    token FourC { CD }
    token ThreeC { CCC }
    token TwoC { CC }
    token OneC { C }

    token NineX { XC }
    token EightX { LXXX }
    token SevenX { LXX }
    token SixX { LX }
    token FiveX { L }
    token FourX { XL }
    token ThreeX { XXX }
    token TwoX { XX }
    token OneX { X }

    token NineI { IX }
    token EightI { VIII }
    token SevenI { VII }
    token SixI { VI }
    token FiveI { V }
    token FourI { IV }
    token ThreeI { III }
    token TwoI { II }
    token OneI { I }
}

The RomanNumeralsAction class provides methods that are called when the grammar matches a part of the input. Each method uses the make keyword to associate a numeric value with the matched Roman numeral part. For example, if the grammar matches DCCC, the EightC method returns 800. The TOP() method sums the values of all matched parts to produce the final decimal value.

class RomanNumeralsAction {
    method TOP($/) {
        make $/.values».made.sum;
    }

    method thousands($/) {
        make $/.values[0].made;
    }

    method hundreds($/) {
        make $/.values[0].made;
    }

    method tens($/) {
        make $/.values[0].made;
    }

    method ones($/) {
        make $/.values[0].made;
    }

    method ThreeM($/) { make 3000; }
    method TwoM($/)   { make 2000; }
    method OneM($/)   { make 1000; }

    method NineC($/)  { make 900; }
    method EightC($/) { make 800; }
    method SevenC($/) { make 700; }
    method SixC($/)   { make 600; }
    method FiveC($/)  { make 500; }
    method FourC($/)  { make 400; }
    method ThreeC($/) { make 300; }
    method TwoC($/)   { make 200; }
    method OneC($/)   { make 100; }

    method NineX($/)  { make 90; }
    method EightX($/) { make 80; }
    method SevenX($/) { make 70; }
    method SixX($/)   { make 60; }
    method FiveX($/)  { make 50; }
    method FourX($/)  { make 40; }
    method ThreeX($/) { make 30; }
    method TwoX($/)   { make 20; }
    method OneX($/)   { make 10; }

    method NineI($/)  { make 9; }
    method EightI($/) { make 8; }
    method SevenI($/) { make 7; }
    method SixI($/)   { make 6; }
    method FiveI($/)  { make 5; }
    method FourI($/)  { make 4; }
    method ThreeI($/) { make 3; }
    method TwoI($/)   { make 2; }
    method OneI($/)   { make 1; }
}

fromRoman() is a subroutine that takes a Roman numeral string, parses it using the grammar and action class, and returns the corresponding decimal value. If the input is not a valid Roman numeral, it returns 0.

sub fromRoman(Str $number where .uc) {
    my $decimal = RomanNumerals.parse($number,
                    actions => RomanNumeralsAction.new);
    if defined $decimal {
        return $decimal.made;
    }
    return 0;
}

The toRoman() subroutine converts a decimal number (between 1 and 3000) to its Roman numeral representation. It does this by breaking the number into thousands, hundreds, tens, and ones, and appending the appropriate Roman numeral symbols for each part to the result string.

sub toRoman(Int $number where { $number >= 1 && $number <= 3000 }) {
    my Str $result;

    given ($number div 1000) {
        when 1 { $result ~= 'M'; }
        when 2 { $result ~= 'MM'; }
        when 3 { $result ~= 'MMM'; }
        default {}
    }

    given ($number div 100 % 10) {
        when 1 { $result ~= 'C'; }
        when 2 { $result ~= 'CC'; }
        when 3 { $result ~= 'CCC'; }
        when 4 { $result ~= 'CD'; }
        when 5 { $result ~= 'D'; }
        when 6 { $result ~= 'DC'; }
        when 7 { $result ~= 'DCC'; }
        when 8 { $result ~= 'DCCC'; }
        when 9 { $result ~= 'CM'; }
        default {}
    }

    given ($number div 10 % 10) {
        when 1 { $result ~= 'X'; }
        when 2 { $result ~= 'XX'; }
        when 3 { $result ~= 'XXX'; }
        when 4 { $result ~= 'XL'; }
        when 5 { $result ~= 'L'; }
        when 6 { $result ~= 'LX'; }
        when 7 { $result ~= 'LXX'; }
        when 8 { $result ~= 'LXXX'; }
        when 9 { $result ~= 'XC'; }
        default {}
    }

    given ($number % 10) {
        when 1 { $result ~= 'I'; }
        when 2 { $result ~= 'II'; }
        when 3 { $result ~= 'III'; }
        when 4 { $result ~= 'IV'; }
        when 5 { $result ~= 'V'; }
        when 6 { $result ~= 'VI'; }
        when 7 { $result ~= 'VII'; }
        when 8 { $result ~= 'VIII'; }
        when 9 { $result ~= 'IX'; }
        default {}
    }

    return $result;
}

(Full code on Github.)

Perl doesn't have grammars so I just used a big given/when statement with regular expressions to parse the numbers directly within fromRoman(). The \G anchor ensures that matching continues from where the last match left off, and the /gc modifiers update the regex position. For each successful match, the corresponding value is added to $result. If the end of the string is reached, the loop exits. If there is invalid input, i.e. unexpected pattern occurred, $result is set to 0 and the loop is exited.

sub fromRoman {
    my ($number) = @_;
    my $result;

    while () {
        given ($number) {
            when (/\GMMM/gc) { $result += 3000; }
            when (/\GMM/gc)  { $result += 2000; }
            when (/\GM/gc)   { $result += 1000; }

            when (/\GCM/gc)   { $result += 900; }
            when (/\GDCCC/gc) { $result += 800; }
            when (/\GDCC/gc)  { $result += 700; }
            when (/\GDC/gc)   { $result += 600; }
            when (/\GD/gc)    { $result += 500; }
            when (/\GCD/gc)   { $result += 400; }
            when (/\GCCC/gc)  { $result += 300; }
            when (/\GCC/gc)   { $result += 200; }
            when (/\GC/gc)    { $result += 100; }

            when (/\GXC/gc)   { $result += 90; }
            when (/\GLXXX/gc) { $result += 80; }
            when (/\GLXX/gc)  { $result += 70; }
            when (/\GLX/gc)   { $result += 60; }
            when (/\GL/gc)    { $result += 50; }
            when (/\GXL/gc)   { $result += 40; }
            when (/\GXXX/gc)  { $result += 30; }
            when (/\GXX/gc)   { $result += 20; }
            when (/\GX/gc)    { $result += 10; }

            when (/\GIX/gc)   { $result += 9; }
            when (/\GVIII/gc) { $result += 8; }
            when (/\GVII/gc)  { $result += 7; }
            when (/\GVI/gc)   { $result += 6; }
            when (/\GV/gc)    { $result += 5; }
            when (/\GIV/gc)   { $result += 4; }
            when (/\GIII/gc)  { $result += 3; }
            when (/\GII/gc)   { $result += 2; }
            when (/\GI/gc)    { $result += 1; }

            when (/\G$/gc)    { last; }
            default           { $result = 0; last; } # some unexpected input
        }
    }
    return $result;
}

For toRoman()...

sub toRoman {
    my ($number) = @_;

...first we check if the input is within the valid range. If not, processing stops and a usage message is printed.

    if ($number < 1 || $number > 3000) {
        usage();
    }

    my $result;

Otherwise, we break the number into thousands, hundreds, tens, and ones using division and modulo operations. For each digit, it appends the appropriate Roman numeral symbols to the result string using a given/when for each possible value. This process constructs the Roman numeral from left to right, ensuring the correct order and format.

    given ($number / 1000 % 10) {
        when (1) { $result .= 'M'; }
        when (2) { $result .= 'MM'; }
        when (3) { $result .= 'MMM'; }
        default {}
    }

    given ($number / 100 % 10) {
        when (1) { $result .= 'C'; }
        when (2) { $result .= 'CC'; }
        when (3) { $result .= 'CCC'; }
        when (4) { $result .= 'CD'; }
        when (5) { $result .= 'D'; }
        when (6) { $result .= 'DC'; }
        when (7) { $result .= 'DCC'; }
        when (8) { $result .= 'DCCC'; }
        when (9) { $result .= 'CM'; }
        default {}
    }

    given ($number / 10 % 10) {
        when (1) { $result .= 'X'; }
        when (2) { $result .= 'XX'; }
        when (3) { $result .= 'XXX'; }
        when (4) { $result .= 'XL'; }
        when (5) { $result .= 'L'; }
        when (6) { $result .= 'LX'; }
        when (7) { $result .= 'LXX'; }
        when (8) { $result .= 'LXXX'; }
        when (9) { $result .= 'XC'; }
        default {}
    }

    given ($number % 10) {
        when (1) { $result .= 'I'; }
        when (2) { $result .= 'II'; }
        when (3) { $result .= 'III'; }
        when (4) { $result .= 'IV'; }
        when (5) { $result .= 'V'; }
        when (6) { $result .= 'VI'; }
        when (7) { $result .= 'VII'; }
        when (8) { $result .= 'VIII'; }
        when (9) { $result .= 'IX'; }
        default {}
    }

    return $result;
}

(Full code on Github.)

Challenge 2:

Write a script to find Jaro-Winkler distance between two strings. For more information check wikipedia page.

In MAIN() we take two command-line arguments as the strings to be compared., and prints the Jaro-Winkler distance between them. This structure allows for easy command-line usage and clear separation of the algorithm's components.

multi sub MAIN(
    Str $arg1,   #= First string to compare.
    Str $arg2    #= Second string to compare.
) {

The strings are swap()ed if necessary and copied so they can be modified. (Raku/Perl6 function arguments are immutable.) The swap() function is described below.

    my ($string1, $string2) = swap($arg1, $arg2);

jaroWinklerDistance() (described below) performs the task at hand and the result is printed out with say().

    say jaroWinklerDistance($string1, $string2);
}

This is swap(). The point of it is to make sure the longer string is always the first argument which helps optimize the comparison process.

sub swap(Str $s1, Str $s2) {
    return (max($s1.chars, $s2.chars) ~~ $s1.chars) ?? ($s1, $s2) !! ($s2, $s1);
}

jaroWinklerDistance() simply returns one minus the Jaro-Winkler similarity, giving a distance metric where 0 means identical strings and values closer to 1 indicate less similarity.

sub jaroWinklerDistance(Str $string1, Str $string2) {
    return 1 - jaroWinklerSimilarity($string1, $string2);
}

The jaroWinklerSimilarity() subroutine enhances the Jaro similarity by giving more weight to strings that match from the beginning.

sub jaroWinklerSimilarity(Str $string1, Str $string2) {

The Jaro similarity score is calculated...

    my $j = jaroSimilarity($string1, $string2);

...and a bonus is added to it based on the length of the common prefix, scaled by 0.1. The final score is returned.

    return $j + (prefix($string1, $string2) * 0.1) * (1 - $j);
}

prefix() computes the length of the common prefix (up to 4 characters) shared by two strings.

sub prefix(Str $string1, Str $string2) {

First we split both strings into arrays of characters.

    my @prefix1 = $string1.comb;
    my @prefix2 = $string2.comb;

A counter is created to keep track of how many characters match.

    my $result = 0;

Then we compare the arrays of characters one by one,

    for 0 .. 3 -> $i {

For each matching character at the same position, $result is incremented.

        if (@prefix2[$i].defined && @prefix1[$i] eq @prefix2[$i]) {
            $result++;
        } else {

The loop stops at the first mismatch or after four characters.

            last;
        }
    }

And $result is returned.

    return $result;
}

jaroSimilarity() calculates the Jaro similarity score between two strings.

sub jaroSimilarity(Str $string1, Str $string2) {

$string2 is immutable but it has to be modified so a copy is made.

    my Str $string2Copy = $string2;

We will need to count the number of matching characters ($m) and transpositions ($t).

    my $m = 0;
    my $t = 0;

Based on the length of $string1 (calculated with .chars()) we determine what distance is deemed "near".

    my $near = $string1.chars / 2 - 1;

And we have to break up $string1 into a list of individual characters with .comb().

    my @chars = $string1.comb;

For each character in that list...

    for 0 .. @chars.elems - 1 -> $i {

...it is considered matching if it appears in both strings and is within a $near distance.

        my $pos = $string2Copy.index(@chars[$i]);
        if $pos.defined {
            $m++;
            if abs($pos - $i) > $near {
                $t++;
            }

To avoid counting the same character twice, matched characters in the second string are replaced with a space.

            $string2Copy.substr-rw($pos, 1) = ' '; # prevent using same char twice.
        }
    }

The final Jaro similarity is computed using the standard formula mentioned in Wikipedia, which averages the ratios of matches to string lengths and adjusts for transpositions.

    $t /= 2;
    return $m
        ?? 1/3 * ($m / $string1.chars + $m / $string2.chars + ($m - $t) / $m)
        !! 0;
}

(Full code on Github.)

The Perl version is coded exactly the same as in Raku/Perl6.

sub prefix {
    my ($string1, $string2) = @_;
    my @prefix1 = split //, substr($string1, 0, 4);
    my @prefix2 = split //, substr($string2, 0, 4);
    my $result = 0;

    for my $i (0 .. 3) {
        if (defined $prefix2[$i] && $prefix1[$i] eq $prefix2[$i]) {
            $result++;
        } else {
            last;
        }
    }

    return $result;
}

sub jaroSimilarity {
    my ($string1, $string2) = @_;

    my $m = 0; # matches
    my $t = 0; # transpositions
    my $near = int((length $string1) / 2 - 1);
    my @chars = split //, $string1;

    for my $i (0 .. scalar @chars - 1) {
        my $pos = index $string2, $chars[$i];
        if ($pos > -1) {
            $m++;
            if (abs($pos - $i) > $near) {
                $t++;
            }
            substr($string2, $pos, 1) = ' '; # prevent using same char twice.
        }
    }
    $t /= 2;

    return $m
    ? 1/3 * ($m / (length $string1) + $m / (length $string2) + ($m - $t) / $m)
    : 0;
}

sub jaroWinklerSimilarity {
    my ($string1, $string2) = @_;
    my $j = jaroSimilarity($string1, $string2);

    return $j + (prefix($string1, $string2) * 0.1) * (1 - $j);
}

sub jaroWinklerDistance {
    my ($string1, $string2) = @_;

    return 1 - jaroWinklerSimilarity($string1, $string2);
}

sub swap {
    my ($s1, $s2) = @_;

    return length $s1 >= length $s2 ? ($s1, $s2) : ($s2, $s1);
}

my ($string1, $string2) = swap($ARGV[0], $ARGV[1]);

say jaroWinklerDistance($string1, $string2);

(Full code on Github.)