Perl Weekly Challenge: Week 81

Challenge 1:

Common Base String

You are given 2 strings, $A and $B.

Write a script to find out common base strings in $A and $B.

A substring of a string $S is called base string if repeated concatenation of the substring results in the string.

Example 1:
Input:
    $A = "abcdabcd"
    $B = "abcdabcdabcdabcd"

Output:
    ("abcd", "abcdabcd")
Example 2:
Input:
    $A = "aaa"
    $B = "aa"

Output:
    ("a")

In Raku:

sub MAIN(*$A, *$B) {

It occurred to me that a common base string cannot be any longer than than the shorter of the two strings $A and $B. So the first thing I do is find out which one is smaller and which is larger.

    my ($smaller, $larger) = $A.chars < $B.chars ?? ($A, $B) !! ($B, $A);

    my @base_strings;

Then for each substring $s of $smaller from the first character, the first two characters and so on upto the full length of the string, I check if it is a base string.

A simple optimization I thought of is to check if the length of $larger is an even multiple of $c (the length of $s.) If not, there will be left over characters so it not going to be a base string and we can skip processing.

    for 1 .. $smaller.chars -> $c {
        if  ($larger.chars %% $c) {
            my $l = $larger;
            my $s = $smaller.substr(0, $c);

Then I try removing all occurrences of $s from $l (a copy of $larger.) If there is nothing left in $l it means $s is a base string.

            $l ~~ s:g/ $s //;
            if ($l eq q{}) {
                @base_strings.push($s);
            }
        }
    }

Finally we print out any base strings we have found. In hindsight I probably don't need the sort as the results will be processed in the right order anyway.

    say sort @base_strings;
}

(Full code on Github.)

This is the Perl version. It is a straight port of the Raku version so there is nothing remarkable to tell about it.

my ($smaller, $larger) = length $A < length $B ? ($A, $B) : ($B, $A);
my @base_strings;

for my $c (1 .. length $smaller) {
    if  ((length $larger) % $c == 0) {
        my $l = $larger;
        my $s = substr $smaller, 0, $c;
        $l =~ s/ $s //gmsx;
        if ($l eq q{}) {
            push @base_strings, $s;
        }
    }
}

say q{(}, (join q{ }, sort @base_strings), q{)};

(Full code on Github.)

Challenge 2:

Frequency Sort

You are given file named input.

Write a script to find the frequency of all the words.

It should print the result as first column of each line should be the frequency of the the word followed by all the words of that frequency arranged in lexicographical order. Also sort the words in the ascending order of frequency.

INPUT file

West Side Story

The award-winning adaptation of the classic romantic tragedy "Romeo and Juliet". The feuding families become two warring New York City gangs, the white Jets led by Riff and the Latino Sharks, led by Bernardo. Their hatred escalates to a point where neither can coexist with any form of understanding. But when Riff's best friend (and former Jet) Tony and Bernardo's younger sister Maria meet at a dance, no one can do anything to stop their love. Maria and Tony begin meeting in secret, planning to run away. Then the Sharks and Jets plan a rumble under the highway--whoever wins gains control of the streets. Maria sends Tony to stop it, hoping it can end the violence. It goes terribly wrong, and before the lovers know what's happened, tragedy strikes and doesn't stop until the climactic and heartbreaking ending.

NOTE For the sake of this task, please ignore the following in the input file:

. " ( ) , 's --

OUTPUT

1 But City It Jet Juliet Latino New Romeo Side Story Their Then West York adaptation any anything at award-winning away become before begin best classic climactic coexist control dance do doesn't end ending escalates families feuding form former friend gains gangs goes happened hatred heartbreaking highway hoping in know love lovers meet meeting neither no one plan planning point romantic rumble run secret sends sister streets strikes terribly their two under understanding until violence warring what when where white whoever wins with wrong younger

2 Bernardo Jets Riff Sharks The by it led tragedy

3 Maria Tony a can of stop

4 to

9 and the

This time I did the Perl version first. My solution consists of a set of functions which take as input the output of an enclosed function as you would in a functional language such as LISP.

display(
    countWords(
        removePunctuation(
            readText(
                './input'
            )
        )
    )
);

Looking at these functions from inside out, first we have readText() which takes the input file and slurps it into a string.

sub readText {
    my ($filename) = @_;
    my $text;

    open my $fh, '<', $filename or die "$OS_ERROR\n";
    local $RS = undef;
    $text = <$fh>;
    close $fh;

    return $text;
}

removePunctuation() takes that string output by readText() and removes all those bits that the spec says to ignore and replaces them with spaces and outputs the result.

sub removePunctuation {
    my ($text) = @_;

    $text =~ s/ \. | " | \( | \) | , | 's | --/ /gmsx;

    return $text;
}

countWords() takes the output of removePunctuation() and splits it into words. These words are counted by being added to a hash where the keys are words and the values are the number of times each word occurs. A reference to this hash is returned.

I could have made this function into almost a one-liner by using map() instead of a for loop. I don't know why I didn't.

sub countWords {
    my ($text) = @_;
    my %count;
    my @words = split /\s+/, $text;

    for my $word (@words) {
        $count{$word}++;
    }

    return \%count;
}

Finally we display the results. To do this display() must invert the hash provided by countWords() so that the keys are the frequency with which a word occurred and the values are lists of words which have that frequency.

sub display {
    my ($count) = @_;
    my %frequency;

    map {push @{$frequency{$count->{$_}}}, $_; } sort keys %{$count};

    for my $key (sort keys %frequency) {
        say "$key ", (join q{ }, @{$frequency{$key}}), "\n";
    }
}

(Full code on Github.)

The Raku version follows the same design as Perl. Here are a few notes about differences.

IO.slurp() does Perls' open/read/close file loop all in one go.

sub readText($filename) {
    my $text;

    return $filename.IO.slurp;
}

A principle of functional programming is that the input of a function should be immutable. I didn't always follow this in my Perl version but Raku enforces this for you, so the first step I have to do here is to make a copy of the input and work on that.

sub removePunctuation($text) {
    my $processed = $text;

    $processed ~~ s:g/ '.' | '"' | '(' | ')' | ',' | "'s" | '--'/ /;

    return $processed;
}

Raku lists have a nice method called .classify() that takes the elements and sorts them into a hash with keys whose criteria you specify and lists of elements that specify those criteria as values. Unfortunately, I could not figure out a way to count the number of values so my hash keys looked like e.g. 'Maria' => ['Maria', 'Maria', 'Maria'] instead of what I really wanted, 'Maria' => 3. So I had to do this processing myself as a second step. Instead of recreating the hash I returned a list of key-value pairs so that this function could be used as input for the feed operator. (See below.)

sub countWords($text) {
    my %count = $text.split( /\s+/ ).classify({ $_; });
    return %count.keys.map({ %count{$_}.elems => $_; });
}

Actually countWords returns 'antipairs' (i.e 3 => 'Maria'.) I uses these to reassemble a hash where keys are the frequency with which a word occurred and the values are lists of words which have that frequency.

sub display(@count) {
    my %frequency;

    @count.sort.map({ %frequency{$_.key}.push($_.value); });

    %frequency.keys.sort.map({
        say "$_ ", %frequency{$_}.join(q{ }), "\n";
    });
}

As you can see the main driver of this code is much easier to read than in Perl thanks to the ==> or 'feed' operator. The only problem with the feed operator is that its' input cannot be a hash. I got around this by returning key-value pairs instead as mentioned before.

sub MAIN() {
    readText('./input')
        ==> removePunctuation()
        ==> countWords()
        ==> display()
    ;
}

(Full code on Github.)