SRM 683, D2, 250-Pointer (EqualSubstrings2)

James S. Plank

Wed Jan 27 16:58:02 EST 2021

Problem Statement.
A main() with the examples compiled in.
A skeleton that compiles with main.cpp.
tests.sh. MD5 hash: 32f1eb1db03de3703535bdb2953860be
answers.txt. MD5 hash: 8f45ec1b5c2b3e6c6c3fe3c2037c8010
Problem Given in Topcoder: 2016
Competitors who opened the problem: 505
Competitors who submitted a solution: 393
Number of correct solutions: 305
Accuracy (percentage correct vs those who opened): 60.4%
Average Correct Time: 18:39.

In case Topcoder's servers are down

Here is a summary of the problem:

You are givin a string s composed exclusively of lower-case letters.
Your job is to compute and return the number of non-overlapping, non-empty, identical substrings of s
Topcoder limits the size of s 50, but I'm going to update that to 400 to make it a little more challenging.

The examples

#  S                           Answer
0 "aa"                         1 - the only matching substrings are "a" and "a"
1 "abcd"                       0
2 "aba"                        1 - again, "a" and "a" are the only matching substrings.
3 "abab"                       3 - "a" & "a", "b" & "b", "ab" & "ab".
4 "aaaab"                      7 - 6 different pairs of "a", one pair of "aa".
5 "onetwothreeonetwothree"     86

Testing yourself

Like the Cryptography Problem, I have a shell script tests.sh, that you can use to test your program. When you run tests.sh, your answer should be identical to answers.txt

Hints

Let's evaluate the big-O of a brain-dead answer:

Enumerate all substrings x of s. (O(n²)).
For each of these, enumerate all substrings that start after x and that are the same size as x. That's O(n).
For each of these, compare the substring to x, and increment a counter if they match. This is O(n) as well, although when the strings don't match, the comparison can exit early.

That's looking like a whopping O(n⁴), but with n capped at 50, it's fine. Even with n at 400, it finishes within the Topcoder limits when compiler optimization is used.

You can make this a little faster if, instead of generating substrings and comparing them, you simply use the C library routine strncmp(). Now, you don't need to make copies of any substrings -- it compares them all in place.

A final way of making it faster is call s.find(s1, k), where s1 is the first substring, and k is the index for the second substring. If it returns string::npos, then you don't have to look any more. Otherwise, you increment the total, and set k to be one after the return value.

I've programmed all of these, and their timings on tests.sh are as follows:

Simplest approach: generating and comparing substrings: 15.481 seconds.
Using the find() method to find the first substring in s: 1.107 seconds.
Using strncmp(): 3.284 seconds

This would be a great application of the Rabin-Karp algorithm. Someday, I'll hack it up here and see what the speedup is. Until then, it would be good practice for you!