Leetcode.com problem 336: "Palindrome Pairs"

James S. Plank

Tue Sep 20 01:16:57 EDT 2022

Problem Statement.
Success Rate (as of September, 2022): 35.3%
This is a hard problem.
ex1.txt -- Their example 1.
ex2.txt -- Their example 2.
ex3.txt -- Their example 3.
ex-nasty.txt -- My nasty example.
ex-theirs.txt -- Their nasty example.
A skeleton with my main() that you can use to test your answer.
A solution that uses hashing.

In case Leetcode's servers are down

You are to implement the following method:

class Solution {
  public:
    vector <vector <int> > palindromePairs(vector<string>& words);
};

The vector words is composed of lowercase letters.
The words in words are unique.
There are at most 5,000 elements in words.
Each element in words is at most 300 characters, and may be the empty string.
You need to return a vector of pairs of indices. Each of these pairs should be indices of distinct elements of words, which, when concatenated, make a palindrome. The return value needs to have all of these pairs, in any order.

The Driver Program, and the Examples

The driver program lets you enter words on standard input. It simply reads with cin, so it doesn't matter if each word is on its own line. The word "-" will be turned into the empty string.

You can give it a single command line argument. If that argument has an 's' character, then it will print the answer with a very slow algorithm before it runs yours. That lets you test your answers, at least for small inputs.

If the argument has a 'p', then it will print your answers.

If the argument has a 't', then it will show the timing of your solution.

So here are the first three examples, solved with the "slow" solution:

UNIX> g++ skeleton.cpp 
UNIX> cat ex1.txt
abcd dcba lls s sssll
UNIX> ./a.out s < ex1.txt
0 1             # abcddcba
1 0             # dcbaabcd
2 4             # llsssll 
3 2             # slls      

UNIX> cat ex2.txt
bat tab cat
UNIX> ./a.out s < ex2.txt
0 1
1 0

UNIX> cat ex3.txt
a -
UNIX> ./a.out s < ex3.txt
0 1
1 0

UNIX>

Hints

This problem vexed me, and it's not one of my favorite Leetcode problems, because it's hard to evaluate the proper speed and memory usage of your program. Basically, you submit and pray. Still, it's a nice program to think about in terms of solutions, their speed and their memory usage.

The Slow Solution

The direct solution -- where you concatenate each pair of words and then test to see if the result is a palindrome -- is easy to write, but it's too slow. Here it is:

vector <vector <int> > Solution::slow(vector<string>& words)
{
  int i, j, k;
  string w;
  vector <vector <int> > rv;
  vector <int> p;
  bool ok;

  p.resize(2);

  for (i = 0; i < words.size(); i++) {            // Enumerate all first words.
    for (j = 0; j < words.size(); j++) {          // Enumerate all second words.
      if (j != i) {
        w = words[i] + words[j];                     // Concatenate them.
        ok = true;
        for (k = 0; ok && k < w.size()/2; k++) {  // Test to see if the result is a palindrome.
          ok = (w[k] == w[w.size()-k-1]);
        }
        if (ok) { p[0] = i; p[1] = j; rv.push_back(p); }
      }
    }
  }

  return rv;
}

Why is this too slow? Well, let's name the variables for the analysis:

n is the number of elements in words (max 5000).
m is the maximum size of a string in words (max 300).

The enumeration is O(n²), and the size of the word is O(m). The palindrome test is therefore O(m), which makes this program O(n²m). Plugging in those maximum values, that's 5000*5000*300, which is 7,500,000,000. You only get 10⁷ with Leetcode, so that's way too slow.

What makes this ominous

The total number of characters in words is 5000*300, which is 1,500,000. We're already close to 10⁷ with that, so we can't get much slower.

The Fundamental Observation

Suppose you have two words, a and b, and b's size is greater than or equal to a's size. Then, ab is a palindrome iff b is equal to pa^rev, where p is a palindrome and a^rev is the reverse of a.

That's not a difficult observation to come up with, but turning it into a solution is more difficult.

The Trie Data Structure

Wikipedia's page on the Trie

A data structure that we don't teach in our "Data Structures and Algorithms" courses is a "Trie". It's useful when it's useful. And it's not when it's not. So let's define it.

A trie is a multiway tree. Each node represents a character in a string, and its links to children represent subsequent characters in the string. I know that sounds pretty obtuse, so here's the trie for Example 1 above. I've put stars into nodes that end words.

The links are stored as an array of pointers with 26 entries, one for each character. Therefore, when you're using the trie to find a word, you move on from each node in O(1) time. Here's my node definition (I don't represent that a node ends a word here).

struct t_node {
  char c;                // You actually don't need c or the parent.
  char parent;           // However, they are useful for debugging
  bool eow;              // True if this node ends a word (starred nodes above)
  vector <struct t_node *> children;
};

How can we use the Trie? There are a bunch of ways. One way is the following -- for each node in the trie, maintain a vector indices. Process every word, and for every prefix of c characters in the word, if the resulting suffix is a palindrome, then put the word's index into the trie at character c.

I know that's a mouthful, so let's take a look at the trie for our example above.

Consider the string "abcd". When the prefix of this string is "a", then its suffix is "bcd", which is not a palindrome. That's why a's indices vector is empty. However, when the prefix is "abc", then the suffix is "d", which is a palindrome. That's why c's indices vector contains the value 0 ("abcd" is words[0]).

One we have this trie built, we can traverse the words, and process them from back to front. When you get to the end of a word, if you are at a node in the trie whose indices vector is non-empty, then you add answers to the return value.

Let's look at some examples with the trie above:

Look at "abcd", which is words[0]. You find it in the trie in reverse order: d-c-b-a. You look at that node, and indices equals { 1 }. That means you can put [0,1], which corresponds to "abcddcba" into the return value.
Look at "dcba", which is words[1]. You find it in the trie in reverse order: a-b-c-d. You look at that node, and indices equals { 0 }. That means you can put [1,0], which corresponds to "dbcaabcd" into the return value.
Look at "lls", which is words[2]. You look for it in the trie in reverse order: s-l-l, which is not in the trie. So you move on.
Look at "s", which is words[3]. You look for it in the trie in reverse order: s, which is in the tree with indices equal to { 3 }. You have to use distinct words from words, so that's not a valid answer.
Finally, look at "ssll", which is words[4]. You look for it in the trie in reverse order: l-l-s-s. That's not in the trie, so you're done.

At this point, we've found two of the four answers: [0,1] and [1,0]. To find the other two, we can build a trie for the words in reverse order. Here's that trie:

And let's go over processing the words, where we process each word from front to back. In case you're lost,

Look at "abcd", which is words[0]. We want to find words that end with "dcba", so we process "abcd" in forward order: a-b-c-d. The node there has indices with a value of 1, which corresponds to [0,1]. That's already in the return value so we ignore it (basically, in this second pass, we ignore matches that are the same size, like "abcd" and "dcba").
Look at "dcba", which is words[1]. This works like "abcd" - we find a match but ignore it.
Look at "lls", which is words[2]. You look for it in the trie in forward order: l-l-s, which is in the trie and indices is 4. This corresponds to "llssssll" -- we add [2,4] to the return vector.
Look at "s", which is words[3]. You look for it in the trie, and find it with indices = { 2,3 }. This gives us [3,2] to add to the return value (we ignore [3,3]).
Finally, look at "sssll", which is words[4]. You don't find it in the trie, and you're done. The final return value is correct: { {0,1}, {1,0}, {2,4}, {3,2} }.

If you go through the trouble of programming this up, it's too slow. It took around 4 seconds on the big inputs, and that felt too slow.

Instead of storing prefixes and suffixes, just store the words

For my second try, I created the trie, but only stored the indices of the completed words. Here are the forward and reverse trie's for example 1:

Now, for each trie, you traverse all of the words, and search the trie (either forward-to-backward or backward-to-forward). At each character, if the node is an ending node, then you test the word to see if its prefix/suffix is a palindrome, and if so, you have a pair of words for the return value. This should require both less storage and less palindrome checking.

Let's use words[4], "sssll", as an example. We'll traverse the first trie, and when we do so, we need to traverse "sssll" backward: l-l-s-s-s.

We find the first "l" node, but it isn't storing an index, so it's not the ending character of a string. We ignore it.
Same with the second "l" node.
The "s" node is and ending character, so we see that we can break "sssll" into "ss" and "sll". The "sll" matches the trie, so we need to test whether "ss" is a palindrome. It is, so "llssssll" is a palendrome, and we can add [2,4] to the return value.

This was indeed faster than my previous program, but when I submitted it, it exceeded the memory limit! It did so again, even when I worked in two passes, deleting the first trie completely before creating the second one. Dang it!

A combined approach with one trie

I started looking at solutions at this point, and one that purported to work used a mixed solution: It used just one trie and kept both pieces of information in the trie:

Then, it did one traversal of words, and basically mixed my previous solutions. I don't have time to go over it, so I won't, but it was indeed both fast and small enough to pass the test.

A solution involving hashing

You can use hashing, and forego the trie. I do that in hash_solution.cpp. Instead of creating a trie, I hash each word and store the hash into an unordered map. Then I go through each word and incrementally create the hash of its characters in reverse. I check the map, and if the hash is there, and the prefix is a palindrome, then it's a match. I then do the same thing in reverse.

For the hash function, I use my favorite -- the "djb_hash" function from COSC202, but I store it into an unsigned long long rather than an int. That reduces my chance of collisions:

unsigned long long djb_hash(const string &s)
{
  size_t i;
  unsigned long long h;

  h = 0x1505;

  for (i = 0; i < s.size(); i++) {
    h = (h << 5) + h + s[i];
  }
  return h;
}

Here's the code that processes the words from back to front:

  // For each word, go from back to front, and look up the hash
  // of the suffix in the map.  If it's there, and the prefix
  // is a palindrome, then you have a match!

  for (i = 0; i < words.size(); i++) {
    string &s = words[i];
    h = 0x1505;
    for (j = s.size()-1; j >= -1; j--) {
      mit = m.find(h);
      if (mit != m.end() && mit->second != i && ispal(s, 0, j)) {
        rv.resize(rvs+1);
        rv[rvs].push_back(mit->second);
        rv[rvs].push_back(i);
        rvs++;
      }
      if (j >= 0) h = (h << 5) + h + s[j];
    }
  }

If you need an example, let's use the one above, with "abcd", "dcba", "s", "lls" and "sssll".

0x000000017c93ee4f is the hash of abcd
0x000000017c95978f is the hash of dcba
0x000000000b888cb0 is the hash of lls
0x000000000002b618 is the hash of s
0x0000003110610936 is the hash of sssll

Here's the pass where we process the strings backward.

Considering abcd
0x0000000000001505 is the hash of "" backward
0x000000000002b609 is the hash of "d" backward
0x000000000059778c is the hash of "cd" backward
0x000000000b88696e is the hash of "bcd" backward
0x000000017c95978f is the hash of "abcd" backward
Found a match with dcba
Considering dcba
0x0000000000001505 is the hash of "" backward
0x000000000002b606 is the hash of "a" backward
0x0000000000597728 is the hash of "ba" backward
0x000000000b885c8b is the hash of "cba" backward
0x000000017c93ee4f is the hash of "dcba" backward
Found a match with abcd
Considering lls
0x0000000000001505 is the hash of "" backward
0x000000000002b618 is the hash of "s" backward
Found a match with s
0x0000000000597984 is the hash of "ls" backward
0x000000000b88aa70 is the hash of "lls" backward
Considering s
0x0000000000001505 is the hash of "" backward
0x000000000002b618 is the hash of "s" backward
Considering sssll
0x0000000000001505 is the hash of "" backward
0x000000000002b611 is the hash of "l" backward
0x000000000059789d is the hash of "ll" backward
0x000000000b888cb0 is the hash of "sll" backward
Found a match with lls
0x000000017c9a2323 is the hash of "ssll" backward
0x000000310fde87f6 is the hash of "sssll" backward

This approach gets faster scores on Leetcode (no collisions -- yay!), although it's slower on my ex-nasty.txt file. Go figure. I think I've beaten this problem to death sufficiently.