maximum number of non-overlapping substrings

24 September 2024
Leetcode,
RC

This was todays daily for the #AlgoBot Daily Question channel in the RC zulip.

I may choose a different channel in the future, there are a few and nobody did this one.

problem #

Given a string s of lowercase letters, you need to find the maximum number of non-empty substrings of s that meet the following conditions:

1. The substrings do not overlap, that is for any two substrings s[i..j] and s[x..y], either j < x or i > y is true.

2. A substring that contains a certain character c must also contain all occurrences of c.

Find the maximum number of substrings that meet the above conditions. If there are multiple solutions with the same number of substrings, return the one with minimum total length. It can be shown that there exists a unique solution of minimum total length.

Notice that you can return the substrings in any order.

Input: s = "adefaddaccc"
Output: ["e","f","ccc"]

Input: s = "abbaccd"
Output: ["d","bb","cc"]

original solution #

map every char in the string to all its indexes
- "abbbcc" -> {a: [0], b: [1, 2, 3], c: [4, 5]}
turn the indexes into ranges
- drop all indexes but the start and end
- characters with a single index are expanded to the same index twice
sort the ranges by their end
starting with a negative ending time as our "last ending":
- take the next range that doesn't overlap our last ending time
- intuition:
  - choosing the interval that ends earliest leaves as much remaining time as possible

# stinky!
from collections import OrderedDict


class Solution:
    def maxNumOfSubstrings(self, s: str) -> List[str]:
        res = []

        # keys are chars, values are a list of all indexes
        ranges = {}

        # populate indexes
        for i, char in enumerate(s):
            if char not in ranges:
                ranges[char] = []

            ranges[char].append(i)

        for char in ranges.keys():
            # make sure every range has two indexes
            if len(ranges[char]) == 1:
                ind = ranges[char][0]
                ranges[char] = [ind, ind]

            # drop everything but the first and last index
            elif len(ranges[char]) >= 2:
                ranges[char] = [ranges[char][0], ranges[char][-1]]

        # sort all the intervals by the end of their range
        intervals = OrderedDict(sorted(ranges.items(), key=lambda x: x[1][-1]))

        # Greedily select non-overlapping substrings
        last_end = -1
        for char, (start, end) in intervals.items():
            if start > last_end:
                res.append(s[start : end + 1])
                last_end = end

        return res

This "solution" passed the two example test cases, but only passes 81/284 of the real test cases. :(

Input: s = "abab"
Output: ["aba"]
Expected Output: ["abab"]

The first problem is that I only count runs that start and end with the same letter.

I also failed to think about the second constraint.

2. A substring that contains a certain character c must also contain all occurrences of c.

My solution includes one b and drops another.

fixed solution #

map every char in the string to its start and end index
- "ababbcc" -> {a: [0, 2], b: [1, 4], c: [5, 6]}
- str.rindex() to the rescue
sort the ranges by their end
looking at each chars range:
- find every different character in that range
- expand your start and end if those other character's ranges are wider
- "ababbcc" -> {a: [0, 4], b: [0, 4], c: [5, 6]}
  - notice how a and b are now intermingled, and their ranges have changed
starting with a negative ending time as our "last ending":
- take the next range that doesn't overlap our last ending time
- intuition:
  - choosing the interval that ends earliest leaves as much remaining time as possible

Also this can be a ...lot more complicated

class Solution:
    def maxNumOfSubstrings(self, s: str) -> List[str]:
        # keys are chars, values are first and last index (inclusive)
        ranges = {}
        for char in set(s):
            ranges[char] = [s.index(char), s.rindex(char)]

        # we need to expand ranges to include the index of any other characters
        expanded_ranges = []
        for char in set(s):
            orig_start = ranges[char][0]
            orig_end = ranges[char][1]

            # bounds haven't really changed, but we need to loop at least once
            bounds_changed = True
            while bounds_changed:
                bounds_changed = False

                # note the + 1: python slices are upper bound exclusive
                chars_in_range = set(s[orig_start : orig_end + 1])

                for char in chars_in_range:
                    # choose the earliest range start
                    new_start = min(orig_start, ranges[char][0])
                    # choose the latest range end
                    new_end = max(orig_end, ranges[char][1])

                    # if either change, keep bounds_changed true and keep checking
                    if new_start != orig_start or new_end != orig_end:
                        orig_start, orig_end = new_start, new_end
                        bounds_changed = True

            # note the + 1: see above
            expanded_ranges.append([orig_start, orig_end + 1])

        # greedily select non-overlapping substrings
        expanded_ranges.sort(key=lambda x: x[1])
        res = []
        last_end_index = -1
        for start, end in expanded_ranges:
            if start >= last_end_index:
                res.append(s[start:end])
                last_end_index = end
        return res

notes #

python -c "help()"
- spawns a repl that shows you the docs in less for any keyword you enter
- the output is the python standard library docs
- my python LSP doesn't show docs for the std lib, so this lets me skip a browser
str.rindex()
- i knew about list.index(), but only strings support the rightmost index
- apparently "nobody uses this for lists" from the BDFL himself
dict.keys() vs dict.vals() vs dict.items()
- keys returns keys
- vals returns vals
- items returns tuples of (key, val)
- straightforward, the footgun is that for _ in dict iterates over keys, not items
this is a standard dynamic programming problem called "interview scheduling"
- some slides here
- the greedy part at the end is the standard problem statement
- the twist in this leetcode was the expansion of the ranges, i.e. the part I ignored :^)
python slices are upper bound exclusive
- foo = [1, 2, 3] -> foo[0:1] == [1]
- common cause of off by 1 errors

Previous: the hub
Next: sum of prefix scores of strings