Sliding window question -- how to optimize the loop usage

john316 · March 31, 2022, 5:41am

This is a standard question of finding the longest substring without repeating characters. My code does work, but I have two while loops of which I am not sure any one can be further simplified:

class Solution:
    def lengthOfLongestSubstring(self, s: str) -> int:
        n = len(s)
        if n == 0:
            return 0
        curr_str = s[0]
        max_str = s[0]
        max_len = 1
        curr_set = set(s[0])
        i = 0
        j = 1
        
        while j < n:
            while j < n and s[j] not in curr_set:
                curr_str = s[i:j+1]
                curr_set.add(s[j])
                j += 1
            if len(curr_str) > max_len:
                max_len = len(curr_str)
            curr_set.remove(s[i])
            i += 1
        
        return max_len

I have while j < n in the first while loop and also in the inner loop. I could not see a way to simplify the two while j < n statements. It seems that both need to exist. Is it common for such sliding window questions? Is there any good way to simplify based on the current code? Thank you!!

PS: I know this class Solution is an empty, useless class that does nothing, but it is the way Lxxxcode does it, and I just copy my solution. I will certainly not have this class if I code for myself.

john316 · March 31, 2022, 7:36am

I redo it. I think it would be easier.

class Solution:
    def lengthOfLongestSubstring(self, s: str) -> int:
        n = len(s)
        if n == 0 or n ==1:
            return n
        set1 = set(s[0])
        i = 0  # i: left; j: right 
        max_len = 1  
        
        for j in range(1,n): 
            if s[j] not in set1:
                set1.add(s[j])
                max_len = max(max_len, j+1-i)
            else:  
                while s[i] != s[j]:
                    set1.remove(s[i])
                    i += 1
                i += 1
        
        return max_len

steven.daprano · March 31, 2022, 8:55am

It might be a standard question, but not standard enough for me to know what you mean by “longest substring without repeating characters”.

Can you give an example or two of input and expected output?

Or link to a discussion where the expected behaviour is described?

john316 · March 31, 2022, 1:34pm

Sorry. This is the question:

Given a string s, find the length of the longest substring without repeating characters.

Example 1:

Input: s = “abcabcbb”
Output: 3
Explanation: The answer is “abc”, with the length of 3.
Example 2:

Input: s = “bbbbb”
Output: 1
Explanation: The answer is “b”, with the length of 1.
Example 3:

Input: s = “pwwkew”
Output: 3
Explanation: The answer is “wke”, with the length of 3.
Notice that the answer must be a substring, “pwke” is a subsequence and not a substring.

CAM-Gerlach · April 1, 2022, 5:07am

When asking about “optimization”, its important to specify what you’re optimizing for—execution time? Memory usage? Fewest characters/lines? Simplest/cleanest overall code?

In any case, your second function is definitely cleaner, but it still can be simplified a good deal further, which both makes it shorter and easier to understand, and also improves its performance. You can simply iterate over each char in the string, check if each is in your running set, set max_len to the max of the set's len and the current max_len and clear the set if so, and then add the new character to the set.

I.e.

def get_len_longest_unique_substring(the_string):
    max_len = 0
    unique_chars = set()
    for char in the_string:
        if char in unique_chars:
            max_len = max(len(unique_chars), max_len)
            unique_chars.clear()
        unique_chars |= {char}
    return max(len(unique_chars), max_len)

You can optimize for some specific cases, like len(set(the_string)) == 1 (all characters are the same, so return 1, i.e. your second case), or len(set(the_string)) == len(the_string) (all characters are unique, so return len(the_string) to perform better on shorter strings, or returning immediately if max_len == len(set(the_string)) to do better on longer ones, at the cost of code complexity and possibly slower performance on the general case, and there’s probably some low-level trickery you can do to speed things up, at the cost of making your code less readable, but all of that is almost certainly an over-optimization unless you are dealing with on the order of millions of strings, as it will never pay back the development time it cost.

Comparing code length (as a proxy for complexity) with your solution, it weighs in at 8 lines, relative to 21 in your first solution and 18 in your second. Regarding execution time, I ran %timeit (via IPython) on all three:

>>> %timeit lengthOfLongestSubstring("abcabcbb")
16.9 µs ± 697 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

>>> %timeit lengthOfLongestSubstring("abcabcbb")
8.47 µs ± 403 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

>>> %timeit get_len_longest_unique_substring("abcabcbb")
6.07 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

You can see that your second iteration (pun not intended) is twice as fast due to the lack of a nested loop, but the version above does 50% faster still due to the lower complexity.

On a slightly longer, all-unique string:

>>> %timeit lengthOfLongestSubstring(string.ascii_lowercase)
31.7 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

>>> %timeit lengthOfLongestSubstring2(string.ascii_lowercase)
32 µs ± 1.79 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

>>> %timeit longest_unique_substring(string.ascii_lowercase)
11.6 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Both of your methods take essentially the same time, since the nested loop doesn’t come into effect, while the simpler method above is nearly three times faster.

On a very long string with many repeating segments, on the other hand:

>>> %timeit lengthOfLongestSubstring(string.ascii_lowercase * 1000)
57.6 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit lengthOfLongestSubstring2(string.ascii_lowercase * 1000)
15.4 ms ± 580 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit longest_unique_substring(string.ascii_lowercase * 1000)
10.7 ms ± 259 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

You can that your first function is nearly four times lower than your second due to the nested loop, while the simpler above code is around 2/3rds faster.

The moral of the story, though, is that unless you’re dealing with huge data, lots of data, working on a very constrained system or have very limited response time, the most important thing is to first write code that is easy to read, understand and modify, as that will likely save far more net time in the long run than shaving a few microseconds off the run time time.