String Search Overview, Coverage and API

dg-pb · June 26, 2024, 2:59am

The main idea is to cover the bases of string search. This is a more general topic for higher level discussion. Individual proposals have (will have) their own threads.

I decided to lay this out to get a better picture of the situation, get feedback, ideas and list what is being worked on.

1. String search is generally classified into 3 categories:

Single-pattern
- Linear Complexity: Boyer–Moore + Galil rule, Knuth–Morris–Pratt, Two-way
Finite set of patterns
- Linear Complexity: Aho-Corasick
Inifinite number of patterns
- Linear Complexity: <?>

Each category below is a superset of the above.

However, with higher flexibility comes higher computational cost.

1.1. Core set of string search operations and parameters:

find first
find many (count)
count (maxcount)
split (maxsplit, keepsep)
replace (count)

2. Current state of Python

Single pattern (str.method)
1. Algorithm: Horspool + Two-Way
2. Efficiency: Optimal algorithms, sub-optimal algorithm selection
3. Operations
  - find first: yes
  - find many: no
  - count: yes (maxcount: at C level, but not exposed in Python)
  - split: yes (maxplit: yes, keepsep: no)
  - replace: yes (count: yes)
4. Reversed: Only Horspool (No two-way)
  - find first: yes
  - find many: no
  - count: no
  - split: yes
  - replace: no
Finite set of patterns (<?>)
1. Algorithm: None
2. Efficiency: None
3. Operations: None
4. Reversed: None
Intinite set of patterns (re)
1. Algorithm: recursive backtracking?
2. Efficiency: “When compiled as a Deterministic finite automaton, regular expressions can be checked in linear time”
3. Operations
  - find first: yes
  - find many: yes (count: yes)
  - count: yes (maxcount: yes)
  - split: yes (maxsplit: yes, keepsep: via groups)
  - replace: yes (count: yes)
4. Reversed: Supported by 3rd party regex module

Functionality laid out above pretty much covers the bases:

Single-pattern and re are most important. Finite set is secondary, but would nevertheless be nice to have. Currently “Finite set of patterns” searches are being handled by “Infinite set of patterns” search algorithms. Finite set of patterns implementation can improve performance for many problems, where re is overkill. They would also be simpler and more convenient to use being a method of str.
reversed search is used much less frequently
not implemented single-pattern functionality extensions are potential targets for improvement. e.g. github: /\.count\(.*\).*[><=]/ shows many use cases that could take advantage of short-circuiting.
re module (from this perspective) is covered, except for reverse search. I am sure there are other missing features compared to other regular expression implementations, but this IMO is the main one.

2. Current Work that is being done

2.1. https://github.com/python/cpython/pull/120025

Improves Horspool Find (default) algorithm
Eliminates hard-coded step-changes in algorithm selection
Implements direction agnostic routines. So that reversed search is done using same code

It is in a final state. Would be great if someone could have a look at it. It has been reviewed, but coredev review is still needed.

2.2. Support tuples for `[r]find()` & `[r]index()` by nineteendo · Pull Request #2 · nineteendo/cpython · GitHub

This is the first step for finite set of patterns category.

The concept is sound and works well in practice. The algorithm itself is finished and tested.
Although it is not theoretically optimal, but in practice it would be optimal solution for up to 20 patterns. At around 20 patterns re starts outperforming in cases where it performs exceptionally well. Sometimes re is still slower at 40 patterns.
Having initial implementation of this would open up the path for other convenient and efficient methods where re is overkill. E.g. str.find(multiple) can be implemented sensibly and easily once str.split(multiple) is available.

If this was to be merged, I would propose extensions to str.split in due time. Namely keepsep parameter first, allowing multiple-pattern split later.

2.3. Many speed improvments in GitHub · Where software is built

3. Also

I have little to none experience with re so if anyone wants to fill in let me know.
If anyone has any comments on what is needed, what is not, any requests, observations, willingness to volunteer - this is a good place for that.

devdanzin · June 26, 2024, 9:14am

Maybe it’s worth discussing O(1) string comparisons · Issue #486 · faster-cpython/ideas · GitHub again, given the comment that " dicts use the function unicode_eq, whereas something like the COMPARE_OP_STR_JUMP opcode uses unicode_compare_eq". A benchmark or two would be helpful to assess how interesting that change would be.

Thank you for the great summary and analysis, and even more for all the work you’ve been doing on PRs.

methane · June 26, 2024, 10:21am

I don’t think it is relating to this thread.

unicode_eq is not O(1). It doesn’t compare hashes. It is nearly equal to unicode_compare_eq.
We may be able to remove unicode_eq and use unicode_compare_eq if it is inline function.

devdanzin · June 26, 2024, 10:27am

AFAICT the idea would be to add hash comparison to unicode_compare_eq so it becomes O(1). Would that make sense?

methane · June 26, 2024, 10:38am

Only if it shows improvements.

comparison in dict/set and COMPARE_OP_STR are very different. So dict/set uses hash doesn’t justify using it in generic compare.

kknechtel · June 26, 2024, 12:51pm

Should this be in Ideas ? There doesn’t seem to be a concrete proposal at all… ?

dg-pb · June 26, 2024, 1:11pm

I am open to suggestions where this should be moved to.