Some Pattern Annoyance using Constants vs. Literals

Hi folks,

I just used Structural Pattern Matching quite intensively and I’m
pretty amazed of the new possibilities.

But see this code, trying to implement Mark Pilgrim’s regex
algorithm for roman literals with SPM:

With constants, I can write

    match seq:
        case "M", "M", "M", "M", *r:
            return 4 * 1000, r

But if I want to use abbreviations by assignment, this is no longer
possible, and I have to write something weird like:

    M = "M" # should be deduced as equivalent?
    match seq:
        case M, a, b, c, *r if M == a == b == c:
            return 4 * 1000, r

So what is missing seems to be a notion of const-ness, which
could be dynamically deduced. Maybe this is a reason to introduce
a real constant which is treated like a literal?

2 Likes

You are aware that some_object.some_attribute is treated as a constant pattern, right?

No, I’m pretty new to this since as a PySide core developer, we still are using Python 3.8 and I learned about this just 2 days ago. Where can I find that? It is not in PEP-0634

It is described in PEP 634 under PEP 634 – Structural Pattern Matching: Specification | peps.python.org. Or the tutorial PEP 636 under “matching against constants and enums”.

1 Like

It is, see PEP 634 – Structural Pattern Matching: Specification | peps.python.org

See also

1 Like

Thanks guys! I did not realize that for using constants I have to use Classes :smiley:
I never looked into using classes since I wanted strings, only. Now it becomes clear.
Still not too obvious.

Ah, I just tried this, using attributes instead. This is quite a bit slower :frowning:

1 Like

types.SimpleNamespace is a rather cheap alternative to creating a class in case you just want a one-off string.

1 Like

Tried this:

M = SimpleNamespace(M = "M")

is even slower than

class M:
    M = "M"

which is sad.

To make an optimum comparison between my version and Pilgrim’s (much shorter)
regex implementation, I probably have to generate the matches explicitly with
constants. Not important, but outperforming regex would have been cool.

1 Like

How did you measure it?

This is the result for me:

$ python
Python 3.11.4 (main, Jun  7 2023, 00:00:00) [GCC 13.1.1 20230511 (Red Hat 13.1.1-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from timeit import timeit
>>> from types import SimpleNamespace
>>> timeit("M = SimpleNamespace(M='M')", globals=globals())
0.14027663000160828
>>> timeit("""
... class M:
...     M = "M"
... """)
6.152667925998685
1 Like

:smiley: No, that’s not the point. I was timing my resulting Roman to integer
function, which is using SPM instead of Mark Pilgrim who used regexen.
When using classes, it was slower dan using variables and certain if-constraints.

When building the classes natively, the resulting code was faster than using simpleNamespace.

I will post that program here with benchmarks when it is ready.

1 Like

This is the updated version that generates all possible constants in advance.
By chance, I found an even faster correct algorithm without using any patterns :smiley:
“”"
pilgrim_spm.py

(C) Christian Tismer-Sperling 2023-08-02

The Roman Numerals Converter as Structural Patterm Matching

This is a proof of concept, to learn about SPM and to see how far we get.

The idea:

The Pilgrim algorithm is not the fastest, but completely correct, because
it can never admit a Roman Numeral that is not allowed.

Instead of implementing another sub-optimal algorithm, this implementation
uses the identical idea as the regex based Pilgrim solution, BUT it models
the pattern matching using SPM.

How it works:

In the implementation, you can see the original regex code as comment.
Below that is the equivalent implementation as SPM.

If you look at match_four, this is easiest to understand. It shows how
we map the regex to equivalent SPM. In match_group1, this function is
then turned into a parameterless function using functools.partial .

The match_digit function is a bit more complicated because there are more
alternatives. The variable names have been choosen to be a little reminding
of the Roman characters. As an additional complication, we cannot use a
variable multiple times in a “case” clause, so we need to fix that with
guard-if constructs.

We could of course also have used character constants, directly and do all
calculations inline. But that would have grown the code size even bigger.
The elegance of this approach is that we can use the same function over
and over with different constants, see match_seq.

Things to try

This attempt was almost as fast as the Pilgrim solution.
If we really have too much time, we might write instead a generator for
all expressions, making them all constants. That might give some speedup
but also more code bloat.

2023-08-02: Tried this, after using constants everywhere a little faster
than Pilgrim’s algorithm, at least on an M1 Mac Studio.

The fastest version

The fastest version is now implemented very differently:

  • generate a fast result using from_roman_numeral which might be erroneous
  • convert it back with roman.toRoman (very fast)
  • if the created string is the input string, it is correct.

“”"

import re
import roman
from functools import partial
from textwrap import dedent

__all__ = ["from_roman_spm", "RomanError", "InvalidRomanNumeralError"]

class RomanError(Exception):
    pass


class InvalidRomanNumeralError(RomanError):
    pass


# This is copied from M. Pilgrim for reference.
romanNumeralPattern = re.compile("""
    ^                   # beginning of string
    M{0,4}              # thousands - 0 to 4 M's
    (CM|CD|D?C{0,3})    # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's),
                        #            or 500-800 (D, followed by 0 to 3 C's)
    (XC|XL|L?X{0,3})    # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 X's),
                        #        or 50-80 (L, followed by 0 to 3 X's)
    (IX|IV|V?I{0,3})    # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 I's),
                        #        or 5-8 (V, followed by 0 to 3 I's)
    $                   # end of string
    """, re.VERBOSE)


value_map = {"I": 1, "V": 5, "X": 10, "L": 50, "C": 100, "D": 500, "M": 1000}


def match_four(*seq):
    # M{0,4}
    match seq:
        case "M", "M", "M", "M", *rest:
            return 4 * value_map["M"], rest
        case "M", "M", "M", *rest:
            return 3 * value_map["M"], rest
        case "M", "M", *rest:
            return 2 * value_map["M"], rest
        case "M", *rest:
            return value_map["M"], rest
    return 0, seq

# Old version, slightly slower.
def match_digit(C, M, D, *seq):
    # (CM|CD|D?C{0,3})
    match seq:
        # CM | CD
        case xC, xM, *r if xC == C and xM == M:
            return value_map[M] - value_map[C], r
        case xC, xD, *r if xC == C and xD == D:
            return value_map[D] - value_map[C], r
        # D C{0..3}
        case xD, xC, a, b, *r if xD == D and C == xC == a == b:
            return value_map[D] + value_map[C] * 3, r
        case xD, xC, a, *r if xD == D and xC == C == a:
            return value_map[D] + value_map[C] * 2, r
        case xD, xC, *r if xD == D and xC == C:
            return value_map[D] + value_map[C], r
        case xD, *r if xD == D:
            return value_map[D], r
        # C{0..3}
        case xC, a, b, *r if xC == C == a == b:
            return value_map[C] * 3, r
        case xC, a, *r if xC == C == a:
            return value_map[C] * 2, r
        case xC, *r if xC == C:
            return value_map[C], r
    return 0, seq

# New version, faster after I generated everything..
def gen_match_digit(name, C, M, D):
    s = dedent(f"""
        def {name}(*seq):
            # (CM|CD|D?C{{0,3}})
            match seq:
                # CM | CD
                case "{C}", "{M}", *r:
                    return {value_map[M] - value_map[C]}, r
                case "{C}", "{D}", *r:
                    return {value_map[D] - value_map[C]}, r
                # D C{{0..3}}
                case "{D}", "{C}", "{C}", "{C}", *r:
                    return {value_map[D] + value_map[C] * 3}, r
                case "{D}", "{C}", "{C}", *r:
                    return {value_map[D] + value_map[C] * 2}, r
                case "{D}", "{C}", *r:
                    return {value_map[D] + value_map[C]}, r
                case "{D}", *r:
                    return value_map["{D}"], r
                # C{{0..3}}
                case "{C}", "{C}", "{C}", *r:
                    return {value_map[C] * 3}, r
                case "{C}", "{C}", *r:
                    return {value_map[C] * 2}, r
                case "{C}", *r:
                    return {value_map[C]}, r
            return 0, seq
    """)
    exec(s, globals())

match_group1 = match_four
# match_group2 = partial(match_digit, "C", "M", "D")
# match_group3 = partial(match_digit, "X", "C", "L")
# match_group4 = partial(match_digit, "I", "X", "V")

# This version is slightlly faster, almost the same as Pilgrim's.
gen_match_digit("match_group2", "C", "M", "D")
gen_match_digit("match_group3", "X", "C", "L")
gen_match_digit("match_group4", "I", "X", "V")


def match_seq(*seq):
    res1, seq = match_group1(*seq)
    res2, seq = match_group2(*seq)
    res3, seq = match_group3(*seq)
    res4, seq = match_group4(*seq)
    if not seq:
        return res1 + res2 + res3 + res4


def from_roman_spm(s):
    # special case
    if s == 'N':
        return 0
    if not s:
        raise InvalidRomanNumeralError('Input can not be blank')
    res = match_seq(*s)
    if res == None:
        raise InvalidRomanNumeralError(f"Invalid Roman numeral: {s}")
    return res


# For comparison, one of the many fast but incomplete solution from the internet.
# https://bas.codes/posts/python-roman-numerals#converting-roman-numerals-to-numbers-in-python
def from_roman_numeral(numeral):
    value_map = {"I": 1, "V": 5, "X": 10, "L": 50, "C": 100, "D": 500, "M": 1000}
    value = 0
    last_digit_value = 0

    for roman_digit in numeral[::-1]:              # 1
        digit_value = value_map[roman_digit]

        if digit_value >= last_digit_value:        # 2
            value += digit_value
            last_digit_value = digit_value
        else:                                      # 3
            value -= digit_value

    return value


def from_roman_fastest(numeral):
    if numeral == 'N':
        return 0
    num = from_roman_numeral(numeral)
    cmp = roman.toRoman(num)
    if numeral != cmp:
        raise InvalidRomanNumeralError(f"Invalid Roman numeral: {numeral}")
    return num


if __name__ == "__main__":
    import argparse
    import sys

    parser = argparse.ArgumentParser()
    group = parser.add_mutually_exclusive_group()
    group.add_argument("--unittest", "-u", action="store_true")
    group.add_argument("--benchmark", "-b", action="store_true")
    options = parser.parse_args()

    if options.unittest:
        import unittest

        class TestCompat(unittest.TestCase):

            def test_all_correct(self):
                for num in range(5000):
                    s = roman.toRoman(num)
                    my_result = from_roman_spm(s)
                    orig_result = roman.fromRoman(s)
                    self.assertEqual(my_result, orig_result)

            def test_all_correct_fastest(self):
                for num in range(5000):
                    s = roman.toRoman(num)
                    my_result = from_roman_fastest(s)
                    orig_result = roman.fromRoman(s)
                    self.assertEqual(my_result, orig_result)

            def test_incomplete(self):
                ok_arg = "MCMLXXV"
                bad_arg = "MC" + ok_arg
                # Correct cases are correctly handled
                self.assertEqual(from_roman_spm(ok_arg), 1975)
                self.assertEqual(from_roman_numeral(ok_arg), 1975)
                self.assertEqual(from_roman_fastest(ok_arg), 1975)
                # The bad argument is correctly not recognized
                with self.assertRaises(RomanError):
                    from_roman_spm(bad_arg)
                with self.assertRaises(RomanError):
                    from_roman_fastest(bad_arg)
                # but the incomplete solution recognizes it
                self.assertEqual(from_roman_numeral(bad_arg), 2875)

        unittest.main(argv=sys.argv[:1], verbosity=2)

    if options.benchmark:

        from timeit import timeit
        from roman import fromRoman, toRoman

        fname = "from_roman_spm"
        t = timeit(fname + "('MCMLXXV')", globals=globals())
        print(f"A million times {fname!r:20}: {t:g} s")
        fname = "fromRoman"
        t = timeit(fname + "('MCMLXXV')", globals=globals())
        print(f"A million times {fname!r:20}: {t:g} s (Mark Pilgrim)")
        fname = "from_roman_fastest"
        t = timeit(fname + "('MCMLXXV')", globals=globals())
        print(f"A million times {fname!r:20}: {t:g} s (incomplete with Pilgrim)")
        fname = "toRoman"
        t = timeit(fname + "(1975)", globals=globals())
        print(f"A million times {fname!r:20}: {t:g} s (Mark Pilgrim)")
        fname = "from_roman_numeral"
        t = timeit(fname + "('MCMLXXV')", globals=globals())
        print(f"A million times {fname!r:20}: {t:g} s (incomplete)")
        sys.exit(0)

    parser.print_help()
1 Like

Note: The fast solution was only possible because the Roman <-> Integer mapping is bijective. I believe this was not the case when Mark Pilgrim developed his algorithm.

FWIW there is the Final type to indicate “const-ness” to type checkers, i.e. M: Final = "M"

but presumably that doesn’t work for pattern matching currently

2 Likes