DSL Operator – A different approach to DSLs

  1. compile time (macros) - this has no extra speed penalty as stated in PEP.

That sounds like “caching is useful” to me, and relying on the interpreter’s pycache to do it. Be nice if would just happen automatically for “pure” transformations (that is ast.parse("1+1") could either be evaluated, or could do a lookup and see “calling this specific function with this specific value should return this specific value”). Are definitely less pure (ex. read the text from a file, then parse, depends on result of environment variable, uses other modules state, …) forms which would be really cool if they could be conditionally cached / “check these assertions”. Just “this input → this output” w/ manual cache is a repeated pattern (ex. re.compile caches)

1 Like

Just to correct my statement for clarity “It is stated in PEP that compile time (macros) have no runtime speed penalty for code which has already been compiled.” Which makes sense.

Performance Implications PEP 638

Caching could indeed be useful. Especially when AST needs to be constructed at runtime. Not so important when it is done at compile time, but could still be of benefit.

But this would be a separate project on its own. ast nodes have attributes like lineno, so although I think there might be a way to speed things up via caching, it might not be as straight forward as re.compile case.

t-strings seem like a pretty good fit for CAS, as far as I can see.

Taking cas(t"{r} * (cos({theta}) + i*sin({theta}))") as an example, the Template.strings would be:

('', ' * (cos(', ') + i*sin(', '))')

You could then lex it into:

(REF, '*', '(', 'cos', '(', REF, ')', '+', 'i', '*', 'sin', '(', REF, ')', ')')

In reverse-Polish (if I do it right…):

(REF, REF, 'cos', APPLY, 'i', REF, 'sin', APPLY, MULT, ADD, MULT)

Now when I get my interpolations, I just need:

@lru_cache
def seq_to_rpn(strings):
    ...

def apply(template):
    interp = iter(template.interpolations)

    stack = []
    for item in seq_to_rpn(template.strings):
        if item is REF:
            stack.append(next(interp))
        ...

Seems like it could be pretty quick to me, since the strings would be known at compile time for any template literals. The lexing and shunting algorithm could be done in a compiled extension, if more speed is needed.

Fair,
(The way r and theta are managed is missing in your examples but whatever…)
${}.ast is semantically equivalent to a template string in the latter examples, only syntaxically simpler. But the ast attribute should be a tree of functions rather than a string, at the end.

I’ve been thinking about this thread, so I went back to the OP’s post and thought about what was going on.

I believe that the general problem is “how to write something in ones code, than is not to be interpreted as python, but translated to and finally processed as python”. Is this a definition of a DSL?

Consider the OP’s problem.

y = x**2 + 3*x – 5 # (1)

This looks like python and given x = 1 will evaluate to -1. However we want it to to be computed using different rules of arithmetic, i.e. modulo 11, where numbers are constrained in the range 0 to 10 or even a different bitwise logic notation where ** is exclusive or, * is and, + is or and – is inverted-or.

A python way to do this would be to define a class inheriting from integer and then to overload the operators.

However, the OP is looking for a solution that does not overload operators and evaluates the expression as function calls, so

y = cas.sub(cas.add(cas.pow(cas.name("x"), cas.constant(2)), cas.mul(cas.constant(3), cas.name("x"))), cas.constant(5)) # (2)

would be a solution, given that cas was a module that contained functions (sub, add, …) to compute all the operations as required. We then need a way to translate (1) to (2).

Using my macro tools an example code might look like this …

# coding: magic_macro
# above line invokes a preprocessor to deal with the macros.
# This is standard python using magic_codec (see pypi)
#
import cas
#
macro! cas
# extracts macro definitions from module cas
#
result = cas!$ x**2 + 3*x – 5

This would result in effectively the following code

import cas
result = cas.sub(cas.add(cas.pow(x, 2), cas.mul(3, x)), 5)

This has a compile time overhead but no runtime delay over and above calculating the function. Variable interpolation and constant references are compile time issue. Nothing is currently cached, though the macro interpreter could be built to cache the macro parameters and result to be reused when processing different parts of the code

The module cas is the key. Firstly it contains the functions sub, add, etc. Then the definition for the macro cas

def _cas(node, vargs, macro_list):
     # simplified
     return _translate(vargs[V.PARAMETERS])

plus some boilerplate to add _cas to the macro processor as cas

vargs[V.PARAMETERS] is simply a string containing all the text passed to the macro.

and finally a function _translate does the work, and that has a signature
def _translate(a:sr) → ast.Node:

Note that _translate is executed at compile time NOT run time. Calling either _translate or _cas at runtime would result in an exception being raised. A runtime version could be constructed using the eval built-in i.e. result = cas.cas(‘{x}**2 + 3*{x} – 5’), stealing ideas from f-strings and PEP 750.

This scheme would work for any DSL, the problem is to write the _translate function which is part of cas or equivalent module, not a python built-in or part of the standard library.

DSL’s which look like python are easy-ish to translate as the DSL expression can be converted to an ast tree which is easy-ish to walk and convert to another form as required. A non pythonic DSL would need a bespoke parser to generate the ast tree.

An alternative solution might be based on ideas from PEP 750 but this would constrain the syntax of the DSL and involve lots of {} pairs and explicit strings. One might even think of PEP 750 as a very specific DSL.

One could easily work with

answer = dsl!{ a[b] | c[d]}

What does a _translate function do? Remember that answer is the result returned by processing the DSL not a reference to the code, parameters or calculation. These are only valid at compile time and are only accessible by the compiler or in my case the macro processor which is run first.

What about this?

answer = dsl!{ f(23)[a ~ b] | (a or True)[d & d]}

Its all valid python, just the origional a, b, c, d are expressions.

2 Likes

I consider it a bonus that t'cos({x})' clearly distinguishes between x, which is a local variable, and cos, which is an identifier that the CAS needs to provide.

Anyway, I think PEP 750 is likely to be approved, while PEP 638 seems to me less likely, even if it were picked back up and submitted for a decision. So I think we’ll have an opportunity to explore the opportunities provided by t-strings well before macros are on the table.

1 Like

My current opinion is that if the effort and size of addition could be justified, macros would be best. It would be most performant and most flexible solution.

It would also be able to just use ASTs, which are complete syntax representation without worries about performance.

And it would truly be able to cover large portion of problems with solutions of high quality in terms of verbosity, convenience and performance.

I think that such would lift a lot of weight in terms of demands for various language additions.


I think it is worth pre-meditating a possibility to add custom non-Pyhonic DSL parser, but the default and the starting point should focus on “Pythonic DSL”.

Reasons as per DSL Operator – A different approach to DSLs - #38 by dg-pb and also I am in agreement with The Zen of CherryPy · cherrypy/cherrypy Wiki · GitHub, which says: “Domain-specific Python is better than new DSL’s.”


I feel that any other runtime extension would be insufficiently flexible/performant for new syntax addition to be justified, given these would offer too little on top of t-strings.


However, I think that PEP638 (and your implementation) might not be the ultimate form of it. It might be, I haven’t spent much time thinking and looking at these yet, but if there are possible simplifications / elegant ways to make it more intuitive, I think it is worthwhile spending time exploring.

According to well known “Everything should be made as simple as possible , but not simpler”, I think that current state of PEP638 have a bit of room for improvement.

r and theta are the variables that will be inside template.interpolations, and the REF sentinel was intended to identify when they should be substituted.

I’m not sure if you’re implying the result of cas(t'...') must be a string, but to be clear, I would expect the result to be:

cas(t"{r} * (cos({theta}) + i*sin({theta}))")

cas.mul(r, cas.add(cas.cos(theta), cas.mul(cas.i, cas.sin(theta))))

I’m here using cas along the lines of https://discuss.python.org/t/dsl-operator-a-different-approach-to-dsls/79874/18:

class cas:
    i = ...
    def mul(...): ...
    def add(...): ...
    def cos(...): ...
    def sin(...): ...
    ...

I think there is no need to talk hypotheticals / concepts here as actual working solution can be easily made using string.Formatter.

This is actually not the desired behavior, we want to be able to override cos as well, which is furthermore a reference to a function. Another way around and you have a cheap, truncated version of metaprogramming… Only strings should be strings in the AST.

I was going to say that the references should be separated but I realized it is paradoxical with what I said just before… Now I think I understand why the Julia operator is made this way, like said above :

Some people have been asking for practical examples of why one would want to embed a DSL in Python. One of my main motivations for thinking about this stuff is that I want a good way to write relational database queries in a Python program.

There are currently a couple of ways this is typically done. One is to put SQL queries in strings, which has all the usual problems of putting code inside strings that I won’t repeat here.

The other is to use an ORM such as Django or SQLAlchemy that hides the SQL altogether and tries to make the database look like it’s made of objects. This adds a very complicated layer of abstraction that for me seems like massive overkill most of the time.

What I’d really like to be able to do is simply write my query in Python. For example, the SQL query

SELECT C from Customers WHERE C.CustomerGroup = 'CREDIT' AND C.Balance > C.CreditLimit

could be expressed in Python very naturally as

result = [c for c in Customers if c.CustomerGroup = 'CREDIT' and c.Balance > c.CreditLimit]

This could be made to work pretty easily, but it would require retrieving ALL the data from the Customers table, shipping it to the Python process and doing the filtering there.

With my proposed DSL operator, one could write

result = db$[c for c in Customers if c.CustomerGroup = 'CREDIT' and c.Balance > c.CreditLimit]

The db module could then construct an SQL statement and send it to the server, which has far more efficient ways of performing the query.

In general, the DSL operator is for situations where you want to write code in Python, but don’t want it executed as Python. I hope this example helps to explain why one might want to do that.

3 Likes
from _string import formatter_parser
import re
import operator as opr


CMP_REPL_ID = re.compile(r'\{([^\d\W]\w*)\}')


class ns_dict(dict):
    def __init__(self, d, symbols, f_symbol):
        dict.__init__(self, d)
        self.symbols = symbols
        if f_symbol is None:
            f_symbol = lambda x: x
        self.f_symbol = f_symbol

    def __getitem__(self, item):
        if item in self.symbols:
            if (f_symbol := self.f_symbol) is None:
                return item
            return self.f_symbol(item)
        return dict.__getitem__(self, item)


def parse(string, namespace, f_symbol=None):
    parse_it = formatter_parser(string)
    symbols = set(filter(None, map(opr.itemgetter(1), parse_it)))
    assert not symbols & set(namespace)
    debrack = CMP_REPL_ID.sub(r'\1', string)
    namespace = ns_dict(namespace, symbols, f_symbol)
    return eval(debrack, namespace)


print(parse('{x}**2 + 3*{x} - 5', {}, lambda x: 1))
# x**2 + 3*x - 5
# -1


class cas:
    i = 'i'
    def mul(a, b): return ['*', a, b]
    def add(a, b): return ['+', a, b]
    def cos(x): return ['cos', x]
    def sin(x): return ['sin', x]


print(parse('mul({r}, add(cos({theta}), mul(i, sin({theta}))))', cas.__dict__))
# mul(r, add(cos(theta), mul(i, sin(theta))))
# ['*', 1, ['+', ['cos', 1], ['*', 'i', ['sin', 1]]]]


Issues with this:

1. Flexibility (Inability to override operators)

Solution ast.parse-like parse_dsl:

class cas:
    i = 'i'
    def name(name): return ...
    def bin_op(op, a, b): return ['op', a, b]
    def args(...): return ...
    def call(name, args, kwds): return ['call', name, args, kwds]
ast.parse_dsl(string, namespace=cas.__dict__)

So that one can override anything pythonic (e.g. a[1] or b[2])

2. Performance

import ast
from math import cos, sin
i, r, theta = -1, 1, 6
%timeit parse('mul({r}, add(cos({theta}), mul(i, sin({theta}))))', cas.__dict__)    # 51 µs
%timeit eval('r * (cos(theta) + i*sin(theta))')                                     # 25 µs
%timeit ast.parse('r * (cos(theta) + i*sin(theta))')                                # 23 µs

So parse_dsl could be as fast as 23 µs and extra optimizaion efforts could be developed to bring it down further.
e.g.:

ast.parse_dsl(string, namespace=cas.__dict__, caching=MyCacheStrategy())

etc…



So with t-strings solution would look look similar to:

def parse(string, namespace):
    interpolations = t-string-stuff # == {'dummy_r': 1, 'dummy_theta': 2}
    namespace.name = lambda x: interpolations[x] if x in interpolations else namespace.name(x)
    debrack = ... # place dummy_names into places of `{theta}, {r}, etc`
    return parse_dsl(debrack, namespace.__dict__)

cas = functools.partial(parse, namespace=cas)

cas(t'{r} * (cos({theta}) + i*sin({theta}))')


So ast.parse-like function with overridable node construction methods
would pretty much be best that can be done.

Note, in Julia need for $ is inevitable. {} is not that much worse.

So I don’t think it is worth to go any further for runtime solution.
As no runtime solution exists that would make implementing the following sensible anyways:

a = maybe('{a}[1] or {b}[2]')

which if done in Pure Python without dsl is 50 ns, while runtime solution will not be faster than 10 µs. So 200 slower.



The above is a rough draft.
Things could be done more elegantly for sure.
But I think it is enough to paint high level picture of what is possible with t-strings and a bit of non-syntax-changing utility development.



So my proposal is the following:

  1. for runtime solutions, aim re-using Python’s existing ast.parse logic and other existing good stuff to devise tools along the lines of parse_dsl above to make things as good as possible and combine that with t-strings.
  2. Concentrate on devising perfect pythonic solution of macros

So:

  • t-strings + parse_dsl = Pythonic DSL with interpolation @ Run-time
  • t-strings + parse_custom = Non-Pythonic DSL with interpolation @ Run-time
  • MACROS = Pythonic DSL @ Compile-time
  • Other programming languages = Non-Pythonic DSL @ Compile-time

How is this different from an ORM? You would need to have something that takes all of the pieces of Python AST and figures out an SQL query. That’s exactly what ORMs do.

3 Likes

Yes, but all the ORMs I’ve seen do it with a very elaborate and convoluted API. For example, to do this with SQLAlchemy I would first have to create a class defining the Customer table and all of its fields, which would look like

class Customer(DeclarativeBase):
    __tablename__ = "Customers"
    CustomerCode = mapped_column(String(20), primary_key=True)
    CustomerName = mapped_column(String(40))
    CustomerGroup = mapped_column(String(8))
    Balance = mapped_column(Numeric(10, 2))
    CreditLimit = mapped_column(Numeric(10, 2))
    # etc etc

Then the actual query would be something like

stmt = (select(Customer)
    .where(Customer.CustomerGroup == 'CREDIT')
    .where(Customer.Balance > Customer.CreditLimit))
result = session.scalars(stmt)

(This is kind of cheating by using cascaded .where() clauses to implement the boolean and. I don’t know what you need to do in SQLAlchemy to get an or.)

To my mind, this is an enormous amount of complexity and awkwardness to do something that should be very straightforward.

I’ve often wanted something along these lines.

Can you say more about what this gives us that operator overloading does not? Is the idea that people are resistant to a PEP 335 style approach because it removes existing semantic guarantees (namely, that and/or short-circuit), so we can sidestep that by wrapping everything in an explicit marker so that it’s more clear when such guarantees may be violated? Are there any other things besides and/or that would be nice to be able customize but cannot be under current behavior? (You did give one example in another post which I’ll look at below.)

It seems to me that the condition part is the heart of the matter here. I agree it is nice to be able to write c.CustomerGroup == 'CREDIT' and c.Balance > c.CreditLimit, but I don’t see it as vital to be able to write this as a list comprehension rather than something like:

results = someorm.query(c.CustomerGroup == 'CREDIT' and c.Balance > c.CreditLimit)
for result in results:
    do_stuff()

And this can already be done by clever definition of c — except for the inability to override and.

Some ORMs are a bit nicer than that. For instance py4web’s “DAL” lets you do something like db((db.person.name=='Alex') & (db.person.id > 3)).select(). Again, the main awkwardness here is having to use & instead of and, and then having to parenthesize things because of the different operator precedence.

Another possible sticking point with current Python is the need to have all names defined before the expression. This means you can’t even, e.g., have a query like (Column1 == "Blah") & (Column2 == "Stuff"); you have to prefix them with some kind of database object and write db.Column1 or the like. With your proposal it would be possible to dispense with some of this and roll it into the expression evaluator.

To be fair, I think with many ORMs only part of this complexity is needed to actually construct the query. Part of it is to provide a useful result object that can then be used to manipulate the db. For something like your example, what’s often wanted is a result list of “row” or “record” objects, each of which might handle stuff like row.SomeColumn = 2 (to update a value in the DB). This would still be needed regardless of whether the query is generated via an operator-overloaded object or via a new DSL “context”.

And this kind of leads to one worry I have about using such a context: it requires the syntactic wrapping every time the special behavior is needed. I’m not sure this would be a problem, but it does seem that, for instance, if people are doing stuff where they want to do custom_obj and other_obj and have it do magic stuff, they may want to do that repeatedly throughout a section of code, which would require wrapping many expressions. The advantage of an operator overload is that you have a normal object that incorporates its own specialness, so it can be used anywhere.

Overall, like I say, I’ve often wished for the ability to create these kinds of DSLs. I think that most times I want them, the full flexibility of parsing the entire expression would be overkill. It’s just a tantalizingly small gap between what can currently be done with operator overloading and what I’d like to do. Perhaps this would simply mean that someone would write a generic DSL library that transforms the expression into a hierarchy of standardized objects, similar to an AST but more constrained, so that you could call some kind of .eval() function on nodes as needed, and then everyone would wind up using that library rather than directly writing their own DSL handler.

I’ll be honest, as someone who’s used SQL for most of my career, I find the list comprehension version less readable than the SELECT statement. And I would find it difficult to work out how to translate complex SQL (window functions, for example) into Python-like syntax, so this would only be helpful for simpler queries.

Personally, I’d prefer something like

result = sql(t"SELECT C from Customers WHERE C.CustomerGroup = 'CREDIT' AND C.Balance > C.CreditLimit")

especially if t-strings get highlighted based on the embedded language (which is a hope the t-string PEP expresses, but frankly I’m taking an “I’ll believe it when I see it” attitude to that…)

But I don’t want to write database queries in Python. For me, a DSL is exactly that - a domain-specific language, i.e., not Python.

There is something emotionally unsatisfying about putting code in another language into a string, but realistically we have to mark “this is a different language” somehow, and I can’t honestly come up with any logical reason why sql(t"...") is any different than db$[...]. So while I understand the appeal of this sort of “DSL operator”, I struggle to justify it.

6 Likes
A bit OT on highlighting in t-strings

If your editor uses treesitter for syntax highlighting, you’re likely able to add you own injections, making it possible to do today! Bext step would be making your LSP work but that seems a lot harder.

Yes, this is indeed an “ORM” (or rather, an ‘Python Syntax Relational Mapper’), and would require similar logic to an ORM. The point is that currently ORMs, despite going all the complicated parts with thousands of lines of code, can’t deliver the “last mile” that would allow this syntax, due to the iteration being the first thing resolved in this kind of expression, and the best they can offer is method-chaining.

IMO this simple example illustrates the point of this whole thread better than any other.
But I agree that t-strings may allow to express this in ways that are satisfy both groups- the same ORM library could easily provide one call to allow “comprehension like syntax” and “raw SQL syntax” and still execute mostly the same code underneath.

1 Like
session.query(Customer).filter(Customer.CustomerGroup == 'CREDIT', Customer.Balance > Customer.CreditLimit)

There’s and_ and or_

To your point, ORMs are confusing and to my knowledge, SQLAlchemy has at least 3 different APIs

pre 0.x, 1.0-1.4, and 2.0