Which Python-supported database supports regex?

I’m using Python 3.11 on Windows 10 but my production programs will end up on the MS Azure eco system. I’m still a bit new to Python.

Requirements

  1. It should be a database that is supported by Python, does not have a server that needs to be installed (Azure won’t allow some things in some cases).
  2. It should support regex.
  3. I will be dealing with 100,000 records at some point in the future.

Notes

Azure has many different parts and many ways to implement a Python program. The most expensive is a full server deployment which I’m trying to avoid. This is why I have my requirements above. I’m trying to use a Function App or Function App Service to deploy my Python programs. But I don’t yet have enough information about using Azure at this time to know for sure, as I’m a noob when it comes to Azure.

EDIT: Azure supports SQL Server but it does not support regex natively. Something else has to be installed to support regex.

I did not see that sqlite3 supported regex. Mongo db seems to require a server component to be installed.

I don’t have to use SQL to get the records, my last resort is persistent storage like shelve and loop through records myself using Python regex but then I don’t have fields per se I can select on, or implementing a field-like structure would be tedious.

Thank you!

According to the official release log (SQLite Release 3.36.0 On 2021-06-18) it seems to include a regex extension by default in cli builds.

And PostgreSQL seems to support regex too [1]


  1. there are likely more db systems, I can only recomend using Google to find them ↩︎

Yes, Postgresql does support a regex operator, I use it sometimes. But it requires a server component to be installed, which I’m trying to avoid.

Is the CLI version supported by Python? I’m still new to Python.

You don’t have many options if you want to avoid having a server.

SQLite supports a REGEXP operator if you provide a SQL function called regexp. So if you give it such a function, like so:

def regexp(pattern, string):
    return re.search(pattern, string) is not None

con = sqlite3.connect(...) # whichever database
con.create_function("regexp", 2, regexp)

It seems to work. (Though I don’t think the performance will be very good.)

What are the performance requirements for the 100k records? Technically sqlite can be fast enough depending on the requirement.

Otherwise without a server, it’s limited but you could use a (sketchy) directory structure to make a tree of sorts.

Thanks. I didn’t know I could define a function in Python and make it work in SQLite. I have never worked with SQL functions before. I know basic SQL and that’s it.

That will help.

I assume I have to run this code each time in my program? Or is the regexp function stored in that SQLite database when it is first run?

I have no performance requirements. It’s ok if the program takes 40 minutes to run or more. My boss understands these things. 100,000 records was the worst case scenario if I have to store a copy of our database shipping records due to other technical difficulties and business process limitations which I cannot get into.

I’m referring to our shipment records and we make 5000+ shipments every month. A single shipment might have 1 package going to each of 300 stores.

I use regex to do searches and matches on data that ignores case because the data I’m searching is entered by users and it’s not possible for them to be 100% consistent with their data entry. So you see I’m dealing with the human element here.

* Some of my Perl programs that get data from a spreadsheet (in tab-delimited form) with 500,000 records do take 40 minutes to run. I haven’t rewritten this specific program in Python though. But we are getting away from Perl and moving to Python for general processing tasks.

Honestly, I think you could do this pretty easily with just flat JSON files if you wanted to. With no performance requirements, do you really even need a database? Design a file system structure with directories and files in such a way as makes sense, and do all of your querying work in Python. Or just have a single ginormous JSON file, read it in, and then do the querying in memory. Depends whether RAM usage is a problem - if it isn’t, it’ll be simpler to load everything up-front.

Alternatively, go with PostgreSQL, accept the fact that it’s a full database, and then see how much else you can take advantage of. You might find that the payoff for the extra hassle of running a DBMS (on localhost, so you don’t have to worry much about security or anything) is that, instead of taking 40 minutes to run, it takes 40 seconds. Maybe it’ll be worth the effort.

My recommendation there is to try to have a “cleaned up” search-only version of the data, stored as its own column. For example, casefold, strip all diacritical marks, get rid of punctuation, maybe even strip out spaces. Make it so that everything that would match the same regex actually becomes the same string. Then, you can do MUCH faster lookups - and this is true whether you’re using a database engine like PostgreSQL, or an in-Python database like SQLite, or flat JSON files - simply by performing the same transformation on the search, and then matching.

Yes I like Postgresql because it supports regex natively and I’ve used it before. And MS Azure supports Postgresql. I would just have to spin up an SQL server instance on their service. MS SQL does not support regex natively although there is a function out there that can support regex.