Is there a faster way to do string operations? I’m reading a file with 3700+ lines. For each line there are 4 model numbers, I have to run a function cleanmodelread() on all 4 models for every line. So this read function takes 2-3 minutes alone.
Is there a faster way to do what I want in cleanmodel()?
def cleanmodelread(options, oldmodel):
'''Clean model of unwanted stuff.
In: one uncleaned model.
Out: one cleaned model
'''
procname = str(inspect.stack()[0][3]) + ":"
newmodel = oldmodel
newmodel = newmodel.upper()
newmodel = undupespace(newmodel) # My custom function from another .py file.
newmodel = newmodel.strip() # Do last
# checkmodel(options,newmodel)
return newmodel
In this line
newmodel = undupespace(newmodel) # My custom function from another .py file.
this calls undupespace() which is a function in an imported file, which is my personal library file of commonly used functions. Could this be slowing my program down?
This is what undupespace does:
def undupespace(s):
procname = libfilename + ":" + str(inspect.stack()[0][3]) + ":"
t = re.sub(' +',' ',s)
return t
But it’s a really common thing I do so it’s in my library file of utilities.
Is there another way of removing dupe spaces without regex?
What’s the point of procname? It’s a local variable, and it’s not used anywhere in the function, as far as I can see, so that’s wasting time.
What I do to remove excess whitespace is ' '.join(string.split()):
>>> s = ' a string with lots of spaces '
>>> s.split()
['a', 'string', 'with', 'lots', 'of', 'spaces']
>>> ' '.join(s.split())
'a string with lots of spaces'
There isn’t anything in the code you shared that I would expect to take 2-3 minutes to run 3700*4 times. It’s difficult to predict what will be slow, since it partially depends on what the input is.
For example, the regex seems like a likely slow spot, but in my tests, it’s quite fast.
What I recommend is that you do some profiling to get a better picture of where the time is actually being spent. There are a lot of excellent profilers for Python, but a good start is to use the built-in cProfile. If you normally run your program using something like python script.py or python -m module, you can easily profile it by changing it to python -m cProfile script.py or python -m cProfile -m module. The Python Profilers — Python 3.13.2 documentation
If you run the profile, share the results here and we can help with diagnosis.
procname is a standard variable which contains the function name. It’s in all my functions.
What I do to remove excess whitespace is ' '.join(string.split())
I don’t want to remove all whitespace, I want to remove duplicate spaces so My words here becomes my words here with only one space between words.
Ok I used ' '.join(string.split()) instead of calling undupespace() and that also sped things up. Thank you. I misunderstood how that code worked at first.
Using @MRAB’s undupespace and removing temporary variables and unused assignments, your function becomes:
def cleanmodelread(options, oldmodel):
'''Clean model of unwanted stuff.
In: one uncleaned model.
Out: one cleaned model
'''
# Clean by removing leading+trailing whitespace and converting
# multiple whitespaces to one space.
return ' '.join(oldmodel.split())
It is an optimisation, and maybe less readable, but should be faster.
You might want to step back a little and take a look at the whole file. If it looks something like this:
I wonder if introspection: inspect.stack()[0][3] is way more computationally expensive than a few string operations that use native C code. But I haven’t profiled this code any more than you have.
I wonder if introspection: inspect.stack()[0][3] is way more computationally expensive than a few string operations that use native C code.
It’s called eight times per line by my reading, just in the code given. Doing that call and nothing else that many times takes over 30 seconds in a quick and dirty test on my laptop:
In [2]: def foo(): inspect.stack()[0][3]
In [3]: def bar(): foo()
In [4]: %timeit for i in range (3700*8): bar()
30.9 s ± 433 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
But OP says this is in “all their functions” so the actual number of times it is called may be even higher. Getting rid of the introspection everywhere it is not explicitly used is definitely the first thing I would try.
But as suggested, running with cProfile will give a clearer picture of where time is spent.
OMG you’re right. Getting rid of procname = inspect.stack()[0][3] really sped things up. That was just the function name used in error messages, nothing more. And some functions didn’t have any error messages so I commented out that line.
My read time went from 2.96 minutes to 0.58 minutes.