Celery task runs about 50% of the time, doesn't even start the other 50%

dstromberg · September 6, 2023, 7:58pm

The title pretty much says it, but:

I can see one of my celery tasks being skipped roughly every other time it should run. I can see this in both a debugger (the debugger only starts up every other time), and with a primitive/simple little logging module I wrote (it only logs anything every other time).
I wrote a little Celery on_success handler that shows its 4 parameters, including task_id. I can see it logging the same task_id twice (everything else is the same too, even the timestamps), then 0 times, then twice, then 0 times.

The problem comes and goes. I don’t know why. It consistently does the above for a while, and then suddenly starts working again, with what seems to be no code or input changes.

I’ve got the task in question decorated the same as most of the other tasks in this program:
@asynchronous.task(base=TPMplus, bind=True, max_retries=None, reject_on_worker_lost=True)
…except that TPMplus, which just gives on_success and on_failure methods.

I need it to fire 100% of the times it’s called.

For CPython 2.7 I’m using:

celery==3.1.26.post2
  - billiard [required: >=3.3.0.23,<3.4, installed: 3.3.0.23]
  - kombu [required: >=3.0.37,<3.1, installed: 3.0.37]
    - amqp [required: >=1.4.9,<2.0, installed: 1.4.9]
    - anyjson [required: >=0.3.3, installed: 0.3.3]
  - pytz [required: >dev, installed: 2018.3]

And for CPython 3.10 we’re using:

celery==5.2.3
├── billiard [required: >=3.6.4.0,<4.0, installed: 3.6.4.0]
├── click [required: >=8.0.3,<9.0, installed: 8.0.3]
├── click-didyoumean [required: >=0.0.3, installed: 0.3.0]
│   └── click [required: >=7, installed: 8.0.3]
├── click-plugins [required: >=1.1.1, installed: 1.1.1]
│   └── click [required: >=4.0, installed: 8.0.3]
├── click-repl [required: >=0.2.0, installed: 0.2.0]
│   ├── click [required: Any, installed: 8.0.3]
│   ├── prompt-toolkit [required: Any, installed: 3.0.30]
│   │   └── wcwidth [required: Any, installed: 0.2.5]
│   └── six [required: Any, installed: 1.10.0]
├── kombu [required: >=5.2.3,<6.0, installed: 5.2.4]
│   ├── amqp [required: >=5.0.9,<6.0.0, installed: 5.1.0]
│   │   └── vine [required: >=5.0.0, installed: 5.0.0]
│   └── vine [required: Any, installed: 5.0.0]
├── pytz [required: >=2021.3, installed: 2021.3]
├── setuptools [required: >=59.1.1,<59.7.0, installed: 59.6.0]
└── vine [required: >=5.0.0,<6.0, installed: 5.0.0]

We’re on some really old versions there, because we’re still porting from CPython 2.7 to CPython 3.10. But if changing those is likely to help, it’ll probably be worth it.

For this part of the system we’re using CPython 3.10, so those versions probably matter far more.

Any suggestions?

I like celery a lot.

dstromberg · September 6, 2023, 10:42pm

I did some more googling, and found someone saying they’d seen a similar problem that turned out to be more than one celery worker running at the same time. I’m not sure yet if that’ll fit what I’m seeing, but I’m hopeful.

dstromberg · September 8, 2023, 8:18pm

I don’t think it’s a problem of errant celery workers running - we just have on top-level worker, and the 8 it was requested to start up:

tact@celery_worker:/app$ ps -eo pid,ppid,cmd
    PID    PPID CMD
      1       0 /usr/local/bin/tact-python -u -m celery --app=service.run:celery worker --concurrency=8 -E --loglevel=DEBUG -Ofair --logfile=/var/log/celery-worker/celery-worker.log
     13       1 /usr/local/bin/tact-python -u -m celery --app=service.run:celery worker --concurrency=8 -E --loglevel=DEBUG -Ofair --logfile=/var/log/celery-worker/celery-worker.log
     14       1 /usr/local/bin/tact-python -u -m celery --app=service.run:celery worker --concurrency=8 -E --loglevel=DEBUG -Ofair --logfile=/var/log/celery-worker/celery-worker.log
     15       1 /usr/local/bin/tact-python -u -m celery --app=service.run:celery worker --concurrency=8 -E --loglevel=DEBUG -Ofair --logfile=/var/log/celery-worker/celery-worker.log
     16       1 /usr/local/bin/tact-python -u -m celery --app=service.run:celery worker --concurrency=8 -E --loglevel=DEBUG -Ofair --logfile=/var/log/celery-worker/celery-worker.log
     17       1 /usr/local/bin/tact-python -u -m celery --app=service.run:celery worker --concurrency=8 -E --loglevel=DEBUG -Ofair --logfile=/var/log/celery-worker/celery-worker.log
     18       1 /usr/local/bin/tact-python -u -m celery --app=service.run:celery worker --concurrency=8 -E --loglevel=DEBUG -Ofair --logfile=/var/log/celery-worker/celery-worker.log
     20       1 /usr/local/bin/tact-python -u -m celery --app=service.run:celery worker --concurrency=8 -E --loglevel=DEBUG -Ofair --logfile=/var/log/celery-worker/celery-worker.log
    116       1 /usr/local/bin/tact-python -u -m celery --app=service.run:celery worker --concurrency=8 -E --loglevel=DEBUG -Ofair --logfile=/var/log/celery-worker/celery-worker.log
    294       0 /bin/bash
    309     294 ps -eo pid,ppid,cmd

Any suggestions folks? I haven’t given up on Google yet, but at the moment I’m out of ideas.

dstromberg · September 8, 2023, 8:21pm

BTW, for a while I thought it might be a bad interaction between debugpy and pudb, but I’ve since ruled that out; it happens even without any debugger running at all.

massover · September 8, 2023, 9:36pm

Have you tried installing flower? That might give you some more insight to monitor the status of your tasks.

What broker are you using? Do you have access to your broker’s admin UI, such as the rabbitmq admin? From the broker, you can confirm that each task is successfully publishing a message. Then, you can confirm that each message is being consumed. If that’s the case, then you know there is a consumer somewhere processing the messages.

dstromberg · September 8, 2023, 10:39pm

Setting up Flower is on my list of things to do, but I may have to accelerate that. We’re using RabbitMQ as our broker. I don’t know for sure that we have a result backend set up, but if we do, it’s not Redis - it’d about have to be Rabbit.

I’m talking about a large software system written by someone who’s no longer with the company.

Thanks for the reply.

dstromberg · September 14, 2023, 7:09pm

I’ve got Flower set up now - thanks for recommending that, Joshua.

Our development environments have two celery workers, “celery_worker” and “celery_worker2”. We only use one at a time though. They don’t share much data, even though they have a lot of common code.

Based on what I’m seeing in Flower, this one task wants to alternate between those two celery workers - but it should only be going to celery_worker. celery_worker2 isn’t fully configured in this case, so the task fails on even-numbered calls (works on the 1st call, fails on the 2nd, works on the 3rd, fails on the 4th…).

Almost all of our API’s are going to the correct celery worker, “celery_worker”. But for some reason this one celery task wants to alternate between them.

Here’s how I’m calling the task that mistakenly alternates workers:

  options = {}
  t = _finishclean.si(
    storage=pipeline.storage.serialize(admin=True),
    pipeline=pipeline.serialize(admin=True),
    builds_to_delete=builds_to_delete,
    builds_to_unpublish=builds_to_unpublish,
    already_locked=already_locked,
  )
  t.apply_async(**options)

And here’s the def with decorator of the task being called:

@asynchronous.task(base=TACTPublishingMixin, bind=True, max_retries=None, reject_on_worker_lost=True)
def _finishclean(self, *args, **kwargs):

We have about 15 tasks using base=TACTPublishingMixin, so it seems that’s unlikely to be the cause of the problem.

How might a list of valid celery workers be specified to Celery? I’m guessing I need to remove a host from that list, or at least look at the list and see if it spurs any new thoughts. Or could it be a celery bug?

Any (further) suggestions folks?

massover · September 14, 2023, 8:21pm

I’d suggest reading the documentation on task routing. It probably will be helpful to look at the AMQP Primer section as well.

Almost all of our API’s are going to the correct celery worker, “celery_worker”. But for some reason this one celery task wants to alternate between them.

Assuming you are using default settings, and you have two workers connected to a single queue, this is working as intended, each worker is available to pick up a task. If you do not want celery_worker2 to pick up a task off the queue, then turn it off. If you want celery_worker to pick up some tasks off the queue, while celery_worker2 picks up others, then you need to change your routing and worker setup, such as running two queues.

If you’re not using default settings, then we’d need to see more information such as your routing configuration and what queues your workers are connecting to.

dstromberg · September 14, 2023, 11:02pm

It turns out we do have a result backend set up, and it’s MySQL - no SQLAlchemy involved, just MySQL.

dstromberg · September 26, 2023, 7:43pm

This turned out to be because a missing-but-required undocumented parameter to the celery task. IOW, it was an oddity of our routing code.

Thanks for reading and suggesting folks!