Venus OS crash/reboot debug advice

I am trying to figure out why my Venus-OS is crashing/rebooting, but there is nothing useful in /data/log/messages immediately before
syslog.info syslogd started: BusyBox v1.31.1
It is mostly conman, avahi daemon etc, and some minutes before the reboot.

I recently, a few weeks ago, updated my rpi zero from 3.33 to 3.42
The rpi runs a service that has a paho mqtt client subscribing to several MQTT topics on my Cerbo.

In the past when the Cerbo reboots, or otherwise stops its MQ server, the rpi would see several MQ msgs with no topic. I have always assumed last will and testament related, but never really looked at the detail. It then happily sat and the paho loop reconnected when the Cerbo MQ server came back.

Now the rpi is rebooting.
I have added an OnDisconnect method to my service and I can see that we are being disconnected, but venus reboots before it can attempt to manually reconnect.

Hopefully somebody has an insight into where I can find why Busybox is rebooting, if it doesn’t write it to syslog.

Thanks Colin

Use top to see how busy is the system.
If it’s too busy, maybe the watchdog parameters needs to be adjusted.

Thanks Alex,

When running happily top on rpi displays:
Mem: 382152K used, 55020K free, 1724K shrd, 58676K buff, 92856K cached
CPU: 5% usr 1% sys 0% nic 92% idle 0% io 0% irq 0% sirq
Load average: 0.03 0.35 0.36 3/271 24082

Then I rebooted the Cerbo, and immediately received the OnDisconnect callback on the rpi client. Moments later the top display froze as:
Mem: 381624K used, 55548K free, 1724K shrd, 58692K buff, 92856K cached
CPU: 2% usr 0% sys 0% nic 96% idle 0% io 0% irq 0% sirq
Load average: 0.00 0.22 0.31 3/273 24383

Would watchdog log intervention anywhere?

I am assuming there is a change in behaviour of the client and/or broker between releases. I have always just left the paho client loop to handle the disconnect/reconnect. I will have to do some reading and try to improve the OnDisconnect handling, or at least make it crash my service rather than rebooting.

top results are more than OK, therefore the CPU watchdog, for sure, it’s not the cause of reboot.

Thanks for the suggestion anyway. Certainly worth the look.

I have been doing a little experimenting, but no progress.

All of the mqtt code in my service is within try blocks, and they are not catching any exceptions.

I rewrote the mqtt thread to manually issue mqtt.loop() 's rather than using mqtt.loop_start() and mqtt_loop_stop() so paho would not try to reconnect automatically.

I installed paho 1.6.1 and 2.1.0. Maybe slight difference in timing of reboot, but no real difference.

I have found some references online to issues with disconnect and “too many” subscriptions, but no clarity on what “too many” means.

Open to suggestions to locate source of decision to reboot.

Are you confident that its not a failing power supply? random reboots are a classic symptom of a PSU that is providing not quite enough current, voltage drops, reboot occurs.
RP

1 Like

Especially in the Raspberry Pi 3 and 4.
I’ve never had an issue with cheap cell phone chargers on Pi2, or with 24v->5v DCDC (even ones from AliExpress), but on the Pi4 you really need a good PSU or DCDC.