Venus OS crash/reboot debug advice

I am trying to figure out why my Venus-OS is crashing/rebooting, but there is nothing useful in /data/log/messages immediately before
syslog.info syslogd started: BusyBox v1.31.1
It is mostly conman, avahi daemon etc, and some minutes before the reboot.

I recently, a few weeks ago, updated my rpi zero from 3.33 to 3.42
The rpi runs a service that has a paho mqtt client subscribing to several MQTT topics on my Cerbo.

In the past when the Cerbo reboots, or otherwise stops its MQ server, the rpi would see several MQ msgs with no topic. I have always assumed last will and testament related, but never really looked at the detail. It then happily sat and the paho loop reconnected when the Cerbo MQ server came back.

Now the rpi is rebooting.
I have added an OnDisconnect method to my service and I can see that we are being disconnected, but venus reboots before it can attempt to manually reconnect.

Hopefully somebody has an insight into where I can find why Busybox is rebooting, if it doesn’t write it to syslog.

Thanks Colin

Use top to see how busy is the system.
If it’s too busy, maybe the watchdog parameters needs to be adjusted.

Thanks Alex,

When running happily top on rpi displays:
Mem: 382152K used, 55020K free, 1724K shrd, 58676K buff, 92856K cached
CPU: 5% usr 1% sys 0% nic 92% idle 0% io 0% irq 0% sirq
Load average: 0.03 0.35 0.36 3/271 24082

Then I rebooted the Cerbo, and immediately received the OnDisconnect callback on the rpi client. Moments later the top display froze as:
Mem: 381624K used, 55548K free, 1724K shrd, 58692K buff, 92856K cached
CPU: 2% usr 0% sys 0% nic 96% idle 0% io 0% irq 0% sirq
Load average: 0.00 0.22 0.31 3/273 24383

Would watchdog log intervention anywhere?

I am assuming there is a change in behaviour of the client and/or broker between releases. I have always just left the paho client loop to handle the disconnect/reconnect. I will have to do some reading and try to improve the OnDisconnect handling, or at least make it crash my service rather than rebooting.

top results are more than OK, therefore the CPU watchdog, for sure, it’s not the cause of reboot.

Thanks for the suggestion anyway. Certainly worth the look.

I have been doing a little experimenting, but no progress.

All of the mqtt code in my service is within try blocks, and they are not catching any exceptions.

I rewrote the mqtt thread to manually issue mqtt.loop() 's rather than using mqtt.loop_start() and mqtt_loop_stop() so paho would not try to reconnect automatically.

I installed paho 1.6.1 and 2.1.0. Maybe slight difference in timing of reboot, but no real difference.

I have found some references online to issues with disconnect and “too many” subscriptions, but no clarity on what “too many” means.

Open to suggestions to locate source of decision to reboot.

Are you confident that its not a failing power supply? random reboots are a classic symptom of a PSU that is providing not quite enough current, voltage drops, reboot occurs.
RP

1 Like

Especially in the Raspberry Pi 3 and 4.
I’ve never had an issue with cheap cell phone chargers on Pi2, or with 24v->5v DCDC (even ones from AliExpress), but on the Pi4 you really need a good PSU or DCDC.

Power supply is an interesting idea !

The reboots are not random, but when the Cerbo reboots or shutdown -r.

The Cerbo and rpi zero share a grid meter. The Cerbo has direct connection and I run a service on the rpi that subscribes to the grid mqtt topics. With mosquito I used a bridge with topic remapping to send the the grid dbus updates to the rpi with N/cerboid/… changed to W/rpi0id/…
FlashMQ doesnt support that so the rpi zero service now subscribes to the Cerbo’s mqtt server. In fact trying to debug these crashes, I bridged the two mqtt servers so my service could subscribe to the local rpi server. It didn’t make any difference. I still dont get a rush of grid topic, None msgs to know the remote service has shutdown.

You made me think about the power supply, and of course, the rpi zero is powered from a usb port of the Cerbo. I suspect the Cerbo momentarily drops power to the usb port when it is rebooting, and that would cause the rpi to reboot itself once power came back.
Something to test on the weekend.
Thank you

hunted out a usb power supply and longer cables and rpi zero reboot nolonger occurs when cerbo is rebooting.

It appears the console usb port remains powered, but the two data usb ports drop power.

Great result.
It would make sense for the USB ports to be power cycled.
AFAIK (i’m not the RoarPower export on this, but the expert is super busy so i won’t ask right now given you have resolved it) in some (most?) USB integrated circuits these days, the IC is responsible for shutting itself down if too much power is drawn (or a short circuit is detected), and this requires repowering the whole IC to reset, so a momentary power loss on the USB interface would be expected