It has been a week I encounter issues on my system (MP2 + US2000 + 2x SmartSolar + Cerbo GX + ET112). My Cerbo Gx keeps crashing without any explanation I could find, day and night, and it takes, most of the time more than 10 minutes to reboot (but also sometimes I have to hard reboot it myself)
This issue does not seem to be link with the venus OS version, when it started it was running on 3.42. I tried then 3.52, 3.6 (nightybuild), same result.
I also tried to reset to the factory parameters and finally reflashed the whole system, but none of this actions could fix the issue.
I also thought it could be a hardware problem by removing the RS485/USB dongle from the ET112 to reduce the needed power, no luck.
Finally I connected via SSH on /data/Logs and read about 100 differents logs but there is no trace at all of any reboot order. It just stops suddenly and restarts (sometimes).
When it tries to restart, it usually gets stuck with only the Ethernet port’s leds blinking slowly, then restart again. Also sometimes it reaches the step of having the red “Wifi Access Point” led ON, but restart again.
This Cerbo GX has only been running for 4 months.
Does anyone has any clue or experienced something similar ?
Hello I have the same issue. After using the beta Version and also the „stable“ version with Gui 2.0 the device restarted it from time to time. After I decided to switch over to the old GUI on the device itself it works stable.
Thank you all for your advice ! Really appreciate @dognose the Cerbo is powered over the batteries but I’ll still give it a try tomorrow with a laboratory power supply.
@ Mr_G8N the problem started with V2.42, where there were not yet the GUI V2. I’m back to V2.42 after trying the latest versions but now I am using again the GUI V1. @ejrossouw Thanks for the advice, I didn’t know, It’s usefull ! The Dbus RTT seems fine as you can see below. Also, the CPU temp & load and the RAM couldn’t be monitored this way, but I made a script to check suspecting them. Everything seems fine except the CPU temperature which reaches almost the same value before the Cerbo crashes (roughly 60°C). Does somebody know the max CPU temp setting (60°C seems very low) ? Can somebody tell me what is their Cerbo CPU temperature ?
Every drops of temperature is a crash, yes that’s quite a lot for half a day… And because I have Lithium batteries, the Smart Solars go to security mode to protect over voltage… So crashing is OK, but loosing so much Solar energy is really annoying especially in the northern hemisphere right now, where the days are really short.
I forgot to mention, the Cerbo used to have a little script starting at boot which drove my water balloon and load sharing it, according to the inverter capacity. It used to take between 0 to 1% of the CPU (same about RAM) but it has been deactivated for a week now since i reflashed the whole system, but the problem remains.
@dognose you were right, since the Cerbo has got powered with a stable power supply, there are no crash anymore.
The ripple is not that high, between 0.4 to 0.6V during inrush, so it means the Cerbo’s voltage regulator filter has an issue, a cap must have aged prematurely.
I’m gonna buy a stable 48/12V buck converter.
Thank you, you have indeed a good dog’s nose !
The cerbo has a very broad voltage range on the input (8-70 VDC)
So, a little “ripple” shouldn’t really hurt. Eventually really a defective capacitor, or a (unseen) Issue with the continuity of your battery power? (Inverters can withstand, cerbo and smart solars cant?)
Maybe the Smart Solars going into security mode is not a consequence of the cerbo shutting down, but is the cause for the cerbo shutting down, and you should investigate, why a potential overvoltage issue is happening there?
(or something else happening on your dc-bus is the cause for the issue seen on both device types)
Depending on where you live, it’s very cold outside these day. First thing I would check, if eventually solar voltage exceeds the input limit of the smart solars, beeing the root of a chain of events to happen. (Mind Temperature-Sensitivity on the panels itself)
The issue happened again today, playing with the cables of the temperature probes, a short circuit occurred (bad wiring from myself, not using wire ferrule on these 2). So I changes these 2 probes by new ones, using the original wire ferrule. I tried to restart it properly (i.e from the GUI) and it got stuck again at boot. I had to power cycle the laboratory power supply to make it start.
It crashed 30 mins later and I noticed from the log of my little script (restarted since everything was fine yesterday), connected to dBus, that the dBus was not responding before crashing.
About the SmartSolars, the wiring is good, it can go up to 200Vdc @25°C (5 panels per strings) and anyway the problem also occurs at night.
I got “BMS Lost connection” on both SmartSolars and MP2 and all of them are responding after the Cerbo crashes (I checked via MK3 and Bluetooth). I think they (SM & MP2) broadcast the alarms because the cerbo is OFF and cannot forward the BMS info for too long.
About the DC Bus, I cannot check, I don’t have anymore an oscilloscope to carefully check, but everything had worked fine for 4 months before.
I also tried to remove one after the other the communication port from the Cerbo (Can Bus, Ve.direct, etc…) without being able to identify the source of the problem.
Well, let’s see with the new temp probes, it hasn’t crashed for nearly 3 hours, but I still have issue rebooting it from GUI.
Thank you for your help so far
When you setup third party scripts interacting with dbus, you have to be careful. A lot of the scripts you can find on github are using outdated methods to register themself on dbus. That usually ends up in other dbus-monitors getting stuck for a certain amount of time.
eventually one of that scripts keeps stopping, gets restartet and causes this stucks over and over, until the cerbo crashes.
Yes you’re right, i first stopped my script when the problem occurred nearly 2 weeks ago, thinking it might be the cause. But even without, and with a whole new flashed system, the issue was still there.
I don’t use high level library much, i did my own around a low level one, which i used often on industrial grade embedded system.
Anyway, I’m gonna kill the script and see how it goes without again.
Thank you @dognose
A general list of possible reasons for random reboots (this applies to most small form factor devices like Cerbo, Pi, ESP32, etc); (not an exhaustive list)
CPU Temperature exceeding the threshold.
50°C is the limit for the Cerbo, but @guystewart might be able to confirm that this is the environmental temperature limit, not the temperature of the CPU. Your earlier post mentions “CPU temperature” but your graph is titled “Internal temp” with no data, and then your “Temp_Max” graph which does have data doesn’t say where this is, so we have to presume this is a sensor you have placed close to the CPU?
Either way, looking at your Temp_Max graph, If temperature were the issue I would have expected to see the reboots occurring at roughly the same temperature, but this doesn’t happen. Some reboots happen at ~58, some at ~59, some at ~62. None of those temperatures are unusual for CPU temp in a device like the Cerbo.
Poor power supply, or poor cabling to the device.
I’ve seen boards like this, especially Pi, where a sudden uptick in CPU load caused a reboot, and it was because of a poor fuse holder, so sudden current demand saw a large drop in voltage over the bad connection (due to a dirty fuse connection, or poor spring tension that holds the fuse), and as soon as voltage drops below the critical point, even for a tiny period, the board reboots. With Pi, its often due to USB-C PSU that claim to be 2A but struggle to keep a 5v supply at even 1.2A - fine for charging a phone, but not good enough for powering a Pi. Cerbo is supplied with a very long power cable - and i’ve seen these pinched and damaged, leading to weak supply. If you can, shorten this cable.
The Pi runs on 5V, so you don’t have much headroom there, and 2A is a fair bit for a cheap phone charger to supply. With the Cerbo, as @dognose says, the voltage it accepts is wide, so you can supply it directly from a 48v battery which keeps current very low - typically around 10mA when powered from a 48v LiFePO4, so your cabling can be light weight.
Settings;
There is an option to reboot the Cerbo if the connection to VRM drops - this can resolve lock ups, but I feel that it gets used as a first resort, when it should be the last resort. If you have a flaky internet connection, target that rather than presuming the Cerbo is locked up and a reboot will fix it.
See Settings > VRM Online Portal to see if this setting is enabled;
You can also control how long the connection must be down for before a reboot will be forced.
Memory card corruption (SD card on Pi)
A corrupt memory card or corrupt onboard memory can cause seg-faults or kernel panics, which will usually cause a lockup or a reboot. It seems you have factory-reset which is a good step. SD cards on Pi have an extremely variable range of power consumption and read/write speed, even in cards that are all stamped as Class10, and swapping cards is often one of the early troubleshooting steps. Obviously this is not possible on Cerbo, so a factory reset like you have done is the best you can do.
Seg-faults caused by corrupt data.
I’ve seen Pi lock up when flooded with a packet storm (over ethernet) which was caused by a desk leg being placed on the ethernet cable, crushing the cable. I’ve also seen whole networks get taken down with packet storms cause by cable damage, even with switches that auto-partition ports when they see too many malformed packets.
My experience with VE.Can and VE.Bus is limited, but there will be a good reason why Victron recommend you use moulded, professional (Cat6? or Cat5e?) data cables between the MP and the Cerbo, and not hand-crimped cables made from RJ45 plugs and random Cat5 cable.
I would not be surprised if data cables running close to certain things like AC power cables, or relays, might be picking up interference. In data cabling work I’ve done, we never ever ran data cables close to the ballast of florescent light fittings for this reason.
Overloaded memory, CPU, or both
This can be caused by a rogue addition to the Cerbo, a bad NodeRed flow, corrupt data (as above) causing the Cerbo to be stuck in a loop, plus many other issues. CPU Temperature is a good proxy for CPU load, as is D-Bus RTT (round trip time). Your D-Bus RTT is very low (<=4ms), so this doesn’t seem to be the case for you.
One note is that as more devices get added to the system, the CPU load on the Cerbo will naturally go up, but for your system the number of devices looks modest.
Thank you @RoarPowerNZ for your time and your detailled post, i appreciate.
About the temperature graph i joined to my previous post, it is the CPU temperature from a probe inside the CPU. The data is accessible from a linux file. It is common to see threshold temperature above 80°C. The 50°C from the Cerbo datasheet is indeed for the ambient temperature.
About the power supply, my Cerbo used to be powered straight from the 50Vdc Pylontech US2000 batteries. When i follow @dognose advice, i tested with a stable laboratory power supply (set at 30Vdc). And to remove any doubt about the cable or fuse/fuse holder integrity, i used the cable furnished with the Cerbo (i cut one side though because it’s way too long) without fuse (I set the lab power source with a high but not critical max current).
About the automatic reboot if the VRM connection fall, it has never been activated. Also, as you mention in case of data overflow or DDoS attack, i tried unplugging the ethernet cable, but the problem remained. And indeed, i used moulded cables (Ethernet) and Victron cables (CAN Bus, Ve.direct, Ve.bus)
About the corrupted data, i remember now to have a VRM alarm about it on the very first day or days the issue appeared. That’s why, i first reflashed the /data partition first and it fixed the problem. Is there something else i can do ?
And about the overloaded CPU and RAM, it is not my case, i checked and they stay way below the red line: between 8 to 13% for the CPU, with some maximum values measured at 38% (rarely). The RAM is stable, no overflow.
Please Note: this driver is not supported on CCGX due to it’s limited system resources. Installation on CCGX can cause random reboots.
It made me think if there was a possibility that the root cause of reboots in the CCGX situation might be the same as your situation.
After you did a factory reset of your Cerbo, what tools/packages/extras have you loaded?
Thank you @RoarPowerNZ to think about my case. Unfortunately, i don’t use this library nor even any, the /data partition only have my script and a bash to start it from Crontab (when activated). No library imported
the /data partition only have my script and a bash to start it from Crontab (when activated).
How often are the reboots? Your temperature graph implies 8 in one day, but is this a typical day?
If so, then we could say that no reboots in a 24h period is probably proof that the problem is fixed?
I would remove or disable your script in the /data dir, after that if you get even one reboot then this means its unlikely anything to do with the script.
What voltage are you supplying to your Cerbo GX? I went through three of them. I was running on 48V. I put in a simple Amazon 48V DC to 12V DC inverter to power the Cerbo GX. It has been brilliant since. An added bonus is that I now have a rock-solid 12V supply to use on other devices.