From time to time my Cerbo shows unwanted reboots with a loss of power production. As Venus complaints about logged in users or devices before shutting down, it does not seem to be any unexpected hardware problem. 10.10.20.158 is my PC with port 1881 connection to NR and ttypS0 connects a smart shunt.
Are there any ways to see the reason from system view?
Last login: Sat Jul 26 07:00:40 2025
root@einstein:~# last
root pts/0 10.10.20.158 Sat Jul 26 07:59 still logged in
root ttyS0 Sat Jul 26 07:00 still logged in
reboot system boot 5.10.109-venus-1 Sat Jul 26 07:00 - 07:59 (00:59)
wtmp begins Sat Jul 26 07:00:30 2025
root@einstein:~#
Seems I have to change from Cerbo to Ekrano. CPU load is probably caused by big image to transfer SunSpec data from Ziehl relais. Try to increase polling intervall from 1 second to 5 or 10 seconds, maybe adding more memory or other solutions?
root@einstein:/var/log# cat messages.0
...
...
Jul 26 02:29:18 einstein daemon.info connmand[761]: ntp: time slew +0.172154 s
Jul 26 04:29:19 einstein daemon.info connmand[761]: ntp: time slew +0.174667 s
Jul 26 06:29:19 einstein daemon.info connmand[761]: ntp: time slew +0.173114 s
Jul 26 06:57:53 einstein daemon.err watchdog[555]: loadavg 7 7 6 is higher than the given threshold 0 6 6!
Jul 26 06:57:54 einstein daemon.err watchdog[555]: repair binary /usr/sbin/store_watchdog_error.sh returned 253 = 'load average too high'
Jul 26 06:57:54 einstein daemon.alert watchdog[555]: shutting down the system because of error 253 = 'load average too high'
Jul 26 06:57:54 einstein daemon.err watchdog[555]: /usr/sbin/sendmail does not exist or is not executable (errno = 2)
Jul 26 06:58:04 einstein syslog.info syslogd exiting
Jul 26 07:00:32 einstein syslog.info syslogd started: BusyBox v1.31.1
Jul 26 07:00:32 einstein user.notice kernel: klogd started: BusyBox v1.31.1 (2025-01-28 13:25:27 UTC)
Jul 26 07:00:32 einstein user.info kernel: [ 0.000000] Booting Linux on physical CPU 0x0
Jul 26 07:00:32 einstein user.notice kernel: [ 0.000000] Linux version 5.10.109-venus-17 (oe-user@oe-host) (arm-ve-linux-gnueabi-gcc (GCC) 9.5.0, GNU ld (GNU Binutils) 2.34.0.20200910) #1 SMP Tue Jan 28 14:19:01 UTC 2025
Or look at the mods or nodered flows you have, dial back polling intervals etc.
Sometimes mods just need an update.
If you run top, you should be able to see what is eating up the cpu time.
Of coarse, its NR as its JavaScripts are super inefficient. Sometimes dbus_systemcalc jumps to top. Probablity of reboot seems a beat frequency problem of those two. Maybe gui_v1 will save some percents? Another thing to try is always closing the NR editor tab after deploy.
Mem: 634136K used, 395948K free, 35312K shrd, 43612K buff, 226880K cached
CPU: 58% usr 14% sys 0% nic 21% idle 0% io 2% irq 2% sirq
Load average: 3.34 4.78 4.74 4/303 32050
PID PPID USER STAT VSZ %VSZ %CPU COMMAND
1909 990 nodered S 248m 25% 18% node-red
1006 977 root S 224m 22% 11% /opt/victronenergy/gui-v2/venus-gui-v2
1062 1034 root S 26088 3% 7% {dbus_systemcalc} /usr/bin/python3 -u /opt/victronenergy/dbus-systemcalc-py/db
1022 996 root S 29724 3% 6% {localsettings.p} /usr/bin/python3 -u /opt/victronenergy/localsettings/localse
1007 967 root S 44392 4% 5% {vrmlogger.py} /usr/bin/python3 -u /opt/victronenergy/vrmlogger/vrmlogger.py
689 687 messageb S 4060 0% 3% dbus-daemon --system --nofork
1827 1822 root S 4116 0% 2% /opt/victronenergy/vecan-dbus/vecan-dbus -c socketcan:can0 --banner --log-befo
1773 1013 www-data S 7544 1% 2% nginx: worker process
1934 1932 root S 55004 5% 2% /usr/bin/flashmq
1841 1834 root S 38308 4% 2% {dbus-modbus-cli} /usr/bin/python3 -u /opt/victronenergy/dbus-modbus-client/db
997 969 root S 57212 6% 1% /opt/victronenergy/venus-platform/venus-platform
1053 1026 root S 51172 5% 1% /opt/victronenergy/hub4control/hub4control
1664 1628 root S 3796 0% 1% /opt/victronenergy/mk2-dbus/mk2-dbus --log-before 25 --log-after 25 --banner -
1068 1046 root S 53064 5% 1% /opt/victronenergy/dbus-fronius/dbus-fronius
1828 1824 root S 3184 0% 1% /opt/victronenergy/can-bus-bms/can-bus-bms --log-before 25 --log-after 25 -vv
1003 986 root S 3176 0% 1% {serial-starter.} /bin/bash /opt/victronenergy/serial-starter/serial-starter.s
1699 1691 root S 3376 0% 0% /opt/victronenergy/vedirect-interface/vedirect-dbus -v --log-before 25 --log-a
1787 1784 root S 3184 0% 0% /opt/victronenergy/can-bus-bms/can-bus-bms --log-before 25 --log-after 25 -vv
1069 1042 root S 48776 5% 0% {dbus-modbus-cli} /usr/bin/python3 -u /opt/victronenergy/dbus-modbus-client/db
30929 30499 root R 2784 0% 0% top
1080 1048 root S 25268 2% 0% {dbus_digitalinp} /usr/bin/python3 -u /opt/victronenergy/dbus-digitalinputs/db
1060 1028 root S 23136 2% 0% {dbus_vebus_to_p} /usr/bin/python3 -u /opt/victronenergy/dbus-vebus-to-pvinver
1081 1073 root S 3416 0% 0% /opt/victronenergy/dbus-adc/dbus-adc --banner
1627 1625 root S 3244 0% 0% /opt/victronenergy/vedirect-interface/vedirect-dbus -v --log-before 25 --log-a
1648 1642 root S 3244 0% 0% /opt/victronenergy/vedirect-interface/vedirect-dbus
NodeRed is high, but on average the CPU is capable of coping with the workload. The watchdog sitting at 6 is too trigger-happy. I’ve set it to 10 and my htop currently looks like this:
I found, that the gui-v1 service is occasionally consuming quite some cpu, despite i’m no longer using it.
svc -d /service/start-gui stops it, but be aware that this also terminates the vnc service, used to display the remote-console (for either version) through vrm.
Then, you can only access the local ui remotely when you have a vpn connection and can access the device ip directly.
Other than that, do a df -h and check the remaining disk space on the root partition. With the large image, there is little room, and when your device fills up that, it will become unstable.
The large image ships with signal-k, which i’m also not using at all, so I dropped that (which is quite huge as well): rm -rf /usr/lib/node_modules/signalk-server
(Advice as a user, not as victron staff, do at your own risk )
Closing NR editors tab saves about 2% in average for the NR task.
Erasing SignalK makes a quarter of the disk space.
Going on with further ideas and observations
root@einstein:~# df -h
Filesystem Size Used Available Use% Mounted on
/dev/root 1.1G 1.0G 50.5M 95% /
/dev/mmcblk1p5 1.1G 81.2M 952.6M 8% /data
tmpfs 503.0M 956.0K 502.0M 0% /service
root@einstein:~# rm -rf /usr/lib/node_modules/signalk-server
root@einstein:~# df -h
Filesystem Size Used Available Use% Mounted on
/dev/root 1.1G 767.6M 328.2M 70% /
/dev/mmcblk1p5 1.1G 81.2M 952.7M 8% /data
root@einstein:~#
Collecting Modbus-TCP/Sunspec data from various sources, Ziehl, ABB, Kaco, Fronius devices. For the moment, I try to optimize some protocol handshake overhead by reading ~200 byte array of data and dropping unused values between instead of reading 20 scattered doublewords (IEEE float version) for each value separate from each device.
SOC and power regulation using PID regulator. Simple code for regulator cascade plus state machine with cycle time throttled to 10 sec only.
let error = (flow.get("pv2") - flow.get("sv2"));
msg.errsum += error; // integral difference
if (msg.errsum > 120000) msg.errsum = 120000; // Integral Lock
if (msg.errsum < -120000) msg.errsum = -120000; // Integral Lock
msg.out = (Kp * error)+ (Ki * msg.errsum)+ Kd * (error - msg.errold); // PID Regulator
msg.errold = error; // save Error for next cycle
msg.out = Math.round(msg.out)*-1 ; // negative Einspeisung, positive Akkulast
Updating values to Dashboard V1 seems to occupy many time. 4 gauges, 3 radar charts, 3 dropdown menues, display approx. 2 dozen important values. The digital time watch display sometimes skips a second what is the hint for a busy http server. No other chart diagramms. I already dropped the 24 hour and shorter charts for frequency & mains voltage long ago for performance reasons. Consider this job for using Influx-DB or maybe trying to use the VRM-API therefore.
The FlowFuse dashboard is already on my todo list. Delayed as I wont try this change at my 100kWp productive installation as well as using some of the latest Venus and beta versions. On the other hand, functionallity cannot be tested without real data from devices in use.
The only installed library nodes are
node-red-dashboard (deprecated V1)
node-red-contrib-modbus (only ModbusRead node in use to access Sunpsec)
node-red-contrib-buffer-parser (only BufferParse node in use to convert IEEE floats)
First attempt was to start with FHEM, but for active control (not only visualisation), use of pre-installed NodeRed running with the 24/7 Venus availability seems a unbeatable benefit
To repeat myself: Average load is high, the watchdog does as it has been told and issues a reset of the system. Tweaking the workflows may or may not help, but if the CPU load is intermittent, load average will come down eventually.
In my experience, setting the watchdog threshold higher in its config file /etc/watchdog.conf
can lead to elimination of reboots. If the workload is way too high, then other resolutions will be needed.
For more than a year now, I’ve been running the following settings:
Your htop headline with calculated uptime since last boot seems usefull.
How did you install this on Venus without apt-get ? Compile from source using make?
The last downtime was due to scheduled maintenance (installation of more battery capacity).
I’m running 8 MPPT 150/35 on ve.direct, a lynx shunt and a MPPT 150/70 on ve.can, a 3~ Multiplus 2-48/5000 ESS, 8x Pylontech US5000 and two shelly 3 em. The ESS is governed by node red. Load is high, but operation is stable.
Obviosly, Modbus TCP operations sometimes occupy many CPU time. After changing the data transfer from single scattered float values to 100 byte transfers, the occasional peaks in CPU time for NR seems to be less. Typical average of NR task is now about 15-20% CPU time. There is still one reboot about once a day.
Using vi (is there anything else in Venus?) I changed watchdog.conf. What are the values max-load-5 and max-load-15 and do I have to change both?
While I am writing here, reboot happened again why I also changed max-load-15 = 8;
Another thing to understand is my flow. It works like expected if using the “Deploy” button while same script reports a NaN problem if autostart after reset happens.