Attempted to create Ethernet-to-LTE failover service: failed. But learned something interesting

I tried to post this in the Modifications forum, but received an error saying I don’t have enough experience – how stupid…

I have a Cerbo-GX, which has an Ethernet connection to the rest of my home network. It also has the official GX LTE 4G modem attached.

The Problem

As the documentation for the LTE modem states, if there’s an active Ethernet connection, that will take priority over LTE. Further, no determination is made on whether the active Ethernet connection has Internet reachability, so LTE won’t take precedence in the case that Ethernet does not have Internet connectivity. This makes sense (I’m plenty familiar with networking in Linux systems), but also kind of a disappointment: if the system can’t intelligently determine that the Internet isn’t reachable through an active Ethernet connection, and thus failover to LTE, then what’s the point?

The (Impossible) Solution

I set about writing my own daemontools service for the Cerbo that would actively monitor Internet reachability via eth0, and if that fails, somehow command the system to prefer the gateway on the ppp0 interface. I was plenty successful in writing a good service that runs perfectly on the Cerbo (including re-installing itself after a firmware update), but failed colossally in actually manipulating the route priority.

I understand Venus OS uses connman to manage network connections, and because of this, trying to manipulate route priority by lowering the metric on ppp0 and raising the metric on eth0 has no effect (e.g. running ip route replace … doesn’t actually replace the route as expected, it just add a new one). No matter what I tried, there was no way I could either remove the default gateway from the Ethernet interface (to prevent it from having any priority in route selection), or tweak the route metrics to cause the LTE interface to have a higher priority.

While poking around connman, I noticed that only the Ethernet interface had a service in connman (and maybe the WiFi interface would, too, except I have that disabled on my Cerbo). Why wasn’t the LTE interface being managed by connman? Then I noticed that ppp0 was blacklisted in /etc/connman/main.conf. D’oh.

The Discovery

To test my failover service, I setup a firewall filter rule in my router to drop traffic destined for the Internet that originates from the Cerbo IP address. That way, I could simulate a failure, but maintain an SSH connection and tail the log of my service to watch it work (or not work, as it were). From the SSH terminal, I’d try a ping 1.1.1.1 while that firewall rule was active, and there’d be no reply (so, obviously, my failover script didn’t work).

But I noticed something: I had a browser tab open to VRM, and my dashboard was showing “Last Updated: Real Time”. And data was indeed being updated live. How was this possible? I decided to view the list of TCP connections in my router, filtering out all traffic except what was going to/from the Cerbo IP. Nothing (except my SSH connection). Whut?

TL;DR

Then I realized: I bet Venus OS is maintaining some sort of tunnel back to the Victron cloud. And I also bet that the system is designed so that whatever service manages this tunnel, also looks for a ppp0 interface, and if one exists, explicitly uses that interface for the tunnel. If I’m right, that means all my time and effort spent on developing a failover solution was in vain. The downside to Victron’s strategy to explicitly tunnel through the LTE interface is that I’m chewing up LTE data unnecessarily (even if the use is minimal) when the tunnel could be formed over the Ethernet link for free.

Why post a forum topic?

While working on this, I couldn’t find any existing discussions on the topics of LTE and connman in Venus OS. I decided to post my experience just in case someone else might want to do the same thing, and maybe I’ll save them some trouble.

Also, it’d be nice if someone more experienced and knowledgeable about Venus OS (and Cerbo-GX) might chime in, especially if any of my assertions or discoveries are incorrect.

1 Like

Did you restarted Cerbo after activating the firewall rule?
A previous connection may have been active during activating firewall rule.

From what I’ve saw, yes, Cerbo maintains active connections for three things:

  • reporting VRM historical data
  • reporting VRM realtime data
  • two-way communication for setting, updating, commands, etc.

All to Amazon servers…

Better use netstat from inside Cerbo for monitoring active connections.