Skip to content

host-device: copy host interface IP addresses and routes into container#1257

Open
SchSeba wants to merge 1 commit into
containernetworking:mainfrom
SchSeba:host-device-l3-info
Open

host-device: copy host interface IP addresses and routes into container#1257
SchSeba wants to merge 1 commit into
containernetworking:mainfrom
SchSeba:host-device-l3-info

Conversation

@SchSeba

@SchSeba SchSeba commented May 4, 2026

Copy link
Copy Markdown
Contributor

Add a new configuration option useInterfaceNetwork that instructs the host-device plugin to capture the interface's IP addresses and routes from the host before moving the device into the container namespace, and then apply them inside the container.

This is critical for virtual environments (AWS, IBM Cloud, GPC) where the cloud provider configures IP addresses and routes directly on the network device. In these environments, there is no traditional IPAM source; the ground truth for L3 configuration lives on the host interface itself.

When useInterfaceNetwork is enabled, the plugin:

  • Captures all global-scope addresses and non-local routes from the host device before moving it into the container namespace.
  • Applies the captured addresses and routes to the interface inside the container.
  • Reports the addresses and routes in the CNI result (merged with any IPAM result if an IPAM plugin is also configured).

NOTE: The interface configuration on the host node must be persistent. When the device is moved back to the host (via DEL) and renamed to its original name, the system's network management service (e.g. NetworkManager, systemd-networkd, cloud-init, or cloud-specific agents) is expected to detect the device and re-apply the IP addresses and routes. This plugin does NOT re-configure the host interface on DEL; it relies on the node's network configuration being declarative and reconciled by the platform's networking stack.

Also implements the STATUS command to verify the host device exists, replacing the previous TODO stub.

@SchSeba SchSeba force-pushed the host-device-l3-info branch 2 times, most recently from 0feea32 to df398aa Compare May 4, 2026 17:27
@SchSeba

SchSeba commented May 4, 2026

Copy link
Copy Markdown
Contributor Author

Hi @s1061123 @squeed @LionelJouin if you have time please take a look on the PR.
This is critical for us to support virtual clusters running on clouds where the VFs are pass into the cluster VMs nodes with network configuration.

localRouteTable = 255
)

// HostNetworkStateFile holds the captured host-side L3 configuration

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personal preference no comment unless exported

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - removed doc comments from all unexported symbols.


// HostNetworkStateFile holds the captured host-side L3 configuration
// (addresses, routes, and rules) that should be applied to the container interface.
type HostNetworkStateFile struct {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefix ^File$ not very nice not really a file. May InterfaceInfo, InterfaceConfig, or just Interface

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call - renamed HostNetworkStateFile to HostNetworkState throughout.

type HostNetworkStateFile struct {
HostIfName string `json:"hostIfName"`
HostLinkWasUp bool `json:"hostLinkWasUp"`
Addresses []string `json:"addresses,omitempty"`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use netlink.Addr, netlink.routes, rule

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string-based representation is intentional here - netlink.Addr, netlink.Route and netlink.Rule contain net.IP / *net.IPNet fields that don't round-trip cleanly through JSON (net.IP marshals as a base64 byte array, *net.IPNet isn't directly marshalable). Using strings gives us portable, human-readable JSON and avoids coupling the serialization format to the netlink library's internal types.

We do convert back to netlink types when applying (applyOnLink), so the actual netlink interaction is the same.


// applyNetworkStateToPod applies captured state to the moved interface inside the pod namespace.
func applyNetworkStateToPod(containerNs ns.NetNS, contDev netlink.Link, state *HostNetworkStateFile) error {
if state == nil {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having that check, (imo) indicates that this function maybe should be method to the struct.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - converted applyNetworkStateToPod and applyNetworkStateOnLink to methods on *HostNetworkState (applyToPod and applyOnLink). The nil-receiver check now reads more naturally.

},
},
}
mergeNetworkStateIntoResult(result, state)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo if user decides to keep network config from the host then we should ignore IPAM or block the combination witt IPAM in the config. I think is either IPAM, or host network config (no ip is also valid config)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the point, but there are valid use cases for the combination: the host interface provides the base L3 config (addresses/routes from the cloud provider), and IPAM adds additional addresses on top (e.g. secondary IPs, service IPs). Blocking the combination would reduce flexibility for users who need both.

The current merge behavior is additive - IPAM addresses are appended alongside host-captured ones. If you feel strongly, we could add a validation warning instead of an error, but I'd prefer to keep the flexibility. WDYT?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about conflicting default routes? IPs on the same prefix?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Both cases are handled gracefully:

  • Conflicting IPs on the same prefix: applyOnLink wraps every AddrAdd / RouteAdd call with isAlreadyExistsErr - if the host-captured address or route already exists (because IPAM configured the same one), the duplicate is silently ignored.
  • Conflicting default routes: Linux allows multiple default routes with different metrics (ECMP). If both the host config and IPAM supply a default route with the same gateway+metric, the second RouteAdd returns EEXIST which we handle. If they differ in metric, both coexist as expected.

The order of operations is: host addresses/routes are applied first (in applyToPod), then IPAM runs and calls ConfigureIface. So IPAM is always the "last writer" for identical entries, and duplicates are harmlessly skipped.

In practice, the combination is intended for cases like: host provides the primary IP/routes from the cloud provider, and IPAM adds secondary IPs or service-specific routes on top - they typically won't conflict.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Both cases are handled gracefully:

  • Conflicting IPs on the same prefix: applyOnLink wraps every AddrAdd / RouteAdd call with isAlreadyExistsErr - if the host-captured address or route already exists (because IPAM configured the same one), the duplicate is silently ignored.
  • Conflicting default routes: Linux allows multiple default routes with different metrics (ECMP). If both the host config and IPAM supply a default route with the same gateway+metric, the second RouteAdd returns EEXIST which we handle. If they differ in metric, both coexist as expected.

The order of operations is: host addresses/routes are applied first (in applyToPod), then IPAM runs and calls ConfigureIface. So IPAM is always the "last writer" for identical entries, and duplicates are harmlessly skipped.

In practice, the combination is intended for cases like: host provides the primary IP/routes from the cloud provider, and IPAM adds secondary IPs or service-specific routes on top - they typically won't conflict.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Both cases are handled gracefully:

  • Conflicting IPs on the same prefix: applyOnLink wraps every AddrAdd / RouteAdd call with isAlreadyExistsErr - if the host-captured address or route already exists (because IPAM configured the same one), the duplicate is silently ignored.
  • Conflicting default routes: Linux allows multiple default routes with different metrics (ECMP). If both the host config and IPAM supply a default route with the same gateway+metric, the second RouteAdd returns EEXIST which we handle. If they differ in metric, both coexist as expected.

The order of operations is: host addresses/routes are applied first (in applyToPod), then IPAM runs and calls ConfigureIface. So IPAM is always the "last writer" for identical entries, and duplicates are harmlessly skipped.

In practice, the combination is intended for cases like: host provides the primary IP/routes from the cloud provider, and IPAM adds secondary IPs or service-specific routes on top - they typically won't conflict.

)

// TestUseInterfaceNetwork verifies useInterfaceNetwork boolean behavior.
func TestUseInterfaceNetwork(t *testing.T) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what exactly this test is doing?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this test - it was only validating the trivial boolean guard function which is already covered implicitly by the integration tests.

}

// TestStateJSONHasNoNeighbors verifies state serialization excludes neighbors.
func TestStateJSONHasNoNeighbors(t *testing.T) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what exactly this test is doing? Not sure the tests in the file really add value

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this test as well. Kept the TestMergeNetworkState* and TestLoadConf* tests since those exercise actual logic (result merging, config parsing, DPDK rejection).

@SchSeba SchSeba force-pushed the host-device-l3-info branch from df398aa to 20dc60e Compare May 11, 2026 13:04

@s1061123 s1061123 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces new parameter, 'useInterfaceNetwork', so could you please create another PR in https://github.com/containernetworking/cni.dev/pulls to modify host-device CNI document as well?

RuntimeConfig struct {
DeviceID string `json:"deviceID,omitempty"`
} `json:"runtimeConfig,omitempty"`
UseInterfaceNetwork bool `json:"useInterfaceNetwork,omitempty"`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend to add comment to quickly mention what is 'UseInterfaceNetwork' because the option name is not intuitive (what 'useInterfaceNetwork' is used?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - added an inline comment: // When true, copy the host interface's IP addresses and routes into the container before IPAM runs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also open the PR for the documentation site containernetworking/cni.dev#156

@SchSeba SchSeba force-pushed the host-device-l3-info branch 3 times, most recently from 72eec2d to 10936b3 Compare May 11, 2026 13:47
@karampok

Copy link
Copy Markdown
Contributor

I am a bit unsure if we capture everything that need to be re-applied or if capture something we should not.
For example, when IPv6, when SLAAC, routes from RA, nothing should be copied imo. (IPv6 test are missing)

Some AI generated list which seems possible

1. Neighbours not captured — GCP assigns /32 to NICs; all egress needs ARP entry for gateway; LinkSetNsFd wipes neighbour table; traffic blackholes.
  proof
  2. rp_filter not captured — AWS multi-ENI requires loose (2); container gets netns default strict (1); asymmetric return traffic silently dropped.
  proof
  3. RTPROT_RA routes copied — RA routes have expiry (expires Nsec); re-applied without lifetime; no RA daemon in container to refresh; stale routes
  persist forever. proof
  4. SLAAC addresses copied without IFA_F_PERMANENT filter — dynamic addrs (ip addr show → dynamic) become permanent in container; no renewal; never
  expire. [proof](/usr/include/linux/if_addr.h:54 — IFA_F_PERMANENT 0x80]

@squeed squeed requested a review from mlguerrero12 June 22, 2026 14:24
@SchSeba

SchSeba commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review @karampok - addressing your IPv6/SLAAC concerns:

Addressed in the latest push:

  • RTPROT_RA routes — now filtered out in captureHostNetworkState. RA-learned routes carry kernel-managed lifetimes and there's no RA daemon inside the container to refresh them, so copying them would create stale permanent routes.
  • SLAAC addresses — now filtered via IFA_F_PERMANENT flag check. Only permanently configured addresses are captured; dynamic (SLAAC/DHCPv6) addresses are skipped since they won't be renewed inside the container.

Out of scope for this PR (can be follow-ups):

  • Neighbours — On GCP with /32 NICs, the gateway ARP entry is indeed wiped when the device moves namespaces. However, once traffic flows in the container, the kernel will re-do ARP resolution (the gateway is reachable via the copied routes). If specific cloud setups require pre-populated neighbour entries, that can be added as a follow-up with a new captureNeighbours option. Adding it here would increase scope and complexity significantly.
  • rp_filter — This is a per-interface sysctl, not part of the netlink L3 configuration. Managing sysctls is the responsibility of the tuning CNI plugin, which is the standard approach in CNI. Users on AWS multi-ENI setups can chain tuning after host-device to set rp_filter=2.

Also added IPv6-related filtering (link-local unicast routes were already skipped).

@SchSeba SchSeba force-pushed the host-device-l3-info branch from 10936b3 to a2cbe4c Compare June 24, 2026 17:10
Add a new configuration option `useInterfaceNetwork` that instructs the
host-device plugin to capture the interface's IP addresses and routes
from the host before moving the device into the container namespace,
and then apply them inside the container.

This is critical for virtual environments (AWS, IBM Cloud, GPC) where
the cloud provider configures IP addresses and routes directly on the
network device. In these environments, there is no traditional IPAM
source; the ground truth for L3 configuration lives on the host
interface itself.

When `useInterfaceNetwork` is enabled, the plugin:
  - Captures all global-scope addresses and non-local routes from the
    host device before moving it into the container namespace.
  - Applies the captured addresses and routes to the interface inside
    the container.
  - Reports the addresses and routes in the CNI result (merged with
    any IPAM result if an IPAM plugin is also configured).

NOTE: The interface configuration on the host node must be persistent.
When the device is moved back to the host (via DEL) and renamed to its
original name, the system's network management service (e.g.
NetworkManager, systemd-networkd, cloud-init, or cloud-specific agents)
is expected to detect the device and re-apply the IP addresses and
routes. This plugin does NOT re-configure the host interface on DEL; it
relies on the node's network configuration being declarative and
reconciled by the platform's networking stack.

Also implements the STATUS command to verify the host device exists,
replacing the previous TODO stub.

Signed-off-by: Sebastian Sch <sebassch@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants