Skip to content

Traffic outage #54

@QQ2017

Description

@QQ2017

When using sar to monitor network traffic, an issue involving intermittent periods of zero reported traffic was observed. An analysis using Codex revealed that this anomaly stems from the driver's statistics-gathering path itself, rather than indicating an actual "traffic outage."

  1. The logic for updating netdev statistics in this specific driver version contains a significant flaw. Tools such as sar -n DEV 1 ultimately invoke the ndo_get_stats64 handler, which maps to ice_get_stats64(); see /ethernet-linux-ice-main/ethernet-linux-ice-main/src/ice_main.c:9752. The intended design here was to first call ice_update_vsi_ring_stats(vsi) during the read operation—thereby calculating the current ring statistics—and then return the resulting values.

  2. However, the implementation of ice_update_vsi_ring_stats() is defective: it accumulates the statistics into a temporary vsi_stats buffer but subsequently calls kfree(vsi_stats) directly without writing the accumulated TX/RX packet and byte counts back to the persistent vsi->net_stats structure. See lines /ethernet-linux-ice-main/ethernet-linux-ice-main/src/ice_main.c:9462 through /ethernet-linux-ice-main/ethernet-linux-ice-main/src/ice_main.c:9515. This implies that the values ​​read by ice_get_stats64() are frequently not the "ring counts freshly aggregated during the current request," but rather stale cached data.

  3. This stale cached data is primarily derived from periodic updates performed by the watchdog mechanism; specifically, the ice_watchdog_subtask() function updates these statistics once every pf->serv_tmr_period (defined as HZ) intervals. See /ethernet-linux-ice-main/ethernet-linux-ice-main/src/ice_main.c:2245 and /ethernet-linux-ice-main/ethernet-linux-ice-main/src/ice_main.c:5583. If a sampling tool happens to perform a read precisely between two such watchdog updates, it will observe that the "current value remains unchanged." When converted into a traffic rate, this translates to a value of zero. Conversely, during the next sampling cycle—when the tool reads the value that now includes all the counts accumulated since the previous update—it captures a significantly larger figure. This results in the observed alternating pattern of "0 / very large value / 0 / very large value." The pattern shown in your screenshot aligns very closely with this scenario.

  4. There is also a secondary implementation issue: ice_fetch_u64_stats_per_ring() passes the struct ice_q_stats by value; consequently, u64_stats_fetch_begin/retry do not actually provide protection for the reading of the raw ring counters. (See:
    ethernet-linux-ice-main/ethernet-linux-ice-main/src/ice_main.c:9392). This is more of a consistency flaw—it typically won't reliably cause the "reset-to-zero-every-other-second" behavior—but it does indicate that the quality of the current statistics-handling code is indeed problematic.

Conclusion: What you are observing appears to be an "anomaly/lag in driver statistics updates," rather than the actual business traffic intermittently dropping to zero. The primary cause is highly likely that ice_update_vsi_ring_stats() fails to write the aggregated ring statistics back to vsi->net_stats; as a result, sar can only intermittently read the stale values ​​that were last refreshed by the watchdog timer.

If you intend to fix this, I recommend prioritizing the following two points:

  1. After the aggregation process within ice_update_vsi_ring_stats() is complete, write the values ​​from vsi_stats->{tx_packets, tx_bytes, rx_packets, rx_bytes} back to vsi->net_stats.
  2. Modify ice_fetch_u64_stats_per_ring() to accept a pointer—struct ice_q_stats *stats—and perform the dereferenced reads directly within the u64_stats_fetch_begin/retry loop.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions