CPU Usage from walking interfaces on Juniper switches #694

neilschelly · 2021-10-14T22:44:02Z

neilschelly
Oct 14, 2021

I've got two Juniper virtual chassis switch stacks that end up with their CPU usage really high whenever SNMP Exporter is walking the interfaces to get typical throughput/errors/counters/etc OIDs. I think part of the problem is that I have a redundant pair of Prometheus servers querying SNMP Exporter, and so SNMP exporter may be conducting multiple walks simultaneously.

I was able to manage this reasonably well on an EX3400 stack of switches just by reducing the scrape frequency out to every 2 minutes. On those, the CPU usage jumps from about 35% to 70% when queries are happening. On a new stack of EX4300 switches, I'm seeing nearly near constant CPU pegged over 95% when I'm querying for this data. It drops to around 40% when I'm not querying for interface metrics.

I've been tempted to put some kind of cache in front of SNMP exporter that can just reply with cached information from the last 2 minutes if it's asked. I'm sure that would help. I'm wondering if anyone's seen similar behavior and found ways to mitigate or work around it.

Answered by SuperQ

Oct 15, 2021

Typicaly what I do for Juniper is to enable caching in the config. I set the JunOS SNMP cache to be 1 second less than my scrape interval.

set snmp stats-cache-lifetime 29

I also broke up my walks into smaller pieces to make JunOS do less work per scrape. I also dropped the ifDescr lookup on JunOS since it wasn't necessary for my setup.

Here's my JunOS generator config:

---
modules:
  # Trimmed down if_mib for slow JunOS devices.
  if_mib_junos:
    walk:
    - sysUpTime
    # ifXTable
    - ifHCInOctets
    - ifHCInUcastPkts
    - ifHCInBroadcastPkts
    - ifHCOutOctets
    - ifHCOutUcastPkts
    - ifHCOutBroadcastPkts
    lookups:
    - source_indexes: [ifIndex]
      lookup: ifAlias

View full answer

SuperQ · 2021-10-15T08:37:04Z

SuperQ
Oct 15, 2021
Maintainer

Typicaly what I do for Juniper is to enable caching in the config. I set the JunOS SNMP cache to be 1 second less than my scrape interval.

set snmp stats-cache-lifetime 29

I also broke up my walks into smaller pieces to make JunOS do less work per scrape. I also dropped the ifDescr lookup on JunOS since it wasn't necessary for my setup.

Here's my JunOS generator config:

---
modules:
  # Trimmed down if_mib for slow JunOS devices.
  if_mib_junos:
    walk:
    - sysUpTime
    # ifXTable
    - ifHCInOctets
    - ifHCInUcastPkts
    - ifHCInBroadcastPkts
    - ifHCOutOctets
    - ifHCOutUcastPkts
    - ifHCOutBroadcastPkts
    lookups:
    - source_indexes: [ifIndex]
      lookup: ifAlias
    #- source_indexes: [ifIndex]
    #  # Uis OID to avoid conflict with PaloAlto PAN-COMMON-MIB.
    #  lookup: 1.3.6.1.2.1.2.2.1.2 # ifDescr
    - source_indexes: [ifIndex]
      # Use OID to avoid conflict with Netscaler NS-ROOT-MIB.
      lookup: 1.3.6.1.2.1.31.1.1.1.1 # ifName
    overrides:
      ifAlias:
        ignore: true # Lookup metric
      ifName:
        ignore: true # Lookup metric
  # Trimmed down if_mib for slow JunOS devices.
  if_mib_junos_errors:
    walk:
    # ifTable
    - ifAdminStatus
    - ifOperStatus
    - ifInDiscards
    - ifInErrors
    - ifOutDiscards
    - ifOutErrors
    # ifXTable
    - ifHighSpeed
    lookups:
    - source_indexes: [ifIndex]
      lookup: ifAlias
    #- source_indexes: [ifIndex]
    #  # Uis OID to avoid conflict with PaloAlto PAN-COMMON-MIB.
    #  lookup: 1.3.6.1.2.1.2.2.1.2 # ifDescr
    - source_indexes: [ifIndex]
      # Use OID to avoid conflict with Netscaler NS-ROOT-MIB.
      lookup: 1.3.6.1.2.1.31.1.1.1.1 # ifName
    overrides:
      ifAdminStatus:
        type: EnumAsStateSet
      ifAlias:
        ignore: true # Lookup metric
      ifName:
        ignore: true # Lookup metric
      ifOperStatus:
        type: EnumAsStateSet
      ifType:
        type: EnumAsInfo

This allows me to scrape every 30s without much trouble.

1 reply

neilschelly Oct 15, 2021
Author

I don't know how I didn't know about the stats-cache-lifetime configuration option. That alone was a huge improvement.

I also built an nginx proxy_cache in front of SNMP Exporter with the following configuration:

proxy_cache_path /tmp/nginx_cache levels=1:2 keys_zone=snmp_exporter:10m max_size=10m inactive=5m use_temp_path=off;
proxy_cache_min_uses 1;
proxy_cache_key $scheme$proxy_host$uri$is_args$args;

server {
        listen 9117 default_server;
        server_name _;
        location / {
                proxy_cache snmp_exporter;
                proxy_pass http://localhost:9116;

                proxy_ignore_headers X-Accel-Expires Expires Cache-Control Set-Cookie;
                proxy_cache_valid any 1m;

                add_header X-Proxy-Cache $upstream_cache_status;
                proxy_cache_lock on;

                set $bypass 'NOCACHE';
                if ($arg_module = 'if_mib') { set $bypass ''; }
                proxy_cache_bypass $bypass;
        }
}

Since we have redundant Prometheus servers each querying this exporter separately with the same parameters very close in time, this configuration keeps both requests passing through and creating two separate walks. The second request will just wait for the first to come back and get the same answer. I also only wanted some SNMP requests to get cached at all, so in my case, only ones with module=if_mib will get cached at all in the above configuration. With this in place, I set Prometheus to query port 9117 instead of 9116 on the same machines, and it works great.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU Usage from walking interfaces on Juniper switches #694

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

CPU Usage from walking interfaces on Juniper switches #694

neilschelly Oct 14, 2021

Replies: 1 comment · 1 reply

SuperQ Oct 15, 2021 Maintainer

neilschelly Oct 15, 2021 Author

neilschelly
Oct 14, 2021

Replies: 1 comment 1 reply

SuperQ
Oct 15, 2021
Maintainer

neilschelly Oct 15, 2021
Author