Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1843926: SPCS service events & metrics #1954

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

sfc-gh-ashen
Copy link
Contributor

@sfc-gh-ashen sfc-gh-ashen commented Dec 13, 2024

Pre-review checklist

  • I've confirmed that instructions included in README.md are still correct after my changes in the codebase.
  • I've added or updated automated unit tests to verify correctness of my new code.
  • I've added or updated integration tests to verify correctness of my new code.
  • I've confirmed that my changes are working by executing CLI's commands manually on MacOS.
  • I've confirmed that my changes are working by executing CLI's commands manually on Windows.
  • I've confirmed that my changes are up-to-date with the target branch.
  • I've described my changes in the release notes.
  • I've described my changes in the section below.

Changes description

Design Document

1. Events Command: snow spcs service events

  • Purpose: Retrieve events for a specific service with advanced filtering options.

2. Metrics Command: snow spcs service metrics

  • Purpose: Fetch service metrics with similar filtering capabilities.

Change Motivation

  • Fill Monitoring Gaps: The current SnowCLI lacks support for retrieving service-specific metrics and events, limiting the ability to monitor and debug effectively.
  • Enhanced User Experience: Streamlined commands with filtering options for time, type, and scope will make monitoring easier and more efficient for users.

Alternate Approach: Standalone Command (snow spcs events)

  • Unified Command Consideration: Instead of separate commands for metrics and events, a unified snow spcs events command was considered. This would handle both service and compute pool telemetry.

    • Pros:
      • Retrieve all telemetry types with a single command.
      • Expose more filtering capabilities, offering customers more flexibility.
    • Cons:
      • Telemetry data for services is already supported under specific resource commands.
      • A generic command might add confusion, so a more intuitive design with separate commands is preferable.

    Given these drawbacks, the current design with distinct commands was chosen for clarity and future extensibility.

Design Logic

Required Parameters

For both events and metrics, the following parameters are mandatory to ensure precise identification of the service and its resources:

  • name: Service identifier (Service name).
  • --container-name: Name of the container.
  • --instance-id: Instance ID of the service (starting with 0).

Filtering Options

  • Time-Based Filters:

    • --since: Fetch telemetry data newer than the specified time (e.g., 1h, 5d).
    • --until: Fetch telemetry data older than the specified time.
  • Pagination (for Events Only):

    • --first <N>: Fetch only the first N events.
    • --last <N>: Fetch only the last N events.
    • --first and --last are mutually exclusive.
  • All Columns Display:

    • --all: Fetch all columns from the event or metrics table.

Behavior for Metrics

  • No Filters Provided (--since, --until not used):

    • Returns only the latest metrics for each unique metric type (grouped), ensuring deduplication of metrics.
  • With Any Filters (--since, --until):

    • Returns all matching metrics from the event table based on the specified filters.

Data Formats

Events (Formatted Data)

TIMESTAMP DATABASE NAME SCHEMA NAME SERVICE NAME INSTANCE NAME CONTAINER NAME SEVERITY EVENT NAME EVENT VALUE
-- -- -- -- -- -- -- -- --

Metrics (Formatted Data)

TIMESTAMP DATABASE NAME SCHEMA NAME SERVICE NAME INSTANCE NAME CONTAINER NAME METRIC NAME METRIC VALUE
-- -- -- -- -- -- -- --

Raw Data (--all Flag)

TIMESTAMP START_TIMESTAMP OBSERVED_TIMESTAMP TRACE RESOURCE RESOURCE_ATTRIBUTES SCOPE SCOPE_ATTRIBUTES RECORD_TYPE RECORD RECORD_ATTRIBUTES VALUE EXEMPLARS
-- -- -- -- -- -- -- -- -- -- -- -- --

Use Cases

snow spcs service events

  1. Retrieve all events for a specific service:
    snow spcs service events LOG_EVENT --container-name log-printer --instance-id 0
    Output: Formatted table with key columns.
TIMESTAMP DATABASE NAME SCHEMA NAME SERVICE NAME INSTANCE NAME CONTAINER NAME SEVERITY EVENT NAME EVENT VALUE
2024-12-13 10:01:52.808692 TESTDB PUBLIC LOG_EVENT 0 log-printer INFO CONTAINER.STATUS_CHANGE { "message": "Running", "status": "READY" }
2024-12-14 22:27:25.420489 TESTDB PUBLIC LOG_EVENT 0 log-printer INFO CONTAINER.STATUS_CHANGE { "message": "Running", "status": "READY" }
2. **Retrieve a subset of events for a specific service:** - Limit the number of events: ```bash snow spcs service events LOG_EVENT --container-name log-printer --instance-id 0 --first 10 ``` Output: Formatted table showing the first 10 events.
  • Or fetch the last 10 events:
    snow spcs service events LOG_EVENT --container-name log-printer --instance-id 0 --last 10
    Output: Formatted table showing the last 10 events.
  1. Narrow the time range using interval syntax:

    • Fetch events newer than the last 5 minutes:
      snow spcs service events LOG_EVENT --container-name log-printer --instance-id 0 --since '5 minutes'

    Output :

    • Fetch events older than 1 hour:
      snow spcs service events LOG_EVENT --container-name log-printer --instance-id 0 --until '1 hour'
      Output: Events within the specified time range.
  2. Retrieve all events with all columns displayed:

    snow spcs service events LOG_EVENT --container-name log-printer --instance-id 0 --all --last 1

    Output: Raw table including all available columns.

  3. Retrieve events formatted for JSON output:

    snow spcs service events LOG_EVENT --container-name log-printer --instance-id 0 --last 1 --format json

    Output (JSON):

    [
        {
            "TIMESTAMP": "2024-12-14T22:27:25.420489",
            "DATABASE NAME": "TESTDB",
            "SCHEMA NAME": "PUBLIC",
            "SERVICE NAME": "LOG_EVENT",
            "INSTANCE NAME": "0",
            "CONTAINER NAME": "log-printer",
            "SEVERITY": "INFO",
            "EVENT NAME": "CONTAINER.STATUS_CHANGE",
            "EVENT VALUE": "{\n  \"message\": \"Running\",\n  \"status\": \"READY\"\n}"
        }
    ]

snow spcs service metrics

  1. Retrieve metrics for a specific service:
    snow spcs service metrics LOG_EVENT --container-name log-printer --instance-id 0
    Output: Formatted table showing the latest metrics for each unique metric type.
TIMESTAMP DATABASE NAME SCHEMA NAME SERVICE NAME INSTANCE NAME CONTAINER NAME METRIC NAME METRIC VALUE
2024-12-18 18:10:25.202000 TESTDB PUBLIC LOG_EVENT 0 log-printer container.cpu.limit 1
2024-12-18 18:10:25.202000 TESTDB PUBLIC LOG_EVENT 0 log-printer container.memory.requested 536870912
2024-12-18 18:10:25.202000 TESTDB PUBLIC LOG_EVENT 0 log-printer container.memory.limit 6442450944
2024-12-18 18:10:25.202000 TESTDB PUBLIC LOG_EVENT 0 log-printer container.cpu.requested 0.5
2024-12-18 18:10:08.957000 TESTDB PUBLIC LOG_EVENT 0 log-printer container.cpu.usage 0.0004400012665396536
2024-12-18 18:10:08.957000 TESTDB PUBLIC LOG_EVENT 0 log-printer container.memory.usage 1323008
  1. Retrieve a subset of metrics with pagination:

    • Fetch the first 5 metrics:

      snow spcs service metrics LOG_EVENT --container-name log-printer --instance-id 0 --first 5
    • Or fetch the last 5 metrics:

      snow spcs service metrics LOG_EVENT --container-name log-printer --instance-id 0 --last 5

      Output: Formatted table showing the first or last 5 metrics.

  2. Retrieve metrics within a time range:

    • Fetch metrics newer than 1 hour:

      snow spcs service metrics LOG_EVENT --container-name log-printer --instance-id 0 --since '1 hour'
    • Fetch metrics older than 2 hours:

      snow spcs service metrics LOG_EVENT --container-name log-printer --instance-id 0 --until '2 hours'

      Output: Metrics within the specified time range.

  3. Retrieve metrics with all columns:

    snow spcs service metrics LOG_EVENT --container-name log-printer --instance-id 0 --last 1 --all

    Output: Raw table showing the last metric with all columns.

  4. Retrieve metrics formatted for JSON output:

    snow spcs service metrics LOG_EVENT --container-name log-printer --instance-id 0 --format json

    Output (JSON):

    [
        {
            "TIMESTAMP": "2024-12-14T22:27:25.420489",
            "SERVICE NAME": "LOG_EVENT",
            "INSTANCE NAME": "0",
            "CONTAINER NAME": "log-printer",
            "METRIC TYPE": "CPU_UTILIZATION",
            "VALUE": "75.4"
        }
    ]

@sfc-gh-ashen sfc-gh-ashen requested a review from a team as a code owner December 13, 2024 22:02
@sfc-gh-ashen sfc-gh-ashen marked this pull request as draft December 13, 2024 22:03
@sfc-gh-ashen sfc-gh-ashen force-pushed the SNOW-1843926-spcs-event-table-snow-cli-integration branch from 0700514 to 70cc699 Compare December 13, 2024 22:15
@sfc-gh-ashen sfc-gh-ashen self-assigned this Dec 13, 2024
@sfc-gh-ashen sfc-gh-ashen changed the title SNOW-1843926: SPCS service events&metrics SNOW-1843926: SPCS service events & metrics Dec 18, 2024
@sfc-gh-ashen sfc-gh-ashen marked this pull request as ready for review December 18, 2024 18:54
@sfc-gh-ashen sfc-gh-ashen requested a review from a team as a code owner December 18, 2024 18:54
@sfc-gh-tkommineni
Copy link
Contributor

great job with design doc and description on the pr!

@sfc-gh-tkommineni sfc-gh-tkommineni force-pushed the SNOW-1843926-spcs-event-table-snow-cli-integration branch from 72e4916 to d8e47b7 Compare January 1, 2025 22:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants