Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚗️ Add tracing spans to simcore services #6792

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

bisgaard-itis
Copy link
Contributor

@bisgaard-itis bisgaard-itis commented Nov 21, 2024

What do these changes do?

  • This is an experimental PR which attempts to add functionality to add opentelemetry tracing spans to all methods within the simcore services.
  • In this PR it is only actually added to the api-server as a proof of concept.
  • Concretely, decorators are added which can be used to decorate free functions and classes in simcore services to wrap them in opentelemetry tracing spans. The experimental part is adding an import hook 🪄 which can be used to apply these automatically in all services.
  • Here's an example of the spans produced in jaeger:
    image
  • I believe the decorators themselves make sense and should be added to all fcns in simcore services. But please be skeptical 😡 in your reviews of the import hook

Related issue/s

How to test

Dev-ops checklist

@bisgaard-itis bisgaard-itis self-assigned this Nov 21, 2024
@bisgaard-itis bisgaard-itis added a:services-library issues on packages/service-libs a:apiserver api-server service labels Nov 21, 2024
@bisgaard-itis bisgaard-itis added this to the Event Horizon milestone Nov 21, 2024
Copy link

codecov bot commented Nov 21, 2024

Codecov Report

Attention: Patch coverage is 39.21569% with 31 lines in your changes missing coverage. Please review.

Project coverage is 88.51%. Comparing base (a87de46) to head (ccdb8f0).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6792      +/-   ##
==========================================
+ Coverage   85.19%   88.51%   +3.31%     
==========================================
  Files        1545     1201     -344     
  Lines       61634    52312    -9322     
  Branches     2155     1045    -1110     
==========================================
- Hits        52508    46302    -6206     
+ Misses       8799     5823    -2976     
+ Partials      327      187     -140     
Flag Coverage Δ
integrationtests 64.79% <ø> (+0.02%) ⬆️
unittests 86.04% <39.21%> (-0.38%) ⬇️
Components Coverage Δ
api ∅ <ø> (∅)
pkg_aws_library ∅ <ø> (∅)
pkg_dask_task_models_library ∅ <ø> (∅)
pkg_models_library ∅ <ø> (∅)
pkg_notifications_library ∅ <ø> (∅)
pkg_postgres_database ∅ <ø> (∅)
pkg_service_integration ∅ <ø> (∅)
pkg_service_library 75.96% <35.41%> (-0.42%) ⬇️
pkg_settings_library ∅ <ø> (∅)
pkg_simcore_sdk 85.36% <ø> (ø)
agent 97.00% <ø> (ø)
api_server 89.72% <100.00%> (+<0.01%) ⬆️
autoscaling 95.21% <ø> (ø)
catalog 90.57% <ø> (ø)
clusters_keeper 98.73% <ø> (ø)
dask_sidecar 91.26% <ø> (ø)
datcore_adapter 93.17% <ø> (ø)
director 75.97% <ø> (-0.17%) ⬇️
director_v2 91.13% <ø> (+14.56%) ⬆️
dynamic_scheduler 96.59% <ø> (ø)
dynamic_sidecar 89.72% <ø> (+29.85%) ⬆️
efs_guardian 90.12% <ø> (ø)
invitations 93.44% <ø> (ø)
osparc_gateway_server 85.49% <ø> (ø)
payments 92.76% <ø> (ø)
resource_usage_tracker 90.79% <ø> (+0.07%) ⬆️
storage 89.66% <ø> (ø)
webclient ∅ <ø> (∅)
webserver 88.70% <ø> (ø)

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a87de46...ccdb8f0. Read the comment docs.

---- 🚨 Try these New Features:

@mrnicegyu11
Copy link
Member

Alas, hotfix to prod ASAP, I say!

from servicelib.logging_utils import config_all_loggers
from simcore_service_api_server import exceptions

setup_tracing_spans_for_simcore_service_functions()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really do not like that the setup has to be called depending on the imports
if it is experimental, why not to use DEV_FEATURE to enable/disable it?
I assume you will be on top of it right?

from servicelib.logging_utils import config_all_loggers
from simcore_service_api_server import exceptions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use relative imports ... from .. import exceptions
Actually, perhaps you can do some cleanup? :-)
image

Exceptionally main might need absolute import, otherwise we should use relavite imports everywhere.

def _opentelemetry_method_span(cls):
for name, value in cls.__dict__.items():
if callable(value) and not name.startswith("_"):
setattr(cls, name, _opentelemetry_function_span(value))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so here value is a callable and you basically decorate the member functions.
MINOR

setattr(cls, name, _opentelemetry_function_span(func=value))


def _opentelemetry_method_span(cls):
for name, value in cls.__dict__.items():
if callable(value) and not name.startswith("_"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about properties, or (in general) descriptors , do you decorate them?

return None


def setup_tracing_spans_for_simcore_service_functions():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why dont you pass the name of the module instead of relying on things that were imported?

If i undestand it right, yoy want to decorate all members and functions of a given module so the steps would be

  1. import all modules in a package
  2. inspect all members and functions
  3. apply decorators above

for 1 and 2 see what we do with walk_model_examples_in_package. I guess is similar to the MetaPathFinder but higher level since uses built-in pkgutil.walk_packages

setattr(module, name, _opentelemetry_method_span(cls))


class _AddTracingSpansFinder(MetaPathFinder):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MetaPathFinder ... ?? this sounds like a wonder name for a spaceship 🚀 :-D

Copy link
Member

@sanderegg sanderegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is nice to start a trial, but I would be very wary of:

  • induce performance hits,
  • every single function is now wrapped by network calls,
  • what happens when the collector is down?
  • how much overhead?
  • how can I exclude some functions?

Actually I thought that you wanted to initialy allow to decorate some specific functions to be in the telemetry. I find instrumenting everything is too much with too many unknown on the effects.
If you wish we could try the benchmark test that lists all the services in the registry, I have it ready at the moment.

@@ -2,7 +2,16 @@

"""

import importlib
import importlib.machinery
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very cool name...

return wrapper


def _opentelemetry_method_span(cls):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here I have some questions:

  • how heavy is using this? If I get it right, using this wraps every single function with a network call to the telemetry collector right? That means any call is network-limited is that correct?
  • if yes, then what happens if the collector is broken or the network is shaky?
  • does this break the function?
  • if there are network issues, does it retry? how many times? what is the timeout?
  • if you telemetrize a ping function for example, that might be quite bad no? these function typically should run with no "brakes", can you exclude functions?

from servicelib.logging_utils import config_all_loggers
from simcore_service_api_server import exceptions

setup_tracing_spans_for_simcore_service_functions()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this not be called in your import directly? I have the feeling this is very fragile as this will for sure move with some of our tools.

also, this instruments everything? I guess my questions above then remain more than ever.

Copy link
Contributor

@matusdrobuliak66 matusdrobuliak66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:apiserver api-server service a:services-library issues on packages/service-libs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add opentelemetry spans to simcore service methods
5 participants