Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graph API for interacting with ros2_tracing events #1

Merged
merged 2 commits into from
Feb 6, 2023
Merged

Conversation

mjcarroll
Copy link
Collaborator

@mjcarroll mjcarroll commented Jan 30, 2023

This introduces an alternative mechanism for interacting with trace data introduced in [https://github.com/ros-tracing/tracetools_analysis].

The primary data structure is a graph that represents the ROS 2 computation graph. From this entrypoint, users can introspect on the runtime layout of a traced ROS 2 system as recorded via tracetools.

In addition to the individual events, the data is associated across elements via:

  1. Publication and Subscription events on the same topic are associated
  2. Publications from within timer callbacks are associated
  3. Publications from within subscription callbacks are associated.

This allows users to inspect causal "chains" of events across a ROS 2 computation graph.

As an example, in the Mont Blanc test topology, we can trace the sequence of events to cause the arequipa subscription to be fired:

node event topic timestamp (ms)
portsmouth callback_start timer 0.000000
portsmouth rclcpp_pub /danube 0.023422
portsmouth rcl_pub /danube 0.025456
portsmouth rmw_pub /danube 0.025918
portsmouth callback_end timer 0.069164
hamburg rmw_take /danube 0.172475
hamburg rcl_take /danube 0.172953
hamburg rclcpp_take /danube 0.173848
hamburg callback_start /danube 0.182634
hamburg rclcpp_pub /parana 0.206418
hamburg rcl_pub /parana 0.208355
hamburg rmw_pub /parana 0.209520
hamburg callback_end /danube 0.315995
geneva rmw_take /parana 0.339240
geneva rcl_take /parana 0.339772
geneva rclcpp_take /parana 0.340232
geneva callback_start /parana 0.341611
geneva rclcpp_pub /arkansas 0.354110
geneva rcl_pub /arkansas 0.355227
geneva rmw_pub /arkansas 0.356448
geneva callback_end /parana 0.398542
arequipa rmw_take /arkansas 0.454997
arequipa rcl_take /arkansas 0.455474
arequipa rclcpp_take /arkansas 0.456482
arequipa callback_start /arkansas 0.462663
arequipa callback_end /arkansas 0.466198

@christophebedard
Copy link

I'm guessing you're going to bring tracetools_analysis back after upgrading it to babeltrace2? Have you done any trace-reading performance comparison between babeltrace and babeltrace2?

@mjcarroll
Copy link
Collaborator Author

I'm guessing you're going to bring tracetools_analysis back after upgrading it to babeltrace2?

While it is up to you, I think the better long term plan would be to add this as a "peer" API in tracetools analysis. It seems like a useful way of looking at the same data in a way that is associated with the relevant ROS computation graph pieces.

Have you done any trace-reading performance comparison between babeltrace and babeltrace2?

Not yet, any specific benchmark you are interested in?

@christophebedard
Copy link

While it is up to you, I think the better long term plan would be to add this as a "peer" API in tracetools analysis. It seems like a useful way of looking at the same data in a way that is associated with the relevant ROS computation graph pieces.

I'm mostly referring to the utilities for actually processing the trace, event by event (i.e., EventHandler/Processor). It's kind of a simplified version of Trace Compass' event processing architecture, which allows you to create one EventHandler that uses the result of another EventHandler.

I guess you don't really need this if you process everything at once, though. I was just wondering.

Not yet, any specific benchmark you are interested in?

Nothing specific. Just wondering how performance would be affected for typical use-cases if we were to upgrade tracetools_analysis to babeltrace2.

@mjcarroll
Copy link
Collaborator Author

We discussed performance a little bit offline, the initial conclusion is that babeltrace2 is actually slower than babeltrace, at least in the way that we are currently using it (processing all of the events in a single shot).

Here is a notebook that covers some of potential ways that this is used with performance measurements: https://gist.github.com/mjcarroll/34e7f06d761c8c6ce2cce36027900b34

@iluetkeb
Copy link

iluetkeb commented Feb 1, 2023

So, I haven't really looked at the new API in depth. But one of the things I noticed is that you remove the "procname" field from the events. This is an example of things where, for your analysis, you can probably ignore, but for what I'm doing, I can't.

And that is what I see a lot: Most people have some specific analysis in mind, and the code they design is tailored to that, and then it isn't re-usable.

Another example of this is is that a lot of tracetools analysis code expects certain kernel events, but in our use case, we'd like to avoid the need for users to have all the permissions set up, and so our traces don't contain these events.

I've been thinking a bit on how to avoid this or whether this is actually possible to avoid. One thing I noticed is that a particular problem is that ctf is not queryable, so the first part of each processing pipeline converts it into something which is queryable, usually either an internal data-structure, or a database. And then the second part of the pipeline puts some convenient (where "convenient" is very use-case specific) API on what ever the first part produced.

Therefore, I've been wondering whether it would be useful for us an base API which makes very few, if any, assumptions beyond that it knows certain tracepoints (processing only those which are actually present) and they have certain fixed elements and certain variable elements (the context), but which is queryable and iterable efficiently. This can be implemented both for in-memory and for on-disk storage.

Since our data is a time-series, I would suggest to make the base API time-series oriented. We can then put things like graph APIs on top.

What do you guys think?

@mjcarroll
Copy link
Collaborator Author

But one of the things I noticed is that you remove the "procname" field from the events. This is an example of things where, for your analysis, you can probably ignore, but for what I'm doing, I can't

Removing procname is a hold-over from me doing initial analysis with entirely composed nodes, so it wasn't particularly interesting at the time. I will add that back in.

Therefore, I've been wondering whether it would be useful for us an base API which makes very few, if any, assumptions beyond that it knows certain tracepoints (processing only those which are actually present) and they have certain fixed elements and certain variable elements (the context), but which is queryable and iterable efficiently. This can be implemented both for in-memory and for on-disk storage.

I think this actually makes a lot of sense.

To begin with here, I started from the API that @christophebedard had set up in tracetools_read, but ended up dropping it in the short term to try using babeltrace2. My intent (as mentioned above) is to potentially merge the bt2 improvements upstream and then take advantage of that API.

Maybe it would make sense to have a separate meeting to discuss what this DataModel should look like? I imagine that it would look a lot like what is already implemented in tracetools_analysis, and I think I could likely rewrite the processing here to take advantage of that underlying structure.

@mjcarroll
Copy link
Collaborator Author

I would think that something like Apache Arrow may be a reasonable approach to storing in-memory and on-disk. It has pretty substantial compatibility across languages if we wanted to do the ctf->time series conversion in a separate place from the remainder of the analysis.

@christophebedard
Copy link

Therefore, I've been wondering whether it would be useful for us an base API which makes very few, if any, assumptions beyond that it knows certain tracepoints (processing only those which are actually present) and they have certain fixed elements and certain variable elements (the context), but which is queryable and iterable efficiently. This can be implemented both for in-memory and for on-disk storage.

I think this actually makes a lot of sense.

I agree. This is pretty similar to the way Trace Compass processes traces for performance/scaling purposes. Typically, "analyses" handle trace events one by one, and write some kind of higher-level result to what is basically a time series database. When the user zooms into a particular section of the trace, the corresponding range in the time series database is loaded from disk into RAM, transformed into UI elements, and displayed. That means that, while it is kind of limiting sometimes, if you follow this principle, you never need to load the whole trace (or the complete higher-level representation data) into RAM. Furthermore, if an analysis B depends on the result of an analysis A, it can simply process a trace event after analysis A has processed it. Then analysis B can query analysis A's time series database as it's getting built, but slightly delayed. This way, you only need to process trace events once.

I tried to copy some of it for tracetools_analysis (see the whole EventHandler thing), but the analysis result (i.e., DataModel) is still all loaded into RAM. The thing is that, for @mjcarroll's specific purpose here, or for something that doesn't really produce much data, this is a bit overkill, but I agree that it would make sense to try to come up with something we can all reuse. Sounds like you both have some ideas/tools that could get us close to how Trace Compass achieves this.

@iluetkeb
Copy link

iluetkeb commented Feb 2, 2023

Maybe it would make sense to have a separate meeting to discuss what this DataModel should look like?

Count me in :-)

I would think that something like Apache Arrow may be a reasonable approach to storing in-memory and on-disk

I've been looking at bit at parquet, which is part of Apache Arrow, to store data -- primarily because it is also supported by dask (the big data variant of pandas).

When looking at the API, I noticed that Arrow also has some query functionality, but I couldn't find something similar to pandas merge, which at least I am using heavily to merge the metadata (what the functions are called, primarily) with the other trace data.

It has pretty substantial compatibility across languages if we wanted to do the ctf->time series conversion in a separate place from the remainder of the analysis.

Is there interest in languages other than Python? Maybe for performance? If there are significant performance advantages, it might be worth writing the converter from ctf to parquet in C++.

@mjcarroll
Copy link
Collaborator Author

Is there interest in languages other than Python? Maybe for performance? If there are significant performance advantages, it might be worth writing the converter from ctf to parquet in C++.

I think that it could make sense to do the converter in a non-interpreted language. I would probably want some evidence to prove that it would be worth the effort before starting, though.

@mjcarroll
Copy link
Collaborator Author

I want to continue iterating in this repo, so I'm going to branch this conversation over to here: ros2/ros2_tracing#35

@mjcarroll mjcarroll merged commit aefac5e into main Feb 6, 2023
@mjcarroll mjcarroll deleted the graph_api branch February 6, 2023 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants