Graph API for interacting with ros2_tracing events #1

mjcarroll · 2023-01-30T15:37:36Z

This introduces an alternative mechanism for interacting with trace data introduced in [https://github.com/ros-tracing/tracetools_analysis].

The primary data structure is a graph that represents the ROS 2 computation graph. From this entrypoint, users can introspect on the runtime layout of a traced ROS 2 system as recorded via tracetools.

In addition to the individual events, the data is associated across elements via:

Publication and Subscription events on the same topic are associated
Publications from within timer callbacks are associated
Publications from within subscription callbacks are associated.

This allows users to inspect causal "chains" of events across a ROS 2 computation graph.

As an example, in the Mont Blanc test topology, we can trace the sequence of events to cause the arequipa subscription to be fired:

node	event	topic	timestamp (ms)
portsmouth	callback_start	timer	0.000000
portsmouth	rclcpp_pub	/danube	0.023422
portsmouth	rcl_pub	/danube	0.025456
portsmouth	rmw_pub	/danube	0.025918
portsmouth	callback_end	timer	0.069164
hamburg	rmw_take	/danube	0.172475
hamburg	rcl_take	/danube	0.172953
hamburg	rclcpp_take	/danube	0.173848
hamburg	callback_start	/danube	0.182634
hamburg	rclcpp_pub	/parana	0.206418
hamburg	rcl_pub	/parana	0.208355
hamburg	rmw_pub	/parana	0.209520
hamburg	callback_end	/danube	0.315995
geneva	rmw_take	/parana	0.339240
geneva	rcl_take	/parana	0.339772
geneva	rclcpp_take	/parana	0.340232
geneva	callback_start	/parana	0.341611
geneva	rclcpp_pub	/arkansas	0.354110
geneva	rcl_pub	/arkansas	0.355227
geneva	rmw_pub	/arkansas	0.356448
geneva	callback_end	/parana	0.398542
arequipa	rmw_take	/arkansas	0.454997
arequipa	rcl_take	/arkansas	0.455474
arequipa	rclcpp_take	/arkansas	0.456482
arequipa	callback_start	/arkansas	0.462663
arequipa	callback_end	/arkansas	0.466198

Signed-off-by: Michael Carroll <[email protected]>

christophebedard · 2023-01-30T21:24:02Z

I'm guessing you're going to bring tracetools_analysis back after upgrading it to babeltrace2? Have you done any trace-reading performance comparison between babeltrace and babeltrace2?

mjcarroll · 2023-01-31T15:04:18Z

I'm guessing you're going to bring tracetools_analysis back after upgrading it to babeltrace2?

While it is up to you, I think the better long term plan would be to add this as a "peer" API in tracetools analysis. It seems like a useful way of looking at the same data in a way that is associated with the relevant ROS computation graph pieces.

Have you done any trace-reading performance comparison between babeltrace and babeltrace2?

Not yet, any specific benchmark you are interested in?

Signed-off-by: Michael Carroll <[email protected]>

christophebedard · 2023-01-31T22:54:23Z

While it is up to you, I think the better long term plan would be to add this as a "peer" API in tracetools analysis. It seems like a useful way of looking at the same data in a way that is associated with the relevant ROS computation graph pieces.

I'm mostly referring to the utilities for actually processing the trace, event by event (i.e., EventHandler/Processor). It's kind of a simplified version of Trace Compass' event processing architecture, which allows you to create one EventHandler that uses the result of another EventHandler.

I guess you don't really need this if you process everything at once, though. I was just wondering.

Not yet, any specific benchmark you are interested in?

Nothing specific. Just wondering how performance would be affected for typical use-cases if we were to upgrade tracetools_analysis to babeltrace2.

mjcarroll · 2023-01-31T23:05:09Z

We discussed performance a little bit offline, the initial conclusion is that babeltrace2 is actually slower than babeltrace, at least in the way that we are currently using it (processing all of the events in a single shot).

Here is a notebook that covers some of potential ways that this is used with performance measurements: https://gist.github.com/mjcarroll/34e7f06d761c8c6ce2cce36027900b34

iluetkeb · 2023-02-01T09:48:40Z

So, I haven't really looked at the new API in depth. But one of the things I noticed is that you remove the "procname" field from the events. This is an example of things where, for your analysis, you can probably ignore, but for what I'm doing, I can't.

And that is what I see a lot: Most people have some specific analysis in mind, and the code they design is tailored to that, and then it isn't re-usable.

Another example of this is is that a lot of tracetools analysis code expects certain kernel events, but in our use case, we'd like to avoid the need for users to have all the permissions set up, and so our traces don't contain these events.

I've been thinking a bit on how to avoid this or whether this is actually possible to avoid. One thing I noticed is that a particular problem is that ctf is not queryable, so the first part of each processing pipeline converts it into something which is queryable, usually either an internal data-structure, or a database. And then the second part of the pipeline puts some convenient (where "convenient" is very use-case specific) API on what ever the first part produced.

Therefore, I've been wondering whether it would be useful for us an base API which makes very few, if any, assumptions beyond that it knows certain tracepoints (processing only those which are actually present) and they have certain fixed elements and certain variable elements (the context), but which is queryable and iterable efficiently. This can be implemented both for in-memory and for on-disk storage.

Since our data is a time-series, I would suggest to make the base API time-series oriented. We can then put things like graph APIs on top.

What do you guys think?

mjcarroll · 2023-02-01T14:28:45Z

But one of the things I noticed is that you remove the "procname" field from the events. This is an example of things where, for your analysis, you can probably ignore, but for what I'm doing, I can't

Removing procname is a hold-over from me doing initial analysis with entirely composed nodes, so it wasn't particularly interesting at the time. I will add that back in.

Therefore, I've been wondering whether it would be useful for us an base API which makes very few, if any, assumptions beyond that it knows certain tracepoints (processing only those which are actually present) and they have certain fixed elements and certain variable elements (the context), but which is queryable and iterable efficiently. This can be implemented both for in-memory and for on-disk storage.

I think this actually makes a lot of sense.

To begin with here, I started from the API that @christophebedard had set up in tracetools_read, but ended up dropping it in the short term to try using babeltrace2. My intent (as mentioned above) is to potentially merge the bt2 improvements upstream and then take advantage of that API.

Maybe it would make sense to have a separate meeting to discuss what this DataModel should look like? I imagine that it would look a lot like what is already implemented in tracetools_analysis, and I think I could likely rewrite the processing here to take advantage of that underlying structure.

mjcarroll · 2023-02-01T14:35:22Z

I would think that something like Apache Arrow may be a reasonable approach to storing in-memory and on-disk. It has pretty substantial compatibility across languages if we wanted to do the ctf->time series conversion in a separate place from the remainder of the analysis.

christophebedard · 2023-02-01T18:31:03Z

Therefore, I've been wondering whether it would be useful for us an base API which makes very few, if any, assumptions beyond that it knows certain tracepoints (processing only those which are actually present) and they have certain fixed elements and certain variable elements (the context), but which is queryable and iterable efficiently. This can be implemented both for in-memory and for on-disk storage.

I think this actually makes a lot of sense.

I agree. This is pretty similar to the way Trace Compass processes traces for performance/scaling purposes. Typically, "analyses" handle trace events one by one, and write some kind of higher-level result to what is basically a time series database. When the user zooms into a particular section of the trace, the corresponding range in the time series database is loaded from disk into RAM, transformed into UI elements, and displayed. That means that, while it is kind of limiting sometimes, if you follow this principle, you never need to load the whole trace (or the complete higher-level representation data) into RAM. Furthermore, if an analysis B depends on the result of an analysis A, it can simply process a trace event after analysis A has processed it. Then analysis B can query analysis A's time series database as it's getting built, but slightly delayed. This way, you only need to process trace events once.

I tried to copy some of it for tracetools_analysis (see the whole EventHandler thing), but the analysis result (i.e., DataModel) is still all loaded into RAM. The thing is that, for @mjcarroll's specific purpose here, or for something that doesn't really produce much data, this is a bit overkill, but I agree that it would make sense to try to come up with something we can all reuse. Sounds like you both have some ideas/tools that could get us close to how Trace Compass achieves this.

iluetkeb · 2023-02-02T08:43:53Z

Maybe it would make sense to have a separate meeting to discuss what this DataModel should look like?

Count me in :-)

I would think that something like Apache Arrow may be a reasonable approach to storing in-memory and on-disk

I've been looking at bit at parquet, which is part of Apache Arrow, to store data -- primarily because it is also supported by dask (the big data variant of pandas).

When looking at the API, I noticed that Arrow also has some query functionality, but I couldn't find something similar to pandas merge, which at least I am using heavily to merge the metadata (what the functions are called, primarily) with the other trace data.

It has pretty substantial compatibility across languages if we wanted to do the ctf->time series conversion in a separate place from the remainder of the analysis.

Is there interest in languages other than Python? Maybe for performance? If there are significant performance advantages, it might be worth writing the converter from ctf to parquet in C++.

mjcarroll · 2023-02-06T14:02:20Z

Is there interest in languages other than Python? Maybe for performance? If there are significant performance advantages, it might be worth writing the converter from ctf to parquet in C++.

I think that it could make sense to do the converter in a non-interpreted language. I would probably want some evidence to prove that it would be worth the effort before starting, though.

mjcarroll · 2023-02-06T14:12:02Z

I want to continue iterating in this repo, so I'm going to branch this conversation over to here: ros2/ros2_tracing#35

Graph API for interacting with ros2_tracing events

12de2aa

Signed-off-by: Michael Carroll <[email protected]>

mjcarroll force-pushed the graph_api branch from db65e80 to 12de2aa Compare January 30, 2023 16:21

More data association across processes

0decd88

Signed-off-by: Michael Carroll <[email protected]>

mjcarroll mentioned this pull request Feb 6, 2023

Process trace events into intermediate storage format ros2/ros2_tracing#35

Open

mjcarroll merged commit aefac5e into main Feb 6, 2023

mjcarroll deleted the graph_api branch February 6, 2023 14:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graph API for interacting with ros2_tracing events #1

Graph API for interacting with ros2_tracing events #1

mjcarroll commented Jan 30, 2023 •

edited

Loading

christophebedard commented Jan 30, 2023

mjcarroll commented Jan 31, 2023

christophebedard commented Jan 31, 2023

mjcarroll commented Jan 31, 2023

iluetkeb commented Feb 1, 2023

mjcarroll commented Feb 1, 2023

mjcarroll commented Feb 1, 2023

christophebedard commented Feb 1, 2023

iluetkeb commented Feb 2, 2023

mjcarroll commented Feb 6, 2023

mjcarroll commented Feb 6, 2023

Graph API for interacting with ros2_tracing events #1

Graph API for interacting with ros2_tracing events #1

Conversation

mjcarroll commented Jan 30, 2023 • edited Loading

christophebedard commented Jan 30, 2023

mjcarroll commented Jan 31, 2023

christophebedard commented Jan 31, 2023

mjcarroll commented Jan 31, 2023

iluetkeb commented Feb 1, 2023

mjcarroll commented Feb 1, 2023

mjcarroll commented Feb 1, 2023

christophebedard commented Feb 1, 2023

iluetkeb commented Feb 2, 2023

mjcarroll commented Feb 6, 2023

mjcarroll commented Feb 6, 2023

mjcarroll commented Jan 30, 2023 •

edited

Loading