Improve LAPPD data monitoring #319

furkan-bilgin · 2024-11-26T12:22:30Z

This PR improves LAPPD data monitoring capabilities of ToolAnalysis. It primarily:

Formats MonitorLAPPDData tool
Adds necessary handles to ParseDataMonitoring to handle additional data that is needed by new plots
Adds plots for PPS event counter from raw LAPPD data vs PPS time (for each LAPPD)
Adds a plot for PPS event accumulated number vs System Time
Adds a plot for PPS time vs PPS accumulated number
Adds a plot for PPS Interval Drift
Adds a histogram plot for Part File Number vs Data Event Count (still WIP)

Last histogram is still work in progress, so I am currently planning to move this PR out of draft when I'm done with it.

- We were previously summing all of the points recorded in PPS accumulated number vector, resulting in an exponential graph, which is incorrect. We now sum only the previous number - which creates a meaningful graph.

- ROOT handles it anyway if the numbers get too big, so we don't need to do it

- turns out i didn't need to store some of the properties in vectors, as entries kind of do the same

- `MonitorLAPPDData` also reads `LAPPD_ID` from `LAPPDTemp` now

marc1uk · 2024-11-27T22:40:59Z

Hi Furkan.

This is perhaps not really a fault in your code but in the history of it, but I made several attempts at reading through this PR's code and basically my brain crapped out each time. 3,200 lines of code in a single file is too much. I can't hold that much in my head. If possible, please refactor this Tool into multiple, smaller tools.
A part of your PR that which could be helped is reducing the number of lines changed. MonitorLAPPDData.cpp has 1,366 changed lines. That's a lot, and a majority of those are just whitespace and coding style changes (moving curly braces, adding newlines). Some of those kinds of changes can be filtered out when viewing on github, but many can't.
While I appreciate that working with poorly formatted code can be a pain, for the sake of simplifying PR checking try to avoid unnecessary changes unless it's really making your life difficult - and when it is, make a pull request that only changes formatting (without any functionality changes), and then a separate PR for functionality changes. Both are then much easier to validate.
Next up is the usual suspects: memory leaks. I haven't gone through this in great detail (with this much code that's not really feasible), but it appears that

canvas_pf_vs_data_events
canvas_pps_interval_drift
canvas_pps_accumulated_number_vs_psec_timestamp
canvas_pps_time_vs_accumulated_number
graph_pps_accumulated_number_vs_psec_timestamp
graph_pps_time_vs_accumulated_number

are not deleted.

Similarly I also see some sketchy memory management happening on variables such as t_raw_lappd_data_pps_counts in WriteToFile. It's newed at 1041, potantially deleted at 1135, but that would appear to leave a TTree with a branch pointing to a deleted object (a segfault waiting to happen, if so), on another code fork it gets used at 1214, but it is not subsequently deleted at 1250 - which to me looks like a memory leak. There are several variables here on adjacent lines that look similar, and similar allocation happening at 1370 that may well also be leaking t_data_event_timestamps, for example.
More fundamentally, the simplest way to avoid memory leaks is not to allocate memory. Is there a reason these vectors need to be on the heap? Can they not just be local function variables? Or member variables? (just make sure to clear them when appropriate in the latter case). The same applies to the numerous histograms and graphs allocated on the heap.
Speaking of graphs on the heap, I was particularly concerned with graph_pps_event_counter and graph_pps_interval_drift, which appear to be maps of TGraphs allocated on the heap. Again these are not deleted, so appear to be leaked, but calling new in a for loop is basically a factory for leaked objects. Again, do they really need to be on the heap?? In fact, do you need this format of variable- from what I can tell each TGraph is only holding a single datapoint??? Which seems wildly inefficient.
I do see this is somewhat copied over from existing (bad) code - which I would also encourage you to fix if at all possible - but at the least please copy the corresponding deletes around line 273.

Considering that the monitoring code should, in theory, run 24/7 for potentially months on end, memory leaks are a real problem. Using the stack is much easier than using valgrind. :)

Moving on from memory leaks: I notice at one point you have

// Because ROOT doesn't support serializing std::map, we need to pack it ourselves
// to a vector

and subsequently seem to have two vectors acting as one of keys and one of values? I'm not sure where this idea came from - ROOT has supported nearly all standard containers, including maps, since ROOT 6 (and even ROOT 5 could do so with a little help from a dictionary). So this seems like an unnecessary complication.

Speaking of unnecessary complications, i also see

    std::string psec_timestamp_string;
    // Get the PSecTimestamp given by ParseDataMonitoring tool
    m_data->CStore.Get("PSecTimestamp", psec_timestamp_string);
    // Cast PSec timestamp to long, then push it
    t_pps_accumulated_psec_timestamp = std::stol(psec_timestamp_string)

I'm not sure why you get a variable into a string, only to cast the string to a number. The CStore (an instance of BoostStore) uses the streamer operator (>>) to retrieve variables, so will happily populate a long directly. No need for an intermediate string.

Final minor remarks: I've not really followed the logic, but it's always concerning seeing magic numbers - especially very large ones. What's the significance of 3.2E9 at line 2507? Checking for a range of +-1 on 3.2 billion is quite specific... what's going on here?
One of my initial concerns on seeing this comparison was if the diff variable was a double (3.2 billion won't fit in a normal int, and if it's a double then floating point comparisons are not reliable). It took an annoying amount of tracing to figure out that it's a uint64_t because of frequent use of auto. This is admittedly a controversial topic, but i would say unless there's a reason to do so, please avoid auto and use types explicitly. Being able to easily see the types of variables makes code easier to follow.

furkan-bilgin · 2024-11-29T21:56:06Z

Hi Marcus. Thanks a lot for your comments on the PR. So far, I've fixed/refactored most of the things you've pointed up, except:

On my claim of ROOT "not supporting" std::map: Upon further investigation, I've found out that when we try to store nested stores, in our case std::map<int, std::vector<unsigned long>>, we get a such warning: The class requested (map<int,vector<unsigned long> >) for the branch "test_map" is an instance of an stl collection and does not have a compiled CollectionProxy. Please generate the dictionary for this collection (map<int,vector<unsigned long> >) to avoid to write corrupted data. that ToolAnalysis somehow silently drops, which initially led me to assume that it was not supported. I've discovered this when I tried to reproduce it using a ROOT macro afterwards, so I'll be working on implementing the functionality using dictionaries. (ROOT version 6.24)
WriteToFile/ReadToFile heap alloc for ROOT branches: While I wholeheartedly agree on avoiding heap as much as possible, the entire Monitoring toolchain uses this bad practice for storing things in ROOT (see MonitorMRDTime, MonitorLAPPDSC, MonitorTankTime, etc.), and when we (as far as I know) don't have the means to test the toolchain in a pragmatical fashion (i.e. unit tests) I'm currently hesitant to rewrite this critical part of the code. In a now-force-pushed-out-of-the-branch commit of mine, I tried to implement it, but got a bunch of seg-faults and some other runtime errors in the process, so I don't think it would be the best for the stability of the ToolAnalysis for me to do it alone, but I could use some help to pull it off.

I'll be posting an update again when I implement the rest and the PF# vs Data Event Count plot. Thanks again, and have a nice one.

marc1uk · 2024-12-02T16:19:55Z

ok, that's a shame about ROOT not supporting std::map<int, std::vector<int>> without a dictionary. I'm surprised, i thought it fully supported nesting. :|
As for moving things off the heap, i do understand it's a pain, especially when you're building on already bad code (perhaps we should have been more strict in the past).
From what I can see you're just replacing heap allocations with stack usage. By definition, you can't create a segfault without taking a pointer, so if you're getting segfaults, then it should be straightforward enough to look for where by looking at where you take pointers to these stack objects - i.e. in the Branch and SetBranchAddress calls.
Firstly SetBranchAddress needs to take the address of a pointer to an object, not just the address of an object (in contrast to Branch which takes the address of an object). So for example instead of

std::vector<double> t_pps_rate;
t->SetBranchAddress("pps_rate", &t_pps_rate);

you need

std::vector<double> t_pps_rate;
std::vector<double>* t_pps_rate_p = & t_pps_rate;
t->SetBranchAddress("pps_rate", &t_pps_rate_p);

the extra variable is a pain, i know, but c'est la ROOT. :(
The only other thing would then be to ensure the scope of those variables encapsulates the usage. So if you were to try to do:

std::vector<double> t_pps_rate;
if(no_tree){
  t = new TTree("newtree", "mytree");
  t->Branch("pps_rate", &pps_rate);
} else {
  t = (TTree*)file->Get("mytree");
  std::vector<double>* t_pps_rate_p = &t_pps_rate;
  t->SetBranchAddress("pps_rate", &t_pps_rate_p);
}
t_pps_rate.push_back(5);
t->Fill();

for example, this would segfault if no_tree was false, because the address of the pps_rate branch relies on the variable t_pps_rate_p, which goes out of scope before the call to t->Fill(). The variables to which you set the branch addresses have to remain in scope for the lifetime of the Tree. From a look at your code you aren't doing this (yet), so maybe try just changing the former (with the latter in mind) and try your code again.

furkan-bilgin · 2024-12-03T17:33:48Z

Wouldn't just using BoostStores be more appropiate for this task? As far as I know, we use ROOT in monitoring for a means to an end for a database, and I don't think the output files are being used for any type of analysis. Also we use Boost in BoostStore so we'll be able to serialize any type of containers without much of a problem--also without an intermediary pointer variable.

marc1uk · 2024-12-03T19:32:57Z

I'll be honest that I didn't go through the code well enough to know that the files are only used as temporary storage.

That's generally not good practice, but simply swapping TFile to BoostStores would be more complicated than you think.
To serialise a class (i.e. store it in a BoostStore) your class needs to declare it as a friend of boost::serialization::access and supply a serialize method that serialises relevant members. You can see this on the various classes in the DataModel of the main ToolAnalysis Application branch - for example in the TimeClass (lines 17 and 60).
Since we don't control ROOT classes, we would need to provide a wrapper to serialise all relevant internal members and perform any necessary initialisation of non-stored state variables to save and retrieve ROOT objects in a BoostStore. 😞 Not very streamlined.
(Technically it is possible to pass pointers to non-serialisable objects via a BoostStore, provided the pointed-to type is obfuscated into a intptr_t, allowing you to use a BoostStore to hold and pass around pointers to non-serialisable objects on the heap, but i'm not sure this really helps your use-case).

So I think sticking with the file for now is fine. If performance were an issue, you could make a RAMdisk and put the TFile there, so that in effect it remains in RAM, but i'm not sure that's a concern for us.

marc1uk · 2024-12-03T20:29:44Z

If you wanted to move to BoostStores it would be better to just drop the ROOT classes and store the underlying data. A histogram is really just a pair of arrays - one of bin edges and one of contents. A TGraph is likewise a pair of x and y arrays. stl containers such as std::vector are natively supported by BoostStores, so you could store those arrays and reconstruct the ROOT object if/when needed.

furkan-bilgin · 2024-12-03T20:50:56Z

TH1F, TGraph and other plots are constructed after we take the data in (t_time, t_end etc.). So, changing:

// ...
std::vector<ULong64_t> *t_time = new std::vector<ULong64_t>;
if (f->GetListOfKeys()->Contains("lappddatamonitor_tree")) {
    // ...
    t->SetBranchAddress("t_start", &t_time);
} else {
    // ...
    t->Branch("t_start", &t_time);
}
t->Write("", TObject::kOverwrite);

to

/*
typechecking=false, type=2=multievent 
(almost every allocation in the codebase uses 
these params)
*/
BoostStore data(false, 2); 

if (/* file exists */) {
    data.Initialise(/* file */);
}
std::vector<ULong64_t> t_time;
data.Set("t_start", t_time);

data.Save(/* file */);

/* Similar to ROOT, each save in BoostStore creates an entry that
we can iterate later when reading the file */

could work. One question I have is that most of the codebase allocates BoostStore in heap (see ParseDataMonitoring ln. 16, MonitorReceive ln. 196-199), so do you think would it be problematic to move it to stack? Or should we also new it?

marc1uk · 2024-12-03T22:27:01Z

There would be no problem putting a BoostStore on the stack. I'm pretty sure nothing in c++ will ever function differently on the heap vs the stack, that's just part of the language.

typechecking=false, type=2=multievent 
(almost every allocation in the codebase uses these params)

typechecking will check that when you try to Get something from a BoostStore, the variable you try to populate is the same type as that which was saved. i.e.

BoostStore b;
double input = 5;
b.Set("adouble", input);
int output;
int ok = b.Get("adouble", output);

is technically undefined behaviour (potentially even crashing the program depending on the types involved), and will result in garbage in output, but the return value ok will be true as long as the requested variable exists in the BoostStore.
With typechecking enabled it will instead bail out and return false - so I would strongly recommend turning it on (and of course checking the return values of Get calls) to avoid unexpected behaviour.

multievent is what lets you use Save and GetEntry to read back old entries, like with a TTree, so you'd need that to be true.

The code you have may work (i'm honestly not sure off the top of my head if you can re-open and append to multi-entry BoostStores, perhaps you can...), but in any case I don't think there would be much if any benefit compared to just moving the ROOT objects to the heap.

marc1uk · 2024-12-03T22:37:46Z

:) Though if you are going to go the ROOT way, make sure you sort your branch addresses out. Your last example again passes the same variable to Branch and SetBranchAddress, whereas in this case TTree::Branch should use just t_time, not &t_time.

furkan-bilgin · 2024-12-04T13:27:06Z

I've refactored the code as we've discussed for both reading from and writing to ROOT and it now works without a problem

furkan-bilgin added 30 commits November 26, 2024 15:14

prettify MonitoringLAPPDData tool

530cfdc

combine pps timestamp parsing logic in MonitoringLAPPDData tool

a12507e

Parse PPS event counter data

a114e6c

WIP: Working on PPS event counter plotting

b403e36

WIP: Working on PPS event counter monitoring 2

37a087a

PPS event counter plotting now works

c8a99c9

Fixed label offset in PPS event counter graph

6c3dda0

fix: Remove unnecessary PPS parsing logic

6a4c51c

fix: Add missing flags for LAPPD processing

ad92e3b

chore: Update PPS event counter graph marker style and color

ab4cb19

feat: persist lappd pps and timestamps throughout runs

b1aecfc

chore: Update README.md

1ee5bb0

feat: persist PPS accumulated number and PSec timestamp

0913184

feat: add PPS accumulated number vs System Time graph

cfc918e

feat: add PPS accumulated number vs System Time graph

bb2e69d

feat: persist raw LAPPD pps data per partrun file

50718c0

feat: plot, and persist PPS time vs accumulated number time

7e39ce4

fix: sum pps counts

80edf1f

fix: properly handle pps psec timestamps

325d2d5

fix: fix lappd pps timestamp

067854d

fix: handle duplicate timestamps in PPS event counter graph

57b1477

fix: sum only preceding PPS accumulated number

c2207fe

- We were previously summing all of the points recorded in PPS accumulated number vector, resulting in an exponential graph, which is incorrect. We now sum only the previous number - which creates a meaningful graph.

refactor: remove t_partrun condition when writing pps data

d51aaa0

refactor: remove code that divides pps timestamps to 1e13

e128fe2

- ROOT handles it anyway if the numbers get too big, so we don't need to do it

fix: iterate through all entries of root file when reading raw pps data

8b9900d

refactor: migrate some pps vectors to variables

3c8e7e2

- turns out i didn't need to store some of the properties in vectors, as entries kind of do the same

feat: accumulate pps number throughout multiple files

ae9af23

feat: add part file vs data events graph

9309bff

feat: store PSECData.LAPPD_ID in ParseDataMonitoring tool

d524e86

- `MonitorLAPPDData` also reads `LAPPD_ID` from `LAPPDTemp` now

wip: seperate pps event counter by lappd id

692bdb0

furkan-bilgin added 4 commits November 26, 2024 15:14

feat: seperate pps event counter by lappd id

bba2c9d

wip: implement pps interval drift plot

1008ab7

refactor: store pps timestamps in clock, not nanoseconds

e151c82

fix: fix pps interval drift histogram

2da2236

furkan-bilgin added 6 commits November 28, 2024 10:53

refactor: simplify PSec timestamp retrieval in MonitorLAPPDData

fc57636

refactor: replace autos for clarity

d1ed089

refactor: replace dynamic latex object creation with stack allocation

169b4ab

refactor: delete leaky canvases and fix hist type

ef1509e

refactor: handle pps event counter and interval drift graphs in stack

77dfb8c

feat: clarify pps interval drift plot

3f33a9d

furkan-bilgin force-pushed the online_monitoring branch 2 times, most recently from bc96727 to 3f33a9d Compare November 29, 2024 19:17

furkan-bilgin added 3 commits December 4, 2024 13:18

refactor: use stack in WriteToFile

4dd49d9

refactor: use stack in ReadFromFile

c691be5

refactor: add verbosity check for pps distribution log

263187c

furkan-bilgin marked this pull request as ready for review December 20, 2024 15:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve LAPPD data monitoring #319

Improve LAPPD data monitoring #319

furkan-bilgin commented Nov 26, 2024

marc1uk commented Nov 27, 2024 •

edited

Loading

furkan-bilgin commented Nov 29, 2024 •

edited

Loading

marc1uk commented Dec 2, 2024 •

edited

Loading

furkan-bilgin commented Dec 3, 2024

marc1uk commented Dec 3, 2024 •

edited

Loading

marc1uk commented Dec 3, 2024

furkan-bilgin commented Dec 3, 2024

marc1uk commented Dec 3, 2024 •

edited

Loading

marc1uk commented Dec 3, 2024

furkan-bilgin commented Dec 4, 2024

Improve LAPPD data monitoring #319

Are you sure you want to change the base?

Improve LAPPD data monitoring #319

Conversation

furkan-bilgin commented Nov 26, 2024

marc1uk commented Nov 27, 2024 • edited Loading

furkan-bilgin commented Nov 29, 2024 • edited Loading

marc1uk commented Dec 2, 2024 • edited Loading

furkan-bilgin commented Dec 3, 2024

marc1uk commented Dec 3, 2024 • edited Loading

marc1uk commented Dec 3, 2024

furkan-bilgin commented Dec 3, 2024

marc1uk commented Dec 3, 2024 • edited Loading

marc1uk commented Dec 3, 2024

furkan-bilgin commented Dec 4, 2024

marc1uk commented Nov 27, 2024 •

edited

Loading

furkan-bilgin commented Nov 29, 2024 •

edited

Loading

marc1uk commented Dec 2, 2024 •

edited

Loading

marc1uk commented Dec 3, 2024 •

edited

Loading

marc1uk commented Dec 3, 2024 •

edited

Loading