Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve LAPPD data monitoring #319

Open
wants to merge 43 commits into
base: online_monitoring
Choose a base branch
from

Conversation

furkan-bilgin
Copy link

This PR improves LAPPD data monitoring capabilities of ToolAnalysis. It primarily:

  • Formats MonitorLAPPDData tool
  • Adds necessary handles to ParseDataMonitoring to handle additional data that is needed by new plots
  • Adds plots for PPS event counter from raw LAPPD data vs PPS time (for each LAPPD)
  • Adds a plot for PPS event accumulated number vs System Time
  • Adds a plot for PPS time vs PPS accumulated number
  • Adds a plot for PPS Interval Drift
  • Adds a histogram plot for Part File Number vs Data Event Count (still WIP)

Last histogram is still work in progress, so I am currently planning to move this PR out of draft when I'm done with it.

- We were previously summing all of the points recorded in PPS accumulated number vector, resulting in an exponential graph, which is incorrect. We now sum only the previous number - which creates a meaningful graph.
- ROOT handles it anyway if the numbers get too big, so we don't need to do it
- turns out i didn't need to store some of the properties in vectors, as entries kind of do the same
- `MonitorLAPPDData` also reads `LAPPD_ID` from `LAPPDTemp` now
@marc1uk
Copy link
Collaborator

marc1uk commented Nov 27, 2024

Hi Furkan.

  • This is perhaps not really a fault in your code but in the history of it, but I made several attempts at reading through this PR's code and basically my brain crapped out each time. 3,200 lines of code in a single file is too much. I can't hold that much in my head. If possible, please refactor this Tool into multiple, smaller tools.

  • A part of your PR that which could be helped is reducing the number of lines changed. MonitorLAPPDData.cpp has 1,366 changed lines. That's a lot, and a majority of those are just whitespace and coding style changes (moving curly braces, adding newlines). Some of those kinds of changes can be filtered out when viewing on github, but many can't.
    While I appreciate that working with poorly formatted code can be a pain, for the sake of simplifying PR checking try to avoid unnecessary changes unless it's really making your life difficult - and when it is, make a pull request that only changes formatting (without any functionality changes), and then a separate PR for functionality changes. Both are then much easier to validate.

  • Next up is the usual suspects: memory leaks. I haven't gone through this in great detail (with this much code that's not really feasible), but it appears that

canvas_pf_vs_data_events
canvas_pps_interval_drift
canvas_pps_accumulated_number_vs_psec_timestamp
canvas_pps_time_vs_accumulated_number
graph_pps_accumulated_number_vs_psec_timestamp
graph_pps_time_vs_accumulated_number

are not deleted.

  • Similarly I also see some sketchy memory management happening on variables such as t_raw_lappd_data_pps_counts in WriteToFile. It's newed at 1041, potantially deleted at 1135, but that would appear to leave a TTree with a branch pointing to a deleted object (a segfault waiting to happen, if so), on another code fork it gets used at 1214, but it is not subsequently deleted at 1250 - which to me looks like a memory leak. There are several variables here on adjacent lines that look similar, and similar allocation happening at 1370 that may well also be leaking t_data_event_timestamps, for example.

  • More fundamentally, the simplest way to avoid memory leaks is not to allocate memory. Is there a reason these vectors need to be on the heap? Can they not just be local function variables? Or member variables? (just make sure to clear them when appropriate in the latter case). The same applies to the numerous histograms and graphs allocated on the heap.

  • Speaking of graphs on the heap, I was particularly concerned with graph_pps_event_counter and graph_pps_interval_drift, which appear to be maps of TGraphs allocated on the heap. Again these are not deleted, so appear to be leaked, but calling new in a for loop is basically a factory for leaked objects. Again, do they really need to be on the heap?? In fact, do you need this format of variable- from what I can tell each TGraph is only holding a single datapoint??? Which seems wildly inefficient.
    I do see this is somewhat copied over from existing (bad) code - which I would also encourage you to fix if at all possible - but at the least please copy the corresponding deletes around line 273.

Considering that the monitoring code should, in theory, run 24/7 for potentially months on end, memory leaks are a real problem. Using the stack is much easier than using valgrind. :)

  • Moving on from memory leaks: I notice at one point you have
// Because ROOT doesn't support serializing std::map, we need to pack it ourselves
// to a vector

and subsequently seem to have two vectors acting as one of keys and one of values? I'm not sure where this idea came from - ROOT has supported nearly all standard containers, including maps, since ROOT 6 (and even ROOT 5 could do so with a little help from a dictionary). So this seems like an unnecessary complication.

  • Speaking of unnecessary complications, i also see
    std::string psec_timestamp_string;
    // Get the PSecTimestamp given by ParseDataMonitoring tool
    m_data->CStore.Get("PSecTimestamp", psec_timestamp_string);
    // Cast PSec timestamp to long, then push it
    t_pps_accumulated_psec_timestamp = std::stol(psec_timestamp_string)

I'm not sure why you get a variable into a string, only to cast the string to a number. The CStore (an instance of BoostStore) uses the streamer operator (>>) to retrieve variables, so will happily populate a long directly. No need for an intermediate string.

  • Final minor remarks: I've not really followed the logic, but it's always concerning seeing magic numbers - especially very large ones. What's the significance of 3.2E9 at line 2507? Checking for a range of +-1 on 3.2 billion is quite specific... what's going on here?

  • One of my initial concerns on seeing this comparison was if the diff variable was a double (3.2 billion won't fit in a normal int, and if it's a double then floating point comparisons are not reliable). It took an annoying amount of tracing to figure out that it's a uint64_t because of frequent use of auto. This is admittedly a controversial topic, but i would say unless there's a reason to do so, please avoid auto and use types explicitly. Being able to easily see the types of variables makes code easier to follow.

@furkan-bilgin furkan-bilgin force-pushed the online_monitoring branch 2 times, most recently from bc96727 to 3f33a9d Compare November 29, 2024 19:17
@furkan-bilgin
Copy link
Author

furkan-bilgin commented Nov 29, 2024

Hi Marcus. Thanks a lot for your comments on the PR. So far, I've fixed/refactored most of the things you've pointed up, except:

  • On my claim of ROOT "not supporting" std::map: Upon further investigation, I've found out that when we try to store nested stores, in our case std::map<int, std::vector<unsigned long>>, we get a such warning: The class requested (map<int,vector<unsigned long> >) for the branch "test_map" is an instance of an stl collection and does not have a compiled CollectionProxy. Please generate the dictionary for this collection (map<int,vector<unsigned long> >) to avoid to write corrupted data. that ToolAnalysis somehow silently drops, which initially led me to assume that it was not supported. I've discovered this when I tried to reproduce it using a ROOT macro afterwards, so I'll be working on implementing the functionality using dictionaries. (ROOT version 6.24)
  • WriteToFile/ReadToFile heap alloc for ROOT branches: While I wholeheartedly agree on avoiding heap as much as possible, the entire Monitoring toolchain uses this bad practice for storing things in ROOT (see MonitorMRDTime, MonitorLAPPDSC, MonitorTankTime, etc.), and when we (as far as I know) don't have the means to test the toolchain in a pragmatical fashion (i.e. unit tests) I'm currently hesitant to rewrite this critical part of the code. In a now-force-pushed-out-of-the-branch commit of mine, I tried to implement it, but got a bunch of seg-faults and some other runtime errors in the process, so I don't think it would be the best for the stability of the ToolAnalysis for me to do it alone, but I could use some help to pull it off.

I'll be posting an update again when I implement the rest and the PF# vs Data Event Count plot. Thanks again, and have a nice one.

@marc1uk
Copy link
Collaborator

marc1uk commented Dec 2, 2024

ok, that's a shame about ROOT not supporting std::map<int, std::vector<int>> without a dictionary. I'm surprised, i thought it fully supported nesting. :|
As for moving things off the heap, i do understand it's a pain, especially when you're building on already bad code (perhaps we should have been more strict in the past).
From what I can see you're just replacing heap allocations with stack usage. By definition, you can't create a segfault without taking a pointer, so if you're getting segfaults, then it should be straightforward enough to look for where by looking at where you take pointers to these stack objects - i.e. in the Branch and SetBranchAddress calls.
Firstly SetBranchAddress needs to take the address of a pointer to an object, not just the address of an object (in contrast to Branch which takes the address of an object). So for example instead of

std::vector<double> t_pps_rate;
t->SetBranchAddress("pps_rate", &t_pps_rate);

you need

std::vector<double> t_pps_rate;
std::vector<double>* t_pps_rate_p = & t_pps_rate;
t->SetBranchAddress("pps_rate", &t_pps_rate_p);

the extra variable is a pain, i know, but c'est la ROOT. :(
The only other thing would then be to ensure the scope of those variables encapsulates the usage. So if you were to try to do:

std::vector<double> t_pps_rate;
if(no_tree){
  t = new TTree("newtree", "mytree");
  t->Branch("pps_rate", &pps_rate);
} else {
  t = (TTree*)file->Get("mytree");
  std::vector<double>* t_pps_rate_p = &t_pps_rate;
  t->SetBranchAddress("pps_rate", &t_pps_rate_p);
}
t_pps_rate.push_back(5);
t->Fill();

for example, this would segfault if no_tree was false, because the address of the pps_rate branch relies on the variable t_pps_rate_p, which goes out of scope before the call to t->Fill(). The variables to which you set the branch addresses have to remain in scope for the lifetime of the Tree. From a look at your code you aren't doing this (yet), so maybe try just changing the former (with the latter in mind) and try your code again.

@furkan-bilgin
Copy link
Author

Wouldn't just using BoostStores be more appropiate for this task? As far as I know, we use ROOT in monitoring for a means to an end for a database, and I don't think the output files are being used for any type of analysis. Also we use Boost in BoostStore so we'll be able to serialize any type of containers without much of a problem--also without an intermediary pointer variable.

@marc1uk
Copy link
Collaborator

marc1uk commented Dec 3, 2024

I'll be honest that I didn't go through the code well enough to know that the files are only used as temporary storage.

That's generally not good practice, but simply swapping TFile to BoostStores would be more complicated than you think.
To serialise a class (i.e. store it in a BoostStore) your class needs to declare it as a friend of boost::serialization::access and supply a serialize method that serialises relevant members. You can see this on the various classes in the DataModel of the main ToolAnalysis Application branch - for example in the TimeClass (lines 17 and 60).
Since we don't control ROOT classes, we would need to provide a wrapper to serialise all relevant internal members and perform any necessary initialisation of non-stored state variables to save and retrieve ROOT objects in a BoostStore. 😞 Not very streamlined.
(Technically it is possible to pass pointers to non-serialisable objects via a BoostStore, provided the pointed-to type is obfuscated into a intptr_t, allowing you to use a BoostStore to hold and pass around pointers to non-serialisable objects on the heap, but i'm not sure this really helps your use-case).

So I think sticking with the file for now is fine. If performance were an issue, you could make a RAMdisk and put the TFile there, so that in effect it remains in RAM, but i'm not sure that's a concern for us.

@marc1uk
Copy link
Collaborator

marc1uk commented Dec 3, 2024

If you wanted to move to BoostStores it would be better to just drop the ROOT classes and store the underlying data. A histogram is really just a pair of arrays - one of bin edges and one of contents. A TGraph is likewise a pair of x and y arrays. stl containers such as std::vector are natively supported by BoostStores, so you could store those arrays and reconstruct the ROOT object if/when needed.

@furkan-bilgin
Copy link
Author

TH1F, TGraph and other plots are constructed after we take the data in (t_time, t_end etc.). So, changing:

// ...
std::vector<ULong64_t> *t_time = new std::vector<ULong64_t>;
if (f->GetListOfKeys()->Contains("lappddatamonitor_tree")) {
    // ...
    t->SetBranchAddress("t_start", &t_time);
} else {
    // ...
    t->Branch("t_start", &t_time);
}
t->Write("", TObject::kOverwrite);

to

/*
typechecking=false, type=2=multievent 
(almost every allocation in the codebase uses 
these params)
*/
BoostStore data(false, 2); 

if (/* file exists */) {
    data.Initialise(/* file */);
}
std::vector<ULong64_t> t_time;
data.Set("t_start", t_time);

data.Save(/* file */);

/* Similar to ROOT, each save in BoostStore creates an entry that
we can iterate later when reading the file */

could work. One question I have is that most of the codebase allocates BoostStore in heap (see ParseDataMonitoring ln. 16, MonitorReceive ln. 196-199), so do you think would it be problematic to move it to stack? Or should we also new it?

@marc1uk
Copy link
Collaborator

marc1uk commented Dec 3, 2024

There would be no problem putting a BoostStore on the stack. I'm pretty sure nothing in c++ will ever function differently on the heap vs the stack, that's just part of the language.

typechecking=false, type=2=multievent 
(almost every allocation in the codebase uses these params)

typechecking will check that when you try to Get something from a BoostStore, the variable you try to populate is the same type as that which was saved. i.e.

BoostStore b;
double input = 5;
b.Set("adouble", input);
int output;
int ok = b.Get("adouble", output);

is technically undefined behaviour (potentially even crashing the program depending on the types involved), and will result in garbage in output, but the return value ok will be true as long as the requested variable exists in the BoostStore.
With typechecking enabled it will instead bail out and return false - so I would strongly recommend turning it on (and of course checking the return values of Get calls) to avoid unexpected behaviour.

multievent is what lets you use Save and GetEntry to read back old entries, like with a TTree, so you'd need that to be true.

The code you have may work (i'm honestly not sure off the top of my head if you can re-open and append to multi-entry BoostStores, perhaps you can...), but in any case I don't think there would be much if any benefit compared to just moving the ROOT objects to the heap.

@marc1uk
Copy link
Collaborator

marc1uk commented Dec 3, 2024

:) Though if you are going to go the ROOT way, make sure you sort your branch addresses out. Your last example again passes the same variable to Branch and SetBranchAddress, whereas in this case TTree::Branch should use just t_time, not &t_time.

@furkan-bilgin
Copy link
Author

I've refactored the code as we've discussed for both reading from and writing to ROOT and it now works without a problem

@furkan-bilgin furkan-bilgin marked this pull request as ready for review December 20, 2024 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants