-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSUnmerged: Track accumulated errors #10982
Comments
FYI @amaltaro |
Thanks for creating this issue, Todor. Do I read it right that there are only 2 problems reported:
If so, I guess what deserves attention is only these reporting communication error, right? It could be that this is somewhat a generic error from the gfal plugin though. |
hi @amaltaro , yes I think you are correct. I'll double check today. |
The more I look into those logs the more confident I get that we should exclude the We are going to have a deployment in production today due to another minor configuration error I've found, so I am about to make the relevant configuration PRs for excluding this part of the tree too. FYI @amaltaro |
And here are the relevant configuration changes: |
@ivmfnal Igor, don't we have this path skipped from the Rucio ConMon system? @todor-ivanov the deployment is just the deployment of the new configuration, right? |
Hi @amaltaro,
Almost... here is another minor configuration error (a typo) we have found with @muhammadimranfarooqi earlier today and which actually trigggered the need of this deployment, but this one should not have caused any troubles in the service run. |
And just for logging purposes here follows a short list of all the RSEs that are missing from MongoDB due to various of reasons, pasted in parallel bellow:
Hi @ivmfnal Could we check if this informatin I am extracting from our dabase and logs is correct? Sorry if causing additional troubles, with such a request. @amaltaro in order to fetch those errors I had to do a several [1]
|
Todor, you can see all the status and some history about CC here: https://cmsweb-k8s-prod.cern.ch/rucioconmon/index
Igor
…________________________________
From: todor-ivanov ***@***.***>
Sent: Friday, February 11, 2022 10:25 AM
To: dmwm/WMCore ***@***.***>
Cc: Igor V Mandrichenko ***@***.***>; Mention ***@***.***>
Subject: Re: [dmwm/WMCore] MSUnmerged: Track accumulated errors (Issue #10982)
And just for logging purposes here follows a short list of all the RSEs that are missing from MongoDB due to various of reasons, pasted in parallel bellow:
* T1:
T1_FR_CCIN2P3_Disk In non-final state in Rucio Consistency Monitor
T1_IT_CNAF_Disk In non-final state in Rucio Consistency Monitor
* T2T3:
T2_BE_UCL In non-final state in Rucio Consistency Monitor
T2_ES_IFCA In non-final state in Rucio Consistency Monitor
T2_FR_CCIN2P3 Missing in stats records at Rucio Consistency Monitor
T2_GR_Ioannina In non-final state in Rucio Consistency Monitor
T2_PK_NCP In non-final state in Rucio Consistency Monitor
T2_RU_IHEP In non-final state in Rucio Consistency Monitor
T2_UK_London_Brunel In non-final state in Rucio Consistency Monitor
T3_BG_UNI_SOFIA Missing in stats records at Rucio Consistency Monitor
T3_CH_CERNBOX Missing in stats records at Rucio Consistency Monitor
T3_CH_CERN_OpenData Missing in stats records at Rucio Consistency Monitor
T3_CH_PSI Missing in stats records at Rucio Consistency Monitor
T3_FR_IPNL Missing in stats records at Rucio Consistency Monitor
T3_HR_IRB Missing in stats records at Rucio Consistency Monitor
T3_IR_IPM Missing in stats records at Rucio Consistency Monitor
T3_IT_MIB Missing in stats records at Rucio Consistency Monitor
T3_IT_Trieste Missing in stats records at Rucio Consistency Monitor
T3_KR_KISTI Missing in stats records at Rucio Consistency Monitor
T3_KR_KNU Missing in stats records at Rucio Consistency Monitor
T3_KR_UOS Missing in stats records at Rucio Consistency Monitor
T3_MX_Cinvestav Missing in stats records at Rucio Consistency Monitor
T3_TW_NCU Missing in stats records at Rucio Consistency Monitor
T3_TW_NTU_HEP Missing in stats records at Rucio Consistency Monitor
* T2T3_US:
T2_US_Caltech_Ceph is skipped due to a restriction set in msConfig
T2_US_Florida In non-final state in Rucio Consistency Monitor
T2_US_Nebraska In non-final state in Rucio Consistency Monitor
T3_US_Baylor Missing in stats records at Rucio Consistency Monitor
T3_US_Brown Missing in stats records at Rucio Consistency Monitor
T3_US_CMU Missing in stats records at Rucio Consistency Monitor
T3_US_Colorado Missing in stats records at Rucio Consistency Monitor
T3_US_FNALLPC Missing in stats records at Rucio Consistency Monitor
T3_US_MIT Missing in stats records at Rucio Consistency Monitor
T3_US_NERSC Missing in stats records at Rucio Consistency Monitor
T3_US_NotreDame Missing in stats records at Rucio Consistency Monitor
T3_US_OSU Missing in stats records at Rucio Consistency Monitor
T3_US_Princeton_ICSE Missing in stats records at Rucio Consistency Monitor
T3_US_PuertoRico Missing in stats records at Rucio Consistency Monitor
T3_US_Rice Missing in stats records at Rucio Consistency Monitor
T3_US_TAMU Missing in stats records at Rucio Consistency Monitor
T3_US_UMD Missing in stats records at Rucio Consistency Monitor
T3_US_UMiss Missing in stats records at Rucio Consistency Monitor
Hi @ivmfnal<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ivmfnal&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=i7PzUhnPHjYZRrTL7FJbKoxY87KesnLnF3QoQ6AeiPz5HNBbcyJXfRFnfHcW-e4n&s=KAjoEyq2ZD8snZ4iJAcdvfJpezNpdmW8WsPZOVWpbRE&e=> Could we check if this informatin I am extracting from our dabase and logs is correct? Sorry if causing additional troubles, with such a request.
@amaltaro<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_amaltaro&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=i7PzUhnPHjYZRrTL7FJbKoxY87KesnLnF3QoQ6AeiPz5HNBbcyJXfRFnfHcW-e4n&s=B-eJA5gjFzTzlry6KO8ep17xQwbwO5mH1MLYFoQHPM4&e=> in order to fetch those errors I had to do a several set operations to figure out which are the RSEs actually missing from the database and upon that I had to parse all those long longs that we have in order to find the reasons. And this is because all those they exit the pipeline by raising a MSUnmergedException quite early [1] and we have never considered for such cases to actually put a record in the database. Given how difficult and time consuming it was to track all that I'd say we better record this as an actual error in the RSE object and preserve it in the database. I may create yet another bugifx for that too. What do you think?
[1]
https://github.com/dmwm/WMCore/blob/90a3cb7703bebe2957f94a1887183da7bd6d8a55/src/python/WMCore/MicroService/MSUnmerged/MSUnmerged.py#L491<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_WMCore_blob_90a3cb7703bebe2957f94a1887183da7bd6d8a55_src_python_WMCore_MicroService_MSUnmerged_MSUnmerged.py-23L491&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=i7PzUhnPHjYZRrTL7FJbKoxY87KesnLnF3QoQ6AeiPz5HNBbcyJXfRFnfHcW-e4n&s=xBSV0doEJVGiQIpMKKHv3Lt3hkfMOGrN7PH-Iv922hQ&e=>
—
Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_WMCore_issues_10982-23issuecomment-2D1036385522&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=i7PzUhnPHjYZRrTL7FJbKoxY87KesnLnF3QoQ6AeiPz5HNBbcyJXfRFnfHcW-e4n&s=rk0cApT-wk9f0jo5IU0E7X3_nkRmPuR28j8uBu5nypA&e=>, or unsubscribe<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AFK4SQW2HWVVBXWTBPUKD5TU2UZ6FANCNFSM5N57BT5A&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=i7PzUhnPHjYZRrTL7FJbKoxY87KesnLnF3QoQ6AeiPz5HNBbcyJXfRFnfHcW-e4n&s=eKnXVozzG2KWVuf4E7qFb8NfsC8R-Flqqu3VV3mzYww&e=>.
Triage notifications on the go with GitHub Mobile for iOS<https://urldefense.proofpoint.com/v2/url?u=https-3A__apps.apple.com_app_apple-2Dstore_id1477376905-3Fct-3Dnotification-2Demail-26mt-3D8-26pt-3D524675&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=i7PzUhnPHjYZRrTL7FJbKoxY87KesnLnF3QoQ6AeiPz5HNBbcyJXfRFnfHcW-e4n&s=iNtw5Fx4GixXK36a0aayiJhrTA8w0DTsb6fy9BskrQs&e=> or Android<https://urldefense.proofpoint.com/v2/url?u=https-3A__play.google.com_store_apps_details-3Fid-3Dcom.github.android-26referrer-3Dutm-5Fcampaign-253Dnotification-2Demail-2526utm-5Fmedium-253Demail-2526utm-5Fsource-253Dgithub&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=i7PzUhnPHjYZRrTL7FJbKoxY87KesnLnF3QoQ6AeiPz5HNBbcyJXfRFnfHcW-e4n&s=hujBgdSkzK9mikKmKSzzrkQ6AHG_zVby4vD-Bi-HGec&e=>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Thank you @ivmfnal , We will track the information provided there too. I am also creating another issue just to address the missing records in our database and I will put the link in the docstring inside the code so that the sites in error could be linked from both systems. |
With the latest version of MSUnmerged deployed in production, we can easily check the result of the above query to MongoDB through the From what I can see in the current run there are
The 3 out of those 4 RSEs above are actually marked as clean, because the number of directories found to be purged equals the number of directories which either have been successfully deleted or have been already gone at the time MSUnmerged retried the RSE. The forth one: |
The error we get for
I did try to run the service in standalone mode in my VM, but every attempt to load this document in the memory is kelled by the kernel (most probably because of VM resource limitations, but anyway):
Checking the status of the RSE at RucioConMon it seems like the record is completely broken. Even though it says here the RSE is in status So:
And this is, kind of, not unexpected, because RAL has (if I am not wrong) a customized setup, which somehow may be affecting the consistency monitor run. @ivmfnal Do you know any more details on that. And should we even try to run the service on that RSE, or should we skip it completely. Maybe also @KatyEllis may shed some light here too. [1]
|
@todor-ivanov JSON is a text file. Even if it is 800MB you can still open it with vim and look up the mailformed characted. The key line here is It is too bad that we have such enourmous JSON, few comments about it:
|
I was able to download https://cmsweb-prod.cern.ch/rucioconmon/unmerged/files/T1_UK_RAL_Disk_wm_file_list.json?rse=T1_UK_RAL_Disk&format=json and parse it as JSON |
|
@todor-ivanov , the short answer is we do have plenty, we monitor nodes and services. But it really depends on what service metrics are. For instance, all python based services under cmsweb reports some stats, e.g. here is reqmgr2 dashboard. It provides cpu, ram, fds, etc. Since you're querying rucioconmon it is up to this service to provide its metrics. Just few days ago, we added node metrics of k8s cluster, see here and here. The question is do they useful for your use case I have no idea. The monitoring comes from services, if service provide relevant metrics we can easily plug them in. If nobody cares about service metrics then you have nothing. |
I am going to implement ignore list filtering at the web server so that the file list is filtered before being sent to the client |
I made changes to the web server:
Examples:
|
thanks @ivmfnal And indeed your change took immediate effect. Now RAL is iterated properly. Here is and excerpt of the service logs:
And here is the fresh record for the RSE at MongoDB upon the successful run. |
Just logging here something we have previously noticed, but it is now a question that needs to be answered. While looking at the logs related to
And those are both mapped to the This is a question that has been asked to the FYI @amaltaro [1]
|
@todor-ivanov have I missed the different errors at the protocol level from your message above? My naive and generic view of this error handling is the following though, if the posix error is good enough for us to track: then we should not try to micromanage those errors. To me, it looks fair to have a map of multiple protocol errors into the same posix error (since the posix errors are very limited as well). |
hi @amaltaro [1]
|
I am not sure if you have... but, maybe I should have pointed them better. Here they are:
|
So far I managed to go through all the logs since last change of the document structure at MongoDB and the full list of RSEs to have issues with deletions is:
I am creating the relevant GGUS tickets for them. |
And here is the llist of GGUS tickets created for this [1]. I am closing the current issue here. We will follow with the sites and Site Support team. If any other changes are needed another WMcore issue is to be opened. [1] |
Impact of the new feature
MSUnmerged
Is your feature request related to a problem? Please describe.
This is NOT a feature request NEITHER a bug
After we deployed the latest version of
MSUnmerged
in production yesterday, we are already capable of tracking all the errors we have per RSE with a single query directly to the database. [1]The current issue is just for tracking purposes, in order to follow if any of those errors are critical, so that we can announce the service completely functional later on.
Describe the solution you'd like
Follow on all the critical Errors that we observe and make sure we do not miss an RSE behind.
Describe alternatives you've considered
No alternative.
Additional context
[1]
The text was updated successfully, but these errors were encountered: