Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSUnmerged: Track accumulated errors #10982

Closed
todor-ivanov opened this issue Feb 9, 2022 · 26 comments
Closed

MSUnmerged: Track accumulated errors #10982

todor-ivanov opened this issue Feb 9, 2022 · 26 comments

Comments

@todor-ivanov
Copy link
Contributor

Impact of the new feature
MSUnmerged

Is your feature request related to a problem? Please describe.
This is NOT a feature request NEITHER a bug

After we deployed the latest version of MSUnmerged in production yesterday, we are already capable of tracking all the errors we have per RSE with a single query directly to the database. [1]
The current issue is just for tracking purposes, in order to follow if any of those errors are critical, so that we can announce the service completely functional later on.

Describe the solution you'd like
Follow on all the critical Errors that we observe and make sure we do not miss an RSE behind.

Describe alternatives you've considered
No alternative.

Additional context
[1]

In [47]: filter = {}

In [48]: projection = {
    ...:     "_id": False,
    ...:     "name": True,
    ...:     "isClean": True,
    ...:     "counters": {
    ...:         "gfalErrors": True,
    ...:         "dirsToDelete": True,
    ...:         "dirsDeletedSuccess": True,
    ...:         "dirsDeletedFail": True}}

In [49]: list(msUnmergedCollProd.find(filter, projection))
Out[49]: 
[{'name': 'T1_US_FNAL_Disk',
  'isClean': True,
  'counters': {'dirsToDelete': 37,
   'dirsDeletedSuccess': 37,
   'dirsDeletedFail': 0,
   'gfalErrors': {}}},
 {'name': 'T2_TW_NCHC',
  'isClean': False,
  'counters': {'dirsToDelete': 57,
   'dirsDeletedSuccess': 55,
   'dirsDeletedFail': 2,
   'gfalErrors': {'No such file or directory': 720}}},
 {'name': 'T2_US_Vanderbilt',
  'isClean': True,
  'counters': {'dirsToDelete': 302,
   'dirsDeletedSuccess': 302,
   'dirsDeletedFail': 0,
   'gfalErrors': {}}},
 {'name': 'T2_PT_NCG_Lisbon',
  'isClean': True,
  'counters': {'dirsToDelete': 8,
   'dirsDeletedSuccess': 8,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 8}}},
 {'name': 'T2_UA_KIPT',
  'isClean': False,
  'counters': {'dirsToDelete': 70,
   'dirsDeletedSuccess': 69,
   'dirsDeletedFail': 1,
   'gfalErrors': {'No such file or directory': 279}}},
 {'name': 'T2_TR_METU',
  'isClean': True,
  'counters': {'dirsToDelete': 0,
   'dirsDeletedSuccess': 0,
   'dirsDeletedFail': 0,
   'gfalErrors': {}}},
 {'name': 'T2_IT_Rome',
  'isClean': True,
  'counters': {'dirsToDelete': 111,
   'dirsDeletedSuccess': 111,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 440}}},
 {'name': 'T2_UK_SGrid_RALPP',
  'isClean': False,
  'counters': {'dirsToDelete': 569,
   'dirsDeletedSuccess': 561,
   'dirsDeletedFail': 8,
   'gfalErrors': {'No such file or directory': 4820}}},
 {'name': 'T1_UK_RAL_Disk',
  'isClean': True,
  'counters': {'dirsToDelete': 194,
   'dirsDeletedSuccess': 194,
   'dirsDeletedFail': 1,
   'gfalErrors': {'Communication error on send': 61}}},
 {'name': 'T2_DE_DESY',
  'isClean': True,
  'counters': {'dirsToDelete': 194,
   'dirsDeletedSuccess': 194,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 1376}}},
 {'name': 'T2_UK_London_IC',
  'isClean': True,
  'counters': {'dirsToDelete': 291,
   'dirsDeletedSuccess': 291,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 3533}}},
 {'name': 'T2_DE_RWTH',
  'isClean': False,
  'counters': {'dirsToDelete': 268,
   'dirsDeletedSuccess': 264,
   'dirsDeletedFail': 4,
   'gfalErrors': {'No such file or directory': 2260}}},
 {'name': 'T2_IT_Bari',
  'isClean': False,
  'counters': {'dirsToDelete': 166,
   'dirsDeletedSuccess': 165,
   'dirsDeletedFail': 1,
   'gfalErrors': {'No such file or directory': 670}}},
 {'name': 'T2_RU_INR',
  'isClean': True,
  'counters': {'dirsToDelete': 31,
   'dirsDeletedSuccess': 31,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 31}}},
 {'name': 'T2_FR_GRIF_LLR',
  'isClean': False,
  'counters': {'dirsToDelete': 593,
   'dirsDeletedSuccess': 565,
   'dirsDeletedFail': 28,
   'gfalErrors': {'No such file or directory': 8054}}},
 {'name': 'T2_US_Wisconsin',
  'isClean': False,
  'counters': {'dirsToDelete': 673,
   'dirsDeletedSuccess': 666,
   'dirsDeletedFail': 7,
   'gfalErrors': {'Communication error on send': 7139}}},
 {'name': 'T2_FR_IPHC',
  'isClean': False,
  'counters': {'dirsToDelete': 241,
   'dirsDeletedSuccess': 238,
   'dirsDeletedFail': 3,
   'gfalErrors': {'No such file or directory': 1383}}},
 {'name': 'T2_RU_ITEP',
  'isClean': False,
  'counters': {'dirsToDelete': 1,
   'dirsDeletedSuccess': 0,
   'dirsDeletedFail': 1,
   'gfalErrors': {'Input/output error': 3}}},
 {'name': 'T2_IN_TIFR',
  'isClean': False,
  'counters': {'dirsToDelete': 176,
   'dirsDeletedSuccess': 171,
   'dirsDeletedFail': 5,
   'gfalErrors': {'No such file or directory': 3053}}},
 {'name': 'T1_DE_KIT_Disk',
  'isClean': True,
  'counters': {'dirsToDelete': 126,
   'dirsDeletedSuccess': 126,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 1639}}},
 {'name': 'T1_RU_JINR_Disk',
  'isClean': True,
  'counters': {'dirsToDelete': 10,
   'dirsDeletedSuccess': 10,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 5}}},
 {'name': 'T1_ES_PIC_Disk',
  'isClean': False,
  'counters': {'dirsToDelete': 367,
   'dirsDeletedSuccess': 366,
   'dirsDeletedFail': 1,
   'gfalErrors': {'No such file or directory': 827758}}},
 {'name': 'T2_US_Caltech',
  'isClean': True,
  'counters': {'dirsToDelete': 126,
   'dirsDeletedSuccess': 126,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 1317}}},
 {'name': 'T2_US_UCSD',
  'isClean': True,
  'counters': {'dirsToDelete': 54,
   'dirsDeletedSuccess': 54,
   'dirsDeletedFail': 0,
   'gfalErrors': {}}},
 {'name': 'T2_IT_Pisa',
  'isClean': True,
  'counters': {'dirsToDelete': 92,
   'dirsDeletedSuccess': 92,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 245}}},
 {'name': 'T2_KR_KISTI',
  'isClean': True,
  'counters': {'dirsToDelete': 64,
   'dirsDeletedSuccess': 64,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 196}}},
 {'name': 'T2_RU_JINR',
  'isClean': True,
  'counters': {'dirsToDelete': 32,
   'dirsDeletedSuccess': 32,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 85}}},
 {'name': 'T2_AT_Vienna',
  'isClean': True,
  'counters': {'dirsToDelete': 107,
   'dirsDeletedSuccess': 107,
   'dirsDeletedFail': 0,
   'gfalErrors': {'Communication error on send': 506}}},
 {'name': 'T2_BR_UERJ',
  'isClean': True,
  'counters': {'dirsToDelete': 167,
   'dirsDeletedSuccess': 167,
   'dirsDeletedFail': 0,
   'gfalErrors': {'Communication error on send': 1}}},
 {'name': 'T2_FI_HIP',
  'isClean': True,
  'counters': {'dirsToDelete': 28,
   'dirsDeletedSuccess': 28,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 76}}},
 {'name': 'T2_UK_SGrid_Bristol',
  'isClean': False,
  'counters': {'dirsToDelete': 94,
   'dirsDeletedSuccess': 0,
   'dirsDeletedFail': 94,
   'gfalErrors': {'Communication error on send': 282}}},
 {'name': 'T2_US_MIT',
  'isClean': True,
  'counters': {'dirsToDelete': 113,
   'dirsDeletedSuccess': 113,
   'dirsDeletedFail': 0,
   'gfalErrors': {'Communication error on send': 36438}}},
 {'name': 'T2_US_Purdue',
  'isClean': True,
  'counters': {'dirsToDelete': 18,
   'dirsDeletedSuccess': 18,
   'dirsDeletedFail': 0,
   'gfalErrors': {}}},
 {'name': 'T3_US_Rutgers',
  'isClean': True,
  'counters': {'dirsToDelete': 0,
   'dirsDeletedSuccess': 0,
   'dirsDeletedFail': 0,
   'gfalErrors': {}}},
 {'name': 'T2_CN_Beijing',
  'isClean': True,
  'counters': {'dirsToDelete': 42,
   'dirsDeletedSuccess': 42,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 64}}},
 {'name': 'T2_EE_Estonia',
  'isClean': False,
  'counters': {'dirsToDelete': 185,
   'dirsDeletedSuccess': 184,
   'dirsDeletedFail': 1,
   'gfalErrors': {'Communication error on send': 1666}}},
 {'name': 'T2_BE_IIHE',
  'isClean': True,
  'counters': {'dirsToDelete': 141,
   'dirsDeletedSuccess': 141,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 902}}},
 {'name': 'T2_IT_Legnaro',
  'isClean': True,
  'counters': {'dirsToDelete': 310,
   'dirsDeletedSuccess': 310,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 1881}}},
 {'name': 'T2_PL_Swierk',
  'isClean': True,
  'counters': {'dirsToDelete': 36,
   'dirsDeletedSuccess': 36,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 2}}},
 {'name': 'T2_CH_CSCS',
  'isClean': True,
  'counters': {'dirsToDelete': 68,
   'dirsDeletedSuccess': 68,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 328}}},
 {'name': 'T2_BR_SPRACE',
  'isClean': False,
  'counters': {'dirsToDelete': 263,
   'dirsDeletedSuccess': 254,
   'dirsDeletedFail': 9,
   'gfalErrors': {'No such file or directory': 982}}},
 {'name': 'T2_HU_Budapest',
  'isClean': True,
  'counters': {'dirsToDelete': 85,
   'dirsDeletedSuccess': 85,
   'dirsDeletedFail': 0,
   'gfalErrors': {'No such file or directory': 465}}},
 {'name': 'T2_CH_CERN',
  'isClean': False,
  'counters': {'dirsToDelete': 245,
   'dirsDeletedSuccess': 244,
   'dirsDeletedFail': 1,
   'gfalErrors': {'Communication error on send': 701}}},
 {'name': 'T2_FR_GRIF_IRFU',
  'isClean': False,
  'counters': {'dirsToDelete': 523,
   'dirsDeletedSuccess': 499,
   'dirsDeletedFail': 24,
   'gfalErrors': {'No such file or directory': 3811}}},
 {'name': 'T2_ES_CIEMAT',
  'isClean': False,
  'counters': {'dirsToDelete': 514,
   'dirsDeletedSuccess': 512,
   'dirsDeletedFail': 2,
   'gfalErrors': {'No such file or directory': 2586}}}]
@todor-ivanov
Copy link
Contributor Author

FYI @amaltaro

@amaltaro
Copy link
Contributor

amaltaro commented Feb 9, 2022

Thanks for creating this issue, Todor. Do I read it right that there are only 2 problems reported:

  1. 'No such file or directory'; or
  2. 'Communication error on send'
    ?

If so, I guess what deserves attention is only these reporting communication error, right? It could be that this is somewhat a generic error from the gfal plugin though.

@todor-ivanov
Copy link
Contributor Author

hi @amaltaro , yes I think you are correct. I'll double check today.

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Feb 11, 2022

The more I look into those logs the more confident I get that we should exclude the /store/unmerged/SAM directory from the tree to be cleaned. The contents there are too volatile and the files are constantly written and deleted by the SAM tests, which in general seems to be doing their job of cleaning behind themselves quite well. Also the contents are usually small enough. While on the other hand if we preserve this part of the tree we are going to be constantly blocked to delete one or two non-empty directories per site and hence not announcing some of them clean for a quite long.

We are going to have a deployment in production today due to another minor configuration error I've found, so I am about to make the relevant configuration PRs for excluding this part of the tree too.

FYI @amaltaro

@todor-ivanov
Copy link
Contributor Author

And here are the relevant configuration changes:

cmsweb-k8s/services_config!127

cmsweb-k8s/services_config!126

@amaltaro
Copy link
Contributor

@ivmfnal Igor, don't we have this path skipped from the Rucio ConMon system?

@todor-ivanov the deployment is just the deployment of the new configuration, right?

@todor-ivanov
Copy link
Contributor Author

Hi @amaltaro,

the deployment is just the deployment of the new configuration, right?

Almost... here is another minor configuration error (a typo) we have found with @muhammadimranfarooqi earlier today and which actually trigggered the need of this deployment, but this one should not have caused any troubles in the service run.

dmwm/CMSKubernetes@0b09abb

@todor-ivanov
Copy link
Contributor Author

And just for logging purposes here follows a short list of all the RSEs that are missing from MongoDB due to various of reasons, pasted in parallel bellow:

  • T1:
T1_FR_CCIN2P3_Disk      In non-final state in Rucio Consistency Monitor
T1_IT_CNAF_Disk         In non-final state in Rucio Consistency Monitor
  • T2T3:
T2_BE_UCL         In non-final state in Rucio Consistency Monitor
T2_ES_IFCA        In non-final state in Rucio Consistency Monitor
T2_FR_CCIN2P3     Missing in stats records at Rucio Consistency Monitor
T2_GR_Ioannina      In non-final state in Rucio Consistency Monitor
T2_PK_NCP        In non-final state in Rucio Consistency Monitor
T2_RU_IHEP        In non-final state in Rucio Consistency Monitor
T2_UK_London_Brunel    In non-final state in Rucio Consistency Monitor
T3_BG_UNI_SOFIA       Missing in stats records at Rucio Consistency Monitor
T3_CH_CERNBOX       Missing in stats records at Rucio Consistency Monitor
T3_CH_CERN_OpenData     Missing in stats records at Rucio Consistency Monitor
T3_CH_PSI       Missing in stats records at Rucio Consistency Monitor
T3_FR_IPNL     Missing in stats records at Rucio Consistency Monitor
T3_HR_IRB     Missing in stats records at Rucio Consistency Monitor
T3_IR_IPM     Missing in stats records at Rucio Consistency Monitor
T3_IT_MIB     Missing in stats records at Rucio Consistency Monitor
T3_IT_Trieste     Missing in stats records at Rucio Consistency Monitor
T3_KR_KISTI     Missing in stats records at Rucio Consistency Monitor
T3_KR_KNU     Missing in stats records at Rucio Consistency Monitor
T3_KR_UOS     Missing in stats records at Rucio Consistency Monitor
T3_MX_Cinvestav     Missing in stats records at Rucio Consistency Monitor
T3_TW_NCU     Missing in stats records at Rucio Consistency Monitor
T3_TW_NTU_HEP    Missing in stats records at Rucio Consistency Monitor

  • T2T3_US:
T2_US_Caltech_Ceph    is skipped due to a restriction set in msConfig
T2_US_Florida    In non-final state in Rucio Consistency Monitor
T2_US_Nebraska    In non-final state in Rucio Consistency Monitor
T3_US_Baylor    Missing in stats records at Rucio Consistency Monitor
T3_US_Brown     Missing in stats records at Rucio Consistency Monitor
T3_US_CMU     Missing in stats records at Rucio Consistency Monitor
T3_US_Colorado     Missing in stats records at Rucio Consistency Monitor
T3_US_FNALLPC     Missing in stats records at Rucio Consistency Monitor
T3_US_MIT     Missing in stats records at Rucio Consistency Monitor
T3_US_NERSC     Missing in stats records at Rucio Consistency Monitor
T3_US_NotreDame     Missing in stats records at Rucio Consistency Monitor
T3_US_OSU     Missing in stats records at Rucio Consistency Monitor
T3_US_Princeton_ICSE     Missing in stats records at Rucio Consistency Monitor
T3_US_PuertoRico     Missing in stats records at Rucio Consistency Monitor
T3_US_Rice     Missing in stats records at Rucio Consistency Monitor
T3_US_TAMU     Missing in stats records at Rucio Consistency Monitor
T3_US_UMD     Missing in stats records at Rucio Consistency Monitor
T3_US_UMiss     Missing in stats records at Rucio Consistency Monitor

Hi @ivmfnal Could we check if this informatin I am extracting from our dabase and logs is correct? Sorry if causing additional troubles, with such a request.

@amaltaro in order to fetch those errors I had to do a several set operations to figure out which are the RSEs actually missing from the database and upon that I had to parse all those long longs that we have in order to find the reasons. And this is because all those they exit the pipeline by raising a MSUnmergedException quite early [1] and we have never considered for such cases to actually put a record in the database. Given how difficult and time consuming it was to track all that I'd say we better record this as an actual error in the RSE object and preserve it in the database. I may create yet another bugifx for that too. What do you think?

[1]

msg = "RSE: %s In non-final state in Rucio Consistency Monitor. " % rseName

@ivmfnal
Copy link

ivmfnal commented Feb 11, 2022 via email

@todor-ivanov
Copy link
Contributor Author

Thank you @ivmfnal , We will track the information provided there too. I am also creating another issue just to address the missing records in our database and I will put the link in the docstring inside the code so that the sites in error could be linked from both systems.

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Mar 10, 2022

With the latest version of MSUnmerged deployed in production, we can easily check the result of the above query to MongoDB through the info=ALL_DOCS REST API of the service. The link is here.

From what I can see in the current run there are gfal errors mostly of the following two types:

  • Communication error on send:
    • T1_UK_RAL_Disk
    • T2_IN_TIFR
    • T2_CH_CERN
    • T2_US_Nebraska
  • No such file or directory:
    • All RSEs which has files or directories which vanished from /store/unmerged since last RucioConMon run.

The 3 out of those 4 RSEs above are actually marked as clean, because the number of directories found to be purged equals the number of directories which either have been successfully deleted or have been already gone at the time MSUnmerged retried the RSE.

The forth one: T1_UK_RAL_Disk is a different story, which I explain bellow - in the next comment.

@todor-ivanov
Copy link
Contributor Author

The error we get for T1_UK_RAL_Disk is [1]. This is basically an error triggered while trying to decode the Json document fetched from RucioConMon for the RSE. I did try to fetch the document by hand and it is enormous:

wget https://cmsweb.cern.ch/rucioconmon/WM/files/T1_UK_RAL_Disk_wm_file_list.json?rse=T1_UK_RAL_Disk&format=json
...
-rw-r--r--  1   827M Mar 10 15:13  T1_UK_RAL_Disk_wm_file_list.json

I did try to run the service in standalone mode in my VM, but every attempt to load this document in the memory is kelled by the kernel (most probably because of VM resource limitations, but anyway):

Mar 10 13:58:18 tivanov-unit02 kernel: Out of memory: Kill process 27141 (ipython) score 802 or sacrifice child
Mar 10 13:58:18 tivanov-unit02 kernel: Killed process 27141 (ipython), UID 71437, total-vm:4273580kB, anon-rss:2826508kB, file-rss:0kB, shmem-rss:0kB
Mar 10 14:00:02 tivanov-unit02 systemd: Created slice User Slice of root.

Checking the status of the RSE at RucioConMon it seems like the record is completely broken. Even though it says here the RSE is in status done. There is no additional stat info other then number of files.

So:

  • Here is how the statistics for this RSE looks like. And
  • Here is how the statistics for a working T1 looks like.

And this is, kind of, not unexpected, because RAL has (if I am not wrong) a customized setup, which somehow may be affecting the consistency monitor run. @ivmfnal Do you know any more details on that. And should we even try to run the service on that RSE, or should we skip it completely. Maybe also @KatyEllis may shed some light here too.

[1]

2022-03-10 10:15:37,127:INFO:MSUnmerged: RSE: T1_UK_RAL_Disk Reading rse data from MongoDB.
2022-03-10 10:15:37,131:INFO:MSUnmerged: RSE: T1_UK_RAL_Disk With old consistency record in Rucio Consistency Monitor. But the RSE has NOT been fully cleaned during the last Rucio Consistency Monitor polling cycle.Retrying cleanup in the current run.
2022-03-10 09:15:09,786:DEBUG:RucioConMon: Fetching data from WM/files?rse=T1_UK_RAL_Disk&format=json, with args None
2022-03-10 09:15:09,786:DEBUG:Service: Data is from the Service cache
2022-03-10 09:15:10,673:ERROR:MSUnmerged: plineUnmerged: General error from pipeline. RSE: T1_UK_RAL_Disk. Error: Unterminated string starting at: line 1 column 168804222 (char 168804221) Will retry again in the next cycle.
Traceback (most recent call last):
  File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/WMCore/MicroService/MSUnmerged/MSUnmerged.py", line 269, in _execute
    pline.run(MSUnmergedRSE(rseName))
  File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/Utils/Pipeline.py", line 140, in run
    return reduce(lambda obj, functor: functor(obj), self.funcLine, obj)
  File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/Utils/Pipeline.py", line 140, in <lambda>
    return reduce(lambda obj, functor: functor(obj), self.funcLine, obj)
  File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/Utils/Pipeline.py", line 72, in __call__
    return self.run(obj)
  File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/Utils/Pipeline.py", line 75, in run
    return self.func(obj, *self.args, **self.kwargs)
  File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/WMCore/MicroService/MSUnmerged/MSUnmerged.py", line 550, in getUnmergedFiles
    rse['files']['allUnmerged'] = self.rucioConMon.getRSEUnmerged(rse['name'])
  File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/WMCore/Services/RucioConMon/RucioConMon.py", line 103, in getRSEUnmerged
    rseUnmerged = self._getResult(uri, callname=rseName)
  File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/WMCore/Services/RucioConMon/RucioConMon.py", line 66, in _getResult
    results = json.loads(results)
  File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 168804222 (char 168804221)

@vkuznet
Copy link
Contributor

vkuznet commented Mar 10, 2022

@todor-ivanov JSON is a text file. Even if it is 800MB you can still open it with vim and look up the mailformed characted. The key line here is line 1 column 168804222, so in vim you just need to jump to column 168804222 and see your guy. The trick is to use <number>| in your case it would be 168804222| and vim cursor will jump to that column.

It is too bad that we have such enourmous JSON, few comments about it:

  • JSON data-format was not designed for large data structures
  • large JSON leads to large RAM, usually 2-3 times of size of JSON
  • not formatted JSON (like this in single line) can be hard to read since you can't scroll it with more/less tools
  • sending such huge JSON via HTTP is path to block entire service, we saw DBS usage of 10GB and you may likely hit this too if you'll use such large JSONs

@ivmfnal
Copy link

ivmfnal commented Mar 10, 2022

@ivmfnal
Copy link

ivmfnal commented Mar 10, 2022

$ curl -k -o file_list.json --cert ~/certs/CERN_crt.pem --key ~/certs/CERN_key.pem "https://cmsweb-prod.cern.ch/rucioconmon/unmerged/files/T1_UK_RAL_Disk_wm_file_list.json?rse=T1_UK_RAL_Disk&format=json"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  826M    0  826M    0     0  10.5M      0 --:--:--  0:01:18 --:--:-- 12.8M

$ python -m json.tool file_list.json | wc -l
 3790380

@todor-ivanov
Copy link
Contributor Author

Thanks @vkuznet and @ivmfnal . I continue searching if there are any issues/limitations on our side.
Valentin, do we have any statistics about resource consumption at the K8 clusters.

@vkuznet
Copy link
Contributor

vkuznet commented Mar 10, 2022

@todor-ivanov , the short answer is we do have plenty, we monitor nodes and services. But it really depends on what service metrics are. For instance, all python based services under cmsweb reports some stats, e.g. here is reqmgr2 dashboard. It provides cpu, ram, fds, etc. Since you're querying rucioconmon it is up to this service to provide its metrics.

Just few days ago, we added node metrics of k8s cluster, see here and here. The question is do they useful for your use case I have no idea. The monitoring comes from services, if service provide relevant metrics we can easily plug them in. If nobody cares about service metrics then you have nothing.

@ivmfnal
Copy link

ivmfnal commented Mar 11, 2022

I am going to implement ignore list filtering at the web server so that the file list is filtered before being sent to the client

@ivmfnal
Copy link

ivmfnal commented Mar 11, 2022

I made changes to the web server:

  • Now the server can filter out some unwanted files from the output of a URL like https://cmsweb.cern.ch/rucioconmon/WM/files/T1_UK_RAL_Disk_wm_file_list.txt?rse=T1_UK_RAL_Disk&format=json. For now, by default it filters out /store/unmerges/logs/*.
  • The default can be overridden by URL parameters:
    • include=<path>[,<path>] - only files under specified paths will be included in the resulting list
    • exclude=<path>[,<path>] - files under the specified paths will be excluded from the list
    • if both include and exclude are present, the list will include only those paths matched by include minus those matched by exclude
  • format values:
    • raw (old) - gzip-compressed original list from the scanner, without any filters described above applied
    • text (new) - plain text, paths are separated with newlines
    • json (old) - JSON list of paths
    • zip-stream (new) - zip stream (see https://docs.python.org/3/library/zlib.html#module-zlib)

Examples:

https://...?rse=T1_UK_RAL_Disk&format=json - exclude /store/unmerges/logs/* by default
https://...?rse=T1_UK_RAL_Disk&format=json&include=/store/unmerged&exclude=/store/unmerged/SAM
https://...?rse=T1_UK_RAL_Disk&format=json&exclude=/store/unmerged/SAM/,/store/unmerged/data/logs/prod/

@todor-ivanov
Copy link
Contributor Author

thanks @ivmfnal

And indeed your change took immediate effect. Now RAL is iterated properly. Here is and excerpt of the service logs:

2022-03-11 18:20:10,338:DEBUG:MSUnmerged: RSE: T1_UK_RAL_Disk, Dir: /store/unmerged/RunIISummer20UL17RECO/TTJets_HT-1200to2500_TuneCP5_
 GError('[gfal_gridftp_unlinkG][gfal_gridftp_unlinkG] globus_xio: System error in send: Broken pipe globus_xio: A system call failed: B
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]
2022-03-11 18:20:10,338:INFO:MSUnmerged: RSE: T1_UK_RAL_Disk, Dir: /store/unmerged/RunIISummer20UL17RECO/TTJets_HT-1200to2500_TuneCP5_1
2022-03-11 18:20:10,338:DEBUG:MSUnmerged: Purging dirEntry: gsiftp://gridftp.echo.stfc.ac.uk:2811/cms:/store/unmerged/RunIISummer20UL17

2022-03-11 18:20:10,339:DEBUG:MSUnmerged:   -> [gfal_gridftp_statG]
2022-03-11 18:20:10,339:DEBUG:MSUnmerged:  -> [GridFTPModule::stat] 
2022-03-11 18:20:10,339:DEBUG:MSUnmerged:  -> [Gridftp_stat_module::globus_gass_stat] 
2022-03-11 18:20:10,339:DEBUG:MSUnmerged: GSIFTP using certificate /etc/proxy/proxy
2022-03-11 18:20:10,339:DEBUG:MSUnmerged: GSIFTP using private key /etc/proxy/proxy
2022-03-11 18:20:10,340:DEBUG:MSUnmerged: gridftp session for: gsiftp://gridftp.echo.stfc.ac.uk:2811 found in  cache !

And here is the fresh record for the RSE at MongoDB upon the successful run.
The RSE is still not cleaned, but we are at least able to debug now. I do see plenty of gfal related errors like the one above.

@todor-ivanov
Copy link
Contributor Author

Just logging here something we have previously noticed, but it is now a question that needs to be answered.

While looking at the logs related to T1_UK_RAL_Disk we noticed that more than one (in this case 2) different errors on protocol level, with two different error messages, but with the same error code returned by gfal, has been mapped to the same posix error:

2022-03-11 18:17:17,450:DEBUG:MSUnmerged: RSE: T1_UK_RAL_Disk, Dir: /store/unmerged/RunIISummer20UL16DIGIPremixAPV/BsToMuMuPhi_PhiToKK_BMuonFilter_SoftQCDnonD_TuneCP5_13TeV-pythia8-evtgen/GE
N-SIM-DIGI/106X_mcRun2_asymptotic_preVFP_v8-v1, delResult: [GError('[gfal_gridftp_unlinkG][gfal_gridftp_unlinkG] an end-of-file was reached globus_xio: An end of file occurred ', 70),
 GError('[gfal_gridftp_unlinkG][gfal_gridftp_unlinkG] an end-of-file was reached globus_xio: An end of file occurred ', 70),
 GError('[gfal_gridftp_unlinkG][gfal_gridftp_unlinkG] an end-of-file was reached globus_xio: An end of file occurred ', 70),
 GError('[gfal_gridftp_unlinkG][gfal_gridftp_unlinkG] globus_xio: System error in send: Broken pipe globus_xio: A system call failed: Broken pipe ', 70),
 GError('[gfal_gridftp_unlinkG][gfal_gridftp_unlinkG] globus_xio: System error in send: Broken pipe globus_xio: A system call failed: Broken pipe ', 70),

And those are both mapped to the posix error: 70: ECOMM: Communication error on send, and reported as such in MongoDB by our service [1].

This is a question that has been asked to the gfal developers here. But we still have no answer.

FYI @amaltaro

[1]
https://cmsweb.cern.ch/ms-unmerged/data/info?rse=T1_UK_RAL_Disk&detail=False

{"result": [
 {
  "wmcore_version": "2.0.1.pre6",
  "microservice_version": "2.0.1.pre6",
  "microservice": "MSManager",
  "query": "rse=T1_UK_RAL_Disk&detail=False",
  "rseData": [
    {
      "name": "T1_UK_RAL_Disk",
      "pfnPrefix": "gsiftp://gridftp.echo.stfc.ac.uk:2811/cms:",
      "rucioConMonStatus": "done",
      "isClean": true,
      "timestamps": {
        "rseConsStatTime": 1646451101,
        "prevStartTime": 1647322438.8978102,
        "startTime": 1647326040.5946178,
        "prevEndTime": 1647322438.8980052,
        "endTime": 1647326040.5947526
      },
      "counters": {
        "totalNumFiles": 20348,
        "totalNumDirs": 1052,
        "dirsToDelete": 235,
        "filesToDelete": 0,
        "filesDeletedSuccess": 5551,
        "filesDeletedFail": 0,
        "dirsDeletedSuccess": 235,
        "dirsDeletedFail": 2,
        "gfalErrors": {
          "Communication error on send": 408
        }
      }
    }
  ]
}]}

@amaltaro
Copy link
Contributor

@todor-ivanov have I missed the different errors at the protocol level from your message above?

My naive and generic view of this error handling is the following though, if the posix error is good enough for us to track:
a) problems in MSUnmerged
b) problems at the storage level

then we should not try to micromanage those errors. To me, it looks fair to have a map of multiple protocol errors into the same posix error (since the posix errors are very limited as well).

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Mar 15, 2022

hi @amaltaro
In general this should be the case yes, and that's exactly why I believed this should be the correct mapping at the first place. We should only watch for permissions/authorisation related errors from the protocol level, not to be masked inside the Communication Error on Send posix error. Here [1] is an example. (This error itself is actually triggered because of a broken uri construction for the gsiftp protocol, but still we need to pay attention to those.)

[1]

$ gfal-ls gsiftp://gridftp.echo.stfc.ac.uk:2811/cms/store/unmerged/RunIISummer20UL18DIGIPremix/DYJetsToEE_M-100To160_VBFFilter_TuneCP5_13TeV-amcatnloFXFX-pythia8/GEN-SIM-DIGI/106X_upgrade2018_realistic_v11_L1v1-v1
gfal-ls error: 70 (Communication error on send) - globus_ftp_client: the server responded with an error 500 Command failed : globus_l_gfs_ceph_stat: authorization error: 'MLST' operation not allowed

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Mar 15, 2022

have I missed the different errors at the protocol level from your message above

I am not sure if you have... but, maybe I should have pointed them better. Here they are:

  • Error message: ...globus_xio: An end of file occurred, Error code: 70 = Posix error: Communication error on send
  • Error message: ...globus_xio: A system call failed: Broken pipe, Error code: 70 = Posix error: Communication error on send

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Mar 25, 2022

So far I managed to go through all the logs since last change of the document structure at MongoDB and the full list of RSEs to have issues with deletions is:

T2_UK_London_Brunel
T2_BE_UCL

I am creating the relevant GGUS tickets for them.

@todor-ivanov
Copy link
Contributor Author

And here is the llist of GGUS tickets created for this [1]. I am closing the current issue here. We will follow with the sites and Site Support team. If any other changes are needed another WMcore issue is to be opened.

[1]
https://ggus.eu/?mode=ticket_info&ticket_id=156526
https://ggus.eu/?mode=ticket_info&ticket_id=156527

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants