MSUnmerged: Implement Zipped I/O streams #11036

todor-ivanov · 2022-03-11T08:41:26Z

Impact of the new feature
MSUnmerged

Is your feature request related to a problem? Please describe.
In the current service construction we already have several APIs from external services to use. One on which we strongly depend is the RucioConsistencyMonitor for fetching unmerged file lists and RSE statistics. The lists provided from RuicioConMon can grow significantly. A typical example is the case discussed in those few comments of this issue. RucioConMon provides two APIs for downloading the list contents:

One is a pure .json: &format=json
The other one is zipped: &format=raw

So far we are only using the .json interface, but working with single lined json files with sizes of the order of gigabytes is quite difficult, resource consuming (like memory and network), error prone etc. Currently we do not benefit from the zip compression during download. We are not having any zipped I/O streams implemented either, so that we could properly upload the contents in memory.

The feature request in this issue is to implement zip I/O streams in order to alleviate the impact of the huge lists sizes, and avoid hitting K8 pods' resource limitations in the future.

NOTE:
Since these are text archives the compression is high. We should consider significant difference between download size and memory size if we happen to upload the whole contents in memory. Even though we should be planning to work in chunks of data all the time. There are well provided python libraries for helping in this, such as shutil.copyfileobj(). Just as a note, here is a non exhaustive list of libraries to be considered during this development: io, zipfile, shutil etc.

Describe the solution you'd like
Implement the needed methods for downloading the compressed file on disk as provided by RucoConMon API &format=raw . Then use the proper libraries to implement a zipped I/O stream, for uploading into memory the contents in chinks of configurable data size.

There are already existing two placeholders for these methods to be implemented:

Describe alternatives you've considered
Continue using the current json format, and hit some resource limitations from time to time.

Additional context
None

The text was updated successfully, but these errors were encountered:

todor-ivanov · 2022-03-11T08:48:22Z

FYI @vkuznet @amaltaro @klannon @ivmfnal

I was quite reluctant in creating this issue, because we are struggling to complete this service development, but I cannot stop the tide. If we continue seeing errors related to those huge Json files we are working with, this will become an important feature to be implemented. I am setting this with an Enhancement label, and putting this in the Q2 developments.

amaltaro · 2022-03-11T14:28:18Z

Thanks for creating this issue, Todor. However, I see this more like a redesign of the service than enhancement, since it's not going to affect only how we load data in memory, but also the whole internal logic of dealing with lfns, mapping them to the base directory, checking what is locked or not, and so on.

Perhaps the best option to avoid this major effort right now, would be to work in an isolated instance of the service (I guess you have it fully operational in your virtual environment, right?) and try to delete as many no-longer-needed files as we can, such that the next Rucio ConMon unmerged storage dump comes with a much smaller number of files. For instance, we could split that 800MB json dump in 4 or 5 pieces and load it from disk instead of making the Rucio ConMon get call, where each cycle of the service it would fetch a different piece.
OR, if you write the script that you were considering to, then we could use that instead of private MSUnmerged itself. I would vote for whatever takes less effort here, of course, provided the risks of making a mistake are close to 0.

ivmfnal · 2022-03-11T15:01:09Z

To clarify, format=raw is in fact gzipped, not zipped. The difference is that gzip format works in terms of files while zip is good for stream compression. I think we should leave "raw" format as is - gzipped line-by-line list of paths and add 3rd option, format=zip, which would be zipped stream (see https://docs.python.org/3/library/zlib.html#module-zlib)

todor-ivanov · 2022-03-11T15:15:33Z

Thanks for taking a look @amaltaro !

However, I see this more like a redesign of the service than enhancement, since it's not going to affect only how we load data in memory, but also the whole internal logic of dealing with lfns, mapping them to the base directory, checking what is locked or not, and so on.

Actually I was not seeing it like a major change. Absolutely nothing should be changed in the MSUnmerged.py file itself at this stage. All the changes need to be enclosed in the Service/RucioConMon.py file. We should just implement the zipped I/O stream so that we can download the zipped version of the file list instead of the .json one. And then feed it back to MSUnmerged the normal way. There is only one entry point at the micro service itself, which is here.

todor-ivanov · 2022-03-11T15:19:18Z

Thanks @ivmfnal

To clarify, format=raw is in fact gzipped, not zipped.

That is an important clarification indeed. I missed to catch that slight detail, but it must be taken into account when we dive deeper int the coding here.

todor-ivanov · 2022-03-11T15:32:44Z

And just to be even more precise by what I mean when I say "The whole change needs to be enclosed in the Service/RucioConMon.py":

For this Error, the disaster begins not when we try to upload the file into memory for the MSUnmerged object to deal with the data, but rather when the Service/RucioConMon.py module tries to decode the Json file:

 File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/WMCore/Services/RucioConMon/RucioConMon.py", line 66, in _getResult
    results = json.loads(results)
...
  File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 168804222 (char 168804221)

Which is basically a sign for a different issue. To me it looks more like a broken session and not completely downloaded Json file. Because, both me and @ivmfnal did manage to download the json file properly and decode it on private machines. It could be also cache size for the Service module to play a role here... and many other things.

ivmfnal · 2022-03-11T15:36:00Z

I wonder if there is some time-out on the client side, which causes premature download termination

todor-ivanov · 2022-03-11T15:36:58Z

yet another possibility indeed

vkuznet · 2022-03-22T17:55:17Z

Somehow I missed the announce of this issue. but it is not to late for me to make few remarks:

please, stop inventing the wheel with extra positional arguments, this is bad approach and goes against HTTP standard which will only complicate things. As I pointed out in Define gzip encoding by default in the headers; properly decompress it #11041 the proper way is to setup Content-Encoding and Accept-Encoding HTTP headers which can deal with different encoding formats. THis should be done instead of using custom optional arguments like proposed here format=zip. Instead, we should only enhance underlying library (pycurl manager) to support different encodings.
I suggest to add to RucioConMon.py a similar to Define gzip encoding by default in the headers; properly decompress it #11041 parts which will use different encoding. Then, based on requested encoding the code will properly wrap the IO stream into it.
if you want to generalize encoding, we may introduce dedicated module for it, which will contain gzip, zip, lza, etc. Please refer to HTTP standard, e.g. here, or here, etc.

klannon · 2022-03-22T17:59:18Z

Quick comments:

I am all for following the HTTP standard, so let's please do that if possible.
I think that one compression algorithm is enough for now. Let's try not to let this grow in scope.
If changes are needed on the Rucio side, can someone please open an issue there so that they can respond accordingly?

vkuznet · 2022-03-22T18:02:13Z

The reason why it is wrong approach by using optional argument is the following:

if you'll use format=raw or format=zip then you still need Content-encoding which you can't use as application/json since it will not be correct
therefore, if you use format=zip then you must use Content-Encoding: application/zip which refers to zip archive, but in this case you have no control how inform code about your actual data format (JSON). May be zip archive contains txt files or may be json. In this case there is no way to tell in a code and if you assume one your code will not be transparent to others
this will require additional checks in a code like
- check for format option
- manage different format options
- check if underlying data from unziped procedure satisfy to certain data-format
  etc.

ivmfnal · 2022-03-22T18:19:24Z

To clarify: currently the following 4 options are implemented: * format = "raw" * formats: * gzipped flat file (newline-separated list of file paths) * Content-Type header sent by the server: application/x-gzip * or uncompressed flat file * Content-Type: text/plain * The server sends the file produced by the scanner (compressed or not) as is * format = "zip-stream": * zip-stream compressed flat file (see https://docs.python.org/3/library/zlib.html) zlib — Compression compatible with gzip — Python 3.10.2 documentation<https://docs.python.org/3/library/zlib.html> zlib.compressobj (level=-1, method=DEFLATED, wbits=MAX_WBITS, memLevel=DEF_MEM_LEVEL, strategy=Z_DEFAULT_STRATEGY [, zdict]) ¶ Returns a compression object, to be used for compressing data streams that won’t fit into memory at once. level is the compression level – an integer from 0 to 9 or -1.A value of 1 (Z_BEST_SPEED) is fastest and produces the least compression, while a value of 9 (Z ... docs.python.org * * Content-Type: application/zip * format = "text" * Uncompressed flat file * Context-Type: text/plain * format = "json": * Uncompressed JSON file (list of strings) * Content-Type: text/json

…

________________________________ From: Valentin Kuznetsov ***@***.***> Sent: Tuesday, March 22, 2022 1:02 PM To: dmwm/WMCore ***@***.***> Cc: Igor V Mandrichenko ***@***.***>; Mention ***@***.***> Subject: Re: [dmwm/WMCore] MSUnmerged: Implement Zipped I/O streams (Issue #11036) The reason why it is wrong approach by using optional argument is the following: * if you'll use format=raw or format=zip then you still need Content-encoding which you can't use as application/json since it will not be correct * therefore, if you use format=zip then you must use Content-Encoding: application/zip which refers to zip archive, but in this case you have no control how inform code about your actual data format (JSON). May be zip archive contains txt files or may be json. In this case there is no way to tell in a code and if you assume one your code will not be transparent to others * this will require additional checks in a code like * check for format option * manage different format options * check if underlying data from unziped procedure satisfy to certain data-format etc. — Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_WMCore_issues_11036-23issuecomment-2D1075455581&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=mhaN5hb7dRjeKjhUWu7fQs92g2ArcShdUrcSwD6V0BZyo96IGTwTccUX-IwaM5Ww&s=fsOEBPSWBTL3HyZTDjIyoMO0qubTFCG6xIwKLJtIJwU&e=>, or unsubscribe<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AFK4SQREHTWMOKUD7U4DJDTVBIDLBANCNFSM5QO7AFYA&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=mhaN5hb7dRjeKjhUWu7fQs92g2ArcShdUrcSwD6V0BZyo96IGTwTccUX-IwaM5Ww&s=Ctq8sXObhC_EtFWUuNk9D_ex65H44Ho78YJaL6D9MKg&e=>. You are receiving this because you were mentioned.Message ID: ***@***.***>

vkuznet · 2022-03-22T18:30:42Z

@ivmfnal, I'm not sure I understand the implemented choices. My question are:

does your service produce data in different data-formats, JSON vs text vs CSV?
if so, what is default format
the Content-Type should correspond to data-format you're sending, e.g. JSON
the Content-Encoding should correspond to encoding applied to your data, e.g. gzip

Especially, it is unclear what format=raw means, from your description it can be gzipped file (of which format?) or uncompressed text stream. How this is possible that single option provides different data-formats?

What should be done is the following:

client provide your server two HTTP headers
Accept-encoding: gzip (or any other encoding) which your server should use to encode the data
Accept-Type: application/json (or any other fomat) which your server should acknowledge and provide this data format.
That's it, the client asks for data in specific format and encoding. Then, your server either acknowledge it or not. If it is acknowledge it, then it will set Content-Type: application/json (or other data format) and Content-Encoding: gzip (or other encoding) such that client will know how to deal with provided data. If your server can't accept requested headers it will provide data in whatever default format it use to and provide proper HTTP headers.

todor-ivanov · 2022-03-23T08:58:26Z

Thank you all @klannon @vkuznet @ivmfnal those really good points that you make here.

I am all for following the HTTP standard, so let's please do that if possible.

I believe we are all on the same page here. The only thing that is needed is to agree on the set of combinations between HTTP header types. I need to mention one thing though, @vkuznet what you suggest here:

That's it, the client asks for data in specific format and encoding. Then, your server either acknowledge it or not. If it is acknowledge it, then it will set Content-Type: application/json (or other data format) and Content-Encoding: gzip (or other encoding) such that client will know how to deal with provided data. If your server can't accept requested headers it will provide data in whatever default format it use to and provide proper HTTP headers.

is really a shortcut to a more broader area which is the content negotiation mechanism. I do agree on your suggestion. But just to stay on the safe side and not deviate from best practices lets note two things:

Setting just a single header Accept-encoding would give the exact guidance to the server for the choise of compression in the response. But if done so, it would require the server to stick to only this mechanism for choosing the proper responce for all clients not only us, and ignore cases where the client does support multiple encodings. As @amaltaro already properly mentioned here we do set multiple encoding mechanisms in our service's requests here. I'd say this would be an incomplete implementation of the standard.
If we need to be really precise in what we are doing an intermediate step is needed, which is: In the cases where multiple encodings are supported by the client, the server side needs to respond with header HTTP 300 (Multiple Choices) and a list of content representations (content type, encodings, etc.) and associated uri to any of them, for the client to choose on the next step. So we are basically talking mostly about Agent-Driven Negotiation as explained here. So the set of different urls as already developed on the RucioConMon side is not a waste of time and I'd say they won't go in vain.

P.S. - One more thing that is worth to be added here too.
The WMCore caches, are actually uniquely identifying the cache to be used mostly based on the url, and not including any header information. This comes from the way how we build the name for the cache. Which means if we always use the same url for all encodings returned, we may get unexpected behavior when in previous calls to the server the contents returned have been for a call which was using a different encoding type, if we do not have a way to set a preferred encoding among multiple encodings supported in the client's request. Because the cache is static content with a lifetime associated with it and nothing more, it will not follow the negotiation mechanism that it would with the server. Of course we can use the HTTP header already set in the reply in the cache, but we still need a mechanism to signal the top level method (in this case that would be _getResultsZipped )if the contents of this cache is what it expects. Yet another point on preserving the different uris associated with the different encoding types, which will strongly facilitate whole this. But I'd let @amaltaro to correct me in case I am wrong here.

ivmfnal · 2022-03-23T10:40:41Z

I do not see why we need to be too pedantic about following HTTP standards here. Proper HTTP headers and representation negotiation would be needed if we were developing a general-purpose server, working with multiple uncontrolled implementations of the client. Clearly this is not the case here, and I do not see why we need to complicate implementation of both sides. Igor

…

vkuznet · 2022-03-23T11:42:23Z

@ivmfnal , how do you define general-purpose server? In other words, any server which relies on HTTP protocol is general-purpose server regardless of what it is doing.

ivmfnal · 2022-03-23T11:46:37Z

But not every server has to implement representation negotiation.

…

________________________________ From: Valentin Kuznetsov ***@***.***> Sent: Wednesday, March 23, 2022 6:42 AM To: dmwm/WMCore ***@***.***> Cc: Igor V Mandrichenko ***@***.***>; Mention ***@***.***> Subject: Re: [dmwm/WMCore] MSUnmerged: Implement Zipped I/O streams (Issue #11036) @ivmfnal<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ivmfnal&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=P9ZMa1TYl0I7K2rOCFFDcLy1BtE-Ytbxpg0U2aQl4soDUtJwVkB0uR8gdKTqN11C&s=N75XpBuUllBhnld5xUc5CpT_jaA_fpcbnHw8iFFArqc&e=> , how do you define general-purpose server? In other words, any server which relies on HTTP protocol is general-purpose server regardless of what it is doing. — Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_WMCore_issues_11036-23issuecomment-2D1076275821&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=P9ZMa1TYl0I7K2rOCFFDcLy1BtE-Ytbxpg0U2aQl4soDUtJwVkB0uR8gdKTqN11C&s=KPVbwu0D_N_gsnJoYoyQt6oHnfXGaQyCn7nBtYUmH4c&e=>, or unsubscribe<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AFK4SQVG72UKQVZYZOXSQGLVBL7SVANCNFSM5QO7AFYA&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=P9ZMa1TYl0I7K2rOCFFDcLy1BtE-Ytbxpg0U2aQl4soDUtJwVkB0uR8gdKTqN11C&s=R5369mg1tP7wS-2d7aqq-kCFLxgOE5MHCXhHke-UGLk&e=>. You are receiving this because you were mentioned.Message ID: ***@***.***>

vkuznet · 2022-03-23T11:55:51Z

@todor-ivanov , your first bullet item does not require clients to stick to single encoding. A client can choose one or another, therefore all of them have their freedom to request specific encoding. The limitation comes from a server to support them. Now, your pointer to Requests does not mean that HTTP server will only be accessed through this class. We may change clients, or even change usage of Python to access HTTP server. As such server implementation should be independent from a choice of client and its settings.

For your second bullet item, I don't know the use-case here. Why do you need multiple encoding, and multiple data-formats to support in clients? I don't really understand that purpose of them.

Please outline use-cases here. For example:

do you need to support multiple data-formats, if so, please justify every single one and give specific reason why do you need them
do you need to support multiple encoding and if so, please justify every single one with specific use-case

To me, this discussion is useless without concrete use-cases. In DMWM land we use everywhere JSON as data-format which can be either gzipped or zipped. So, we need one data-format and one encoding for it to compress the data. Unless I miss something I don't understand the point of this discussion and multiple implementations for different formats/encodings.

todor-ivanov · 2022-03-23T12:58:24Z

Thakns @vkuznet

Me myself do not want to go deep into general discussions here. As I already mentioned:

I do agree on your suggestion.

I was just making a note about few cases we will not be able to cover and we will not be following all best practices by the book, but that's fine here. I am not advocating we should waste time implementing all the stuff right now. I still stay behind this:

The only thing that is needed is to agree on the set of combinations between HTTP header types.

Once we merge your PR for the zipped I/O in pycurl_manager I will immediately rebase on top of it and use it here.

todor-ivanov · 2022-03-29T11:15:50Z

Writing it here as a reminder. There is a change in RucioConMon that needs to be reverted once we are done with the cuurent issue.

Last time when we were having troubles with the large JSON file for RAL we requested a change at RucioConMon - to fiter out /store/unmerged/logs/ branch. The change was done at their frontend and it also provides the ability to filter some parts of the tree through the API calls [1] , which is indeed good, but the filter we requested above is hardwired, until we fix the issue for implementing the Zipped I/O streams. Once we are done we must request this filter to be dropped from RucioConMon and rely on our mechanism to filter the input list. This way we will continue cleaning the /store/unmerged/logs branch as before. If at some point we decide to take advantage of the API extensions Igor have provided, that would be another story and to be tracked in a separate issue.

[1]
#10982 (comment)

todor-ivanov added New Feature Medium Priority Enhancement MSUnmerged labels Mar 11, 2022

todor-ivanov self-assigned this Mar 11, 2022

todor-ivanov mentioned this issue Apr 5, 2022

Wrong gzip compression recognized/identified by header #11066

Closed

vkuznet mentioned this issue May 12, 2022

Implement support for gzipped stream for RucioConMon #11142

Merged

amaltaro assigned vkuznet and unassigned todor-ivanov May 16, 2022

amaltaro closed this as completed in #11142 Jul 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MSUnmerged: Implement Zipped I/O streams #11036

MSUnmerged: Implement Zipped I/O streams #11036

todor-ivanov commented Mar 11, 2022 •

edited

Loading

todor-ivanov commented Mar 11, 2022 •

edited

Loading

amaltaro commented Mar 11, 2022

ivmfnal commented Mar 11, 2022

todor-ivanov commented Mar 11, 2022 •

edited

Loading

todor-ivanov commented Mar 11, 2022

todor-ivanov commented Mar 11, 2022 •

edited

Loading

ivmfnal commented Mar 11, 2022

todor-ivanov commented Mar 11, 2022

vkuznet commented Mar 22, 2022

klannon commented Mar 22, 2022

vkuznet commented Mar 22, 2022

ivmfnal commented Mar 22, 2022 via email

vkuznet commented Mar 22, 2022

todor-ivanov commented Mar 23, 2022 •

edited

Loading

ivmfnal commented Mar 23, 2022 via email

vkuznet commented Mar 23, 2022

ivmfnal commented Mar 23, 2022 via email

vkuznet commented Mar 23, 2022

todor-ivanov commented Mar 23, 2022

todor-ivanov commented Mar 29, 2022

MSUnmerged: Implement Zipped I/O streams #11036

MSUnmerged: Implement Zipped I/O streams #11036

Comments

todor-ivanov commented Mar 11, 2022 • edited Loading

todor-ivanov commented Mar 11, 2022 • edited Loading

amaltaro commented Mar 11, 2022

ivmfnal commented Mar 11, 2022

todor-ivanov commented Mar 11, 2022 • edited Loading

todor-ivanov commented Mar 11, 2022

todor-ivanov commented Mar 11, 2022 • edited Loading

ivmfnal commented Mar 11, 2022

todor-ivanov commented Mar 11, 2022

vkuznet commented Mar 22, 2022

klannon commented Mar 22, 2022

vkuznet commented Mar 22, 2022

ivmfnal commented Mar 22, 2022 via email

vkuznet commented Mar 22, 2022

todor-ivanov commented Mar 23, 2022 • edited Loading

ivmfnal commented Mar 23, 2022 via email

vkuznet commented Mar 23, 2022

ivmfnal commented Mar 23, 2022 via email

vkuznet commented Mar 23, 2022

todor-ivanov commented Mar 23, 2022

todor-ivanov commented Mar 29, 2022

todor-ivanov commented Mar 11, 2022 •

edited

Loading

todor-ivanov commented Mar 11, 2022 •

edited

Loading

todor-ivanov commented Mar 11, 2022 •

edited

Loading

todor-ivanov commented Mar 11, 2022 •

edited

Loading

todor-ivanov commented Mar 23, 2022 •

edited

Loading