Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSUnmerged: Implement Zipped I/O streams #11036

Closed
todor-ivanov opened this issue Mar 11, 2022 · 20 comments · Fixed by #11142
Closed

MSUnmerged: Implement Zipped I/O streams #11036

todor-ivanov opened this issue Mar 11, 2022 · 20 comments · Fixed by #11142

Comments

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Mar 11, 2022

Impact of the new feature
MSUnmerged

Is your feature request related to a problem? Please describe.
In the current service construction we already have several APIs from external services to use. One on which we strongly depend is the RucioConsistencyMonitor for fetching unmerged file lists and RSE statistics. The lists provided from RuicioConMon can grow significantly. A typical example is the case discussed in those few comments of this issue. RucioConMon provides two APIs for downloading the list contents:

So far we are only using the .json interface, but working with single lined json files with sizes of the order of gigabytes is quite difficult, resource consuming (like memory and network), error prone etc. Currently we do not benefit from the zip compression during download. We are not having any zipped I/O streams implemented either, so that we could properly upload the contents in memory.

The feature request in this issue is to implement zip I/O streams in order to alleviate the impact of the huge lists sizes, and avoid hitting K8 pods' resource limitations in the future.

NOTE:
Since these are text archives the compression is high. We should consider significant difference between download size and memory size if we happen to upload the whole contents in memory. Even though we should be planning to work in chunks of data all the time. There are well provided python libraries for helping in this, such as shutil.copyfileobj(). Just as a note, here is a non exhaustive list of libraries to be considered during this development: io, zipfile, shutil etc.

Describe the solution you'd like
Implement the needed methods for downloading the compressed file on disk as provided by RucoConMon API &format=raw . Then use the proper libraries to implement a zipped I/O stream, for uploading into memory the contents in chinks of configurable data size.

There are already existing two placeholders for these methods to be implemented:

Describe alternatives you've considered
Continue using the current json format, and hit some resource limitations from time to time.

Additional context
None

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Mar 11, 2022

FYI @vkuznet @amaltaro @klannon @ivmfnal

I was quite reluctant in creating this issue, because we are struggling to complete this service development, but I cannot stop the tide. If we continue seeing errors related to those huge Json files we are working with, this will become an important feature to be implemented. I am setting this with an Enhancement label, and putting this in the Q2 developments.

@amaltaro
Copy link
Contributor

Thanks for creating this issue, Todor. However, I see this more like a redesign of the service than enhancement, since it's not going to affect only how we load data in memory, but also the whole internal logic of dealing with lfns, mapping them to the base directory, checking what is locked or not, and so on.

Perhaps the best option to avoid this major effort right now, would be to work in an isolated instance of the service (I guess you have it fully operational in your virtual environment, right?) and try to delete as many no-longer-needed files as we can, such that the next Rucio ConMon unmerged storage dump comes with a much smaller number of files. For instance, we could split that 800MB json dump in 4 or 5 pieces and load it from disk instead of making the Rucio ConMon get call, where each cycle of the service it would fetch a different piece.
OR, if you write the script that you were considering to, then we could use that instead of private MSUnmerged itself. I would vote for whatever takes less effort here, of course, provided the risks of making a mistake are close to 0.

@ivmfnal
Copy link

ivmfnal commented Mar 11, 2022

To clarify, format=raw is in fact gzipped, not zipped. The difference is that gzip format works in terms of files while zip is good for stream compression. I think we should leave "raw" format as is - gzipped line-by-line list of paths and add 3rd option, format=zip, which would be zipped stream (see https://docs.python.org/3/library/zlib.html#module-zlib)

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Mar 11, 2022

Thanks for taking a look @amaltaro !

However, I see this more like a redesign of the service than enhancement, since it's not going to affect only how we load data in memory, but also the whole internal logic of dealing with lfns, mapping them to the base directory, checking what is locked or not, and so on.

Actually I was not seeing it like a major change. Absolutely nothing should be changed in the MSUnmerged.py file itself at this stage. All the changes need to be enclosed in the Service/RucioConMon.py file. We should just implement the zipped I/O stream so that we can download the zipped version of the file list instead of the .json one. And then feed it back to MSUnmerged the normal way. There is only one entry point at the micro service itself, which is here.

@todor-ivanov
Copy link
Contributor Author

Thanks @ivmfnal

To clarify, format=raw is in fact gzipped, not zipped.

That is an important clarification indeed. I missed to catch that slight detail, but it must be taken into account when we dive deeper int the coding here.

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Mar 11, 2022

And just to be even more precise by what I mean when I say "The whole change needs to be enclosed in the Service/RucioConMon.py":

For this Error, the disaster begins not when we try to upload the file into memory for the MSUnmerged object to deal with the data, but rather when the Service/RucioConMon.py module tries to decode the Json file:

 File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/WMCore/Services/RucioConMon/RucioConMon.py", line 66, in _getResult
    results = json.loads(results)
...
  File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 168804222 (char 168804221)

Which is basically a sign for a different issue. To me it looks more like a broken session and not completely downloaded Json file. Because, both me and @ivmfnal did manage to download the json file properly and decode it on private machines. It could be also cache size for the Service module to play a role here... and many other things.

@ivmfnal
Copy link

ivmfnal commented Mar 11, 2022

I wonder if there is some time-out on the client side, which causes premature download termination

@todor-ivanov
Copy link
Contributor Author

yet another possibility indeed

@vkuznet
Copy link
Contributor

vkuznet commented Mar 22, 2022

Somehow I missed the announce of this issue. but it is not to late for me to make few remarks:

  • please, stop inventing the wheel with extra positional arguments, this is bad approach and goes against HTTP standard which will only complicate things. As I pointed out in Define gzip encoding by default in the headers; properly decompress it #11041 the proper way is to setup Content-Encoding and Accept-Encoding HTTP headers which can deal with different encoding formats. THis should be done instead of using custom optional arguments like proposed here format=zip. Instead, we should only enhance underlying library (pycurl manager) to support different encodings.
  • I suggest to add to RucioConMon.py a similar to Define gzip encoding by default in the headers; properly decompress it #11041 parts which will use different encoding. Then, based on requested encoding the code will properly wrap the IO stream into it.
  • if you want to generalize encoding, we may introduce dedicated module for it, which will contain gzip, zip, lza, etc. Please refer to HTTP standard, e.g. here, or here, etc.

@klannon
Copy link

klannon commented Mar 22, 2022

Quick comments:

  • I am all for following the HTTP standard, so let's please do that if possible.
  • I think that one compression algorithm is enough for now. Let's try not to let this grow in scope.
  • If changes are needed on the Rucio side, can someone please open an issue there so that they can respond accordingly?

@vkuznet
Copy link
Contributor

vkuznet commented Mar 22, 2022

The reason why it is wrong approach by using optional argument is the following:

  • if you'll use format=raw or format=zip then you still need Content-encoding which you can't use as application/json since it will not be correct
  • therefore, if you use format=zip then you must use Content-Encoding: application/zip which refers to zip archive, but in this case you have no control how inform code about your actual data format (JSON). May be zip archive contains txt files or may be json. In this case there is no way to tell in a code and if you assume one your code will not be transparent to others
  • this will require additional checks in a code like
    • check for format option
    • manage different format options
    • check if underlying data from unziped procedure satisfy to certain data-format
      etc.

@ivmfnal
Copy link

ivmfnal commented Mar 22, 2022 via email

@vkuznet
Copy link
Contributor

vkuznet commented Mar 22, 2022

@ivmfnal, I'm not sure I understand the implemented choices. My question are:

  • does your service produce data in different data-formats, JSON vs text vs CSV?
  • if so, what is default format
  • the Content-Type should correspond to data-format you're sending, e.g. JSON
  • the Content-Encoding should correspond to encoding applied to your data, e.g. gzip

Especially, it is unclear what format=raw means, from your description it can be gzipped file (of which format?) or uncompressed text stream. How this is possible that single option provides different data-formats?

What should be done is the following:

  • client provide your server two HTTP headers
  • Accept-encoding: gzip (or any other encoding) which your server should use to encode the data
  • Accept-Type: application/json (or any other fomat) which your server should acknowledge and provide this data format.
    That's it, the client asks for data in specific format and encoding. Then, your server either acknowledge it or not. If it is acknowledge it, then it will set Content-Type: application/json (or other data format) and Content-Encoding: gzip (or other encoding) such that client will know how to deal with provided data. If your server can't accept requested headers it will provide data in whatever default format it use to and provide proper HTTP headers.

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Mar 23, 2022

Thank you all @klannon @vkuznet @ivmfnal those really good points that you make here.

I am all for following the HTTP standard, so let's please do that if possible.

I believe we are all on the same page here. The only thing that is needed is to agree on the set of combinations between HTTP header types. I need to mention one thing though, @vkuznet what you suggest here:

That's it, the client asks for data in specific format and encoding. Then, your server either acknowledge it or not. If it is acknowledge it, then it will set Content-Type: application/json (or other data format) and Content-Encoding: gzip (or other encoding) such that client will know how to deal with provided data. If your server can't accept requested headers it will provide data in whatever default format it use to and provide proper HTTP headers.

is really a shortcut to a more broader area which is the content negotiation mechanism. I do agree on your suggestion. But just to stay on the safe side and not deviate from best practices lets note two things:

  • Setting just a single header Accept-encoding would give the exact guidance to the server for the choise of compression in the response. But if done so, it would require the server to stick to only this mechanism for choosing the proper responce for all clients not only us, and ignore cases where the client does support multiple encodings. As @amaltaro already properly mentioned here we do set multiple encoding mechanisms in our service's requests here. I'd say this would be an incomplete implementation of the standard.
  • If we need to be really precise in what we are doing an intermediate step is needed, which is: In the cases where multiple encodings are supported by the client, the server side needs to respond with header HTTP 300 (Multiple Choices) and a list of content representations (content type, encodings, etc.) and associated uri to any of them, for the client to choose on the next step. So we are basically talking mostly about Agent-Driven Negotiation as explained here. So the set of different urls as already developed on the RucioConMon side is not a waste of time and I'd say they won't go in vain.

P.S. - One more thing that is worth to be added here too.
The WMCore caches, are actually uniquely identifying the cache to be used mostly based on the url, and not including any header information. This comes from the way how we build the name for the cache. Which means if we always use the same url for all encodings returned, we may get unexpected behavior when in previous calls to the server the contents returned have been for a call which was using a different encoding type, if we do not have a way to set a preferred encoding among multiple encodings supported in the client's request. Because the cache is static content with a lifetime associated with it and nothing more, it will not follow the negotiation mechanism that it would with the server. Of course we can use the HTTP header already set in the reply in the cache, but we still need a mechanism to signal the top level method (in this case that would be _getResultsZipped )if the contents of this cache is what it expects. Yet another point on preserving the different uris associated with the different encoding types, which will strongly facilitate whole this. But I'd let @amaltaro to correct me in case I am wrong here.

@ivmfnal
Copy link

ivmfnal commented Mar 23, 2022 via email

@vkuznet
Copy link
Contributor

vkuznet commented Mar 23, 2022

@ivmfnal , how do you define general-purpose server? In other words, any server which relies on HTTP protocol is general-purpose server regardless of what it is doing.

@ivmfnal
Copy link

ivmfnal commented Mar 23, 2022 via email

@vkuznet
Copy link
Contributor

vkuznet commented Mar 23, 2022

@todor-ivanov , your first bullet item does not require clients to stick to single encoding. A client can choose one or another, therefore all of them have their freedom to request specific encoding. The limitation comes from a server to support them. Now, your pointer to Requests does not mean that HTTP server will only be accessed through this class. We may change clients, or even change usage of Python to access HTTP server. As such server implementation should be independent from a choice of client and its settings.

For your second bullet item, I don't know the use-case here. Why do you need multiple encoding, and multiple data-formats to support in clients? I don't really understand that purpose of them.

Please outline use-cases here. For example:

  • do you need to support multiple data-formats, if so, please justify every single one and give specific reason why do you need them
  • do you need to support multiple encoding and if so, please justify every single one with specific use-case

To me, this discussion is useless without concrete use-cases. In DMWM land we use everywhere JSON as data-format which can be either gzipped or zipped. So, we need one data-format and one encoding for it to compress the data. Unless I miss something I don't understand the point of this discussion and multiple implementations for different formats/encodings.

@todor-ivanov
Copy link
Contributor Author

Thakns @vkuznet

Me myself do not want to go deep into general discussions here. As I already mentioned:

I do agree on your suggestion.

I was just making a note about few cases we will not be able to cover and we will not be following all best practices by the book, but that's fine here. I am not advocating we should waste time implementing all the stuff right now. I still stay behind this:

The only thing that is needed is to agree on the set of combinations between HTTP header types.

Once we merge your PR for the zipped I/O in pycurl_manager I will immediately rebase on top of it and use it here.

@todor-ivanov
Copy link
Contributor Author

Writing it here as a reminder. There is a change in RucioConMon that needs to be reverted once we are done with the cuurent issue.

Last time when we were having troubles with the large JSON file for RAL we requested a change at RucioConMon - to fiter out /store/unmerged/logs/ branch. The change was done at their frontend and it also provides the ability to filter some parts of the tree through the API calls [1] , which is indeed good, but the filter we requested above is hardwired, until we fix the issue for implementing the Zipped I/O streams. Once we are done we must request this filter to be dropped from RucioConMon and rely on our mechanism to filter the input list. This way we will continue cleaning the /store/unmerged/logs branch as before. If at some point we decide to take advantage of the API extensions Igor have provided, that would be another story and to be tracked in a separate issue.

[1]
#10982 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants