-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSUnmerged: Implement Zipped I/O streams #11036
Comments
FYI @vkuznet @amaltaro @klannon @ivmfnal I was quite reluctant in creating this issue, because we are struggling to complete this service development, but I cannot stop the tide. If we continue seeing errors related to those huge Json files we are working with, this will become an important feature to be implemented. I am setting this with an |
Thanks for creating this issue, Todor. However, I see this more like a redesign of the service than enhancement, since it's not going to affect only how we load data in memory, but also the whole internal logic of dealing with lfns, mapping them to the base directory, checking what is locked or not, and so on. Perhaps the best option to avoid this major effort right now, would be to work in an isolated instance of the service (I guess you have it fully operational in your virtual environment, right?) and try to delete as many no-longer-needed files as we can, such that the next Rucio ConMon unmerged storage dump comes with a much smaller number of files. For instance, we could split that 800MB json dump in 4 or 5 pieces and load it from disk instead of making the Rucio ConMon get call, where each cycle of the service it would fetch a different piece. |
To clarify, |
Thanks for taking a look @amaltaro !
Actually I was not seeing it like a major change. Absolutely nothing should be changed in the |
Thanks @ivmfnal
That is an important clarification indeed. I missed to catch that slight detail, but it must be taken into account when we dive deeper int the coding here. |
And just to be even more precise by what I mean when I say "The whole change needs to be enclosed in the For this Error, the disaster begins not when we try to upload the file into memory for the MSUnmerged object to deal with the data, but rather when the
Which is basically a sign for a different issue. To me it looks more like a broken session and not completely downloaded Json file. Because, both me and @ivmfnal did manage to download the json file properly and decode it on private machines. It could be also cache size for the |
I wonder if there is some time-out on the client side, which causes premature download termination |
yet another possibility indeed |
Somehow I missed the announce of this issue. but it is not to late for me to make few remarks:
|
Quick comments:
|
The reason why it is wrong approach by using optional argument is the following:
|
To clarify: currently the following 4 options are implemented:
* format = "raw"
* formats:
* gzipped flat file (newline-separated list of file paths)
* Content-Type header sent by the server: application/x-gzip
* or uncompressed flat file
* Content-Type: text/plain
* The server sends the file produced by the scanner (compressed or not) as is
* format = "zip-stream":
* zip-stream compressed flat file (see https://docs.python.org/3/library/zlib.html)
zlib — Compression compatible with gzip — Python 3.10.2 documentation<https://docs.python.org/3/library/zlib.html>
zlib.compressobj (level=-1, method=DEFLATED, wbits=MAX_WBITS, memLevel=DEF_MEM_LEVEL, strategy=Z_DEFAULT_STRATEGY [, zdict]) ¶ Returns a compression object, to be used for compressing data streams that won’t fit into memory at once. level is the compression level – an integer from 0 to 9 or -1.A value of 1 (Z_BEST_SPEED) is fastest and produces the least compression, while a value of 9 (Z ...
docs.python.org
*
* Content-Type: application/zip
* format = "text"
* Uncompressed flat file
* Context-Type: text/plain
* format = "json":
* Uncompressed JSON file (list of strings)
* Content-Type: text/json
…________________________________
From: Valentin Kuznetsov ***@***.***>
Sent: Tuesday, March 22, 2022 1:02 PM
To: dmwm/WMCore ***@***.***>
Cc: Igor V Mandrichenko ***@***.***>; Mention ***@***.***>
Subject: Re: [dmwm/WMCore] MSUnmerged: Implement Zipped I/O streams (Issue #11036)
The reason why it is wrong approach by using optional argument is the following:
* if you'll use format=raw or format=zip then you still need Content-encoding which you can't use as application/json since it will not be correct
* therefore, if you use format=zip then you must use Content-Encoding: application/zip which refers to zip archive, but in this case you have no control how inform code about your actual data format (JSON). May be zip archive contains txt files or may be json. In this case there is no way to tell in a code and if you assume one your code will not be transparent to others
* this will require additional checks in a code like
* check for format option
* manage different format options
* check if underlying data from unziped procedure satisfy to certain data-format
etc.
—
Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_WMCore_issues_11036-23issuecomment-2D1075455581&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=mhaN5hb7dRjeKjhUWu7fQs92g2ArcShdUrcSwD6V0BZyo96IGTwTccUX-IwaM5Ww&s=fsOEBPSWBTL3HyZTDjIyoMO0qubTFCG6xIwKLJtIJwU&e=>, or unsubscribe<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AFK4SQREHTWMOKUD7U4DJDTVBIDLBANCNFSM5QO7AFYA&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=mhaN5hb7dRjeKjhUWu7fQs92g2ArcShdUrcSwD6V0BZyo96IGTwTccUX-IwaM5Ww&s=Ctq8sXObhC_EtFWUuNk9D_ex65H44Ho78YJaL6D9MKg&e=>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@ivmfnal, I'm not sure I understand the implemented choices. My question are:
Especially, it is unclear what What should be done is the following:
|
Thank you all @klannon @vkuznet @ivmfnal those really good points that you make here.
I believe we are all on the same page here. The only thing that is needed is to agree on the set of combinations between HTTP header types. I need to mention one thing though, @vkuznet what you suggest here:
is really a shortcut to a more broader area which is the content negotiation mechanism. I do agree on your suggestion. But just to stay on the safe side and not deviate from best practices lets note two things:
P.S. - One more thing that is worth to be added here too. |
@ivmfnal , how do you define general-purpose server? In other words, any server which relies on HTTP protocol is general-purpose server regardless of what it is doing. |
@todor-ivanov , your first bullet item does not require clients to stick to single encoding. A client can choose one or another, therefore all of them have their freedom to request specific encoding. The limitation comes from a server to support them. Now, your pointer to Requests does not mean that HTTP server will only be accessed through this class. We may change clients, or even change usage of Python to access HTTP server. As such server implementation should be independent from a choice of client and its settings. For your second bullet item, I don't know the use-case here. Why do you need multiple encoding, and multiple data-formats to support in clients? I don't really understand that purpose of them. Please outline use-cases here. For example:
To me, this discussion is useless without concrete use-cases. In DMWM land we use everywhere JSON as data-format which can be either gzipped or zipped. So, we need one data-format and one encoding for it to compress the data. Unless I miss something I don't understand the point of this discussion and multiple implementations for different formats/encodings. |
Thakns @vkuznet Me myself do not want to go deep into general discussions here. As I already mentioned:
I was just making a note about few cases we will not be able to cover and we will not be following all best practices by the book, but that's fine here. I am not advocating we should waste time implementing all the stuff right now. I still stay behind this:
Once we merge your PR for the zipped I/O in |
Writing it here as a reminder. There is a change in RucioConMon that needs to be reverted once we are done with the cuurent issue. Last time when we were having troubles with the large JSON file for RAL we requested a change at RucioConMon - to fiter out [1] |
Impact of the new feature
MSUnmerged
Is your feature request related to a problem? Please describe.
In the current service construction we already have several APIs from external services to use. One on which we strongly depend is the RucioConsistencyMonitor for fetching unmerged file lists and RSE statistics. The lists provided from RuicioConMon can grow significantly. A typical example is the case discussed in those few comments of this issue. RucioConMon provides two APIs for downloading the list contents:
.json
:&format=json
&format=raw
So far we are only using the
.json
interface, but working with single lined json files with sizes of the order of gigabytes is quite difficult, resource consuming (like memory and network), error prone etc. Currently we do not benefit from the zip compression during download. We are not having any zipped I/O streams implemented either, so that we could properly upload the contents in memory.The feature request in this issue is to implement zip I/O streams in order to alleviate the impact of the huge lists sizes, and avoid hitting K8 pods' resource limitations in the future.
NOTE:
Since these are text archives the compression is high. We should consider significant difference between download size and memory size if we happen to upload the whole contents in memory. Even though we should be planning to work in chunks of data all the time. There are well provided python libraries for helping in this, such as
shutil.copyfileobj()
. Just as a note, here is a non exhaustive list of libraries to be considered during this development: io, zipfile, shutil etc.Describe the solution you'd like
Implement the needed methods for downloading the compressed file on disk as provided by RucoConMon API
&format=raw
. Then use the proper libraries to implement a zipped I/O stream, for uploading into memory the contents in chinks of configurable data size.There are already existing two placeholders for these methods to be implemented:
Describe alternatives you've considered
Continue using the current json format, and hit some resource limitations from time to time.
Additional context
None
The text was updated successfully, but these errors were encountered: