-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Agent Job Re-submission Project #11881
Comments
To modify the job.pkl file, the first thing is to get the path of such file. I see that this line of code captures all information of the job from the database and stores it in the variable loadAction:
Then, in
This variable is the output of the 'execute' function in: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMBS/MySQL/Jobs/LoadFromID.py#L52, which is a dictionary. The cache dir is the information of interest and I am not 100% sure if it will simply be a key of such dictionary, since there is some formatting going on |
An update on my previous comment: The 'execute' function returns a list in which each index contains a dictionary with the result of the sql query for a particular job id in the input list: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMBS/MySQL/Jobs/LoadFromID.py#L52 |
@amaltaro @todor-ivanov could you take a look at the proposed solution here: In summary:
|
@germanfgv @LinaresToine apologies for the delay on getting back to this. The idea looks good in general, but I do have a few concerns and further comments to be considered: |
Thank you very much @amaltaro for your comments. We shall take care of the sandbox change so that it only happens when a job's new memory is greater than the one in the sandbox. @germanfgv, any ideas on this? A PR to the WMCore master branch was created for adequate tracking of the progress. #11928 |
Hello @amaltaro. The PR was updated so that jobs get modified by task rather than sandbox. |
For clarity, the error I have stumbled upon is: Failed to execute JobCreator. Error: pickle data was truncated What I observe after several tests is that this error happens rather randomly, since it is not always observed. I believe it comes from how pickle loads and manages the data of the pickle file, and for some reason it does not always tolerate changes in the pickle data. This would be a problem that gets in the way of this automatic retries of high memory jobs because modifying the memory requires modifying the job pickle file. I would appreciate any guidance or ideas on how to tackle this problem. |
Hello @amaltaro , I believe the patch is ready for review. The only addition that remains to be tested is the changes in the ErrorHandlerPoller.py, the one described in https://github.com/LinaresToine/WMCore/blob/76c1019c364fa4b94e4d191c329666b3c5e2d73c/src/python/WMComponent/ErrorHandler/ErrorHandlerPoller.py#L11 The replays I ran I took advantage of the PauseAlgo parameter that allows you to retry jobs an arbitrary amount of time according to their job type and exit code. Specifically:
Since Central Production does not use PauseAlgo, I thought adding the changes in ErrorHanler was the easiest way. Please let me know what you think. Also, the maxPSS parameter of a sandbox is not easily accessible, so I figured to keep track of that value in a dictionary called dataDict, which is kept on record in a json file in the RetryManager component directory: https://github.com/LinaresToine/WMCore/blob/76c1019c364fa4b94e4d191c329666b3c5e2d73c/src/python/WMComponent/RetryManager/Modifier/BaseModifier.py#L34 Finally, after several replays, the jobs are being resumed automatically in a successful manner, with no mismatch between job[estimatedMemoryUsage] and maxPSS. Also, to minimize number of jobs affected by sandbox modification, the maxPSS is changed per task rther than entire sandbox. All the changes proposed can be seen in LinaresToine#3 Thanks again for your time and attention. |
Hello. A quick update of what is going on with this issue. The patch was in #11928 was tested in a T0 agent and gets all jobs modified successfully. Additional changes were required for a central production agent given that they do not use the PauseAlgo, which allows for multiple retries of a failed job with a given exit code. Such modification is in the ErrorHandler. I talked with @hassan11196 to get the patch tested in a central production agent. I would also like to note that the patch currently keeps data of the retried jobs and new memories used in a json file in the RetryManager component log. I believe that the more elegant way to do it is by adapting the oracle database to allow for this data. Something like having maxPSS data available in a WMBS table, as well as the job estimatedMemoryUsage somewhere in there for better bookkeeping. @amaltaro what do you think? |
Impact of the new feature
Impact on the WMAgent
Is your feature request related to a problem? Please describe.
There are exit codes for which the jobs are simple retried without really modifying anything, when in reality something should/could be modified before resubmission.
Describe the solution you'd like
For example, exit code 50660 for when a job requires higher memory, the resubmission should modify the pkl file and the sandbox with a higher memory before resubmitting the failed job. Other exit codes of interest should have a similar specific procedure of resubmission.
Describe alternatives you've considered
For now we have only given thought to the retry process of high memory jobs. Although, more additions should come with this project since the main motivation is to make the retry process more automatic for exit codes that allow it. For this, a set of functions that modify the job parameters and that are used by the retry manager when dealing with a specific error code.
For the retry of high memory jobs, changing the maxPSS parameter requires a modification of the job sandbox as well as the job.pkl file. A function that takes care of such modifications should take a job id as parameter. Such function shall also define a new maxPSS, or receive it too as parameter.
Additional context
The text was updated successfully, but these errors were encountered: