Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent Job Re-submission Project #11881

Open
LinaresToine opened this issue Jan 30, 2024 · 9 comments
Open

Agent Job Re-submission Project #11881

LinaresToine opened this issue Jan 30, 2024 · 9 comments

Comments

@LinaresToine
Copy link

LinaresToine commented Jan 30, 2024

Impact of the new feature
Impact on the WMAgent

Is your feature request related to a problem? Please describe.
There are exit codes for which the jobs are simple retried without really modifying anything, when in reality something should/could be modified before resubmission.

Describe the solution you'd like
For example, exit code 50660 for when a job requires higher memory, the resubmission should modify the pkl file and the sandbox with a higher memory before resubmitting the failed job. Other exit codes of interest should have a similar specific procedure of resubmission.

Describe alternatives you've considered
For now we have only given thought to the retry process of high memory jobs. Although, more additions should come with this project since the main motivation is to make the retry process more automatic for exit codes that allow it. For this, a set of functions that modify the job parameters and that are used by the retry manager when dealing with a specific error code.

For the retry of high memory jobs, changing the maxPSS parameter requires a modification of the job sandbox as well as the job.pkl file. A function that takes care of such modifications should take a job id as parameter. Such function shall also define a new maxPSS, or receive it too as parameter.

Additional context

@LinaresToine
Copy link
Author

To modify the job.pkl file, the first thing is to get the path of such file. I see that this line of code captures all information of the job from the database and stores it in the variable loadAction:

loadAction = self.daoFactory(classname="Jobs.LoadFromID")

Then, in

results = loadAction.execute(jobID=binds)
, a new variable 'result' is created.
This variable is the output of the 'execute' function in: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMBS/MySQL/Jobs/LoadFromID.py#L52, which is a dictionary.

The cache dir is the information of interest and I am not 100% sure if it will simply be a key of such dictionary, since there is some formatting going on

@LinaresToine
Copy link
Author

An update on my previous comment:

The 'execute' function returns a list in which each index contains a dictionary with the result of the sql query for a particular job id in the input list: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMBS/MySQL/Jobs/LoadFromID.py#L52

This was referenced Feb 15, 2024
@germanfgv
Copy link
Contributor

@amaltaro @todor-ivanov could you take a look at the proposed solution here:
LinaresToine#3

In summary:

  • We are creating the concept of JobModifiers, to be used within the Retry Manager. These are additional plugging that can automatically change an attribute of a failed job before retrying it, according to the plugging configuration. These are able to change jthe job's job.pkl or workload.pkl, according to need.
  • JobModifiers are assigned to specific error codes.
  • Add an additional step to RetryManager: After deciding which jobs to retry (as per usual), RetryManager will check if any of the jobs should be modified afailed with an error code assigned to a JobModifier.

@amaltaro
Copy link
Contributor

amaltaro commented Mar 8, 2024

@germanfgv @LinaresToine apologies for the delay on getting back to this.

The idea looks good in general, but I do have a few concerns and further comments to be considered:
a) workload.pkl is shared among all the jobs, from the WMSandbox area. Which means, if one job changes it, those changes will be visible to any other job. So this is something that needs to be further investigated.
b) changing the job.pkl file means that files need to be changed in the filesystem. Which initially does not look like a great idea (compared to in memory or database changes), but given that only jobs in a given error code would go through this, I think we should proceed with this.
c) monitoring!!! At the moment, the only way I see to know whether a job was customized or not, would be through the agent logs (ComponentLog of the component). If everyone agrees, we can probably move forward with this, but that means we cannot commit to debug such cases.

@LinaresToine
Copy link
Author

Thank you very much @amaltaro for your comments. We shall take care of the sandbox change so that it only happens when a job's new memory is greater than the one in the sandbox. @germanfgv, any ideas on this?

A PR to the WMCore master branch was created for adequate tracking of the progress. #11928

@LinaresToine
Copy link
Author

Hello @amaltaro. The PR was updated so that jobs get modified by task rather than sandbox.
On another issue, the tests we have performed so far have the JobCreator unconfortable about the pkl files being truncated. Would you have an idea on how to work around this?

@LinaresToine
Copy link
Author

LinaresToine commented Apr 30, 2024

For clarity, the error I have stumbled upon is:

Failed to execute JobCreator. Error: pickle data was truncated
Traceback (most recent call last):
File "/data/tier0/srv/wmagent/3.1.5/sw/slc7_amd64_gcc630/cms/t0/3.1.5/lib/python3.8/site-packages/WMComponent/JobCreator/JobCreatorPoller.py", line 376, in algorithm
self.pollSubscriptions()
File "/data/tier0/srv/wmagent/3.1.5/sw/slc7_amd64_gcc630/cms/t0/3.1.5/lib/python3.8/site-packages/WMComponent/JobCreator/JobCreatorPoller.py", line 440, in pollSubscriptions
wmWorkload = retrieveWMSpec(workflow=workflow)
File "/data/tier0/srv/wmagent/3.1.5/sw/slc7_amd64_gcc630/cms/t0/3.1.5/lib/python3.8/site-packages/WMComponent/JobCreator/JobCreatorPoller.py", line 47, in retrieveWMSpec
wmWorkload.load(wmWorkloadURL)
File "/data/tier0/srv/wmagent/3.1.5/sw/slc7_amd64_gcc630/cms/t0/3.1.5/lib/python3.8/site-packages/WMCore/WMSpec/Persistency.py", line 65, in load
self.data = pickle.load(handle)
_pickle.UnpicklingError: pickle data was truncated
2024-04-29 17:59:20,954:140150441658112:ERROR:BaseWorkerThread:Error in worker algorithm (1):
Backtrace:
<WMComponent.JobCreator.JobCreatorPoller.JobCreatorPoller object at 0x7f775f72dfa0> <@========== WMException Start ==========@>
Exception Class: JobCreatorException
Message: Failed to execute JobCreator. Error: pickle data was truncated
ClassName : None
ModuleName : WMComponent.JobCreator.JobCreatorPoller
MethodName : algorithm
ClassInstance : None
FileName : /data/tier0/srv/wmagent/3.1.5/sw/slc7_amd64_gcc630/cms/t0/3.1.5/lib/python3.8/site-packages/WMComponent/JobCreator/JobCreatorPoller.py
LineNumber : 397
ErrorNr : 0

What I observe after several tests is that this error happens rather randomly, since it is not always observed. I believe it comes from how pickle loads and manages the data of the pickle file, and for some reason it does not always tolerate changes in the pickle data. This would be a problem that gets in the way of this automatic retries of high memory jobs because modifying the memory requires modifying the job pickle file.

I would appreciate any guidance or ideas on how to tackle this problem.

@LinaresToine
Copy link
Author

LinaresToine commented Jul 10, 2024

Hello @amaltaro ,

I believe the patch is ready for review. The only addition that remains to be tested is the changes in the ErrorHandlerPoller.py, the one described in https://github.com/LinaresToine/WMCore/blob/76c1019c364fa4b94e4d191c329666b3c5e2d73c/src/python/WMComponent/ErrorHandler/ErrorHandlerPoller.py#L11

The replays I ran I took advantage of the PauseAlgo parameter that allows you to retry jobs an arbitrary amount of time according to their job type and exit code. Specifically:

config.RetryManager.PauseAlgo.section_('Processing')
config.RetryManager.PauseAlgo.Processing.retryErrorCodes = { 70: 0, 50660: 0, 50661: 1, 50664: 0, 71304: 1 }

Since Central Production does not use PauseAlgo, I thought adding the changes in ErrorHanler was the easiest way. Please let me know what you think.

Also, the maxPSS parameter of a sandbox is not easily accessible, so I figured to keep track of that value in a dictionary called dataDict, which is kept on record in a json file in the RetryManager component directory: https://github.com/LinaresToine/WMCore/blob/76c1019c364fa4b94e4d191c329666b3c5e2d73c/src/python/WMComponent/RetryManager/Modifier/BaseModifier.py#L34

Finally, after several replays, the jobs are being resumed automatically in a successful manner, with no mismatch between job[estimatedMemoryUsage] and maxPSS. Also, to minimize number of jobs affected by sandbox modification, the maxPSS is changed per task rther than entire sandbox.

All the changes proposed can be seen in LinaresToine#3

Thanks again for your time and attention.

@LinaresToine
Copy link
Author

Hello.

A quick update of what is going on with this issue. The patch was in #11928 was tested in a T0 agent and gets all jobs modified successfully. Additional changes were required for a central production agent given that they do not use the PauseAlgo, which allows for multiple retries of a failed job with a given exit code. Such modification is in the ErrorHandler. I talked with @hassan11196 to get the patch tested in a central production agent.

I would also like to note that the patch currently keeps data of the retried jobs and new memories used in a json file in the RetryManager component log. I believe that the more elegant way to do it is by adapting the oracle database to allow for this data. Something like having maxPSS data available in a WMBS table, as well as the job estimatedMemoryUsage somewhere in there for better bookkeeping. @amaltaro what do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants