-
Notifications
You must be signed in to change notification settings - Fork 39
CRAB vs HammerCloud
HammerCloud tools uses CRAB to submit jobs to CMS sites for continuous site monitoring.
This page describes which features CRAB has which are meant explicitely for HC use and are not part of general user documentation
If you do not have a place where to keep this, I guess I can make a twiki page in CRAB, but I do not want to encourage users to play with activity flag
User (i.e. you i.e. HC) sets the config. paramenter General.activity
If that contrains the string "hc" (case insensitive) CRAB flags it as an HammerCloud task and sets these classAds for reporting to MONIT so that they become keys in ES/Grafana/Kibana searches
CMS_WMTool = 'HammerCloud'
CMS_TaskType = same string as found in General.activity above
CMS_Type = 'Test'
Be aware that CMS_Type = 'Test'
is used also by WMA
besides what is reported, there's the matter of what/where is run CRAB uses a parameter in TaskWorker config [2]
config.TaskWorker.ActivitiesToRunEverywhere = ['hctest', 'hcdev']
to disable black lists [3] and stageout check [4]
so if you want e.g. to use 'hctestNew'
and still want it to run at blacklisted sites,
you need to tell the CRAB operators in advance so that we change config.
Alternatively you can explicitly put in crabConfig :
config.Site.ignoreGlobalBlacklist = True
config.General.transferOutputs = False
config.General.transferLogs = False
(no transfers.. no need to check [5])
[4] https://github.com/dmwm/CRABServer/blob/32066a9248142e7851ebf9ebe0dd12f95679bef4/src/python/TaskWorker/Actions/StageoutCheck.py#L14-L21 https://github.com/dmwm/CRABServer/blob/32066a9248142e7851ebf9ebe0dd12f95679bef4/src/python/TaskWorker/Actions/StageoutCheck.py#L96-L100
for HammerCloud CRAB can release jobs in a task slowly so that they are hopefully executed in a constant flow at the sites, rather than all at the same time in O(100) job bunches
- standard operations: users submits a 100-job tasks, 100jobs are queued in HTCondor "asap" via a quick succession of
condor_submit
(this is done by DAGMAN) - slow release: user specifies in
crabConfig.py
this lineconfig.Debug.extraJDL=['+CRAB_JobReleaseTimeout=Nsec']
where Nsec is an integer indicating a number of seconds- Then (still via DAGMAN, inserting a delay in each DAG node):
- task starts in schedd at time T0
- job #1 is submitted to HTCondor at T0 + Nsec
- job #2 is submitted to HTCondor at T0 + 2*Nsec
- ...
- job #N is submitted to HTCondor at T0 + N*Nsec
- there is no guarantee and no way to predict when jobs will start running, new submissions do not wait for previous jobs to complete
- Then (still via DAGMAN, inserting a delay in each DAG node):
and here is the code, which is all in all clear enough
- all the work happens in PreJob context
- at task start time, when DAGMAN starts, all PreJobs are executed quickly. The code in PreJob.py simply returns from all but the first of them with
status=4
which asks DAGMAN to be deferred. The printout of from PreJob.py is only informational and the deferred time computed inside it is irrelevant - deferred PreJobs are executed again by DAGMAN after the delay indicated in the DAG configuration (
SPOOL_DIR/RunJobs.dag
) - relevant (trimmed) lines from
RunJobs.dag
in a real example:
SCRIPT DEFER 4 300 PRE Job1 dag_bootstrap.sh
SCRIPT DEFER 4 600 PRE Job2 dag_bootstrap.sh
SCRIPT DEFER 4 900 PRE Job3 dag_bootstrap.sh
SCRIPT DEFER 4 1200 PRE Job4 dag_bootstrap.sh
SCRIPT DEFER 4 1500 PRE Job5 dag_bootstrap.sh
SCRIPT DEFER 4 1800 PRE Job6 dag_bootstrap.sh
SCRIPT DEFER 4 2100 PRE Job7 dag_bootstrap.sh
...
SCRIPT DEFER 4 30000 PRE Job100 dag_bootstrap.sh
SCRIPT DEFER 4 30300 PRE Job101 dag_bootstrap.sh
SCRIPT DEFER 4 30600 PRE Job102 dag_bootstrap.sh
...
- so the time when each PreJob runs, and hence when actual job is submitted to the global pool, is predefined at task start.
- if DAGMAN is restarted for any reason (machine reboot, schedd restart etc.), DAGMAN is also restarted and all PreJobs for non-completed jobs are executed again. Those "past due" are then submitted immediately (the code in the PreJob finds out that there is no need to defer), but the ones still to be submitted are deferred and now they are deferred by the amount initially specified but at this point from the current (latest) DAGMAN start