Error handling of failed jobs #286

simmonspaul · 2020-02-27T20:51:53Z

I use crunz to schedule a number of jobs that rely on third party servers. Some are job dependent, others I schedule to minimise resource contention but are otherwise independent.

For the record, the first, interim, and last steps of the jobs that I run update a db table (eg "job status") which provides the job name, status (start, step1..., end), and start, last update, and end times. This provides me a dashboard.

An issue that I have, is sometimes the host will disconnect. I am trying to get to the bottom of it, but suffice to say that the error is difficult to trap so that the job can be closed gracefully. Jobs hang...

What this results in is a long running job. As it does not end and I have not yet been able to raise an error, the crunz schedule is effectively blocked and no further jobs run. (Similar outcome was raised in #260 and error enabled unlock #261).

Due to this issue, I have started to use monet for monitoring of my system and these jobs https://mmonit.com/monit/. I think monitoring should be left to another tool and monit allows jobs to be monitored, stopped, restarted, based on criteria (eg long running, load, cpu, memory, etc). This might be an alternative solution to #193

A dependency for these monitoring tools is that the Process ID or PID is captured. Typically this is achieved by a system process creating a .pid file in the /var/run directory. This .pid file will have a single line with the process id. When the process ends the .pid file is removed.

(It is also possible to monitor log files with monit.)

*** I think ideally, when crunz runs a task that it should create a .pid file. Indeed when crunz itself runs it should produce a .pid file. This enables them to be externally monitored.**

I couldn't figure out how to achieve this within crunz schedule files so I added the following to each of my jobs to achieve this.

define('LOCK_FILE', "var/run" . basename(__FILE__) . ".pid");
if (tryLock()) {
    die("Already running.\n");
}
# remove the lock on exit
register_shutdown_function('unlink', LOCK_FILE);

function tryLock()
{
    # If lock file exists, check if stale.  If exists and is not stale, return TRUE
    # Else, create lock file and return FALSE.

    if (file_exists(LOCK_FILE) === TRUE) {
        echo 'locked';
        return true;
    } else {
        echo LOCK_FILE;
        $fp = fopen(LOCK_FILE, 'a');
        fwrite($fp, getmypid());
        fclose($fp);
        echo 'not locked';
    }
    return false;
}

I intend to monitor my jobs and if they hit a resource limit then I will stop these processes using the monitoring tool. This will send an alert via email.

I will continue to rely on crunz to restart these jobs based on the task schedule criteria.

Should recording pids be embraced by crunz or left independent?

Thanks
Paul

PabloKowalczyk · 2020-03-09T10:47:08Z

Hello,
Am I understand correctly that your problem is that failed jobs don't end (which should be fixed by #261)?

simmonspaul · 2020-03-09T22:42:09Z

I'm using v2.1.x-dev and my jobs still lock - was not resolved by #261.

(Nothing to do with crunz... but the reason for my errors are disconnects from host:
Fatal error: Uncaught GuzzleHttp\Exception\RequestException: cURL error 56: Recv failure: Connection reset by peer
or
Fatal error: Uncaught GuzzleHttp\Exception\ServerException: Server error: ''resulted in a 502 Bad Gateway response:)

PabloKowalczyk · 2020-03-10T06:19:34Z

Nothing to do with crunz... but the reason for my errors are disconnects from host

Is it long running script that do not exit after disconnect?

simmonspaul · 2020-03-10T15:26:50Z

Pablo, yes that is correct. It is long running... because it does not error when a disconnect occurs. I have circumvented this by stopping and resetting the job via a monitoring tool. In the next crunz schedule, the job is starts again and continues from where it left off (was disconnected).

Having said that, my greater point is that I believe that the scope of crunz should be starting tasks as part of a configurable schedule.

I do not believe that crunz is or should be a monitoring tool. eg Monit is a good monitoring solution. Crunz is a good scheduler. Therefore crunz should not be stopping the resulting jobs that have started as the result of a crunz task being triggered.

In order to best navigate ps commands and for php monitoring tools to work (eg monit) the process number or pid file (containing the process number) is required.

Currently when crunz run it does not generate a pid file.
Currently when individual crunz tasks are triggered, they do not produce pid files.
The name of pid files that are generated should be as configurable as the name of log files that crunz allows.

I believe that this can be handled with integration to the symfony/process as per 189.
Symfony default is to place pid files in the cache folder https://symfony.com/doc/current/configuration/override_dir_structure.html and this can be overwritten by using the --pidfile option. symfony/symfony#29160. I don't think crunz uses these features or the folder structure within its code.

Several issues that have been raised against crunz would be resolvable using a monitoring tool instead. Monitoring tools can read and react to the output of log files which crunz already produces and are better suited to stop, reset, or restart jobs as required.

Issues that potentially are the scope of being handled in monitoring tools #281 - ie. read task log file and react, 260 - ie. monitoring tool should be used to stop jobs that breach criteria, 193 - monitoring tool should detect failure and reset

Issues that can be handled via pid file processing - 200 - ie check for presence of pid file before running.

As it is easy, I implemented pid processing directly in my jobs and achieved monitoring. IMHO If crunz adopts pid processing, similar to how logging options are adopted, then crunz scope is better contained.

PabloKowalczyk · 2020-03-11T15:12:25Z

I see now and you are 100% right, Crunz was never designed to keep or monitor long running scripts. IMO you should definitely use external tools to monitor status of your connection inside your script and not rely on Crunz at all.

simmonspaul · 2020-03-11T16:40:48Z

So as a feature request, crunz to produces pids when it starts and for the tasks that it is managing.

Crunz is powerful because it provides core a 'one stop shop' for features like providing log files for triggered scripts, and the ability of emailing script output and errors, rather than having to rebuild this in each user script.

Generating pids is core php and would be an improvement for crunz to embrace as then monitoring tools are fully enabled 'out of the box'.

This is not critical as the work around is to create pids in the scripts themselves as required.

default crunz pid directory location (customisable similar to log files).
crunz pid to be generated when crunz is active (allows crunz process to be monitored)
task pids to be generated when they are triggered in the schedule (allows task processes to be monitored).
default pid name should be the task name (customisable similar to log files).
Crunz should check for an existing pid before running a task again or else mechanism for this to be ignored (forced) in the task. Way to check when one job ran, make sure other job doesn't run before x time has passed #200

I think this could be achieved through tighter integration with the already embraced symphony processes.

PabloKowalczyk · 2020-03-11T16:58:32Z

@simmonspaul having creation of PIDs in core of Crunz is not a good idea IMO, with plugin system this can be moved to external package which is very good, but plugin system is only in my head, not in code.
Actually Crunz itself doesn't need PIDs, everything is in memory and process can success or fail (ends with code 0 or other). Again: Crunz is not designed for long running scripts and adding this functionality is beyond scope of this package.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error handling of failed jobs #286

Error handling of failed jobs #286

simmonspaul commented Feb 27, 2020 •

edited by PabloKowalczyk

Loading

PabloKowalczyk commented Mar 9, 2020

simmonspaul commented Mar 9, 2020

PabloKowalczyk commented Mar 10, 2020

simmonspaul commented Mar 10, 2020

PabloKowalczyk commented Mar 11, 2020

simmonspaul commented Mar 11, 2020

PabloKowalczyk commented Mar 11, 2020

Error handling of failed jobs #286

Error handling of failed jobs #286

Comments

simmonspaul commented Feb 27, 2020 • edited by PabloKowalczyk Loading

PabloKowalczyk commented Mar 9, 2020

simmonspaul commented Mar 9, 2020

PabloKowalczyk commented Mar 10, 2020

simmonspaul commented Mar 10, 2020

PabloKowalczyk commented Mar 11, 2020

simmonspaul commented Mar 11, 2020

PabloKowalczyk commented Mar 11, 2020

simmonspaul commented Feb 27, 2020 •

edited by PabloKowalczyk

Loading