-
-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error handling of failed jobs #286
Comments
Hello, |
I'm using v2.1.x-dev and my jobs still lock - was not resolved by #261. (Nothing to do with crunz... but the reason for my errors are disconnects from host: |
Is it long running script that do not exit after disconnect? |
Pablo, yes that is correct. It is long running... because it does not error when a disconnect occurs. I have circumvented this by stopping and resetting the job via a monitoring tool. In the next crunz schedule, the job is starts again and continues from where it left off (was disconnected). Having said that, my greater point is that I believe that the scope of crunz should be starting tasks as part of a configurable schedule. I do not believe that crunz is or should be a monitoring tool. eg Monit is a good monitoring solution. Crunz is a good scheduler. Therefore crunz should not be stopping the resulting jobs that have started as the result of a crunz task being triggered. In order to best navigate ps commands and for php monitoring tools to work (eg monit) the process number or pid file (containing the process number) is required.
I believe that this can be handled with integration to the symfony/process as per 189. Several issues that have been raised against crunz would be resolvable using a monitoring tool instead. Monitoring tools can read and react to the output of log files which crunz already produces and are better suited to stop, reset, or restart jobs as required. Issues that potentially are the scope of being handled in monitoring tools #281 - ie. read task log file and react, 260 - ie. monitoring tool should be used to stop jobs that breach criteria, 193 - monitoring tool should detect failure and reset Issues that can be handled via pid file processing - 200 - ie check for presence of pid file before running. As it is easy, I implemented pid processing directly in my jobs and achieved monitoring. IMHO If crunz adopts pid processing, similar to how logging options are adopted, then crunz scope is better contained. |
I see now and you are 100% right, Crunz was never designed to keep or monitor long running scripts. IMO you should definitely use external tools to monitor status of your connection inside your script and not rely on Crunz at all. |
So as a feature request, crunz to produces pids when it starts and for the tasks that it is managing. Crunz is powerful because it provides core a 'one stop shop' for features like providing log files for triggered scripts, and the ability of emailing script output and errors, rather than having to rebuild this in each user script. Generating pids is core php and would be an improvement for crunz to embrace as then monitoring tools are fully enabled 'out of the box'. This is not critical as the work around is to create pids in the scripts themselves as required.
I think this could be achieved through tighter integration with the already embraced symphony processes. |
@simmonspaul having creation of PIDs in core of Crunz is not a good idea IMO, with plugin system this can be moved to external package which is very good, but plugin system is only in my head, not in code. |
I use crunz to schedule a number of jobs that rely on third party servers. Some are job dependent, others I schedule to minimise resource contention but are otherwise independent.
For the record, the first, interim, and last steps of the jobs that I run update a db table (eg "job status") which provides the job name, status (start, step1..., end), and start, last update, and end times. This provides me a dashboard.
An issue that I have, is sometimes the host will disconnect. I am trying to get to the bottom of it, but suffice to say that the error is difficult to trap so that the job can be closed gracefully. Jobs hang...
What this results in is a long running job. As it does not end and I have not yet been able to raise an error, the crunz schedule is effectively blocked and no further jobs run. (Similar outcome was raised in #260 and error enabled unlock #261).
Due to this issue, I have started to use monet for monitoring of my system and these jobs https://mmonit.com/monit/. I think monitoring should be left to another tool and monit allows jobs to be monitored, stopped, restarted, based on criteria (eg long running, load, cpu, memory, etc). This might be an alternative solution to #193
A dependency for these monitoring tools is that the Process ID or PID is captured. Typically this is achieved by a system process creating a .pid file in the /var/run directory. This .pid file will have a single line with the process id. When the process ends the .pid file is removed.
(It is also possible to monitor log files with monit.)
*** I think ideally, when crunz runs a task that it should create a .pid file. Indeed when crunz itself runs it should produce a .pid file. This enables them to be externally monitored.**
I couldn't figure out how to achieve this within crunz schedule files so I added the following to each of my jobs to achieve this.
I intend to monitor my jobs and if they hit a resource limit then I will stop these processes using the monitoring tool. This will send an alert via email.
I will continue to rely on crunz to restart these jobs based on the task schedule criteria.
Should recording pids be embraced by crunz or left independent?
Thanks
Paul
The text was updated successfully, but these errors were encountered: