Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shutdown ScheduledExecutorService at runtime termination #272

Open
farhin23 opened this issue Dec 6, 2024 · 10 comments
Open

Shutdown ScheduledExecutorService at runtime termination #272

farhin23 opened this issue Dec 6, 2024 · 10 comments
Assignees
Labels
bug_report Suspected bugs, awaiting triage

Comments

@farhin23
Copy link

farhin23 commented Dec 6, 2024

Bug Report

The ScheduledExecutorService (responsible for recursive crawling) does not shutdown when the EDC BaseRuntime gets shutdown.

Describe the Bug

In the current implementation of FederatedCatalog, the ExecutionPlan creates a ScheduledExecutorService that repetitively executes the crawling task. However, during runtime shutdown, the ExecutorService is not terminated as expected, causing the tasks to keep running indefinitely.

Expected Behavior

The ScheduledExecutorService should terminate gracefully during runtime shutdown. This includes canceling all future tasks, ensuring a proper shutdown, and cleaning up resources.

Observed Behavior

The ExecutorService does not terminate and continues executing tasks after the runtime shutdown is processed.

Steps to Reproduce

FcRecurringExecutionPlanTest.java contains tests designed to reproduce the issue.

Steps to reproduce the bug:

  1. Add the file FcRecurringExecutionPlanTest.java to project directory ./system-tests/end2end-test/e2e-junit-runner/src/test/java/org/eclipse/edc/end2end/ of FederatedCatalog.
  2. From project root execute the command: ./gradlew test -DincludeTags="EndToEndTest" -PverboseTest=true
  3. Review the logs of catalog runtimes, as they continue to execute the crawling task even after being shut down.

Logs Review

Track the logs of, Task :system-tests:end2end-test:e2e-junit-runner:test, starting from the following log,

Task :system-tests:end2end-test:e2e-junit-runner:test
FcRecurringExecutionPlanTest >

In this test class we have two catalogs: catalog-one and catalog-two. Let's say, catalog-one starts execution first. So it goes through the following phases:

  • catalog-one runtime boot

    [catalog-one] INFO 2024-12-06T18:14:05.821516008 52 service extensions started

    [catalog-one] INFO 2024-12-06T18:14:05.82173214 Runtime 2d1ab650-9bad-41df-b2e0-b4dcc31d9b36 ready

    INFO 2024-12-06T18:14:05.821937528 Runtime catalog-one started

  • Starts crawling. This can be observed in the logs through entries similar to the following,

    [catalog-one] DEBUG 2024-12-06T18:14:05.905056092 [ExecutionManager] Run pre-execution task

    [catalog-one] DEBUG 2024-12-06T18:14:05.905575718 [ExecutionManager] Loaded 1 work items from storage

    [catalog-one] DEBUG 2024-12-06T18:14:05.905815223 [ExecutionManager] Crawler parallelism is 1, based on config and number of work items

    [catalog-one] DEBUG 2024-12-06T18:14:05.909222619 [ExecutionManager] Crawler-8312f7a8-a0f3-4142-8672-da19f6c0fced: WorkItem acquired

  • Shutdown catalog-one runtime

    [catalog-one] DEBUG 2024-12-06T18:14:08.412094779 Shutdown Core Default Services

    [catalog-one] DEBUG 2024-12-06T18:14:08.412149472 Shutdown Federated Catalog Default Services

    [catalog-one] INFO 2024-12-06T18:14:08.412202229 Shutdown complete

In ideal case, the catalog-one crawler thread should get terminated after the shutdown. However, even after the runtime gets terminated, the scheduled thread keeps running. And so we can observe the following logs through out rest of the execution (through out the execution of catalog-two test).

[catalog-two] DEBUG 2024-12-06T23:09:41.312296155 Registered Web API context alias: default

[catalog-two] DEBUG 2024-12-06T23:09:41.474718829 Registered Web API context alias: protocol

[catalog-one] DEBUG 2024-12-06T23:09:41.528775262 [ExecutionManager] Run pre-execution task

[catalog-one] DEBUG 2024-12-06T23:09:41.529318959 [ExecutionManager] Loaded 1 work items from storage

[catalog-one] DEBUG 2024-12-06T23:09:41.529418432 [ExecutionManager] Crawler parallelism is 1, based on config and number of work items

[catalog-one] DEBUG 2024-12-06T23:09:41.529636797 [ExecutionManager] Crawler-425f25b6-6afc-41c5-936a-d4f6eb0c9aeb: WorkItem acquired

[catalog-two] DEBUG 2024-12-06T23:09:41.611729624 Registered Web API context alias: management

[catalog-two] DEBUG 2024-12-06T23:09:41.662468043 Registered Web API context alias: catalog

Context Information

  • Used version: EDC 0.10.1

Detailed Description

If applicable, add screenshots and logs to help explain your problem.

Possible Implementation

Proper shutdown method in FederatedCatalogCacheExtension to trigger the shutdown of the ExecutorService should solve this issue. This will ensure that all the scheduled tasks are terminated properly at the time of runtime shutdown.

@farhin23 farhin23 added bug_report Suspected bugs, awaiting triage triage all new issues awaiting classification labels Dec 6, 2024
@farhin23 farhin23 self-assigned this Dec 6, 2024
Copy link

github-actions bot commented Dec 6, 2024

Thanks for your contribution 🔥 We will take a look asap 🚀

@paullatzelsperger
Copy link
Member

please provide a proper, self-contained reproduction guide that doesn't involve other repositories, preferably in the form of a test, e.g. as Gist.

Do you mean that a thread keeps running after the host runtime (= java process) is killed?

@farhin23
Copy link
Author

farhin23 commented Dec 6, 2024

please provide a proper, self-contained reproduction guide that doesn't involve other repositories, preferably in the form of a test, e.g. as Gist.

Sure. I will work on that.

Do you mean that a thread keeps running after the host runtime (= java process) is killed?

By runtime I mean the EDC BaseRuntime. The thread keeps running after the BaseRuntime shutdown has been processed.

@farhin23 farhin23 assigned farhin23 and unassigned farhin23 Dec 6, 2024
@farhin23
Copy link
Author

farhin23 commented Dec 6, 2024

please provide a proper, self-contained reproduction guide that doesn't involve other repositories, preferably in the form of a test, e.g. as Gist.

I have provided a test class FcRecurringExecutionPlanTest.java which contains the tests designed to reproduce the bug. The updated "Steps to Reproduce" section includes detailed guidelines to replicate the issue.

@paullatzelsperger
Copy link
Member

paullatzelsperger commented Dec 10, 2024

I copied the code, and ran it, but unfortunately it does not illustrate the claimed defective behaviour at all - both were green. All you did was more or less copy another e2e test and duplicate it in a nested test suite.

Unless you can provide a test setup to clearly show the defective behaviour in a test with assertions, I'm afraid I'll have to close this issue. Here, what you will have to prove is that exiting the process doesn't actually kill all its threads.

In Java, all Threads that belong to a certain process (i.e. the java programm) are killed when the process exists. Shutting down the runtime is exactly that, terminating the process.

I would also recommend reading up on the use of the @Nested annotation.

@farhin23
Copy link
Author

I copied the code, and ran it, but unfortunately it does not illustrate the claimed defective behaviour at all - both were green.

Both of the tests will pass. The behavior is that even after the responsible BaseRuntime gets terminated, the task keeps running. And this can be tracked in the logs.

All you did was more or less copy another e2e test and duplicate it in a nested test suite.
I would also recommend reading up on the use of the @Nested annotation.

The issue occurs when multiple runtimes are booted. To demonstrate this behavior, I used that test example as a baseline.

Unless you can provide a test setup to clearly show the defective behaviour, I'm afraid I'll have to close this issue.

The issue is that tasks continue to run even after their runtimes have ended. While this does not cause the tests to fail, it can be observed in the logs. Although it doesn't impact the current tests, it leads to resource shortages in the GitHub pipeline, affecting other tests.

@paullatzelsperger
Copy link
Member

paullatzelsperger commented Dec 10, 2024

yeah, but what you created isn't a test. A (JUnit) test consists of three parts: arrange (set up test data), act (make the system under test do something), assert (verify a certain outcome). You created another, very complicated way to start a runtime, that uses the JUnit framework in the wrong way.

In any event. I think what you actually mean is this: The executor service in the RecurringExecutionPlan is never getting shut down.

If that is the case, then this is more of a feature request, rather than a bug, because all threads die when the owning process dies, so it is not faulty behaviour. Terminating the java process (= runtime) will definitely stop all the threads (= executor service).

As such, I would propose to close this issue, and start with a discussion first.
One way to approach this would be introducing an ExecutionPlan#stop() method which would probably solve things, but I'm not 100% convinced that it is the right solution. Mostly, because there could be execution plans that aren't recurring, so "stopping" them wouldn't make any sense semantically.
Another, possibly better solution would be to make ExecutionPlan#run() return a handle of some sort, which can then be used to forward the shutdown signal. For non-recurring plans this would just be a no-op, or maybe return null. Its a subtle but important difference IMO.

@farhin23
Copy link
Author

In any event. I think what you actually mean is this: The executor service in the RecurringExecutionPlan is never getting shut down.

Yes, that's precisely what I intended to say in the issue.

If that is the case, then this is more of a feature request, rather than a bug,
As such, I would propose to close this issue, and start with a discussion first.

We can proceed as you’ve suggested. Should I open a discussion for the feature request?

One way to approach this would be introducing an ExecutionPlan#stop() method which would probably solve things, but I'm not 100% convinced that it is the right solution. Mostly, because there could be execution plans that aren't recurring, so "stopping" them wouldn't make any sense semantically.

I was thinking of the solution more in this direction. This way, execution plans using the executor services for recurring tasks can override the stop() method to adjust to their own shutdown process.

Another, possibly better solution would be to make ExecutionPlan#run() return a handle of some sort, which can then be used to forward the shutdown signal.

For ScheduledExecutorService it is possible to return a ScheduledFuture instance which later can be used to cancel the future tasks.

@paullatzelsperger
Copy link
Member

Yes please open a discussion for a feature request

@farhin23
Copy link
Author

Yes please open a discussion for a feature request

I have created a discussion here: #273

@ndr-brt ndr-brt removed the triage all new issues awaiting classification label Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug_report Suspected bugs, awaiting triage
Projects
None yet
Development

No branches or pull requests

3 participants