Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing using multiple threads #5367

Merged
merged 12 commits into from
Nov 4, 2022

Conversation

thomaslow
Copy link
Collaborator

@thomaslow thomaslow commented Sep 27, 2022

This pull request improves the batch indexing performance (triggered via the System/Indexing page) by using multiple threads for indexing.

A full re-indexing of 81.000 processes and 4.000 tasks took 12 min 12 seconds on my machine (8-core CPU + SSD, default batch size of 250). With these changes, the indexing improves as follows (relative to the best case scenario of the current master branch with a batch size of 2500):

batch size master pr threads=1 pr threads=4 pr threads=8
250 0.57 (12m:12s) 1.04 (6m:40s) 3.39 (2m:03s) 4.85 (1m:26s)
500 0.76 (9m:12s) 0.95 (7m:19s) 3.50 (1m:59s) 5.09 (1m:22s)
2500 1.00 (6m:57s) 1.08 (6m:27s) 3.26 (2m:08s) 4.74 (1m:28s)

There are still a bunch more problems with the indexing, but at least now you will notice them much quicker ;-)

Also, the indexing should be more resilient to intermittent errors. I restarted ElasticSearch during the indexing process and the indexing still finished successfully.

The default number of threads is set to 4 and can be configured via the elasticsearch.threads parameter.

@henning-gerhardt
Copy link
Collaborator

I tried your changes but you killed the workaround for #5179 (refreshing the page to prevent high access rate) and in a short time gigabyte of log entries are created in application proxy in front of the elastic search system.

@thomaslow
Copy link
Collaborator Author

thomaslow commented Oct 5, 2022

You are probably logging HTTP-requests that go to your ElasticSearch server, right?

You can limit the number of HTTP-requests used for indexing via the elasticsearch.batch parameter. A higher batch size will reduce the number of HTTP-requests required to index all objects, roughly number_of_objects / batch_size.

The HTTP-requests that are used to refresh the current number of already indexed objects (issue #5179) are triggered by the user interface and can be avoided by navigating away from the System-page (and not only switching to another tab in the System-page), logging out of Kitodo or closing the browser window. The indexing should not stop. As mentioned in #5181, there is currently no button to cancel the indexing process. Alternatively, we could increase the refresh interval to a higher number (e.g. 5 seconds or so?).

Another approach would be to set up log-rotation in your application proxy to avoid huge log files.

@henning-gerhardt
Copy link
Collaborator

You are probably logging HTTP-requests that go to your ElasticSearch server, right?

Correct. We need / use this application proxy to limit the access to the ElasticSearch instance.

You can limit the number of HTTP-requests used for indexing via the elasticsearch.batch parameter. A higher batch size will reduce the number of HTTP-requests required to index all objects, roughly number_of_objects / batch_size.

We have already increased the value of batch parameter to 2.500 to lower the amount of transfers between both systems.

The HTTP-requests that are used to refresh the current number of already indexed objects (issue #5179) are triggered by the user interface and can be avoided by navigating away from the System-page (and not only switching to another tab in the System-page), logging out of Kitodo or closing the browser window. The indexing should not stop. As mentioned in #5181, there is currently no button to cancel the indexing process.

The index page refreshing is the issue for the big amount of log entries not the indexing itself. I will try it and switch to an other page after the indexing is started to get a possible new workaround for the open issue with the page refreshing

Alternatively, we could increase the refresh interval to a higher number (e.g. 5 seconds or so?).

This could be a way but should be separated into its own pull request as this is "fixing" an other issue. I never understand why the refreshing is done 10 times per second but this should be discussed in the other issue.

Another approach would be to set up log-rotation in your application proxy to avoid huge log files.

There is already a log rotation which works fine for normal use but the indexing page refresh is the issue for so many log entries.

Copy link
Collaborator

@henning-gerhardt henning-gerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only a code review with small adjustments. I will add some notice regarding to the speed improvement later.

Comment on lines 90 to 91
logger.error("stop indexing after maximum amount of attempts");
throw new RuntimeException("stop indexing after maximum amount of attempts");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use a string to avoid duplicated text message? I don't know if logging and throwing an exception should happen on one place and RuntimeException must be catched explicit. Maybe @solth should decide here what is a good solution.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RuntimeException was used to break out of the while-loop. I added a boolean "failed" to gracefully exit the while loop and thread.

Comment on lines 67 to 68
Helper.setErrorMessage(e.getLocalizedMessage(), IndexingService.getLogger(), e);
logger.error(e);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe first log and then show the error message like a few lines above? If showing the error message raise itself an exception / error then this error will not get logged instead the exception / error of the displaying will be logged and this error get lost.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the Helper.setErrorMessage call because a Thread.sleep interrupt is not really a problem.

// by iterating over all indexed documents in elastic search
try {
while (offset < amountInIndex) {
// TODO: actually iterate over all elastic search documents
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will this TODO be solved? Is there is already changed ready or will this be fixed in a middle / far future away change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was not changed by this pull request. I just added a comment that this is a problem. There is a new issue #5379 for this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, then will this TODO / Issue not got lost.

searchService.removeLooseIndexData(searchService.findAllIDs(offset, indexLimit));
offset += indexLimit;
}
} catch (Exception e) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You catching all kind of exceptions. Is this is really expected?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exception is actually an elastic search exception. Unfortunately, the elastic search java library is not available in this class. I changed it to a RuntimeException, which is the most specific exception type available at this point.

@henning-gerhardt
Copy link
Collaborator

henning-gerhardt commented Oct 7, 2022

Good thing is, that your changes did not lower the indexing speed - if you use the workaround and switch to an other page or close the tab like the old implementation. But I can not confirm the speed improvements in this large scale.

I did a lot of benchmarks but speed improvements in my local environment are visible not not in a such way that the break records. I did all my benchmarks with a batch limit of 2.500 and index limit of 2.500, I close the tab or switch to an other non system page after the single type indexing has started (workaround!) and before a indexing is started the corresponding index was cleared and the application restarted (cleaning index was done between application stop and restart).

Indexing around 560.000 processes including meta data files and properties (around 13 million property table entries in the database) need 1:45 hours on a single core. Only 1:15 hours on four cores. Not really faster on four cores instead of using only a single core.

Indexing around 4.4 million tasks take 6:20 hours on one core and 2:36 hours on four cores. Here is a acceptable speed improvement but not so fast like on your timings.

Massive negative impact on indexing speed is to stay on the system / indexing page as this speed down indexing a lot - I broke the process indexing after 3 hours and only a quarter or third of all data was indexed.

No or only less influence has the used application proxy in front of the elastic search service. The process indexing goes down on four cores to 1:10 hours.

@thomaslow
Copy link
Collaborator Author

@henning-gerhardt Thank you for your detailed feedback. I will have a look at your code review soon. I'm sorry if I raised too high expectations for this pull request. I suppose your setup (multiple virtual machines?) has different bottlenecks than my setup (multiple docker container on one non-virtual machine). Also, my database consists mostly of fake data (empty processes, no meta data, only a few mets files, etc.). Maybe there are other bottlenecks with proper real-world data (like disk i/o) or a multi-server setups (like network delays).

Did you see full CPU utilization on each of your virtual machines (Database, ElasticSearch, Tomcat)? With 16 threads, my CPU is fully utilized at nearly 100%. Both the Tomcat server (~48%) and MariaDB (~40%) use the majority of the CPU. ElasticSearch is using only ~10% of the CPU.

If you do not see full CPU utilization, I would suggest to try more than 4 threads, even if you Tomcat server only has 4 CPU cores. There could be other bottlenecks, e.g. network delay, disk i/o delay, that could improve with a higher number of threads. Try to increase the number of threads until there is no more increase in CPU utilization or indexing time. For example, 8 or even 16 threads should be no problem at all.

Also, at some point (I don't know exactly when), your batch size could be too high. However, I tested your batch size (2500) on my machine and got the same duration of ~2min for 4 threads. A high batch size might cause additional network delays, because large data packets are sent over the network, and the indexing can not proceed while the data is sent over the network. Maybe this could be a problem. I don't know.

Unfortunately, it is not easy to find a good balance for these parameters.

@henning-gerhardt
Copy link
Collaborator

Some information about my local test environment which I used for benchmarking: 4 core or 8 core with hyper-threading processor, nvme based ssd and 32 gb ram. IDEA is starting the Tomcat server (9.0.56) and deploying the application inside this tomcat server. Database (MariaDB 10.5.15) and ElasticSearch (7.17.6) are started on system up and are running as normal system processes. Connection to database, Tomcat and Elastic search are over localhost or local ethernet card without any real data transfer over network. A local running Apache service is acting as application proxy for the Kitodo.Productoin application and ElasticSearch but the application proxy has only less influence on the indexing.

Using 4 threads for indexing let my system stay at all 50 - 60% total usage except I did not close the indexing page then the cpu usage go to 100% on all 8 cores. Using 8 or more cores is not really practicable as I run this benchmarks in the background while doing other import daily work.

If I would try your changes in our real world system (VMWare virtualisation with many VMs (4 core application, 4 core elastic, 2 core database and accessing the meta data files through nfs on the application server) then benchmarking is not difficult to compare as network i/o, disk i/o (database + elastic search), VMware host cpu usage and many other factors will have a (big) influence on timing. Maybe I can convince my team mates to stay away from our system for a week or more so I can do some benchmarking in this environment.

I choose a higher batch limit size and a similar indexing limit as lower values need more time on a single core indexing with our data.

@thomaslow
Copy link
Collaborator Author

Ok. If, your system is similar to mine and your CPU usage is already at 100% (or as high as you want it to be) than I don't think there is anything that can be improved by tuning the batch or threads parameters.

Btw. I did my tests while keeping the indexing page active (including the automatic refresh every 1s). I also enabled database logging, which prints every SQL statement that is executed. So, I could probably gain a few seconds by optimizing my setup. This is why I wasn't expecting that your performance would be worse than mine. Weird.

I suppose the most likely factor is that I use very simple test data.

@thomaslow
Copy link
Collaborator Author

I found the change that caused that the previous workaround for stopping the automatic refresh (polling) didn't work anymore. Now the previous behavior is back (page refresh will stop polling).

Copy link
Collaborator

@henning-gerhardt henning-gerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did found only some small notices - in general are this changes approved.

// by iterating over all indexed documents in elastic search
try {
while (offset < amountInIndex) {
// TODO: actually iterate over all elastic search documents
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, then will this TODO / Issue not got lost.

@thomaslow
Copy link
Collaborator Author

I refactored the failure logic a bit. Previously, a failure would not stop the indexing of subsequent objects types (if the indexing was started via the "index everything" button). I also added a simple cancel link to the UI, since the cancel implementation was very similar to the failure handling.

@solth
Copy link
Member

solth commented Oct 10, 2022

I also added a simple cancel link to the UI, since the cancel implementation was very similar to the failure handling.

Does this resolve #5181 ?

@thomaslow
Copy link
Collaborator Author

The cancel link looks like this:

image

If this is sufficient, then it will resolve #5181.

@henning-gerhardt
Copy link
Collaborator

  1. Adding new functionality after a review is near done and as new function was not part of any review request or other requirement is not so good in my opinion. I must now find a couple of time in the next days to check the new and existing functions again as no (selenium based) tests are added.
  2. If I could say a wish: the new "cancel" button should be appear as the other buttons on this page and should only be visible if an indexing is started.

@thomaslow
Copy link
Collaborator Author

thomaslow commented Oct 10, 2022

I'm sorry I mixed features in one pull request. If it helps, I could revert this pull request to the last reviewed commit and add the recent changes to a new pull request.

The cancel link is only visible when the indexing is currently in progress.

@henning-gerhardt
Copy link
Collaborator

No, let this changes in. I will test this tomorrow or the day after tomorrow.

@henning-gerhardt
Copy link
Collaborator

Canceling the indexing is not working properly:

I started the indexing of processes with a batch size of 2.500 and indexing limit of 2.500 and interrupted the indexing in the first batch. In the UI the state of process indexing is set to failed but in the background process indexing is still a second attempt:

[INFO] 2022-10-11 10:54:27,851 [IndexManagementThread] org.kitodo.production.services.index.IndexingService - start 1 threads for indexing PROCESS
[INFO] 2022-10-11 10:54:27,857 [pool-6-thread-1] org.kitodo.production.helper.IndexWorker - index PROCESS with offset 0 and attempt 1/10
[TRACE] 2022-10-11 10:54:28,483 [pool-6-thread-1] org.kitodo.filemanagement.FileManagement - Found 7/meta.xml
[TRACE] 2022-10-11 10:54:28,484 [pool-6-thread-1] org.kitodo.filemanagement.FileManagement - Found 7/meta.xml
[TRACE] 2022-10-11 10:54:28,523 [pool-6-thread-1] org.kitodo.filemanagement.FileManagement - Found 7/meta.xml
[INFO] 2022-10-11 10:54:28,533 [pool-6-thread-1] org.kitodo.production.services.dataformat.MetsService - Reading 7/meta.xml


... (Now hitting the cancel button in the UI) ...

[INFO] 2022-10-11 10:54:55,802 [pool-6-thread-1] org.kitodo.production.services.dataformat.MetsService - Reading 1455/meta.xml
[TRACE] 2022-10-11 10:54:55,846 [pool-6-thread-1] org.kitodo.filemanagement.FileManagement - Found 1456/meta.xml
[TRACE] 2022-10-11 10:54:55,846 [pool-6-thread-1] org.kitodo.filemanagement.FileManagement - Found 1456/meta.xml
[TRACE] 2022-10-11 10:54:55,869 [pool-6-thread-1] org.kitodo.filemanagement.FileManagement - Found 1456/meta.xml
[INFO] 2022-10-11 10:54:55,869 [pool-6-thread-1] org.kitodo.production.services.dataformat.MetsService - Reading 1456/meta.xml
[INFO] 2022-10-11 10:54:55,871 [IndexManagementThread] org.kitodo.production.services.index.IndexingService - indexing of PROCESS failed, cleaning up
[TRACE] 2022-10-11 10:54:55,899 [pool-6-thread-1] org.kitodo.filemanagement.FileManagement - Found 1457/meta.xml
[TRACE] 2022-10-11 10:54:55,899 [pool-6-thread-1] org.kitodo.filemanagement.FileManagement - Found 1457/meta.xml
[TRACE] 2022-10-11 10:54:55,900 [pool-6-thread-1] org.kitodo.filemanagement.FileManagement - Found 1457/meta.xml
[INFO] 2022-10-11 10:54:55,900 [pool-6-thread-1] org.kitodo.production.services.dataformat.MetsService - Reading 1457/meta.xml

... more indexing is done until the end of the current indexing batch ...

[INFO] 2022-10-11 10:55:17,612 [pool-6-thread-1] org.kitodo.production.services.dataformat.MetsService - Reading 2702/meta.xml
[ERROR] 2022-10-11 10:55:40,661 [pool-6-thread-1] org.kitodo.production.helper.IndexWorker - java.lang.InterruptedException
org.elasticsearch.ElasticsearchException: java.lang.InterruptedException
        at org.elasticsearch.client.RestHighLevelClient.performClientRequest(RestHighLevelClient.java:2695) ~[elasticsearch-rest-high-level-client-7.17.6.jar:7.17.6]
        at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:2171) ~[elasticsearch-rest-high-level-client-7.17.6.jar:7.17.6]
        at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:2137) ~[elasticsearch-rest-high-level-client-7.17.6.jar:7.17.6]
        at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:2105) ~[elasticsearch-rest-high-level-client-7.17.6.jar:7.17.6]
        at org.elasticsearch.client.RestHighLevelClient.bulk(RestHighLevelClient.java:620) ~[elasticsearch-rest-high-level-client-7.17.6.jar:7.17.6]
        at org.kitodo.data.elasticsearch.index.IndexRestClient.addTypeSync(IndexRestClient.java:110) ~[kitodo-data-management-3.4.4-SNAPSHOT.jar:?]
        at org.kitodo.data.elasticsearch.index.Indexer.performMultipleRequests(Indexer.java:117) ~[kitodo-data-management-3.4.4-SNAPSHOT.jar:?]
        at org.kitodo.production.services.data.base.SearchService.addAllObjectsToIndex(SearchService.java:210) ~[classes/:3.4.4-SNAPSHOT]
        at org.kitodo.production.services.data.ProcessService.addAllObjectsToIndex(ProcessService.java:331) ~[classes/:3.4.4-SNAPSHOT]
        at org.kitodo.production.helper.IndexWorker.indexObjects(IndexWorker.java:98) ~[classes/:3.4.4-SNAPSHOT]
        at org.kitodo.production.helper.IndexWorker.run(IndexWorker.java:69) ~[classes/:3.4.4-SNAPSHOT]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1343) ~[?:?]
        at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:243) ~[elasticsearch-7.17.6.jar:7.17.6]
        at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:75) ~[elasticsearch-7.17.6.jar:7.17.6]
        at org.elasticsearch.client.RestHighLevelClient.performClientRequest(RestHighLevelClient.java:2692) ~[elasticsearch-rest-high-level-client-7.17.6.jar:7.17.6]
        ... 15 more
[INFO] 2022-10-11 10:55:42,661 [pool-6-thread-1] org.kitodo.production.helper.IndexWorker - index PROCESS with offset 0 and attempt 2/10
[TRACE] 2022-10-11 10:55:42,862 [pool-6-thread-1] org.kitodo.filemanagement.FileManagement - Found 7/meta.xml
[TRACE] 2022-10-11 10:55:42,862 [pool-6-thread-1] org.kitodo.filemanagement.FileManagement - Found 7/meta.xml
[TRACE] 2022-10-11 10:55:42,863 [pool-6-thread-1] org.kitodo.filemanagement.FileManagement - Found 7/meta.xml
[INFO] 2022-10-11 10:55:42,863 [pool-6-thread-1] org.kitodo.production.services.dataformat.MetsService - Reading 7/meta.xml

... now indexing again the indexing batch until end of the batch ...

[TRACE] 2022-10-11 10:56:15,523 [pool-6-thread-1] org.kitodo.filemanagement.FileManagement - Found 2702/meta.xml
[TRACE] 2022-10-11 10:56:15,523 [pool-6-thread-1] org.kitodo.filemanagement.FileManagement - Found 2702/meta.xml
[TRACE] 2022-10-11 10:56:15,547 [pool-6-thread-1] org.kitodo.filemanagement.FileManagement - Found 2702/meta.xml
[INFO] 2022-10-11 10:56:15,547 [pool-6-thread-1] org.kitodo.production.services.dataformat.MetsService - Reading 2702/meta.xml

... no more log entries ...

After second attempt is "done" indexing is not executed further but in the UI the amount of indexed processes is set to 2.500, so indexing of one batch size was submitted to the index.

Maybe it should be documented that the running indexing process is executed until the end of the current batch and even submitted to the elastic search.

@thomaslow
Copy link
Collaborator Author

Yes, after the user cancels the indexing, currently active batches are still being indexed. If you use threads=4, the currently 4 active batches are still indexed and submitted to ElasticSearch. In case of a smaller batch size (e.g. the default 250) the user will most likely not notice any delay.

Actually interrupting (or killing) the indexing of a single batch could cause various problems, e.g. partial HTTP requests being submitted to ElasticSearch, network connections not being closed, files not being closed. I wouldn't recommend that.

When indexing all objects via the index everything button, there is a small time window (in between indexing of two subsequent object types) where the cancel button will currently have no effect and not cancel anything.

As I said, this is a very simple implementation that basically follows the failure logic. I didn't really put much thought into the user experience side of things. If this is not "better than nothing", I would suggest to revert this pull request to a previous commit.

@henning-gerhardt
Copy link
Collaborator

henning-gerhardt commented Oct 11, 2022

Yes, after the user cancels the indexing, currently active batches are still being indexed. If you use threads=4, the currently 4 active batches are still indexed and submitted to ElasticSearch. In case of a smaller batch size (e.g. the default 250) the user will most likely not notice any delay.

This will be okay and I'm fine with that. My point was or is, that the current interruption logic is raising a second index attempt.

Actually interrupting (or killing) the indexing of a single batch could cause various problems, e.g. partial HTTP requests being submitted to ElasticSearch, network connections not being closed, files not being closed. I wouldn't recommend that.

That is true and this is not what I want as we have already enough issues to solve and should not introduce new ones.

When indexing all objects via the index everything button, there is a small time window (in between indexing of two subsequent object types) where the cancel button will currently have no effect and not cancel anything.

Will the cancel request be forgotten and the user must hit the cancel button again after the indexing has started? I did not test this as I do only single indexing.

As I said, this is a very simple implementation that basically follows the failure logic. I didn't really put much thought into the user experience side of things. If this is not "better than nothing", I would suggest to revert this pull request to a previous commit.

I will put some of my notes to the cancel indexing issue so that it should be clear what to be expect and what not. Edit: see #5181 (comment)

If the refactoring commit and so the cancel interaction commit should be reverted should be decided by @solth .

@solth
Copy link
Member

solth commented Oct 11, 2022

If the refactoring commit and so the cancel interaction commit should be reverted should be decided by @solth .

I am a fan of incremental improvements. Even if this is not the optimal solution, having the current simple link is still better than having no way to cancel the indexing process at all.

I would suggest to leave the commit in this pull request but not close issue #5181. Instead, the issue description could be updated to reflect that the "cancel" functionality is now available, but could (and should) be improved upon in a future pull request.

@henning-gerhardt
Copy link
Collaborator

@solth @thomaslow :
Changes are okay except the new introduced issue, where an additional batch indexing run is executed after the cancel button is clicked. I suggest to remove the cancel button from the UI, so that the latest changes (which introduce the cancel mechanic) must not be reverted but must be corrected in a separate pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants