-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indexing using multiple threads #5367
Indexing using multiple threads #5367
Conversation
…tch elastic search error when checking already indexed documents due to max_result_window limit.
Kitodo/src/main/java/org/kitodo/config/enums/ParameterCore.java
Outdated
Show resolved
Hide resolved
I tried your changes but you killed the workaround for #5179 (refreshing the page to prevent high access rate) and in a short time gigabyte of log entries are created in application proxy in front of the elastic search system. |
You are probably logging HTTP-requests that go to your ElasticSearch server, right? You can limit the number of HTTP-requests used for indexing via the The HTTP-requests that are used to refresh the current number of already indexed objects (issue #5179) are triggered by the user interface and can be avoided by navigating away from the Another approach would be to set up log-rotation in your application proxy to avoid huge log files. |
Correct. We need / use this application proxy to limit the access to the ElasticSearch instance.
We have already increased the value of
The index page refreshing is the issue for the big amount of log entries not the indexing itself. I will try it and switch to an other page after the indexing is started to get a possible new workaround for the open issue with the page refreshing
This could be a way but should be separated into its own pull request as this is "fixing" an other issue. I never understand why the refreshing is done 10 times per second but this should be discussed in the other issue.
There is already a log rotation which works fine for normal use but the indexing page refresh is the issue for so many log entries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only a code review with small adjustments. I will add some notice regarding to the speed improvement later.
Kitodo/src/main/java/org/kitodo/production/helper/IndexWorker.java
Outdated
Show resolved
Hide resolved
Kitodo/src/main/java/org/kitodo/production/helper/IndexWorker.java
Outdated
Show resolved
Hide resolved
logger.error("stop indexing after maximum amount of attempts"); | ||
throw new RuntimeException("stop indexing after maximum amount of attempts"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use a string to avoid duplicated text message? I don't know if logging and throwing an exception should happen on one place and RuntimeException
must be catched explicit. Maybe @solth should decide here what is a good solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RuntimeException was used to break out of the while-loop. I added a boolean "failed" to gracefully exit the while loop and thread.
Helper.setErrorMessage(e.getLocalizedMessage(), IndexingService.getLogger(), e); | ||
logger.error(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe first log and then show the error message like a few lines above? If showing the error message raise itself an exception / error then this error will not get logged instead the exception / error of the displaying will be logged and this error get lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the Helper.setErrorMessage
call because a Thread.sleep interrupt is not really a problem.
// by iterating over all indexed documents in elastic search | ||
try { | ||
while (offset < amountInIndex) { | ||
// TODO: actually iterate over all elastic search documents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will this TODO be solved? Is there is already changed ready or will this be fixed in a middle / far future away change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code was not changed by this pull request. I just added a comment that this is a problem. There is a new issue #5379 for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, then will this TODO / Issue not got lost.
Kitodo/src/main/java/org/kitodo/production/services/index/IndexingService.java
Outdated
Show resolved
Hide resolved
searchService.removeLooseIndexData(searchService.findAllIDs(offset, indexLimit)); | ||
offset += indexLimit; | ||
} | ||
} catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You catching all kind of exceptions. Is this is really expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exception is actually an elastic search exception. Unfortunately, the elastic search java library is not available in this class. I changed it to a RuntimeException, which is the most specific exception type available at this point.
Good thing is, that your changes did not lower the indexing speed - if you use the workaround and switch to an other page or close the tab like the old implementation. But I can not confirm the speed improvements in this large scale. I did a lot of benchmarks but speed improvements in my local environment are visible not not in a such way that the break records. I did all my benchmarks with a batch limit of 2.500 and index limit of 2.500, I close the tab or switch to an other non system page after the single type indexing has started (workaround!) and before a indexing is started the corresponding index was cleared and the application restarted (cleaning index was done between application stop and restart). Indexing around 560.000 processes including meta data files and properties (around 13 million property table entries in the database) need 1:45 hours on a single core. Only 1:15 hours on four cores. Not really faster on four cores instead of using only a single core. Indexing around 4.4 million tasks take 6:20 hours on one core and 2:36 hours on four cores. Here is a acceptable speed improvement but not so fast like on your timings. Massive negative impact on indexing speed is to stay on the system / indexing page as this speed down indexing a lot - I broke the process indexing after 3 hours and only a quarter or third of all data was indexed. No or only less influence has the used application proxy in front of the elastic search service. The process indexing goes down on four cores to 1:10 hours. |
@henning-gerhardt Thank you for your detailed feedback. I will have a look at your code review soon. I'm sorry if I raised too high expectations for this pull request. I suppose your setup (multiple virtual machines?) has different bottlenecks than my setup (multiple docker container on one non-virtual machine). Also, my database consists mostly of fake data (empty processes, no meta data, only a few mets files, etc.). Maybe there are other bottlenecks with proper real-world data (like disk i/o) or a multi-server setups (like network delays). Did you see full CPU utilization on each of your virtual machines (Database, ElasticSearch, Tomcat)? With 16 threads, my CPU is fully utilized at nearly 100%. Both the Tomcat server (~48%) and MariaDB (~40%) use the majority of the CPU. ElasticSearch is using only ~10% of the CPU. If you do not see full CPU utilization, I would suggest to try more than 4 threads, even if you Tomcat server only has 4 CPU cores. There could be other bottlenecks, e.g. network delay, disk i/o delay, that could improve with a higher number of threads. Try to increase the number of threads until there is no more increase in CPU utilization or indexing time. For example, 8 or even 16 threads should be no problem at all. Also, at some point (I don't know exactly when), your batch size could be too high. However, I tested your batch size (2500) on my machine and got the same duration of ~2min for 4 threads. A high batch size might cause additional network delays, because large data packets are sent over the network, and the indexing can not proceed while the data is sent over the network. Maybe this could be a problem. I don't know. Unfortunately, it is not easy to find a good balance for these parameters. |
Some information about my local test environment which I used for benchmarking: 4 core or 8 core with hyper-threading processor, nvme based ssd and 32 gb ram. IDEA is starting the Tomcat server (9.0.56) and deploying the application inside this tomcat server. Database (MariaDB 10.5.15) and ElasticSearch (7.17.6) are started on system up and are running as normal system processes. Connection to database, Tomcat and Elastic search are over localhost or local ethernet card without any real data transfer over network. A local running Apache service is acting as application proxy for the Kitodo.Productoin application and ElasticSearch but the application proxy has only less influence on the indexing. Using 4 threads for indexing let my system stay at all 50 - 60% total usage except I did not close the indexing page then the cpu usage go to 100% on all 8 cores. Using 8 or more cores is not really practicable as I run this benchmarks in the background while doing other import daily work. If I would try your changes in our real world system (VMWare virtualisation with many VMs (4 core application, 4 core elastic, 2 core database and accessing the meta data files through nfs on the application server) then benchmarking is not difficult to compare as network i/o, disk i/o (database + elastic search), VMware host cpu usage and many other factors will have a (big) influence on timing. Maybe I can convince my team mates to stay away from our system for a week or more so I can do some benchmarking in this environment. I choose a higher batch limit size and a similar indexing limit as lower values need more time on a single core indexing with our data. |
Ok. If, your system is similar to mine and your CPU usage is already at 100% (or as high as you want it to be) than I don't think there is anything that can be improved by tuning the Btw. I did my tests while keeping the indexing page active (including the automatic refresh every 1s). I also enabled database logging, which prints every SQL statement that is executed. So, I could probably gain a few seconds by optimizing my setup. This is why I wasn't expecting that your performance would be worse than mine. Weird. I suppose the most likely factor is that I use very simple test data. |
I found the change that caused that the previous workaround for stopping the automatic refresh (polling) didn't work anymore. Now the previous behavior is back (page refresh will stop polling). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did found only some small notices - in general are this changes approved.
// by iterating over all indexed documents in elastic search | ||
try { | ||
while (offset < amountInIndex) { | ||
// TODO: actually iterate over all elastic search documents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, then will this TODO / Issue not got lost.
Kitodo/src/main/java/org/kitodo/production/services/index/IndexingService.java
Outdated
Show resolved
Hide resolved
Kitodo/src/main/java/org/kitodo/production/services/index/IndexingService.java
Outdated
Show resolved
Hide resolved
Kitodo/src/main/java/org/kitodo/production/helper/IndexWorkerStatus.java
Show resolved
Hide resolved
Kitodo/src/main/java/org/kitodo/production/helper/IndexWorkerStatus.java
Show resolved
Hide resolved
…l indexing via simple link. Fix status images that were not displayed.
I refactored the failure logic a bit. Previously, a failure would not stop the indexing of subsequent objects types (if the indexing was started via the "index everything" button). I also added a simple cancel link to the UI, since the cancel implementation was very similar to the failure handling. |
Does this resolve #5181 ? |
The cancel link looks like this: If this is sufficient, then it will resolve #5181. |
|
I'm sorry I mixed features in one pull request. If it helps, I could revert this pull request to the last reviewed commit and add the recent changes to a new pull request. The cancel link is only visible when the indexing is currently in progress. |
No, let this changes in. I will test this tomorrow or the day after tomorrow. |
Canceling the indexing is not working properly: I started the indexing of processes with a batch size of 2.500 and indexing limit of 2.500 and interrupted the indexing in the first batch. In the UI the state of process indexing is set to failed but in the background process indexing is still a second attempt:
After second attempt is "done" indexing is not executed further but in the UI the amount of indexed processes is set to 2.500, so indexing of one batch size was submitted to the index. Maybe it should be documented that the running indexing process is executed until the end of the current batch and even submitted to the elastic search. |
Yes, after the user cancels the indexing, currently active batches are still being indexed. If you use Actually interrupting (or killing) the indexing of a single batch could cause various problems, e.g. partial HTTP requests being submitted to ElasticSearch, network connections not being closed, files not being closed. I wouldn't recommend that. When indexing all objects via the As I said, this is a very simple implementation that basically follows the failure logic. I didn't really put much thought into the user experience side of things. If this is not "better than nothing", I would suggest to revert this pull request to a previous commit. |
This will be okay and I'm fine with that. My point was or is, that the current interruption logic is raising a second index attempt.
That is true and this is not what I want as we have already enough issues to solve and should not introduce new ones.
Will the cancel request be forgotten and the user must hit the cancel button again after the indexing has started? I did not test this as I do only single indexing.
I will put some of my notes to the cancel indexing issue so that it should be clear what to be expect and what not. Edit: see #5181 (comment) If the refactoring commit and so the cancel interaction commit should be reverted should be decided by @solth . |
I am a fan of incremental improvements. Even if this is not the optimal solution, having the current simple link is still better than having no way to cancel the indexing process at all. I would suggest to leave the commit in this pull request but not close issue #5181. Instead, the issue description could be updated to reflect that the "cancel" functionality is now available, but could (and should) be improved upon in a future pull request. |
@solth @thomaslow : |
…improved in the future.
This pull request improves the batch indexing performance (triggered via the System/Indexing page) by using multiple threads for indexing.
A full re-indexing of 81.000 processes and 4.000 tasks took 12 min 12 seconds on my machine (8-core CPU + SSD, default batch size of 250). With these changes, the indexing improves as follows (relative to the best case scenario of the current master branch with a batch size of 2500):
There are still a bunch more problems with the indexing, but at least now you will notice them much quicker ;-)
Also, the indexing should be more resilient to intermittent errors. I restarted ElasticSearch during the indexing process and the indexing still finished successfully.
The default number of threads is set to 4 and can be configured via the
elasticsearch.threads
parameter.