Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hibernate-search] Implement indexing and search #6283

Open
wants to merge 28 commits into
base: hibernate-search
Choose a base branch
from

Conversation

matthias-ronge
Copy link
Collaborator

@matthias-ronge matthias-ronge commented Oct 29, 2024

Issue #5760 2c), 2d), 2e) and 2f)

Follow-up pull request to #6218 (immediate diff)

The filters work. Metadata is indexed and searchable. The metadata search syntax works as documented in the wiki.

@BartChris
Copy link
Collaborator

BartChris commented Oct 29, 2024

Great work! Can you elaborate (short high level overview) how the metadata is indexed now in general? (For reference: #4266)

Map<String, String> parameters = FacesContext.getCurrentInstance().getExternalContext()
.getRequestParameterMap();
if (parameters.containsKey("input") && StringUtils.isBlank(filterInEditMode)) {
filterInEditMode = parameters.get("input");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are taking over any argument from the outside without validating that the given value is correct or at least is in range of the expectations?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the user can enter any string into the search slot. The input is not “checked”. But of course, if the user has entered nonsense, the search will not find any results.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this clarification but I expected something different as I read the method, method content and the used written class variable filterInEditMode. Handling of this class variable filterInEditMode is strange in this class but this is a different story.

@matthias-ronge
Copy link
Collaborator Author

Can you elaborate (short high level overview) how the metadata is indexed now in general?

The metadata is stored in a text field that is indexed. The search is later carried out on the text field.

"Pseudo words" are used to search on certain metadata fields, i.e. if the word Berlin is in the TitleDocMain field in the metadata, a pseudo word "titledocmainqberlin" is also indexed. If a user later enters "TitleDocMain:Berlin" in the search field, this is converted into the pseudo word and searched for.

Pseudo words are generated for metadata keys, translated metadata keys according to the rule set, and domains according to the ruleset. The user can therefore also search for "Haupttitel:Berlin" (if the translation in the ruleset is this).

@matthias-ronge matthias-ronge force-pushed the 5760_2c branch 2 times, most recently from c052fc4 to 5bee9cb Compare October 30, 2024 11:30
@BartChris
Copy link
Collaborator

BartChris commented Oct 30, 2024

Pseudo words are generated for metadata keys, translated metadata keys according to the rule set, and domains according to the ruleset. The user can therefore also search for "Haupttitel:Berlin" (if the translation in the ruleset is this).

Thanks a lot, this is helpful! Do i understand you correctly that you make the fields searchable by their german and english label as well? So for example "Dokumenttyp:Monograph" as a possible search with the german label?

<key id="docType" use="docType">
           <label>Document Type</label>
           <label lang="de">Dokumenttyp</label>
</key>

And those combined terms (LABEL+VALUE) are all stored in the index? What happens if somebody changes the german label of a metadata key in the ruleset?

Comment on lines 344 to 367
* Creates the keywords for searching in correction messages. This is a
* double-truncated search, i.e. any substring.
*
* @param comments
* the comments of a process
* @return keywords
*/
private static final Set<String> initCommentKeywords(List<Comment> comments) {
Set<String> tokens = new HashSet<>();
for (Comment comment : comments) {
String message = comment.getMessage();
if (StringUtils.isNotBlank(message)) {
for (String splitWord : splitValues(message)) {
String word = normalize(splitWord);
for (int i = 0; i < word.length(); i++) {
for (int j = i + 1; j <= word.length(); j++) {
tokens.add(word.substring(i, j));
}
}
}
}
}
return tokens;
}
Copy link
Collaborator

@BartChris BartChris Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the reasoning here? Do i read this correctly and you are indexing every possible substring of all the words in a comment? Is this really necessary? Normal indexing should make it possible to search by word. Is this not enough?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wiki page requires:

Eine Suche nach Meldung findet alle Vorgänge, die einen Kommentar mit einer Nachricht wie Das ist eine Korrekturmeldung! haben.

To achieve this, we need both-side subword search. I only follow the expected behaviour here. I would prefer a simple-word search, too, but this is what the requirement actually looks like. Maybe we can discuss this, if it is really needed, as it inflates the index. I don’t know a clever way of splitting a German compound-word to its single words, to index just these. (This is possible, but it would be a sophisticated task of computational linguistics, typically involving a dictionary file, too, that I don’t have.)

@matthias-ronge
Copy link
Collaborator Author

Do i understand you correctly that you make the fields searchable by their german and english label as well? So for example "Dokumenttyp:Monograph" as a possible search with the german label?

Yes.

What happens if somebody changes the german label of a metadata key in the ruleset?

The search will only find the process with the old label, until the process is saved again, or all processes are re-indexed. Since the index is now only used for filters, re-indexing (or a missing index) won’t “break” the application any more, as it did in the past, it should be possible to run re-indexing in an environment where people work on without breaking something (but it will still make it slow meanwhile!)

@BartChris
Copy link
Collaborator

BartChris commented Oct 31, 2024

What happens if somebody changes the german label of a metadata key in the ruleset?

The search will only find the process with the old label, until the process is saved again, or all processes are re-indexed. Since the index is now only used for filters, re-indexing (or a missing index) won’t “break” the application any more, as it did in the past, it should be possible to run re-indexing in an environment where people work on without breaking something (but it will still make it slow meanwhile!)

I understand. I fear a situation, where correcting a typo in a label would require a reindex of everything, otherwise old data is only found with the misspelled label. I do not want to increase your workload too much, but would it be the option to allow the label-based-filters in the UI, but only index and search under the common key in the index? By e.g. having a mapping function, which maps the labels to the actual key and searches for that.

@BartChris
Copy link
Collaborator

BartChris commented Oct 31, 2024

Since the index is now only used for filters, re-indexing (or a missing index) won’t “break” the application any more, as it did in the past, it should be possible to run re-indexing in an environment where people work on without breaking something (but it will still make it slow meanwhile!)

A more general remark here. I have not really tested anything so far, but it appears to me that almost everything gets indexed. My assumption would be that functionality like displaying list of processes (with all the necessary icons, which require lookups) is faster from the index since we can store data there in denormalized form. It is nice, that we can run Kitodo without a running index for certain functionality but my intuition would be to always use the index if that gives us an edge over database-based retrieval. Maybe i am overlooking something, but it might be worth considering to rely more on the index to have better performance when displaying 100 processes at once, which is very slow in Kitodo 3.7. (which probably uses a mixture of index and database-based retrieval)

This is not necessarily adressed at the current PR, but more a general remark which can be considered.

@matthias-ronge
Copy link
Collaborator Author

I fear a situation, where correcting a typo in a label would require a reindex of everything, otherwise old data is only found with the misspelled label.

In my experience, rulesets are rarely changed at all during a running project, except perhaps the addition of a field; and I know librarians are very good at not making typos.

would it be the option to allow the label-based-filters in the UI, but only index and search under the common key in the index? By e.g. having a mapping function, which maps the labels to the actual key and searches for that.

  1. The metadata is also indexed under the key string defined in the rule set, so you can always use the key name for the search.
  2. Yes, of course, that would be possible, but I think it goes against the purpose of using search engines. It would be possible to implement this, but it would significantly slow down search queries because all relevant rulesets would have to be determined each time and checked to see whether a field name needs to be translated. I don't think that would be useful.

Perhaps it would be interesting to discuss this case among users. Such a change can be introduced later.

@matthias-ronge matthias-ronge marked this pull request as ready for review October 31, 2024 13:53
@matthias-ronge
Copy link
Collaborator Author

it appears to me that almost everything gets indexed.

Yes, a lot of indexing is still going on at the moment and I don't think that's necessary. But the task here was not to make more than necessary changes, but to keep the application as most as it was before. But I think that in a separate development, it will be possible to clean this up so that in the end only processes and tasks are indexed, as only these contain metadata. The other objects are never searched, so they don’t do something sensible in the index by now.

My assumption would be that functionality like displaying list of processes […] is faster from the index since we can store data there in denormalized form. […] Maybe i am overlooking something, but it might be worth considering to rely more on the index to have better performance when displaying 100 processes at once, which is very slow in Kitodo 3.7. (which probably uses a mixture of index and database-based retrieval)

No, it's exactly the other way round: In Production 3.7, the display is taken from the index alone, as you describe here as a wish. And it is slow. In this version, this behavior was exactly inverted and now everything except the keyword search is taken from the database. On my developer laptop, the application has become significantly faster as a result.

Don't ask me why it behaves like this. This is another example of how performance is less about well-intentioned programming than about actual measurements. I suspect that both database queries and restoring Java in-memory objects from database rows are so much more sophisticated, having been researched for several decades longer, that it beats out indexing.

Copy link
Collaborator

@henning-gerhardt henning-gerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a code only reading review. I'm not sure if the word splitting for any kind of indexing type and there content is really needed to be done on our side - this should be done on an internal way in ES/OS itself.

}
if (parameter instanceof Collection) {
int size = ((Collection<?>) parameter).size();
if (size > 12) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is a value greater than 12 bad? Should not all parameters be used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have based this on the written language: Numbers up to twelve are written out, from 13 onwards the number is written. Yes, a list of 500,000 IDs in a query makes the logging unreadable, which is why only the number of elements is output in the debug output for longer lists. I think 80% of users will be happy with this logging format, so no parameterization is required for log output.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanation. Can you add a comment with this explanation as this can lead to questions in future.

Kitodo-DataManagement/src/test/resources/log4j2.xml Outdated Show resolved Hide resolved
@@ -158,7 +158,7 @@ public void runNotExistingScriptAsync() throws InterruptedException {
String commandString = scriptPath + "not_existing_script" + scriptExtension;
CommandService service = new CommandService();
service.runCommandAsync(commandString);
Thread.sleep(1000); // wait for async thread to finish;
Thread.sleep(2000); // wait for async thread to finish;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change and similar changes increase the test execution time for everyone :-( and the test execution itself is already really long.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to discuss the one second. This is just a single test being run. Yes, on a slower laptop, 1 second sometimes doesn't work and the build aborts.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One second at one point, an other second an other point. Yes, this is not much at a single place but in common this could be extend the execution time. But for now I'm fine with this change.

* the title of the process
* @return keywords
*/
private static Set<String> initTitleKeywords(String processTitle) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method and similar methods are really needed? Should this not be done by ElasticSearch/Opensearch in a internal way without bothering every one who is using it? This will increase the execution time and the data size on ElasticSearch/OpenSearch server without knowing that a internal handling is better.

Maybe the submitted data must provided in an other way to get this handled by ElasticSearch/Opensearch itself instead of implementing and maintaining it on our side?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's possible that you can build this into OpenSearch, but then you'll have to compile your own OpenSearch binaries and you won't be able to just install OpenSearch from the distribution. Maybe that makes sense if we have several of them, but for administrator convenience, it would be more convenient to install the web application and have it just work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am not mistaken @henning-gerhardt is talking about configuring OpenSearch/ElasticSearch via options in standard configuration files in yml, xml, etc. instead of changing the implementation via the source code files, as you imply.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @solth already written: I thought that this could be configured on the ES / OS side or through the type of data so no change is needed explicit on ES / OS side itself instead of implementing this on our side. Maybe my imagination is wrong but this would blow up the data size of the ES / OS index a lot if you have many processes and big meta data (task data, comment data, data from the meta data files itself, ...) entries per process. But we will see how the handling and outcome is.

Copy link
Collaborator Author

@matthias-ronge matthias-ronge Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am completely in agreement if we can solve this in a simplified way in the future in line with the requirements of the community board. For a first functional requirement, however, I am happy if the code does what it is supposed to.

@henning-gerhardt, if you know of a way to achieve this functionality using annotations alone, I am welcome to do so. The following is required for the title search field:

  • character string separation at non-word boundaries, where a word consists of letters and numbers
  • every separated group can be searched using right truncation, only groups with at least three characters need to be indexed AND
  • every separated group can be searched using left truncation, only groups with at least three characters need to be indexed
  • but not both in combination

Another question would be, how do you pass the text to the annotations? And, it also has to be in the free search, so there is no point in solving this in the annotations for the search field, if it is not included in the basic search field; so what do you suggest?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry I have not the knowledge to do this as this is not my part of my work. I should look into the changes and I should comment them if I'm unsure about the changes. I don't know which exact requirements are made as I'm not involved in this discussions. The solutions looks complex and bad to maintain and I still thinking that this is not really necessary to do this required work every time a ES / OS should be used. Maybe this is limited because of the way of hibernate-search or through other limitations. Maybe I'm overrating what should be done by ES / OS on default and this stuff must be done by everyone how is using ES / OS. Maybe using the binding and brigdes could be a start of writing self defined bridges in a more clean way if they are a possible solution.

Copy link
Collaborator

@BartChris BartChris Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might need to revisit the requirements, but from my understanding, the main source of complexity seems to be supporting multi-term searches like "digi abc" that would match titles such as digi_1234_abc, ignoring characters in between. This feels like overkill and might go beyond what’s necessary.

A simpler approach could be focusing on searches based on three or more consecutive characters/tokens, which would still allow matching relevant titles.

For example, searching for "abc" or "1234" should match the title digi_1234_abc if the search logic considers tokens generated by an ngram tokenizer with min_gram: 3. This would generate tokens such as:

"digi_1234_abc" → ["dig", "digi", "igi", "gi_", "i_1", "123", "234", "abc"]
With this setup:

  • A search for "abc" would match because "abc" exists as a token.
  • A search for "igi" would match because "igi" is also generated as a token.

Will consult with the @kitodo/kitodo-board

For customization see:
https://docs.jboss.org/hibernate/stable/search/reference/en-US/html_single/#backend-elasticsearch-analysis-analyzers

Copy link
Collaborator

@BartChris BartChris Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am completely in agreement if we can solve this in a simplified way in the future in line with the requirements of the community board.

For me the best option would be, if we can keep most configuration inside the code and the Hibernate
configuration (which comes preconfigured with a release), so that the user does not have to deal with even more configuration files in Kitodo. I have not really tested it but a way to get a more customized tokenization without custom Java processing seems to involve the following:

  1. Define a Custom Analyzer class, to register custom tokenizers or analyzers. I would suppose that if we are using standard tokenizers like ngram this would also work with OpenSearch (which Hibernate Search conceptually treats similar to Elasticsearch i suppose)
package org.kitodo.data.database.beans;

import org.hibernate.search.backend.elasticsearch.analysis.ElasticsearchAnalysisConfigurationContext;
import org.hibernate.search.backend.elasticsearch.analysis.ElasticsearchAnalysisConfigurer;

public class MyElasticsearchAnalysisConfigurer implements ElasticsearchAnalysisConfigurer {
    @Override
    public void configure(ElasticsearchAnalysisConfigurationContext context) {
        // Custom n-gram analyzer for titles
        context.analyzer("title_ngram_analyzer")
                .custom()
                .tokenizer("title_ngram_tokenizer")
                .tokenFilters("lowercase");

        context.tokenizer("title_ngram_tokenizer")
                .type("nGram")
                .param("min_gram", "3")   // Minimum length of matching term
                .param("max_gram", "8"); // Maximum length for flexibility
    }
}

This should use the default ngram-tokenizer coming with OpenSearch and Elasticsearch.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html

  1. Annotate the field in the domain model with the custom analyzer
  /**
     * When indexing, outputs the index keywords for searching in title.
     * 
     * @return the index keywords for searching in title
     */
    @FullTextField(analyzer = "title_ngram_analyzer")
    public String getProcessTitle() {
       return this.processTitle;
    }
  1. Register the ElasticSearchAnalysisConfigurer with the hibernate.properties
hibernate.search.enabled=true
hibernate.search.backend.hosts=localhost:9200
hibernate.search.backend.protocol=http
# Register Custom Analyzer Configurer
hibernate.search.backend.analysis.configurer=org.kitodo.data.database.beans.MyElasticsearchAnalysisConfigurer

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the programming way in Java, it is also possible to define the same in custom index settings which could even used for other general or index specific adjustments.

@solth
Copy link
Member

solth commented Dec 2, 2024

I get an exception when navigating to the processes list (after rebuilding the index via the updated "indexing" page):

Edit: I tested on an outdated version of this branch. With the latest commits the exception on the process list page disappears.

Instead, I now get an exception when trying to edit a project, though:

Bildschirmfoto 2024-12-02 um 13 20 43

@BartChris
Copy link
Collaborator

All the filters i enter are always duplicated in the filter menu, which also duplicates the conditions in the database query.
image

The SQL query generated has two IN-conditions with the same process Ids.

@@ -470,6 +472,7 @@ public List<Process> loadData(int offset, int limit, String sortField, SortOrder
.collect(Collectors.toList());
query.restrictToProjects(projectIDs);
query.defineSorting(SORT_FIELD_MAPPING.get(sortField), sortOrder);
query.performIndexSearches();
Copy link
Collaborator

@BartChris BartChris Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an observation: The strategy here is to retrieve the process IDs for which a specific condition e.g. stepopen:Archivierung is met from the Search Index. Those IDs are used to extend the sql query with a IN-statement containing all the IDs retrieved from the search index. This can lead to REALLY large queries when you have 10.000 processes in your database and 9000 of them are in the state "Strukturierung". If you search for that exact stage the SQL IN clause will contain 9000 process IDs. With more processes in Kitodo the queries can become really large, maybe exhausting MySQL here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an observation: The strategy here is to retrieve the process IDs for which a specific condition e.g. stepopen:Archivierung is met from the Search Index. Those IDs are used to extend the sql query with a IN-statement containing all the IDs retrieved from the search index

This was necessary in the past or with ES / OS as ES / OS can only retrieve JSON based data which are no objects and are not so good accessible. For this reason the ES / OS result was given to Hibernate to retrieve the objects from the database which could contain more information and accessing them was a lot easier. As the used ES / OS API could never result more than 10.000 hits there was a limit to the Hibernate query itself.

This is not needed anymore and should be removed. As after the query is send to Hibernate / Hibernate-Search the results are objects which are easy accessible and hold all information which are needed.

Copy link
Collaborator

@BartChris BartChris Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a situation, where some information (metadata, but also filter settings like "stepopen") can only be queried using the index. So we need the index to answer the query and therefor we have to query the index somewhere in the process. In my review i will try to concentrate on functionality so that we can focus on performance later, but the general remark i have is this:

We have seen in different places (e.g. #4378) that raw SQL / dedicated index queries often outperforms Hibernate standard queries, which do sometimes hundred of subsequent queries to fetch data which could be fetched in one go. (If Hibernate supports the optimized SQL query)

The process list search is one of the most used features and it should be optimized as much as possible (within the constraints of Hibernate) in my oppinion. When scrolling through the process list i still see A LOT of SQL queries. And queries like here

styleClass="#{ProcessForm.createProcessAsChildPossible(process) ? '' : 'ui-state-disabled'}"

are executed for every single process leading to the classical n+1 query.

I am wondering wether we can not construct something like a ProcessListDTO Object (or ProcessListObject, this is not about the DTO concept), which contains only the information needed to construct the list. (and the statistics). The objects in the list could already have the necessary properties like hasChildren precomputed, so there is no reason to do additional database queries to calculate them hundred of times.

And since we are already using the index, why not store the necessary properties in the index? We do not have to store all projects associated with a process in the index, but we should store the information which is needed for display like hasChildren. If we have this, we ideally only need one query insted of hundreds to get all the data for the list and have good performance. This will also make issues like #6348 less annoying, although we can do something on usability there, too.

The necessary features of Hibernate Search have already been referenced in other places

https://docs.jboss.org/hibernate/search/7.1/reference/en-US/html_single/#projections
https://docs.jboss.org/hibernate/search/7.1/reference/en-US/html_single/index.html#search-mapping-associated
https://docs.jboss.org/hibernate/search/7.1/reference/en-US/html_single/index.html#mapping-reindexing-derivedfrom

I think it is fantastic that we now reduce indexing time by only index the processes and using the Hibernate abstractions simplifies a lot of logic. But when necessary for performance it would be great if we could use dedicated custom logic to achieve the perfomance in places where it is mostly needed.

@BartChris
Copy link
Collaborator

BartChris commented Dec 17, 2024

Requirement checklist; as discussed in the Community Board:

  1. Blue search box in the top left jumps to the normal process list and not a special list with less selectable actions

Works in general. Some problems

a) Searching in the top left box does not reset the process list filter, but appends. A search should reset the current search.

image

  1. Searching for processes

a) Searching for process:11 throws an exception. It seems to happen when i just type process:11 and press enter, without selecting process from the dropdown.

The same thing sometimes happens when searching for 11 in the top left search bar

image

After that it is not really possible to return to the process list without logging out.

b) The requirements outlined above seem to be working

  • character string separation at non-word boundaries, where a word consists of letters and numbers
  • every separated group can be searched using right truncation, only groups with at least three characters need to be indexed AND
  • every separated group can be searched using left truncation, only groups with at least three characters need to be indexed
  • but not both in combination

@BartChris
Copy link
Collaborator

BartChris commented Dec 17, 2024

I get an exception when navigating to the processes list (after rebuilding the index via the updated "indexing" page):

Edit: I tested on an outdated version of this branch. With the latest commits the exception on the process list page disappears.

Instead, I now get an exception when trying to edit a project, though:

Bildschirmfoto 2024-12-02 um 13 20 43

I get the same exception when trying to add a new process through the project list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants