Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize more entity filter expressions #6539

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

grzesiek2010
Copy link
Member

@grzesiek2010 grzesiek2010 commented Dec 2, 2024

Closes #6498

Why is this the best possible solution? Were any other approaches considered?

I decided to come up with a versatile solution right away, one that supports complex expressions with multiple AND/OR conditions, so we don’t waste time addressing this in two separate pull requests. The way I pass filters to the database repository using the Query class is a result of the discussion below.

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

Testing should focus on filtering entities, ensuring that both simple and complex expressions are handled correctly. You can use the form attached below as a starting point and modify it to include a variety of edge cases, such as nested conditions, combinations of AND/OR operators, etc. This will help identify potential risks, such as incorrect query logic, unexpected behavior with complex filters, or performance issues when processing large datasets.

Do we need any specific form for testing your changes? If so, please attach one.

I used this one: people.zip

Does this change require updates to documentation? If so, please file an issue here and include the link below.

No.

Before submitting this PR, please make sure you have:

  • added or modified tests for any new or changed behavior
  • run ./gradlew connectedAndroidTest (or ./gradlew testLab) and confirmed all checks still pass
  • added a comment above any new strings describing it for translators
  • added any new strings with date formatting to DateFormatsTest
  • verified that any code or assets from external sources are properly credited in comments and/or in the about file.
  • verified that any new UI elements use theme colors. UI Components Style guidelines

@grzesiek2010
Copy link
Member Author

The issue outlines four filter expressions that need optimization.
The first one (label = <value>) is already supported.
I have added support for the last one (id/label/version/property != <value>).

It seems we need to discuss how to handle the following cases:

  • <id/property/label> = <value 1> and <id/property/label> = <value 2>
  • <id/property/label> = <value 1> or <id/property/label> = <value 2>

Specifically:

  • Should we support only combining two expressions, or was the example just a demonstration, with the solution needing to be more flexible?
  • To write tests, we must first merge Improved parsing attributes in tests javarosa#808, as wiring expressions with and/or will fail without this.

Another question, not directly related to any of the filter types:
we are calling [replacePartialElements](https://github.com/getodk/collect/blob/master/entities/src/main/java/org/odk/collect/entities/javarosa/filter/LocalEntitiesFilterStrategy.kt#L52C36-L52C58). In which scenario would we encounter issues if we do not call it? I removed this call, and all tests passed, which was strange to me. I believe I understood its purpose, but now that all tests are passing without it, I feel a bit disoriented.

@seadowg, feel free to not only answer these questions but also take over if you don't have other pressing tasks. You might have intended to work on this yourself, and I took it on without asking. If that's the case, I hope at least what I’ve done will be helpful.

@seadowg
Copy link
Member

seadowg commented Dec 3, 2024

The first one (label = ) is already supported.

It's supported, but not optimized (compared to id = <value>). The current implementation will use the DB to find the tree references, but it does that by calling getEntities and then filtering them in memory which requires the whole entity list (from the DB) in memory (at least temporarily). We want to let SQLite handle the filtering here (like we do with ID) so we don't need to take that hit.

Should we support only combining two expressions, or was the example just a demonstration, with the solution needing to be more flexible?

For this issue, I'd only intended that we add support for those exact expressions. If it feels easier to you at this point to make the solution more general though, that'd be a bonus.

To write tests, we must first merge getodk/javarosa#808, as wiring expressions with and/or will fail without this.

Interesting! I'll have a look at that as soon as I can.

Another question, not directly related to any of the filter types:
we are calling [replacePartialElements](https://github.com/getodk/collect/blob/master/entities/src/main/java/org/odk/collect/entities/javarosa/filter/LocalEntitiesFilterStrategy.kt#L52C36-L52C58). In which scenario would we encounter issues if we do not call it? I removed this call, and all tests passed, which was strange to me. I believe I understood its purpose, but now that all tests are passing without it, I feel a bit disoriented.

Everything will "work" without replacePartialElements: the FilterStrategy chain will return the correct references and then later those will get materialized into elements via the secondary instance's resolveReference method. Without replacePartialElements however, the call to resolveReference will end up forcing a full parse of the secondary instance as the reference will point at a partial element. replacePartialElements essentially fills in the gaps in the partially parsed instance (from the representation) as we make filters so we only load the things into memory that we need. There are specific tests that check that we don't "fallback" like this (example here).

@seadowg, feel free to not only answer these questions but also take over if you don't have other pressing tasks. You might have intended to work on this yourself, and I took it on without asking. If that's the case, I hope at least what I’ve done will be helpful.

No I think it's actually a good idea to swap over on this one as I'm worried I'm a silo on how the entities stuff all works (especially the optimizations). It'll be good to spread the knowledge out a bit!

@grzesiek2010 grzesiek2010 force-pushed the COLLECT-6498 branch 2 times, most recently from 809b104 to 56489ec Compare December 4, 2024 16:02
@grzesiek2010 grzesiek2010 force-pushed the COLLECT-6498 branch 2 times, most recently from e80f541 to 9e0e041 Compare December 12, 2024 10:48
@grzesiek2010
Copy link
Member Author

@seadowg
I think I need to rethink the approach we started with, so I wanted to discuss it with you.

If you take a look at the first couple of commits, you'll see that I added support for label = <value> and id != <value> by continuing with our initial approach. Specifically, I updated the queryEq method and implemented a new queryNotEq

While this worked well initially, I found that adding new methods like getBySomething for every case ended up generating a lot of code. The real challenge came when I started thinking about how to handle combinations of = and != with AND/OR. I realized that our current approach would make it extremely difficult, if not impossible, to support versatile expressions involving multiple AND/OR conditions.

I believe the best solution to address this, while also reducing the amount of code, would be to generate SQLite selection and selection arguments directly in LocalEntitiesFilterStrategy.
This approach offers several advantages:

  • It significantly reduces boilerplate code.
  • It automatically optimizes other expressions like version = <value> and version != <value>, eliminating the need to add support for each case individually.
  • It simplifies handling combinations of expressions using AND/OR.

Please take a quick look to make sure you agree with it before I continue.

@seadowg
Copy link
Member

seadowg commented Dec 13, 2024

While this worked well initially, I found that adding new methods like getBySomething for every case ended up generating a lot of code. The real challenge came when I started thinking about how to handle combinations of = and != with AND/OR. I realized that our current approach would make it extremely difficult, if not impossible, to support versatile expressions involving multiple AND/OR conditions.

I do think you're right that we're going to need something more versatile to support all the kinds of expressions we want to optimize, and I think you're going in the right direction. It is probably the right time to consider generalizing so we don't end up with loads of methods on EntitiesRepository we replace later (as you say). However, I'd like to avoid passing SQL through as much as possible so any actual string query construction is kept to one place (DatabaseEntitiesRepository) and so that the API is still pretty rigid so we could go further. I think we're probably aiming to build our own fairly limited query language here to expand on what you've already got (which is going to be interesting). I'm just going to think out loud about how that could work here:

entitiesRepository.query(Eq("age", "34"))
entitiesRepository.query(NotEq("age", "52"))
entitiesRepository.query(And(Eq("age", "34"), NotEq("name", "Callum"))

I guess that would require a type hierarchy like:

sealed interface Query {
    class Eq(val column: String, val value: String): Query
    class NotEq(val column: String, val value: String): Query
    class And(val query1: Query, val query2: Query): Query
    class Or(val query1: Query, val query2: Query): Query
}

This would mean we'd support any length/depth of and or or XPath expressions though (and that's not something we need right now), so we could simplify it initially:

sealed interface Query {
    class And(val comparison1: Comparison, val comparison2: Comparison): Query
    class Or(val comparison1: Comparison, val comparison2: Comparison): Query

    sealed interface Comparison : Query {
        class Eq(val column: String, val value: String): Comparison
        class NotEq(val column: String, val value: String): Comparison
    }
}

What do you think?

@grzesiek2010
Copy link
Member Author

What do you think?

It looks nice but I'm not sure if generating expressions with multiple AND/OR would be easy. I will try to add support for it with the current approach and then introduce such an interface at the end of the process.

@seadowg
Copy link
Member

seadowg commented Dec 13, 2024

It looks nice but I'm not sure if generating expressions with multiple AND/OR would be easy.

Do you mean generating Query or creating the SQL from it wouldn't be easy?

I will try to add support for it with the current approach and then introduce such an interface at the end of the process.

That's fine, but I really do think we should make sure we end up with something structured rather than passing string SQL queries to the repository (but it doesn't have to be this type based tree representation approach - that's just one possibility). I'd be tempted to start with just replacing with the current getByX methods with something general and then build on top of that to add the new expressions. Remember we don't need to deal with mixed or nested and/or expressions at the moment as well.

@grzesiek2010 grzesiek2010 force-pushed the COLLECT-6498 branch 5 times, most recently from 5a2d685 to bd4b88e Compare December 13, 2024 23:55
@grzesiek2010 grzesiek2010 force-pushed the COLLECT-6498 branch 3 times, most recently from 1a9ffe4 to d1c2405 Compare December 16, 2024 02:23
@grzesiek2010 grzesiek2010 force-pushed the COLLECT-6498 branch 5 times, most recently from 676585f to e315c73 Compare December 17, 2024 17:59
@grzesiek2010 grzesiek2010 marked this pull request as ready for review December 17, 2024 21:03
@grzesiek2010 grzesiek2010 requested a review from seadowg December 17, 2024 21:03
@grzesiek2010
Copy link
Member Author

Do you mean generating Query or creating the SQL from it wouldn't be easy?

Yes but never mind I figured it out.

That's fine, but I really do think we should make sure we end up with something structured rather than passing string SQL queries to the repository (but it doesn't have to be this type based tree representation approach - that's just one possibility).

I agree and it has been implemented.

Remember we don't need to deal with mixed or nested and/or expressions at the moment as well.

Yes but I didn't want to waste time and this pull request should handle complex cases as well.

Copy link
Member

@seadowg seadowg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really cool! I wanted to dig into Query mainly at first. Two things:

  1. I think we should make sure to separate the Query "language" from the "compiler" that turns it into SQL(ite) here. Query could live outside the sqlite package and then in sqlite (or somewhere else) we could have the logic that translates Query object to SQL.
  2. We could theoretically get rid of the getByX methods on EntitiesRepository now right? They can just be replaced with query as far as I can see. Maybe that's better as a follow-up though?

One more thing is that it looks to me like you've added support for nested expressions here (like value = thing and (value != blah and value != other)) due to the recursion in LocalEntitiesFilterStrategy#xPathExpressionToQuery. That's very cool, but not something we need right now so if it makes 1 harder, I'd happily drop it for now.

@grzesiek2010 grzesiek2010 force-pushed the COLLECT-6498 branch 4 times, most recently from 4ad9e03 to 3766a57 Compare December 19, 2024 20:57
@grzesiek2010
Copy link
Member Author

I think we should make sure to separate the Query "language" from the "compiler" that turns it into SQL(ite) here. Query could live outside the sqlite package and then in sqlite (or somewhere else) we could have the logic that translates Query object to SQL.

Done.

We could theoretically get rid of the getByX methods on EntitiesRepository now right? They can just be replaced with query as far as I can see. Maybe that's better as a follow-up though?

I think it would be better to do that in a separate pull request.

One more thing is that it looks to me like you've added support for nested expressions here (like value = thing and (value != blah and value != other)) due to the recursion in LocalEntitiesFilterStrategy#xPathExpressionToQuery. That's very cool, but not something we need right now so if it makes 1 harder, I'd happily drop it for now.

That doesn't seem to be a problem and I still think it's easier to implement a complete and versatile solution in one pull request.

@grzesiek2010 grzesiek2010 requested a review from seadowg December 21, 2024 12:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize more entity filter expressions
2 participants