Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce LogicalPlan invariants, begin automatically checking them #13651

Merged

Conversation

wiedld
Copy link
Contributor

@wiedld wiedld commented Dec 5, 2024

Which issue does this PR close?

Part of #13652

Rationale for this change

The original discussion included implicit changes which can cause problems when trying to upgrade DF. One class of bugs are related to user-constructed LPs, and the mutations of these LPs. This PR is a first step to programmatically enforce the rules of what should, and should not, be done.

What changes are included in this PR?

  • defined what must always be true for an LP.
    • This applies to all LPs, including those constructed by user (e.g. own language frontends).
  • define the contract for semantic mutation.
    • The LP may not be executable before the Analyzer passes.
    • The LP must be executable after the Analyzer passes.
  • define what further can be mutated during the LP optimizer passes.
    • The LP must be executable before, and after, the optimizer passes.
    • The LP schema cannot be mutated by the optimizer.

Are these changes tested?

By existing tests, and are benchmarked for impact to planning time.

Are there any user-facing changes?

There is a new LogicalPlan::check_invariants public api.

@github-actions github-actions bot added the optimizer Optimizer rules label Dec 5, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wiedld -- I really like this idea

I filed an issue for this idea here: #13652

@findepi @jonahgao or @Omega359 I wonder if you have any thoughts on this basic idea?

datafusion/optimizer/src/optimizer.rs Outdated Show resolved Hide resolved
@Omega359
Copy link
Contributor

Omega359 commented Dec 5, 2024

I love the idea of having additional testing like this! It's definitely out of my area of expertise but a few questions:

  • Should invariants be pluggable? For example, Influxdb might have additional invariants that another derivative of DF does not (or vice versa)?
  • Is the invariant testing overhead acceptable for production or should it there be the capability to disable it (enabled by default of course!) ?
  • Is there an equivalent set of invariants elsewhere in the core besides the LP (maybe physical plan? Just guessing here as I'm definitely no expert here) where this approach could be duplicated ?

// verify invariant: equivalent schema across union inputs
// assert_unions_are_valid(check_name, plan)?;

// TODO: trait API and provide extension on the Optimizer to define own validations?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a mention of the extensibility of invariants. Options include:

  • for general invariants:
    • defined as being checked before/after each OptimizerRule, and applied here in check_plan() (or equivalent code)
    • we could provide Optimizer.invariants = Vec<Arc<dyn InvariantCheck>> for user-defined invariants
  • for invariants specific for a given OptimizerRule:
    • we could provide OptimizerRule::check_invariants() such that certain invariants are only checked for a given rule (instead of all rules)
    • for a user-defined OptimizerRule, users can also check their own invariants

Ditto for the AnalyzerRule passes. Altho I wasn't sure about how much is added complexity and planning time overhead - as @Omega359 mentions we could make it configurable (e.g. run for CI and debugging in downstream projects).

This WIP is about proposing different ideas of what we could do. 🤔

Copy link
Member

@jonahgao jonahgao Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it can be controlled through environment variables, similar to RUST_LOG or RUST_BACKTRACE. Enable it for debugging when problems are encountered or during an upgrade.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have example use-case for user-defined plan invariants?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some special invariants for our SortPreservingMerge replacement, ProgressiveEval (related to time ranges of parquet files) that would be great to be able to encode

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it can be controlled through environment variables, similar to RUST_LOG or RUST_BACKTRACE. Enable it for debugging when problems are encountered or during an upgrade.

We could also add it as a debug_assert! after each optimizer pass and call the real validation

  1. After analyze
  2. After all the optimizer passes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some special invariants for our SortPreservingMerge replacement, ProgressiveEval (related to time ranges of parquet files) that would be great to be able to encode

is this about LogicalPlan::Extension? I agree it makes sense to support validation of these if we validate the overall plan.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this to the followup items list: #13652 (comment)

datafusion/optimizer/src/optimizer.rs Outdated Show resolved Hide resolved
datafusion/optimizer/src/optimizer.rs Outdated Show resolved Hide resolved
/// This invariant is subject to change.
/// refer: <https://github.com/apache/datafusion/issues/13525#issuecomment-2494046463>
fn assert_unique_field_names(plan: &LogicalPlan) -> Result<()> {
plan.schema().check_names()?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is check_names also called whenever creating new DFSchema?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, on every creation. But not after every merge. Which should be ok if no bug is introduced in the merge -- altho I would prefer to add the check there.

@findepi
Copy link
Member

findepi commented Dec 5, 2024

I like the verification. The only concern is the overhead of doing it, especially for large plans. Every optimizer pass gets a new verification of all plan nodes. i hope one day we go towards more iterative optimization and then we should be able to verify plan invariants locally. Which would guarantee that the verification overhead is linearly proportional to number of modifications applied to the plan. This would require eg that get_type is a constant operation (#12604), since plan verification should also include that the types do match.

@alamb
Copy link
Contributor

alamb commented Dec 6, 2024

I think it would also be great if we could consider moving the invariant check directly into LogicalPlan -- for exmaple
LogicalPlan::check_invariants

This would make this function easier to discover and for others to understand what the invariants are.

…sus assertions made after a given analyzer or optimizer pass
@github-actions github-actions bot added the logical-expr Logical plan and expressions label Dec 16, 2024
@wiedld
Copy link
Contributor Author

wiedld commented Dec 16, 2024

The updates focus on dividing out LP invariants, vs analyzer pass checks, vs optimizer pass checks (including invariants listed as the analyzer's responsibility). I added in reference links to invariants in the docs, in an attempt to delineate an invariant vs a valid plan. But I'm muddy on this boundary.

I'll make it configurable (or debug only) after I get feedback on this^^. 🙏🏼

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great @wiedld -- thank you

I think the basic structure of assert_invariants is looking good.

I had a suggestion for how to combine the two different invariant checks into a single API.

In terms of planning:

  1. Perhaps we can begin creating an epic for "Enforcing Invariants" and list items there to follow up on (like "Define the invariants for Union plans" for example)

In terms of this PR, once we solidify the API the only remaining concern I have is with potential performance slowdown of re-checking the invariants. When we have the code ready we can run the sql_planning benchmark to make sure this PR doesn't slow down planning. If it does, we can perhaps only run the invariant checks in debug builds

datafusion/optimizer/src/analyzer/mod.rs Outdated Show resolved Hide resolved
datafusion/expr/src/logical_plan/plan.rs Outdated Show resolved Hide resolved
datafusion/expr/src/logical_plan/plan.rs Outdated Show resolved Hide resolved
datafusion/expr/src/logical_plan/plan.rs Outdated Show resolved Hide resolved
datafusion/expr/src/logical_plan/mod.rs Show resolved Hide resolved
@wiedld wiedld force-pushed the 13525/invariant-checking-for-implicit-LP-changes branch from 651f5d3 to e52187e Compare December 17, 2024 23:35
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Dec 17, 2024
@wiedld
Copy link
Contributor Author

wiedld commented Dec 18, 2024

Ran all sql_planner benchmarks once, and then re-ran anything showing 1.09 to 1.24 ratio difference (max 3.3ms change in 17ms total).

all benchmarks

c3d-standard-8 debian-12

group                                         main                                   with_invariants_check
-----                                         ----                                   ---------------------
logical_aggregate_with_join                   1.00   1015.8±9.56µs        ? ?/sec    1.00  1015.9±10.30µs        ? ?/sec
logical_select_all_from_1000                  1.00      4.7±0.03ms        ? ?/sec    1.01      4.8±0.01ms        ? ?/sec
logical_select_one_from_700                   1.00    762.6±7.60µs        ? ?/sec    1.00    763.6±5.10µs        ? ?/sec
logical_trivial_join_high_numbered_columns    1.00    729.3±5.35µs        ? ?/sec    1.00    732.4±4.83µs        ? ?/sec
logical_trivial_join_low_numbered_columns     1.00    712.5±4.97µs        ? ?/sec    1.00    714.2±5.51µs        ? ?/sec
physical_intersection                         1.00   1729.6±9.00µs        ? ?/sec    1.08   1872.4±9.49µs        ? ?/sec
physical_join_consider_sort                   1.00      2.4±0.02ms        ? ?/sec    1.19      2.9±0.02ms        ? ?/sec
physical_join_distinct                        1.00    702.1±8.19µs        ? ?/sec    1.01    706.8±6.23µs        ? ?/sec
physical_many_self_joins                      1.00     13.8±0.09ms        ? ?/sec    1.24     17.1±0.16ms        ? ?/sec
physical_plan_clickbench_all                  1.00    187.3±0.61ms        ? ?/sec    1.00    187.4±0.46ms        ? ?/sec
physical_plan_clickbench_q1                   1.01      2.7±0.02ms        ? ?/sec    1.00      2.6±0.01ms        ? ?/sec
physical_plan_clickbench_q10                  1.01      3.6±0.04ms        ? ?/sec    1.00      3.6±0.03ms        ? ?/sec
physical_plan_clickbench_q11                  1.00      3.6±0.04ms        ? ?/sec    1.00      3.6±0.02ms        ? ?/sec
physical_plan_clickbench_q12                  1.00      3.8±0.02ms        ? ?/sec    1.00      3.8±0.02ms        ? ?/sec
physical_plan_clickbench_q13                  1.00      3.4±0.02ms        ? ?/sec    1.00      3.4±0.02ms        ? ?/sec
physical_plan_clickbench_q14                  1.00      3.6±0.02ms        ? ?/sec    1.00      3.6±0.02ms        ? ?/sec
physical_plan_clickbench_q15                  1.00      3.5±0.02ms        ? ?/sec    1.00      3.5±0.02ms        ? ?/sec
physical_plan_clickbench_q16                  1.00      3.0±0.02ms        ? ?/sec    1.00      3.0±0.02ms        ? ?/sec
physical_plan_clickbench_q17                  1.00      3.1±0.03ms        ? ?/sec    1.00      3.1±0.02ms        ? ?/sec
physical_plan_clickbench_q18                  1.00      2.9±0.02ms        ? ?/sec    1.00      2.9±0.01ms        ? ?/sec
physical_plan_clickbench_q19                  1.00      3.6±0.02ms        ? ?/sec    1.00      3.6±0.02ms        ? ?/sec
physical_plan_clickbench_q2                   1.01      2.9±0.03ms        ? ?/sec    1.00      2.9±0.02ms        ? ?/sec
physical_plan_clickbench_q20                  1.00      2.7±0.02ms        ? ?/sec    1.01      2.7±0.02ms        ? ?/sec
physical_plan_clickbench_q21                  1.00      2.9±0.02ms        ? ?/sec    1.01      2.9±0.02ms        ? ?/sec
physical_plan_clickbench_q22                  1.00      3.7±0.02ms        ? ?/sec    1.01      3.7±0.02ms        ? ?/sec
physical_plan_clickbench_q23                  1.00      4.1±0.02ms        ? ?/sec    1.01      4.2±0.02ms        ? ?/sec
physical_plan_clickbench_q24                  1.00      4.9±0.02ms        ? ?/sec    1.04      5.0±0.02ms        ? ?/sec
physical_plan_clickbench_q25                  1.00      3.2±0.02ms        ? ?/sec    1.02      3.2±0.02ms        ? ?/sec
physical_plan_clickbench_q26                  1.00      2.9±0.01ms        ? ?/sec    1.02      3.0±0.02ms        ? ?/sec
physical_plan_clickbench_q27                  1.00      3.2±0.02ms        ? ?/sec    1.02      3.3±0.02ms        ? ?/sec
physical_plan_clickbench_q28                  1.00      3.8±0.02ms        ? ?/sec    1.02      3.9±0.02ms        ? ?/sec
physical_plan_clickbench_q29                  1.00      4.8±0.03ms        ? ?/sec    1.01      4.9±0.02ms        ? ?/sec
physical_plan_clickbench_q3                   1.02      2.9±0.04ms        ? ?/sec    1.00      2.9±0.02ms        ? ?/sec
physical_plan_clickbench_q30                  1.00     16.0±0.06ms        ? ?/sec    1.00     16.0±0.07ms        ? ?/sec
physical_plan_clickbench_q31                  1.00      3.9±0.02ms        ? ?/sec    1.01      3.9±0.02ms        ? ?/sec
physical_plan_clickbench_q32                  1.00      3.9±0.02ms        ? ?/sec    1.01      3.9±0.02ms        ? ?/sec
physical_plan_clickbench_q33                  1.00      3.5±0.02ms        ? ?/sec    1.00      3.5±0.02ms        ? ?/sec
physical_plan_clickbench_q34                  1.00      3.1±0.01ms        ? ?/sec    1.01      3.1±0.02ms        ? ?/sec
physical_plan_clickbench_q35                  1.00      3.3±0.02ms        ? ?/sec    1.00      3.3±0.02ms        ? ?/sec
physical_plan_clickbench_q36                  1.01      4.4±0.03ms        ? ?/sec    1.00      4.4±0.02ms        ? ?/sec
physical_plan_clickbench_q37                  1.00      4.4±0.02ms        ? ?/sec    1.01      4.4±0.02ms        ? ?/sec
physical_plan_clickbench_q38                  1.00      4.4±0.02ms        ? ?/sec    1.00      4.4±0.02ms        ? ?/sec
physical_plan_clickbench_q39                  1.00      4.0±0.02ms        ? ?/sec    1.01      4.0±0.02ms        ? ?/sec
physical_plan_clickbench_q4                   1.02      2.7±0.03ms        ? ?/sec    1.00      2.7±0.02ms        ? ?/sec
physical_plan_clickbench_q40                  1.00      4.5±0.02ms        ? ?/sec    1.00      4.5±0.02ms        ? ?/sec
physical_plan_clickbench_q41                  1.00      4.3±0.02ms        ? ?/sec    1.00      4.3±0.02ms        ? ?/sec
physical_plan_clickbench_q42                  1.00      4.1±0.02ms        ? ?/sec    1.01      4.1±0.02ms        ? ?/sec
physical_plan_clickbench_q43                  1.00      4.1±0.02ms        ? ?/sec    1.00      4.2±0.02ms        ? ?/sec
physical_plan_clickbench_q44                  1.00      2.8±0.02ms        ? ?/sec    1.00      2.8±0.01ms        ? ?/sec
physical_plan_clickbench_q45                  1.00      2.8±0.01ms        ? ?/sec    1.00      2.8±0.01ms        ? ?/sec
physical_plan_clickbench_q46                  1.01      3.3±0.02ms        ? ?/sec    1.00      3.3±0.02ms        ? ?/sec
physical_plan_clickbench_q47                  1.00      3.9±0.02ms        ? ?/sec    1.00      3.9±0.02ms        ? ?/sec
physical_plan_clickbench_q48                  1.00      4.4±0.02ms        ? ?/sec    1.00      4.4±0.02ms        ? ?/sec
physical_plan_clickbench_q49                  1.00      4.6±0.03ms        ? ?/sec    1.00      4.6±0.02ms        ? ?/sec
physical_plan_clickbench_q5                   1.01      2.9±0.02ms        ? ?/sec    1.00      2.8±0.02ms        ? ?/sec
physical_plan_clickbench_q6                   1.01      2.9±0.02ms        ? ?/sec    1.00      2.9±0.01ms        ? ?/sec
physical_plan_clickbench_q7                   1.02      3.4±0.03ms        ? ?/sec    1.00      3.4±0.02ms        ? ?/sec
physical_plan_clickbench_q8                   1.01      3.1±0.03ms        ? ?/sec    1.00      3.1±0.02ms        ? ?/sec
physical_plan_clickbench_q9                   1.02      3.4±0.03ms        ? ?/sec    1.00      3.3±0.02ms        ? ?/sec
physical_plan_tpcds_all                       1.00   1178.8±3.25ms        ? ?/sec    1.00   1183.7±0.95ms        ? ?/sec
physical_plan_tpch_all                        1.00     73.1±0.31ms        ? ?/sec    1.00     73.1±0.21ms        ? ?/sec
physical_plan_tpch_q1                         1.00      2.6±0.01ms        ? ?/sec    1.00      2.6±0.01ms        ? ?/sec
physical_plan_tpch_q10                        1.00      3.6±0.01ms        ? ?/sec    1.01      3.6±0.01ms        ? ?/sec
physical_plan_tpch_q11                        1.00      3.1±0.01ms        ? ?/sec    1.01      3.1±0.01ms        ? ?/sec
physical_plan_tpch_q12                        1.00      2.5±0.01ms        ? ?/sec    1.01      2.5±0.01ms        ? ?/sec
physical_plan_tpch_q13                        1.00   1833.9±8.83µs        ? ?/sec    1.00  1830.2±10.09µs        ? ?/sec
physical_plan_tpch_q14                        1.00      2.2±0.01ms        ? ?/sec    1.01      2.2±0.01ms        ? ?/sec
physical_plan_tpch_q16                        1.00      3.2±0.01ms        ? ?/sec    1.01      3.2±0.01ms        ? ?/sec
physical_plan_tpch_q17                        1.00      2.9±0.01ms        ? ?/sec    1.01      2.9±0.01ms        ? ?/sec
physical_plan_tpch_q18                        1.00      3.2±0.01ms        ? ?/sec    1.00      3.2±0.01ms        ? ?/sec
physical_plan_tpch_q19                        1.00      5.4±0.02ms        ? ?/sec    1.01      5.4±0.02ms        ? ?/sec
physical_plan_tpch_q2                         1.00      6.3±0.03ms        ? ?/sec    1.01      6.3±0.02ms        ? ?/sec
physical_plan_tpch_q20                        1.00      3.8±0.01ms        ? ?/sec    1.00      3.8±0.01ms        ? ?/sec
physical_plan_tpch_q21                        1.00      5.1±0.02ms        ? ?/sec    1.00      5.1±0.01ms        ? ?/sec
physical_plan_tpch_q22                        1.00      2.9±0.01ms        ? ?/sec    1.00      2.9±0.01ms        ? ?/sec
physical_plan_tpch_q3                         1.00      2.5±0.01ms        ? ?/sec    1.01      2.6±0.01ms        ? ?/sec
physical_plan_tpch_q4                         1.00  1995.3±10.65µs        ? ?/sec    1.01      2.0±0.01ms        ? ?/sec
physical_plan_tpch_q5                         1.00      3.5±0.01ms        ? ?/sec    1.02      3.6±0.01ms        ? ?/sec
physical_plan_tpch_q6                         1.00  1364.7±12.81µs        ? ?/sec    1.01   1375.4±8.17µs        ? ?/sec
physical_plan_tpch_q7                         1.00      4.7±0.02ms        ? ?/sec    1.02      4.8±0.02ms        ? ?/sec
physical_plan_tpch_q8                         1.00      5.6±0.02ms        ? ?/sec    1.01      5.7±0.03ms        ? ?/sec
physical_plan_tpch_q9                         1.00      4.3±0.02ms        ? ?/sec    1.01      4.4±0.01ms        ? ?/sec
physical_select_aggregates_from_200           1.00     24.6±0.09ms        ? ?/sec    1.01     24.7±0.09ms        ? ?/sec
physical_select_all_from_1000                 1.00     44.8±0.44ms        ? ?/sec    1.03     46.3±0.27ms        ? ?/sec
physical_select_one_from_700                  1.00      2.5±0.01ms        ? ?/sec    1.15      2.9±0.02ms        ? ?/sec
physical_theta_join_consider_sort             1.00      2.7±0.01ms        ? ?/sec    1.17      3.2±0.02ms        ? ?/sec
physical_unnest_to_join                       1.00      2.5±0.01ms        ? ?/sec    1.09      2.7±0.02ms        ? ?/sec
with_param_values_many_columns                1.02    206.4±1.18µs        ? ?/sec    1.00    202.4±0.89µs        ? ?/sec
confirmation run
group                                         main                                   with_invariants_check
-----                                         ----                                   ---------------------
physical_join_consider_sort                   1.00      2.4±0.01ms        ? ?/sec    1.19      2.9±0.01ms        ? ?/sec
physical_many_self_joins                      1.00     13.9±0.09ms        ? ?/sec    1.23     17.0±0.11ms        ? ?/sec
physical_select_one_from_700                  1.00      2.5±0.01ms        ? ?/sec    1.15      2.9±0.02ms        ? ?/sec
physical_theta_join_consider_sort             1.00      2.7±0.02ms        ? ?/sec    1.15      3.2±0.01ms        ? ?/sec
physical_unnest_to_join                       1.00      2.4±0.01ms        ? ?/sec    1.10      2.7±0.02ms        ? ?/sec

The performance change was minimized because we didn't add that many extra timepoints of checking. Specifically:

  • (new check): the subset of InvariantLevel::Always checks occurs once before the analyzer starts
  • (existing/main check): the full InvariantLevel::Executable check occurs once after analyzer is done.
  • (existing/main check): each optimizer pass only does a check that the schema is not changed
  • (new check): the full InvariantLevel::Executable check occurs once before all optimizer passes start
  • (existing/main check): the full InvariantLevel::Executable check occurs once after all optimizer passes start

I propose that we add another full InvariantLevel::Executable between each optimization pass, that only runs in debug mode (since that will be a performance impact as @findepi mentioned).

I can also make some of these check listed above into debug only, if the current performance change is unacceptable.

@wiedld wiedld marked this pull request as ready for review December 18, 2024 05:58
@wiedld
Copy link
Contributor Author

wiedld commented Dec 18, 2024

(existing/main check): each optimizer pass only does a check that the schema is not changed

We could also reduce the time on this one by making it compile time with Arc::ptr_eq, but that means (a) a stricter contract (metadata & nullability cannot change) and (b) it requires code changes to chase down the current pointer_eq failures. 🤔

@alamb
Copy link
Contributor

alamb commented Dec 20, 2024

I propose that we add another full InvariantLevel::Executable between each optimization pass, that only runs in debug mode (since that will be a performance impact as @findepi mentioned).

Perhaps we can do this as a follow on PR

I can also make some of these check listed above into debug only, if the current performance change is unacceptable.

Yes, I think this is important for two reasons:

  1. I think we can't introduce a 25% planning performance regression
  2. Adidng a check to help development that will generate errors in plans that used to run to completion will likely be seen as a regression for users

I am checking out the rest of this PR now as well

@alamb alamb changed the title Define LogicalPlan invariants, delineated by invariant type. Introduce LogicalPlan invariants, begin automatically checking them Dec 20, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wiedld - I think this is a nice step towards keeping Datafusion stable for building other systems. Other than the performance regression I think this PR is ready to go.

datafusion/expr/src/logical_plan/invariants.rs Outdated Show resolved Hide resolved
datafusion/optimizer/src/analyzer/mod.rs Show resolved Hide resolved
datafusion/optimizer/src/optimizer.rs Outdated Show resolved Hide resolved
datafusion/optimizer/src/optimizer.rs Show resolved Hide resolved
datafusion/optimizer/src/optimizer.rs Show resolved Hide resolved
@wiedld wiedld marked this pull request as draft December 23, 2024 20:15
@wiedld wiedld force-pushed the 13525/invariant-checking-for-implicit-LP-changes branch from 10cd5d5 to 1164a7b Compare December 23, 2024 20:26
…ter all optimizer passes, except in debug mode it runs after each pass.
@wiedld
Copy link
Contributor Author

wiedld commented Dec 24, 2024

Fixed the performance regression. It wasn't where we thought it was.

The problem was a recursive check (down the LP) of the check_fields within the assert_unique_field_names(). I've removed the recursive nature of this check.

Final numbers, no regression
$ critcmp main invariants__no_recursive_namecheck__schema_check_per_optimizer_pass
group                                         invariants__no_recursive_namecheck__schema_check_per_optimizer_pass    main
-----                                         -------------------------------------------------------------------    ----
logical_aggregate_with_join                   1.00    987.0±5.66µs        ? ?/sec                                    1.02  1007.8±10.29µs        ? ?/sec
logical_select_all_from_1000                  1.04      5.0±0.04ms        ? ?/sec                                    1.00      4.8±0.04ms        ? ?/sec
logical_select_one_from_700                   1.00    739.4±5.38µs        ? ?/sec                                    1.01    743.5±5.70µs        ? ?/sec
logical_trivial_join_high_numbered_columns    1.00    703.2±5.00µs        ? ?/sec                                    1.02    718.2±4.64µs        ? ?/sec
logical_trivial_join_low_numbered_columns     1.00    685.1±5.28µs        ? ?/sec                                    1.02    699.6±3.40µs        ? ?/sec
physical_intersection                         1.00  1726.0±10.72µs        ? ?/sec                                    1.00  1726.6±15.43µs        ? ?/sec
physical_join_consider_sort                   1.00      2.4±0.02ms        ? ?/sec                                    1.02      2.5±0.01ms        ? ?/sec
physical_join_distinct                        1.00    677.2±5.94µs        ? ?/sec                                    1.02    689.2±7.02µs        ? ?/sec
physical_many_self_joins                      1.00     13.6±0.07ms        ? ?/sec                                    1.02     13.9±0.07ms        ? ?/sec
physical_plan_clickbench_all                  1.00    181.7±0.39ms        ? ?/sec                                    1.00    181.5±0.36ms        ? ?/sec
physical_plan_clickbench_q1                   1.00      2.6±0.01ms        ? ?/sec                                    1.00      2.6±0.01ms        ? ?/sec
physical_plan_clickbench_q10                  1.00      3.5±0.01ms        ? ?/sec                                    1.00      3.5±0.01ms        ? ?/sec
physical_plan_clickbench_q11                  1.00      3.5±0.01ms        ? ?/sec                                    1.00      3.5±0.01ms        ? ?/sec
physical_plan_clickbench_q12                  1.00      3.6±0.01ms        ? ?/sec                                    1.01      3.7±0.22ms        ? ?/sec
physical_plan_clickbench_q13                  1.00      3.3±0.01ms        ? ?/sec                                    1.00      3.3±0.01ms        ? ?/sec
physical_plan_clickbench_q14                  1.00      3.5±0.01ms        ? ?/sec                                    1.00      3.5±0.01ms        ? ?/sec
physical_plan_clickbench_q15                  1.00      3.4±0.01ms        ? ?/sec                                    1.00      3.4±0.01ms        ? ?/sec
physical_plan_clickbench_q16                  1.00      3.0±0.01ms        ? ?/sec                                    1.00      2.9±0.01ms        ? ?/sec
physical_plan_clickbench_q17                  1.00      3.0±0.01ms        ? ?/sec                                    1.00      3.0±0.01ms        ? ?/sec
physical_plan_clickbench_q18                  1.01      2.8±0.01ms        ? ?/sec                                    1.00      2.8±0.01ms        ? ?/sec
physical_plan_clickbench_q19                  1.00      3.5±0.01ms        ? ?/sec                                    1.00      3.4±0.01ms        ? ?/sec
physical_plan_clickbench_q2                   1.00      2.8±0.01ms        ? ?/sec                                    1.00      2.8±0.01ms        ? ?/sec
physical_plan_clickbench_q20                  1.00      2.6±0.01ms        ? ?/sec                                    1.00      2.6±0.01ms        ? ?/sec
physical_plan_clickbench_q21                  1.01      2.8±0.01ms        ? ?/sec                                    1.00      2.8±0.01ms        ? ?/sec
physical_plan_clickbench_q22                  1.00      3.6±0.01ms        ? ?/sec                                    1.01      3.6±0.01ms        ? ?/sec
physical_plan_clickbench_q23                  1.00      4.0±0.01ms        ? ?/sec                                    1.00      4.0±0.02ms        ? ?/sec
physical_plan_clickbench_q24                  1.01      4.8±0.02ms        ? ?/sec                                    1.00      4.7±0.01ms        ? ?/sec
physical_plan_clickbench_q25                  1.01      3.2±0.02ms        ? ?/sec                                    1.00      3.1±0.01ms        ? ?/sec
physical_plan_clickbench_q26                  1.01      2.9±0.01ms        ? ?/sec                                    1.00      2.8±0.01ms        ? ?/sec
physical_plan_clickbench_q27                  1.00      3.2±0.01ms        ? ?/sec                                    1.00      3.2±0.01ms        ? ?/sec
physical_plan_clickbench_q28                  1.00      3.7±0.01ms        ? ?/sec                                    1.00      3.7±0.02ms        ? ?/sec
physical_plan_clickbench_q29                  1.00      4.7±0.02ms        ? ?/sec                                    1.00      4.7±0.01ms        ? ?/sec
physical_plan_clickbench_q3                   1.01      2.8±0.01ms        ? ?/sec                                    1.00      2.8±0.01ms        ? ?/sec
physical_plan_clickbench_q30                  1.01     15.5±0.41ms        ? ?/sec                                    1.00     15.3±0.04ms        ? ?/sec
physical_plan_clickbench_q31                  1.00      3.8±0.02ms        ? ?/sec                                    1.00      3.8±0.01ms        ? ?/sec
physical_plan_clickbench_q32                  1.00      3.8±0.01ms        ? ?/sec                                    1.00      3.8±0.01ms        ? ?/sec
physical_plan_clickbench_q33                  1.00      3.4±0.01ms        ? ?/sec                                    1.00      3.4±0.01ms        ? ?/sec
physical_plan_clickbench_q34                  1.00      3.1±0.01ms        ? ?/sec                                    1.00      3.1±0.01ms        ? ?/sec
physical_plan_clickbench_q35                  1.00      3.2±0.01ms        ? ?/sec                                    1.00      3.2±0.01ms        ? ?/sec
physical_plan_clickbench_q36                  1.00      4.2±0.01ms        ? ?/sec                                    1.00      4.2±0.02ms        ? ?/sec
physical_plan_clickbench_q37                  1.00      4.3±0.01ms        ? ?/sec                                    1.00      4.3±0.02ms        ? ?/sec
physical_plan_clickbench_q38                  1.00      4.3±0.02ms        ? ?/sec                                    1.00      4.3±0.01ms        ? ?/sec
physical_plan_clickbench_q39                  1.00      3.8±0.01ms        ? ?/sec                                    1.00      3.8±0.01ms        ? ?/sec
physical_plan_clickbench_q4                   1.00      2.6±0.01ms        ? ?/sec                                    1.00      2.6±0.01ms        ? ?/sec
physical_plan_clickbench_q40                  1.00      4.3±0.01ms        ? ?/sec                                    1.00      4.3±0.02ms        ? ?/sec
physical_plan_clickbench_q41                  1.00      4.1±0.01ms        ? ?/sec                                    1.00      4.1±0.02ms        ? ?/sec
physical_plan_clickbench_q42                  1.00      4.0±0.02ms        ? ?/sec                                    1.00      4.0±0.01ms        ? ?/sec
physical_plan_clickbench_q43                  1.00      4.0±0.01ms        ? ?/sec                                    1.00      4.0±0.01ms        ? ?/sec
physical_plan_clickbench_q44                  1.00      2.7±0.01ms        ? ?/sec                                    1.00      2.7±0.01ms        ? ?/sec
physical_plan_clickbench_q45                  1.00      2.8±0.01ms        ? ?/sec                                    1.00      2.7±0.01ms        ? ?/sec
physical_plan_clickbench_q46                  1.00      3.2±0.01ms        ? ?/sec                                    1.00      3.3±0.01ms        ? ?/sec
physical_plan_clickbench_q47                  1.00      3.8±0.01ms        ? ?/sec                                    1.00      3.8±0.01ms        ? ?/sec
physical_plan_clickbench_q48                  1.00      4.3±0.01ms        ? ?/sec                                    1.00      4.3±0.01ms        ? ?/sec
physical_plan_clickbench_q49                  1.00      4.5±0.02ms        ? ?/sec                                    1.00      4.5±0.01ms        ? ?/sec
physical_plan_clickbench_q5                   1.00      2.8±0.01ms        ? ?/sec                                    1.00      2.7±0.01ms        ? ?/sec
physical_plan_clickbench_q6                   1.00      2.8±0.01ms        ? ?/sec                                    1.00      2.8±0.01ms        ? ?/sec
physical_plan_clickbench_q7                   1.01      3.3±0.01ms        ? ?/sec                                    1.00      3.3±0.01ms        ? ?/sec
physical_plan_clickbench_q8                   1.00      3.0±0.01ms        ? ?/sec                                    1.00      3.0±0.01ms        ? ?/sec
physical_plan_clickbench_q9                   1.00      3.2±0.02ms        ? ?/sec                                    1.00      3.3±0.02ms        ? ?/sec
physical_plan_tpcds_all                       1.00   1160.9±2.22ms        ? ?/sec                                    1.00   1161.8±5.74ms        ? ?/sec
physical_plan_tpch_all                        1.00     72.5±0.23ms        ? ?/sec                                    1.00     72.7±0.20ms        ? ?/sec
physical_plan_tpch_q1                         1.00      2.6±0.01ms        ? ?/sec                                    1.00      2.6±0.01ms        ? ?/sec
physical_plan_tpch_q10                        1.00      3.5±0.01ms        ? ?/sec                                    1.00      3.5±0.01ms        ? ?/sec
physical_plan_tpch_q11                        1.00      3.1±0.01ms        ? ?/sec                                    1.01      3.2±0.17ms        ? ?/sec
physical_plan_tpch_q12                        1.00      2.5±0.01ms        ? ?/sec                                    1.00      2.5±0.01ms        ? ?/sec
physical_plan_tpch_q13                        1.00   1832.8±8.27µs        ? ?/sec                                    1.00   1832.2±7.35µs        ? ?/sec
physical_plan_tpch_q14                        1.00      2.2±0.01ms        ? ?/sec                                    1.00      2.2±0.01ms        ? ?/sec
physical_plan_tpch_q16                        1.00      3.2±0.01ms        ? ?/sec                                    1.00      3.2±0.01ms        ? ?/sec
physical_plan_tpch_q17                        1.00      2.9±0.01ms        ? ?/sec                                    1.00      2.9±0.01ms        ? ?/sec
physical_plan_tpch_q18                        1.00      3.2±0.01ms        ? ?/sec                                    1.00      3.2±0.01ms        ? ?/sec
physical_plan_tpch_q19                        1.00      5.3±0.01ms        ? ?/sec                                    1.00      5.3±0.02ms        ? ?/sec
physical_plan_tpch_q2                         1.01      6.4±0.03ms        ? ?/sec                                    1.00      6.3±0.02ms        ? ?/sec
physical_plan_tpch_q20                        1.00      3.9±0.01ms        ? ?/sec                                    1.00      3.8±0.02ms        ? ?/sec
physical_plan_tpch_q21                        1.00      5.1±0.01ms        ? ?/sec                                    1.00      5.1±0.02ms        ? ?/sec
physical_plan_tpch_q22                        1.00      2.9±0.01ms        ? ?/sec                                    1.00      2.9±0.01ms        ? ?/sec
physical_plan_tpch_q3                         1.00      2.5±0.01ms        ? ?/sec                                    1.00      2.5±0.01ms        ? ?/sec
physical_plan_tpch_q4                         1.00   1994.6±9.01µs        ? ?/sec                                    1.00   1993.7±9.04µs        ? ?/sec
physical_plan_tpch_q5                         1.01      3.6±0.02ms        ? ?/sec                                    1.00      3.5±0.01ms        ? ?/sec
physical_plan_tpch_q6                         1.00   1350.1±7.99µs        ? ?/sec                                    1.00   1349.8±7.07µs        ? ?/sec
physical_plan_tpch_q7                         1.01      4.7±0.02ms        ? ?/sec                                    1.00      4.7±0.01ms        ? ?/sec
physical_plan_tpch_q8                         1.01      5.6±0.03ms        ? ?/sec                                    1.00      5.6±0.02ms        ? ?/sec
physical_plan_tpch_q9                         1.00      4.3±0.02ms        ? ?/sec                                    1.00      4.3±0.01ms        ? ?/sec
physical_select_aggregates_from_200           1.00     24.3±0.09ms        ? ?/sec                                    1.00     24.3±0.05ms        ? ?/sec
physical_select_all_from_1000                 1.01     43.2±0.35ms        ? ?/sec                                    1.00     42.6±0.41ms        ? ?/sec
physical_select_one_from_700                  1.00      2.5±0.01ms        ? ?/sec                                    1.02      2.5±0.01ms        ? ?/sec
physical_theta_join_consider_sort             1.00      2.7±0.02ms        ? ?/sec                                    1.02      2.8±0.01ms        ? ?/sec
physical_unnest_to_join                       1.00      2.4±0.01ms        ? ?/sec                                    1.02      2.5±0.01ms        ? ?/sec
with_param_values_many_columns                1.00    183.9±0.62µs        ? ?/sec                                    1.00    183.2±0.56µs        ? ?/sec

The release vs debug mode only has a single difference in the checks. The debug mode will run a full InvariantLevel::Executable check per each optimizer pass (instead of after all pass).

@wiedld wiedld marked this pull request as ready for review December 24, 2024 04:10
…cks without impa ct:

* assert_valid_optimization can run each optimizer pass
* remove the recursive cehck_fields, which caused the performance regression
* the full LP Invariants::Executable can only run in debug
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wiedld -- this looks like a great improvement

I left some comments but I don't they are required

In my opinion this is needed before merging:

  1. I will wait 24 to allow time for anyone else to comment
  2. I am also double-checking the benchmarks

If you have time I do think some of my comments would improve the usability of this PR as well (e.g. the message changes)

Also, shall we file some follow on tickets for follow on work?

datafusion/expr/src/logical_plan/invariants.rs Outdated Show resolved Hide resolved
datafusion/expr/src/utils.rs Show resolved Hide resolved
datafusion/optimizer/src/analyzer/mod.rs Outdated Show resolved Hide resolved
datafusion/optimizer/src/analyzer/mod.rs Outdated Show resolved Hide resolved
datafusion/optimizer/src/optimizer.rs Outdated Show resolved Hide resolved
datafusion/optimizer/src/optimizer.rs Outdated Show resolved Hide resolved
})?;

// run LP invariant checks only in debug
#[cfg(debug_assertions)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datafusion/optimizer/src/optimizer.rs Show resolved Hide resolved
@@ -529,7 +567,9 @@ mod tests {
let err = opt.optimize(plan, &config, &observe).unwrap_err();
assert_eq!(
"Optimizer rule 'get table_scan rule' failed\n\
caused by\nget table_scan rule\ncaused by\n\
caused by\n\
check_optimizer_specific_invariants after optimizer pass: get table_scan rule\n\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is nicer

.and_then(|tnr| {
assert_schema_is_the_same(rule.name(), &starting_schema, &tnr.data)?;
// run checks optimizer invariant checks, per pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what did you mean exactly here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correct statement is // run checks optimizer invariant checks, per optimizer rule applied. And then multiple rules are applied per pass. Statement is fixed. Thanks!.

@alamb
Copy link
Contributor

alamb commented Dec 24, 2024

My performance benchmarks show also show now difference ✅

++ critcmp main 13525_invariant-checking-for-implicit-LP-changes
group                                         13525_invariant-checking-for-implicit-LP-changes    main
-----                                         ------------------------------------------------    ----
logical_aggregate_with_join                   1.00  1498.7±38.07µs        ? ?/sec                 1.00  1494.0±18.89µs        ? ?/sec
logical_select_all_from_1000                  1.00      5.3±0.05ms        ? ?/sec                 1.00      5.3±0.05ms        ? ?/sec
logical_select_one_from_700                   1.00  1169.9±26.47µs        ? ?/sec                 1.01  1179.5±14.41µs        ? ?/sec
logical_trivial_join_high_numbered_columns    1.00  1154.9±13.48µs        ? ?/sec                 1.00  1158.7±16.27µs        ? ?/sec
logical_trivial_join_low_numbered_columns     1.00  1140.9±18.36µs        ? ?/sec                 1.00  1140.3±14.97µs        ? ?/sec
physical_intersection                         1.00      2.5±0.02ms        ? ?/sec                 1.00      2.5±0.02ms        ? ?/sec
physical_join_consider_sort                   1.00      3.3±0.02ms        ? ?/sec                 1.00      3.3±0.04ms        ? ?/sec
physical_join_distinct                        1.00  1130.5±14.83µs        ? ?/sec                 1.00  1133.2±15.97µs        ? ?/sec
physical_many_self_joins                      1.01     17.5±0.10ms        ? ?/sec                 1.00     17.4±0.08ms        ? ?/sec
physical_plan_clickbench_all                  1.00    227.8±1.70ms        ? ?/sec                 1.01    230.0±3.38ms        ? ?/sec

Copy link
Member

@jonahgao jonahgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM👍, thanks @wiedld

datafusion/expr/src/logical_plan/invariants.rs Outdated Show resolved Hide resolved
datafusion/optimizer/src/optimizer.rs Outdated Show resolved Hide resolved
@wiedld wiedld force-pushed the 13525/invariant-checking-for-implicit-LP-changes branch from 1884564 to 9bca470 Compare December 24, 2024 21:24
@wiedld
Copy link
Contributor Author

wiedld commented Dec 24, 2024

Thank you for the reviews.
I made the requested changes to the terminology and error messages.

.

Also, shall we file some follow on tickets for follow on work?

I have a task lists here for follow up items. I was planning to dig more into code to assess viability (& I can make tickets then too) as I start implementing.

@alamb
Copy link
Contributor

alamb commented Dec 25, 2024

Looks like Ci is failing on some tests

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wiedld -- I think this is a nice step forward in helping avoid bugs in DataFusion ❤️

Thanks @jonahgao @findepi @Sl1mb0 and @berkaysynnada

I'll merge this PR now and we can address any other suggestions as a follow on PR

@alamb alamb merged commit cf8f2f8 into apache:main Dec 26, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants