-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sqlite test files, progress bar, and automatic postgres container management into sqllogictests #13936
base: main
Are you sure you want to change the base?
Conversation
Related: apache/datafusion-testing#2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THANK YOU @Omega359 -- this looks awesome
I think there are two things that we should fix prior to merge:
- The submodule issue (details below)
- "UnexpectedToken" issues (though I think this could potentially also be fine to fix this as a follow on PR)
Once we get this PR merged, I think the next obvious thing to do is to start running the suite in CI and actively tend the tickets for fixing issues found (the bugs listed on #13811 which will now be much eaiser to reproduce)
"UnexpectedToken label-XXX" errors:
When I ran this branch with
INCLUDE_SQLITE=true cargo test --profile release-nonlto --test sqllogictests
I got an error
External error: task 27341 panicked with message "called `Result::unwrap()` on an `Err` value: ParseError { kind: UnexpectedToken(\"label-1\"), loc: Location { file: \"../../datafusion-testing/data/sqlite/random/select/slt_good_21.slt\", line: 47, upper: None } }"
- This looks like something that came in via sqllogictests 0.25 yesteday: Update sqllogictest requirement from 0.24.0 to 0.25.0 #13917
Perhaps we could downgrade / revert to 0.24 and file a ticket upstream 🤔
progress reporting
This is pretty neat 🎉
cargo test --test sqllogictests
git submodule
issues
Initially I tried to run this locally and had some problems
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ INCLUDE_SQLITE=true cargo test --profile release-nonlto --test sqllogictests
Finished `release-nonlto` profile [optimized] target(s) in 0.43s
Running bin/sqllogictests.rs (target/release-nonlto/deps/sqllogictests-19127caafe5284e5)
Error: Execution("Error reading directory \"../../datafusion-testing/data/\": Not a directory (os error 20)")
error: test failed, to rerun pass `-p datafusion-sqllogictest --test sqllogictests`
Caused by:
process didn't exit successfully: `/Users/andrewlamb/Software/datafusion2/target/release-nonlto/deps/sqllogictests-19127caafe5284e5` (exit status: 1)
This seems to be related to not having the datafusion-testing submodule checked out
However, git submodule init
didn't seem to work
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ git submodule init
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ git status
On branch sqllogictest_with_sqlite
nothing to commit, working tree clean
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ ls datafusion-testing
datafusion-testing*
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ cat datafusion-testing
e2e320c9477a6d8ab09662eae255887733c0e304(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$
I found I could fix it by running with --force
:
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ git rm datafusion-testing
rm 'datafusion-testing'
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ git submodule add --force https://github.com/apache/datafusion-testing.git
Reactivating local git directory for submodule 'datafusion-testing'
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ ls datafusion-testing/
LICENSE.txt NOTICE.txt README.md data/
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ git status
On branch sqllogictest_with_sqlite
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
typechange: datafusion-testing
datafusion/sqllogictest/src/engines/datafusion_engine/normalize.rs
Outdated
Show resolved
Hide resolved
use std::ffi::OsStr; | ||
use std::fs; | ||
#[cfg(feature = "postgres")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps as a follow on PR the code that manages the postgres container could be moved into its own module (like postgres_container.rs
or something so we only needed one #[cfg(feature = "postgres")]
I suspect this would also make the code a bit easier to reason about
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bit of a mess eh? Yeah, I think that is a good idea
Git submodules - for me this worked (as documented in the description above): Sub modules with branches should be easier to update than ones without but it still isn't a normal flow like the rest of git. I had found this stackoverflow but not all the suggestions actually worked for me. |
This is the slt in question and yes, I think it's a bug. I'll downgrade for now and file an issue with the sqllogictest-rs project later today:
|
|
🤔 -- it still doesn't work for me.
I also tried it on an entirely new checkout: andrewlamb@Andrews-MacBook-Pro-2 Downloads % git clone https://github.com/Omega359/arrow-datafusion.git
Cloning into 'arrow-datafusion'...
remote: Enumerating objects: 125653, done.
remote: Counting objects: 100% (29331/29331), done.
remote: Compressing objects: 100% (1500/1500), done.
remote: Total 125653 (delta 28229), reused 27839 (delta 27831), pack-reused 96322 (from 1)
Receiving objects: 100% (125653/125653), 242.40 MiB | 10.17 MiB/s, done.
Resolving deltas: 100% (97365/97365), done.
andrewlamb@Andrews-MacBook-Pro-2 Downloads % cd arrow-datafusion
andrewlamb@Andrews-MacBook-Pro-2 arrow-datafusion % git submodule init
Submodule 'parquet-testing' (https://github.com/apache/parquet-testing.git) registered for path 'parquet-testing'
Submodule 'testing' (https://github.com/apache/arrow-testing) registered for path 'testing'
andrewlamb@Andrews-MacBook-Pro-2 arrow-datafusion % |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Omega359 -- I think this looks great 🙏
It would be nice to fix the submodule thing before we merged to main, but I also don't think it is necessary (we can fix it afterwards)
Also, it would be nice to file a ticket / follow on to clean up / modularize the postgres container management code in the sqllogictest runnner. I can do that if you don't have a chance.
I'll plan to leave this PR open for at least one more day while approved to let others have a chance to comment / test if they want.
Thank you again for helping drive this forward 🙏
Readd submodule
# Conflicts: # datafusion/sqllogictest/Cargo.toml
…qllogictest_with_sqlite
runner.with_normalizer(value_normalizer); | ||
runner.with_validator(validator); | ||
|
||
let res = runner | ||
.run_file_async(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using run_multi_async
can parse the file only once, maybe try it in a follow-up PR.
let records = parse_file(&path).unwrap();
let count = get_record_count2(&records, "Datafusion");
let res = runner
.run_multi_async(records)
.await
.map_err(|e| DataFusionError::External(Box::new(e)));
fn get_record_count2(
records: &[Record<<DataFusion as AsyncDB>::ColumnType>],
label: &str,
) -> usize {
fn runnable(cond: &Condition, label: &str) -> bool {
match cond {
Condition::SkipIf { label: l } => l != label,
Condition::OnlyIf { label: l } => l == label,
}
}
records
.iter()
.filter(|rec| match rec {
Record::Query { conditions, .. } => {
conditions.iter().all(|c| runnable(c, &label))
}
Record::Statement { conditions, .. } => {
conditions.iter().all(|c| runnable(c, &label))
}
_ => false,
})
.count()
}
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Omega359 for this great work. I think we can merge it as it is and and make further improvements with follow-up PRs.
} | ||
|
||
#[cfg(feature = "postgres")] | ||
static POSTGRES_IN: Lazy<Channel<ContainerCommands>> = Lazy::new(channel); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use std::sync::LazyLock
without introducing the dependency of once_cell
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under the impression that wasn't stable yet in a msrv that DF has but apparently I am wrong. I'll see if I can find time to change this today or file an issue to improve it otherwise.
}); | ||
|
||
POSTGRES_IN.tx.send(FetchHost).unwrap(); | ||
let db_host = POSTGRES_HOST.rx.lock().await.recv().await.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we need to wait for Postgres to start, so I wonder if we can call start_postgres
directly in the current thread, without using thread::spawn
and channels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but you need access to the host/post and container elsewhere. You could return that info back of course but this code was inspired by my test code in another project where that wasn't feasible.
}) | ||
.unwrap_or_else(|| "default_schema".to_string()) | ||
.to_string_lossy() | ||
.to_string() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Calling to_string()
seems unnecessary.
Which issue does this PR close?
Closes #13812
Rationale for this change
Add most of the sqlite test suite to Datafusion sqllogictests. Note: THESE TESTS DO NOT CURRENTLY PASS! Any test results where Datafusion returns a result that does not match sqlite nor match Postgresql was left as-is.
What changes are included in this PR?
This PR includes a number of changes many of which are part of the test files in the
datafusion-testing
repo (5,711,125 select statements of which 78,437 fail outright in Datafusion). The list below includes both the changes in this direct PR as well as the process to generate the files indatafusion-testing/data/sqlite/
evidence
and theindex/delete
folderscontrol resultmode valuewise
added to the beginning to allow the sqllogictest runner to properly be able to compare the results from Datafusion (and Postgresql) to the results in the .slt fileskipif Datafusion
and/orskipif postgres
. For example:git submodule update --init --remote --recursive
to get it added to an existing checkout of datafusion.PG_URI
is set.Are these changes tested?
Indeed, yes. To run the tests locally checkout this branch, update the git submodules then run
INCLUDE_SQLITE=true cargo test --profile release-nonlto --test sqllogictests
. Be aware that the tests can take quite a long time to run, especially if you do not run with release or release-nonlto profiles.Are there any user-facing changes?
No.