Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSSION] More SqlLogicTest test coverage for queries, including join queries #13470

Closed
Tracked by #13525 ...
findepi opened this issue Nov 18, 2024 · 38 comments
Closed
Tracked by #13525 ...

Comments

@findepi
Copy link
Member

findepi commented Nov 18, 2024

During SDF's upgrade to DataFusion 43 we found a bug #13425 . This was possible thanks to extensive test coverage for certain query shapes that we have internally, which is good. The bad part is that the bug could be caught at the PR stage, or before the release, should similar tests exist in DataFusion project.

Would it be useful if there were more SLT query tests, including the ones that allowed to catch #13425?

cc @alamb @schulte-lukas @gliga

@alamb
Copy link
Contributor

alamb commented Nov 19, 2024

Yes, 100% I think adding additional test coverage in all areas would be super helpful

We / I also regularly hit bugs that are caught by InfluxData's test suite during upgrades of DataFusion but were not caught during the DataFusion PR

Other Areas that would benefit from improved testing from my experience:

  1. Metadata handling (like attaching field level metadata, etc)
  2. UNION coverage (which we use extensively)
  3. Dictionary handling (which we also use a lot)

@Dandandan
Copy link
Contributor

Dandandan commented Nov 22, 2024

One way to extend join coverage is to have some more "answer checking" for different benchmarks, such as:

(Earlier doing this for TPC-H found many issues in joins handling, decimal issues, etc.)

We can do the same for the imdb join order benchmark, clickbench, etc.

@Omega359
Copy link
Contributor

I have been looking over the sqllogictests this weekend and I think first of all that they need a good reorganization to clean them up, split them into smaller chunks testing just specific portions of functionality and to reduce the size of some of the files.

I also have started looking at other projects's tests to see how they've done their testing. I am unsure about actually incorporating some of those tests directly into DF (duckdb for example, MIT license vs Apache) but they could be used as a basis for seeing where DF's test suite is falling short and expanding the test suite some more.

@Omega359
Copy link
Contributor

I came across this paper which I found interesting for join testing. It won't work for DF yet since DF doesn't support query hints (at least as best as I know).

@schulte-lukas
Copy link

I would also love to see a separate test-runner (maybe nightly) of the official SQLLite SLTs (where applicable).

At this point DF seems to be able to pass the majority without modification (except delete). Since they are (a) pre-vetted by sqllite and (b) test foundational capability it would make sense to me that they are included in datafusion

@Omega359
Copy link
Contributor

Omega359 commented Dec 1, 2024

I would also love to see a separate test-runner (maybe nightly) of the official SQLLite SLTs (where applicable).

At this point DF seems to be able to pass the majority without modification (except delete). Since they are (a) pre-vetted by sqllite and (b) test foundational capability it would make sense to me that they are included in datafusion

I actually am part way through porting those slt's to something that df's slt test runner can run

@alamb
Copy link
Contributor

alamb commented Dec 2, 2024

Can't wait to see what happens here

@comphead
Copy link
Contributor

comphead commented Dec 2, 2024

We / I also regularly hit bugs that are caught by InfluxData's test suite during upgrades of DataFusion but were not caught during the DataFusion PR

I'm wondering what are test strategies in Influx? is it something we can reuse in DF as well to avoid regression

@alamb
Copy link
Contributor

alamb commented Dec 2, 2024

I'm wondering what are test strategies in Influx? is it something we can reuse in DF as well to avoid regression

We have a bunch of our own basic tests for end to end functionality in our system (some is SQL based, some is "InfluxQL" based)

We have often seen issues where features we use heavily (metadata, and UnionExec, for example) are not as heavily covered in the existing DataFusion test suites 🤔

@comphead
Copy link
Contributor

comphead commented Dec 2, 2024

I'm wondering what are test strategies in Influx? is it something we can reuse in DF as well to avoid regression

We have a bunch of our own basic tests for end to end functionality in our system (some is SQL based, some is "InfluxQL" based)

We have often seen issues where features we use heavily (metadata, and UnionExec, for example) are not as heavily covered in the existing DataFusion test suites 🤔

is it real world use cases ?

if its not I found something used for MariaDB and get investigate it https://mariadb.com/kb/en/random-query-generator-tests/

@Omega359
Copy link
Contributor

Omega359 commented Dec 2, 2024

After the slt work I was thinking of taking a look at the work that @2010YOUY01 has done with sqlancer to see if there are areas that I could add into it. I noticed that the datafusion-contrib for that was merged into sqlancer main a few months back - https://github.com/sqlancer/sqlancer/tree/main/src/sqlancer/datafusion.

sqlancer has quite a few options wrt test oracles that might be useful to look at. The one I was looking at (DQP) I think would need support for query hints (or corresponding set options) to function but if we could get it to work would allow a lot of join testing to happen.

@alamb
Copy link
Contributor

alamb commented Dec 2, 2024

I agree -- I think using sqlancer is a great idea -- I suspect it simply needs someone to maintain and file tickets

@Omega359
Copy link
Contributor

Omega359 commented Dec 3, 2024

I would also love to see a separate test-runner (maybe nightly) of the official SQLLite SLTs (where applicable).
At this point DF seems to be able to pass the majority without modification (except delete). Since they are (a) pre-vetted by sqllite and (b) test foundational capability it would make sense to me that they are included in datafusion

I actually am part way through porting those slt's to something that df's slt test runner can run

I've gotten many of them to run so far after removing or changing troublesome features (create index, valuesort, etc). There are differences with sqllite vs DataFusion wrt how sqllite handles integers vs real numbers that I'm still working through slowly. I do see a number of failures that I believe are real or at least point to limitations of DF. I'm hoping to have a full 'clean' run sometime in the next week that only fails on likely valid queries.

Here is an example of something that is currently failing:

CREATE TABLE tab0(pk INTEGER PRIMARY KEY, col0 INTEGER, col1 FLOAT, col2 TEXT, col3 INTEGER, col4 FLOAT, col5 TEXT);
INSERT INTO tab0 VALUES(0,1164,1149.64,'eyxra',1147,1174.8,'fpigl');
INSERT INTO tab0 VALUES(1,1165,1150.72,'zqxbf',1148,1175.52,'zmukh');
INSERT INTO tab0 VALUES(2,1166,1154.12,'itglg',1149,1176.1,'cfmkm');
INSERT INTO tab0 VALUES(3,1167,1156.44,'xmaca',1150,1177.85,'ggpiv');
INSERT INTO tab0 VALUES(4,1168,1157.98,'lewrx',1151,1178.59,'hjscv');
INSERT INTO tab0 VALUES(5,1169,1158.81,'enubf',1152,1180.13,'apmjh');
INSERT INTO tab0 VALUES(6,1171,1159.96,'olnov',1153,1181.89,'smeyx');
INSERT INTO tab0 VALUES(7,1172,1160.51,'yngfz',1154,1182.90,'vakuy');
INSERT INTO tab0 VALUES(8,1174,1161.72,'awjqq',1155,1183.93,'qzqaw');
SELECT pk, col0 FROM tab0 WHERE col0 IN (SELECT col3 FROM tab0 WHERE col4 > 73.18) ORDER BY 2 DESC;

SanityCheckPlan
caused by
Error during planning: Plan: ["SortPreservingMergeExec: [col0@1 DESC]", "  CoalesceBatchesExec: target_batch_size=8192", "    HashJoinExec: mode=Partitioned, join_type=RightSemi, on=[(col3@0, col0@1)]", "      CoalesceBatchesExec: target_batch_size=8192", "        RepartitionExec: partitioning=Hash([col3@0], 20), input_partitions=20", "          RepartitionExec: partitioning=RoundRobinBatch(20), input_partitions=1", "            CoalesceBatchesExec: target_batch_size=8192", "              FilterExec: CAST(col4@1 AS Float64) > 73.18, projection=[col3@0]", "                MemoryExec: partitions=1, partition_sizes=[9]", "      SortExec: expr=[pk@0 DESC], preserve_partitioning=[true]", "        CoalesceBatchesExec: target_batch_size=8192", "          RepartitionExec: partitioning=Hash([col0@1], 20), input_partitions=1", "            MemoryExec: partitions=1, partition_sizes=[9]"] does not satisfy order requirements: [col0@1 DESC]. Child-0 order: [[pk@0 DESC]]

@gliga
Copy link

gliga commented Dec 4, 2024

It would be great if there was a common infrastructure across
databases for slt. High level shown below.

------------     --------------   ----------------
|  DF slt  |     | SQLite slt |   | whatever slt |
------------     --------------   ----------------
    |                 |                 |
 ----------------------------------------------    --------------
 |     slt transpiler                         |<---| slt bank   |
 ----------------------------------------------    --------------
    |                 |                 |
    v                 v                 v
------------     --------------   ----------------
|  DF slt  |     | SQLite slt |   | whatever slt |
------------     --------------   ----------------

Boxes on the top are existing slt tests available in various
databases/repositories, e.g., SQLite (which is mostly discussed in
this thread).

slt bank is a set of slt tests independent of any dialect (or even
potentially dialect specific) that is contributed by third party
groups, e.g., researchers.

slt transpiler takes existing slt tests and transpiles to the
desired target (e.g., DF). Translation steps could be:

  • Take one record at a time
  • Transpile SQL to the target dialect (using one of the existing tools)
  • Adjust output (likely most challenging)
  • Output record: if it works keep it if it does not comment out in the output

The transpiler would also be a repository of links to existing slts.

Boxes at the bottom are the resulting slt tests.

Ideally, the transpilation would not be a one off process, but it
would be done for example every time slt tests are run. Benefit of
such an approach would be to ensure that any updates made to existing
tests (e.g., new tests added to SQLite) are reflected in the target
run (e.g., DF runs).

@gliga
Copy link

gliga commented Dec 4, 2024

@Omega359 I see you went quite far and already had some good findings.
Would you need any help at this point?

@Omega359
Copy link
Contributor

Omega359 commented Dec 4, 2024

@Omega359 I see you went quite far and already had some good findings. Would you need any help at this point?

Right now I'm just working through issues with the sqlite test files. My biggest issue actually is that the test runner seems to get very very very very slow sometimes (most of the time actually). I don't know why atm - I forked off the rust sqllogictest repo and tried a few things to try and automate converting value-oriented results to row-oriented results and at one point it ran super fast but otherwise it's 'fire off a run and come back in 1/2 hour'.

I'll push up my changes to my repo on github in the next day or so. I suspect it's a good few days of solid work going through all the tests and fixing/changing things.

@Omega359
Copy link
Contributor

Omega359 commented Dec 4, 2024

It would be great if there was a common infrastructure across databases for slt.

Boxes on the top are existing slt tests available in various databases/repositories, e.g., SQLite (which is mostly discussed in this thread).

slt bank is a set of slt tests independent of any dialect (or even potentially dialect specific) that is contributed by third party groups, e.g., researchers.

slt transpiler takes existing slt tests and transpiles to the desired target (e.g., DF).

The transpiler would also be a repository of links to existing slts.

Ideally, the transpilation would not be a one off process, but it would be done for example every time slt tests are run. Benefit of such an approach would be to ensure that any updates made to existing tests (e.g., new tests added to SQLite) are reflected in the target run (e.g., DF runs).

I am absolutely not against this at all - but it's a pretty large undertaking.

Currently the definition of what the format of a slt file is .. fluid. The original format from sqlite isn't directly compatible with the rust crate which has been a large part of my work. Fixing that would be step 1.

Having a base set of slt tests that work around most every db would be nice however given what I've seen so far those would be very basic tests. Even things like what the field type of a Float field isn't stable across db's (or at least the sqllogictest runner impl's). Support for create index .. not in DF (yet). Mysql using DIV for integer division vs / (mysql/oracle treat / as always float result iirc). Just a few basic examples

@2010YOUY01
Copy link
Contributor

After the slt work I was thinking of taking a look at the work that @2010YOUY01 has done with sqlancer to see if there are areas that I could add into it. I noticed that the datafusion-contrib for that was merged into sqlancer main a few months back - https://github.com/sqlancer/sqlancer/tree/main/src/sqlancer/datafusion.

sqlancer has quite a few options wrt test oracles that might be useful to look at. The one I was looking at (DQP) I think would need support for query hints (or corresponding set options) to function but if we could get it to work would allow a lot of join testing to happen.

I think adding test oracles like DQP to sqllogictests is also a project worth to do.

SQLancer is a fuzzer implementation with query generator + test oracles (for property tests), and these two component are loosely coupled: test oracles don't necessarily have to be implemented with SQLancer's query generator, the invariants can be tested as long as there is a valid query, maybe the best strategy is to implement them both in existing test cases(slt) and SQLancer's randomly generated queries.

For DQP, I remember there is a configuration we can turn off 🤔, and there will be no join-reordering, so join order can be controlled by writing table order in SQL queries differently.

@alamb
Copy link
Contributor

alamb commented Dec 11, 2024

I would also love to see a separate test-runner (maybe nightly) of the official SQLLite SLTs (where applicable).
At this point DF seems to be able to pass the majority without modification (except delete). Since they are (a) pre-vetted by sqllite and (b) test foundational capability it would make sense to me that they are included in datafusion

I actually am part way through porting those slt's to something that df's slt test runner can run

@Omega359 I was looking for the official SQLLite SLTs -- are you looking at the ones here: https://www.sqlite.org/sqllogictest/dir?ci=tip&name=test ?

@2010YOUY01
Copy link
Contributor

It would be great if there was a common infrastructure across databases for slt. High level shown below.

------------     --------------   ----------------
|  DF slt  |     | SQLite slt |   | whatever slt |
------------     --------------   ----------------
    |                 |                 |
 ----------------------------------------------    --------------
 |     slt transpiler                         |<---| slt bank   |
 ----------------------------------------------    --------------
    |                 |                 |
    v                 v                 v
------------     --------------   ----------------
|  DF slt  |     | SQLite slt |   | whatever slt |
------------     --------------   ----------------

Boxes on the top are existing slt tests available in various databases/repositories, e.g., SQLite (which is mostly discussed in this thread).

slt bank is a set of slt tests independent of any dialect (or even potentially dialect specific) that is contributed by third party groups, e.g., researchers.

slt transpiler takes existing slt tests and transpiles to the desired target (e.g., DF). Translation steps could be:

  • Take one record at a time
  • Transpile SQL to the target dialect (using one of the existing tools)
  • Adjust output (likely most challenging)
  • Output record: if it works keep it if it does not comment out in the output

The transpiler would also be a repository of links to existing slts.

Boxes at the bottom are the resulting slt tests.

Ideally, the transpilation would not be a one off process, but it would be done for example every time slt tests are run. Benefit of such an approach would be to ensure that any updates made to existing tests (e.g., new tests added to SQLite) are reflected in the target run (e.g., DF runs).

https://arxiv.org/pdf/2410.21731 They did this slt bank for SQLite, PostgreSQL, DuckDB and found multiple bugs, the authors said integrating them into a new system is easy

Implementing them typically requires a low effort, as they are implemented in 33 LOC on average for the DBMSs in our experiments.

The implementation is at https://github.com/suyZhong/SQuaLity

@Omega359
Copy link
Contributor

@Omega359 I see you went quite far and already had some good findings. Would you need any help at this point?

YES! See my post in DF's discord channel https://discord.com/channels/885562378132000778/1314317688150949978/1316455175376081028

@Omega359
Copy link
Contributor

I would also love to see a separate test-runner (maybe nightly) of the official SQLLite SLTs (where applicable).
At this point DF seems to be able to pass the majority without modification (except delete). Since they are (a) pre-vetted by sqllite and (b) test foundational capability it would make sense to me that they are included in datafusion

I actually am part way through porting those slt's to something that df's slt test runner can run

@Omega359 I was looking for the official SQLLite SLTs -- are you looking at the ones here: https://www.sqlite.org/sqllogictest/dir?ci=tip&name=test ?

Yes, though that wasn't where I grabbed them from. I think I grabbed the files from https://github.com/hydromatic/sql-logic-test

@Omega359
Copy link
Contributor

Omega359 commented Dec 11, 2024

Copied from Discord:

I've finally gotten to the point of having most of the sqlite slt test files running in df sqllogictest suite. After attempting to run and fix things manually (for days ... I'm an idiot sometimes) I smartened up and hacked up a version of the sqllogictest crate to do most of the work for me. The result is that everything runs however a LOT of queries are skipped for DF for various reasons.

I'm going to push all of this to a df branch in my github account. I would really really like some assistance with evaluating the files and submitting tickets for any issues that were uncovered.

There are apparently almost 5.7 million queries of which 78,444 were marked to be skipped by Datafusion because of errors or differences in results when compared to sqlite or postgres. If DF results were the same as postgres however (even attempting equality of floats) the results in the slt were updated. This is mostly because sqlite is rather simple and just doesn't seem to properly handle floats, etc correctly.

Because of the format differences between the slt files in DF vs sqlite a custom version of sqllogictest dependency is required. I'm hoping to push a PR to that crate's repo at some point to cover the additional change required to run this. run command is cargo test --test sqllogictests -- --include-sqlite

changes are @ https://github.com/Omega359/arrow-datafusion/tree/feature/sqllogictest_add_sqlite

Also I've been looking at duckdb's test - found a conway's game of life impl in sql. It doesn't even parse in df 😦

https://gist.github.com/janickr/58fab629ee3ea7e5638a

The largest issue I had with porting the tests was that DF is just missing some sql support compared to sqlite. No delete. No create index. No duplicate column names. No scalar subqueries. Type conversion doesn't handle null all that well. And it seems that CTE expressions have a lot of limitations after playing with that for the last hour.

Here is a gist of the most common errors. Note that the delete errors will not show up for anyone who looks at my branch linked above - I just removed that portion of tests rather than try and deal with each file given DF's lack of support

@findepi
Copy link
Member Author

findepi commented Dec 12, 2024

thank you @Omega359 for your work on this!

The result is that everything runs however a LOT of queries are skipped for DF for various reasons.

that's totally understandable!
what about having them with query error msg instead of skip, so that they act as "executable todo" and thus we don't need to remember to find and enable tests when corresponding functionality is fixed?

Because of the format differences between the slt files in DF vs sqlite a custom version of sqllogictest dependency is required.

i know you mentioned this earlier (#13470 (comment)), but i unfortunately don't know what the difference is. How big it is?
Can we avoid custom dependency with eg a regexp replace?
cc @gliga

@Omega359
Copy link
Contributor

The result is that everything runs however a LOT of queries are skipped for DF for various reasons.

that's totally understandable! what about having them with query error msg instead of skip, so that they act as "executable todo" and thus we don't need to remember to find and enable tests when corresponding functionality is fixed?

I was torn about this because it would mean that the expected query types and results would then be lost.

Because of the format differences between the slt files in DF vs sqlite a custom version of sqllogictest dependency is required.

i know you mentioned this earlier (#13470 (comment)), but i unfortunately don't know what the difference is. How big it is? Can we avoid custom dependency with eg a regexp replace? cc @gliga

The primary issue is that those slt files are in what I call 'valuewise' ordering, meaning that results are in a single column. For example:

query II rowsort
SELECT ALL + col1 AS col0, col2 AS col2 FROM tab1
----
14
96
47
68
5
59

Every other use of slt's use what I call 'rowwise' where the results are in proper columns. Both have their uses - valuewise better handles string with leading/trailing spaces, rowwise is easier to parse.

The main change I made to the sltlogictest-rs dep is to add support for control resultmode valuewise which you will see at the top of all the sqlite test files to allow that lib to be able to compare the results correctly.

I also stripped out everything related to mysql and mssql from those files. This is the series of regex I used to clean up the files:

find ./ -type f -name "*.test" -exec rename 's/\.test/\.slt/' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'hash-threshold 8\n' 'hash-threshold 8\ncontrol resultmode valuewise\n' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mysql.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mysql.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mysql.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mysql.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mysql.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mysql.+\n.+\n.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mysql.+\n.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mysql.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mysql.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mysql.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'skipif mysql.+\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'skipif postgresql # PostgreSQL requires AS when renaming output columns\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mssql.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mssql.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mssql.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mssql.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mssql.+\n.+\n.+\n.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mssql.+\n.+\n.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mssql.+\n.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mssql.+\n.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mssql.+\n.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'onlyif mssql.+\n.+\n.+\n\n' '' {} \;
find ./ -type f -name "*.slt" -exec sd -f i 'skipif mssql # not compatible\n' '' {} \;

@Omega359
Copy link
Contributor

After thinking about it a bit I think I could find a way to switch the results by using the column type count in query II to determine how many columns should exist and then do the rewrite. It wouldn't just be a regex (well, I'm sure someone could write some to deal with this but I'm not going there ... I've been burnt by regex far far too often in the past) but could be done with a small script.

@crepererum
Copy link
Contributor

Did anyone check the license of the sqlite test suite? Because if https://www.sqlite.org/sqllogictest/dir?ci=tip doesn't specify a license and https://www.sqlite.org/copyright.html doesn't explicitly mention the test corpus (which one could argue is a separate artifact). Hence you cannot just copy that data (i.e. it would be a copyright violation). Even reading it during tests over the internet might be a bit of a gray zone.

@alamb
Copy link
Contributor

alamb commented Dec 12, 2024

I'm going to push all of this to a df branch in my github account. I would really really like some assistance with evaluating the files and submitting tickets for any issues that were uncovered.

Yes, I will be 100% willing to help both review the errors and triage/file issues (and organize the effort)

@alamb
Copy link
Contributor

alamb commented Dec 12, 2024

I wll look into the copyright issue

@alamb
Copy link
Contributor

alamb commented Dec 12, 2024

https://github.com/hydromatic/sql-logic-test

Yes, though that wasn't where I grabbed them from. I think I grabbed the files from https://github.com/hydromatic/sql-logic-test

I didn't see the actual .test files, so I am not 100% sure what files we are referring to.

Official Git mirror of the SQLite source tree

https://github.com/sqlite/sqlite

https://github.com/sqlite/sqlite/blob/master/LICENSE.md explicitly says that this is in the public domain:

All of the test cases and testing code in the test/ directory

So TLDR I didn't find any compelling evidence either way related to licensing. When we have a proposal I will take a closer look

@alamb
Copy link
Contributor

alamb commented Dec 12, 2024

Even reading it during tests over the internet might be a bit of a gray zone.

At least according to my understanding of US copyright law, reading public files from the internet is not a problem. Redistributing them is where there can become a problem, depending on the license.

@Omega359
Copy link
Contributor

https://github.com/hydromatic/sql-logic-test

Yes, though that wasn't where I grabbed them from. I think I grabbed the files from https://github.com/hydromatic/sql-logic-test

I didn't see the actual .test files, so I am not 100% sure what files we are referring to.

https://github.com/hydromatic/sql-logic-test/tree/main/src/main/resources/test

As far as I could tell the files there are just copies of the ones from sqlite however I didn't have to deal with some oddball version control system to download them.

@Omega359
Copy link
Contributor

All of the test cases and testing code in the test/ directory

The sqllogictest files are not in that test directory. As best as I can tell they have no license attached to them at all if the license from sqlite doesn't cover them. The C source files do have licenses though.

@Omega359
Copy link
Contributor

Did anyone check the license of the sqlite test suite? Because if https://www.sqlite.org/sqllogictest/dir?ci=tip doesn't specify a license and https://www.sqlite.org/copyright.html doesn't explicitly mention the test corpus (which one could argue is a separate artifact). Hence you cannot just copy that data (i.e. it would be a copyright violation). Even reading it during tests over the internet might be a bit of a gray zone.

I asked the sqlite guys and they added a copyright.md file to the sqllogictest repo: https://sqlite.org/sqllogictest/file?name=COPYRIGHT.md&ci=tip

It's about as open as it's possible to make it :)

@Omega359
Copy link
Contributor

After thinking about it a bit I think I could find a way to switch the results by using the column type count in query II to determine how many columns should exist and then do the rewrite. It wouldn't just be a regex (well, I'm sure someone could write some to deal with this but I'm not going there ... I've been burnt by regex far far too often in the past) but could be done with a small script.

Oops, not so quick. I forgot about valuesort which is one of the result sorting methods. It isn't implemented in sqlogictest-rs (yet) and just switching that to rowsort doesn't work (causes errors)

I'll see about pushing a PR to that crate later today and then hopefully if merged we can just use the next version.

@Omega359
Copy link
Contributor

Omega359 commented Dec 14, 2024

I've updated my test branch with the latest changes including adding skipif for Postgresql queries that fail to run. What is left is at least 19 legitimate query result mismatches where tickets were filed.

I've submitted a PR to the sqllogictest-rs repo for my changes but so far they haven't responded. If they do not get around to merging it for whatever reason we have two choices:

  1. Rewrite the .slt files to be columnwise vs valuewise and change ordering to only be rowsort and remove the valuesort. I've actually written a lua script to handle the valuewise -> columnwise conversion, the second part will require some more work.
  2. Fork the repo and use that.

For now I'm going to slow down work on this.

@alamb
Copy link
Contributor

alamb commented Dec 17, 2024

Thank you so much @Omega359

I created an epic to track the remaining work to complete the sqllogictest migration. I am hoping we can rally the community to help us finish it

I also field a ticket to track completing the integratio work:

@alamb
Copy link
Contributor

alamb commented Dec 17, 2024

I am closing this issue for now so we can focus on completing #13811 -- please let me know if there are other items that should be tracked and I will file tickets

Thanks again

@alamb alamb closed this as completed Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants