-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support nuances of the SELECT syntax: WITH, UNION, subqueries etc. #106
Comments
Perhaps the Rewriter model is no longer useful for this approach. Here's an alternative idea: /**
* Processes the SELECT statement (https://dev.mysql.com/doc/refman/8.0/en/select.html)
* SELECT
* [ALL | DISTINCT | DISTINCTROW ]
* [HIGH_PRIORITY]
* [STRAIGHT_JOIN]
* [SQL_SMALL_RESULT] [SQL_BIG_RESULT] [SQL_BUFFER_RESULT]
* [SQL_NO_CACHE] [SQL_CALC_FOUND_ROWS]
* select_expr [, select_expr] ...
*/
protected function consume_select_query() {
$token = $this->next_token();
$token->assert_is_keyword('SELECT');
$token = $this->next_token();
if($token->is_keyword(['ALL', 'DISTINCT', 'DISTINCTROW'])) {
$this->append( $token );
$token = $this->rewriter->next_token();
}
if($token->is_keyword('HIGH_PRIORITY')) {
$token = $this->rewriter->next_token();
}
// ... keep going token by token, don't just skip over things like we do now
// with a while loop ...
if($is_subquery) {
$this->consume_sub_query();
}
// inevitably at some point:
if($token->is_keyword('UNION')) {
$this->consume_select_query();
}
} Importantly, we cannot just operate on strings and translate the MySQL query into the SQLite query. Some transformations require acting on the SQLite database and executing SELECTs, INSERTs etc. Therefore, the input may be a MySQL query string, but the output should be the query result, not the corresponding SQL query for SQLite. |
Intuitively, it feels like we need a thorough but lightweight syntax tree and an interpreter that explicitly handles each case. I've started poking at that idea but don't yet have a working example to share. |
I've explored this and don't see any way other than switching to a full MySQL parser like the PHPMyAdmin one where we originally sourced the tokenizer from. This is a recursive transformation problem where we need to know the subquery tree – either to rewrite them in place, or to execute them and substitute the subquery for its results. Here's a fairly complex query that would be rather difficult to process without a parsed query tree: WITH monthly_sales AS (
SELECT
employee_id,
MONTH(sale_date) AS sale_month,
YEAR(sale_date) AS sale_year,
SUM(sale_amount) AS total_sales
FROM
sales
GROUP BY
employee_id, sale_month, sale_year
),
ranked_sales AS (
SELECT
employee_id,
sale_month,
sale_year,
total_sales,
RANK() OVER (PARTITION BY sale_year, sale_month ORDER BY total_sales DESC) AS sales_rank
FROM
monthly_sales
)
SELECT
e.employee_id,
e.first_name,
e.last_name,
e.department_id,
d.department_name,
s.sale_date,
s.sale_amount,
m.manager_name,
r.total_sales,
r.sales_rank,
(select 2 as a from dual) as x
FROM
employees e
INNER JOIN sales s ON e.employee_id = s.employee_id
LEFT JOIN departments d ON e.department_id = d.department_id
LEFT JOIN managers m ON e.manager_id = m.manager_id
INNER JOIN ranked_sales r ON e.employee_id = r.employee_id
AND MONTH(s.sale_date) = r.sale_month
AND YEAR(s.sale_date) = r.sale_year
WHERE
e.status = 'active'
AND r.sales_rank <= 3
AND s.sale_date BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY
e.employee_id,
s.sale_date
HAVING
SUM(s.sale_amount) > 1000
ORDER BY
r.sale_year DESC,
r.sale_month DESC,
r.sales_rank ASC;
UNION
SELECT
1
FROM DUAL
FOR UPDATE
LOCK IN SHARE MODE The PHPMyAdmin parser can process it just fine: $parser = new PhpMyAdmin\SqlParser\Parser($query1);
foreach($parser->statements as $stmt) {
// Get the first subquery in the first SELECT
var_dump($stmt->cteStatementParser->statements[0]->expr[10]);
// Parse it (the Parser class is lazy)
$subquery = $stmt->cteStatementParser->statements[0]->expr[10]->expr;
$subparser = new PhpMyAdmin\SqlParser\Parser($subquery);
// Get all the structured information we need:
print_r($subparser);
} |
This would also make it easier to clearly reject any unsupported queries. We'd only expect values to show up in specific places of the parsed data structure, and if we get more than that then it's an unsupported query and we can bale out with a clear error message. |
I failed to parse the following query using WITH mytable as (select 1 as a, b from dual)
SELECT HIGH_PRIORITY DISTINCT
col_a,
UPPERCASE(z),
(SELECT MAX(col_b) FROM mytable),
(SELECT MAX(col_b) FROM mytable) as max_C
FROM mytable, (SELECT hgjba, 997482686 FROM mytable) as subquery
FORCE INDEX (idx_department_id)
LEFT JOIN (select a_column_yo from mytable) as t2
ON (t2.id = mytable.id AND t2.id = 1) AND (select 1 from mytable) = 1896
WHERE 1 = 3
AND date_column BETWEEN (SELECT '2024-01-01') AND CONCAT('2024', '-12-31')
GROUP BY col_a, col_b
HAVING 1 = 2
UNION SELECT * from table_b
ORDER BY col_a DESC, col_b ASC
FOR UPDATE The output only goes to a certain level of granularity and not further. For example, I had much more luck with the composer require greenlion/php-sql-parser $parser = new PHPSQLParser();
$parsed = $parser->parse($query);
print_r($parsed); It exposes a full AST and seems to understand every syntactic nuance, including WITH, index hints, subqueries in random places, etc. From here, it's all about recursively rewriting the MySQL AST to a SQLite query and baleing out whenever we encounter an unsupported syntax element. This should be much more reliable as we would never accidentally retain any MySQL-specific syntax. We'll still need some of the logic already shipped with the SQLite integration plugin to rewrite function calls, |
Proposal: Switching to AST for SQLite IntegrationI believe the most viable path forward for the SQLite integration project is to switch from processing a token stream to processing an Abstract Syntax Tree (AST). The best way to generate an AST appears to be by adapting the MySQLParser.g4 grammar to create a small, performant non-ANTLR parser. Why Switch from Token Parsing to AST Processing?Current Approach: Token Stream ProcessingRight now, the SQLite database integration plugin processes a stream of MySQL tokens. This method makes the code complex, as each change requires interpreting the tokens in the surrounding context. The challenges include:
My guess is that we currently support only a small percentage of the MySQLParser.g4 syntax. While this may be sufficient for running WordPress core, there's a long tail of features we don't support (e.g., Proposed Approach: AST ProcessingSwitching to AST processing would simplify and enhance our approach by providing:
Here’s an example of what a declarative transform could look like using an AST: function MySQLFunctionToSQLite($name, $args) {
switch (strtoupper($ast['name'])) {
case 'ASCII': return ['UNICODE', $args];
case 'CHAR_LENGTH':
case 'LENGTH':
return ['LENGTH', $args];
case 'CONCAT':
$concatenated = [new Token("(")];
$k = count($args);
foreach ($args as $k => $arg) {
$concatenated[] = $arg;
if ($k < count($args) - 1) {
$concatenated[] = new Token(" || ");
}
}
$concatenated[] = new Token(")");
return new Expression($concatenated);
}
} Supporting another function means adding another Compare this to the linear token-based rewriting we do today (which is also spread over multiple locations). Parsing MySQL queries into an ASTMySQL Workbench ANTLR grammar yields the only MySQL query parser I found that checks all these boxes:
On the upside, that grammar is official and exhaustive. On the downside, it's too large, slow, and memory-hungry to use as-is. However, we can still reuse it! Issues with the ANTLR-generated MySQL query parserANTLR can generate code for PHP, TypeScript, and other languages. However, the mysql-workbench grammar depends on a few ANTLR parser performanceThe PHP parser needs significant performance improvements before it can be used effectively, and the TypeScript parser isn't great yet, but it seems acceptable for a Playground-centric experiment.
ANTLR parser caveatsThe generated TypeScript parser needed some manual adjustments, e.g., adding a missing A larger problem with the TypeScript parser was the parsing speed. My test query took 3.3 seconds to parse! I tracked it down to a general pitfall all predictive The TypeScript parser is almost in good enough shape to use in Playground. That doesn't help WordPress core, though. For a reusable PHP WordPress plugin, we'd need a more performant version of the PHP parser. Other parsersI tested all the trustworthy-looking MySQL query parsers I could find, both PHP and non-PHP ones. Almost none could parse this valid MySQL query: WITH mytable as (select 1 as a, b from dual)
SELECT HIGH_PRIORITY DISTINCT
CONCAT("a", "b"),
UPPER(z),
col_a,
(SELECT MAX(col_b) FROM mytable),
(SELECT MAX(col_b) FROM mytable) as max_C
FROM
mytable FORCE INDEX (idx_department_id),
(SELECT hgjba, 997482686 FROM mytable) as subquery
LEFT JOIN (select a_column_yo from mytable) as t2
ON (t2.id = mytable.id AND t2.id = 1) AND (select 1 from mytable) = 1896
WHERE 1 = 3
GROUP BY col_a, col_b
HAVING 1 = 2
FOR UPDATE
UNION SELECT * from table_cde
ORDER BY col_a DESC, col_b ASC Here's a list of some of the parsers I tried (yes, there was more!):
Next stepsI think generating a smaller and faster PHP parser is the only viable path forward.
How to generate another parser from an ANTLR grammar?The ANTLR MySQL grammar is the best resource we can lean on. It seems to be the only complete MySQL grammar definition available. The two approaches I can think of are:
We'd also need a comprehensive test suite – I wonder if we could reuse the one from CC @aristath @dmsnell @schlessera @OllieJones @brandonpayton @bgrgicak @griffbrad @akirk off the top of my head – feel free to loop in more folks as needed :-) |
MySQL Workbench also has an 8 years old ANTLR 3 grammar that can be used to generate an LL parser – perhaps that would do the trick? |
Also, surfacing my response to @schlessera's comment from WP-CLI/db-command: InputsWhat would be the use-case for parsing SQLite queries? Also, if we get different parsers for MariaDB and MySQL, then we also need different parsers for MySQL 5.7, 8.0, 8.1 etc. My hope is the same parser could handle all of these. As far as I understand, MariaDB syntax is mostly compatible with MySQL and the official list of incompatible features doesn't seem too bad at the query parsing level. Parsing modelI don't think a In functional terms, transpilation would be a pure function like Here's just a few examples:
That's just the tip of the iceberg. There's Related: SQLGlot is a SQL -> SQL transpiler build in Python. Their parser is more powerful than the one in the https://github.com/WordPress/sqlite-database-integration plugin, and yet it won't handle the MySQL -> SQLite transition for |
Somewhat against my own recommendation I took a stab at manually prototyping a PHP parser, but I quickly hit the wall. There's a lot of branching and decision points in that grammar. I still think it can be parsed with minimal lookaheads, but covering any useful subset of that SQL grammar with a manually-build PHP parser seems like a task for at least a few hundred hours. This issue now hinges on finding a semi-automated workflow to convert the ANTLR grammar to a parser that can be refactored by a human and does minimal lookaheads. Two paths I'd like to explore are:
The former seems intuitively easier as LLM doesn't need to know about how the PHP parser is built to process consecutive grammar chunks. I had some successes asking GPT to convert subsets of the ANTLR grammar to
From there we'd learn and decide what's next. EDIT: Uploading a file to Gemini 1.5 Pro alongside the following prompts is looking promising so far: System prompt:
Prompt:
The model really insisted it won't do it, but it finally gave in to the strong language and capitalization. I feel bad for scolding the model. Weird times. |
^ @JanJakes, since you've been making fixes to the existing translator, I want to share this discussion with you. |
Personal thoughts: I like the idea, but on the other hand I'm not sure it's something worth investing a couple of years of my life in. 😅 I'm not sure we should strive to support every sort of weird MySQL-specific syntactic weirdness though. |
@adamziel @brandonpayton Thanks for inviting me to this discussion. I am a newbie to this codebase, so my views on this may be quite uneducated, but I'll try to share my thoughts anyway. AST Firstly, I think that switching to an AST processing from the current token-based approach will be very beneficial, no matter whether we try to support all the nuances of MySQL queries or not. With AST, a more robust and better architected translation layer could be built, one that can be easier to maintain and extend and less prone to complex errors than changing a flat token stream can cause, especially with advanced syntaxes, subqueries, etc. With that in mind, I think it would be a great improvement in any case. As for the particular parser, I would see the main requirements as 1) cover (close to) 100% of MySQL syntax, and 2) be actively maintained so that we can easily integrate new syntaxes with new versions of MySQL. Performance is, of course, also important. I guess this makes the official MySQL parser and the mysql-workbench grammar the best candidates, at least from what I understood from the proposal. I've noticed that meanwhile there has been some very interesting progress on this, which is great news. WASM Without having much of a bigger context about all the possible motivations behind this effort, there is a question I have to ask. From the perspective of WordPress Playground, why not invest some effort into a WASM build for MySQL? That is definitely not an easy task, but looking at what we're trying to achieve right now, it may be something worth considering. Even if we can parse 100% of MySQL grammar, translating all the specific semantics into SQLite may take years, and even then, it's possible it will never cover 100% of the edge cases. As for creating a WASM build of MySQL, we can get some inspiration from Postgres. First, Supabase created a postgres-wasm project that runs Postgres in an x86 emulator. This approach should be relatively easy to apply to MySQL. In fact, it appears someone had already done so some time ago, and there is a working demo. Later, a WASM-native Postgres build was created in the form of PGLite, which is the ideal (but also much harder) solution. While the first option doesn't seem too hard, I have no idea what would it take to create a WASM-native MySQL build. However, looking at the task of creating a stable, full-featured, and highly compatible SQL translation layer, isn't even that a task worth looking into? I suppose that the motivations for translating SQL might be bigger than WordPress Playground, and that could explain why the MySQL-WASM way might not be the ideal solution. But even then, if MySQL in WASM is doable, WordPress Playground would benefit from that a lot. We should also keep in mind that even if we don't do that, it may, one day, happen. |
+1. I'd love a WebAssembly version of MariaDB! We don't have that today and, since we're such a small team, I'd rather not dive into building MariaDB server as WASM. PHP is already challenging enough, and we're going to get a lot for free by just waiting – the WASM standard keeps evolving and making difficult tasks easy. Someone will bundle that sooner or later. Until we're there, Studio may explore using Tidb looks like an interesting alternative. It claims to be MySQL compatible and has a WASM build. I don't expect this to become the default option for Playground anytime soon, as it's a 50MB download, but to ever get to a reasonable download size, we have to start somewhere. |
Going back to the parser generator idea, I've managed to convert the ANTLR grammar to the EBNF notation which unlocked translating it further to formats supported by existing parser generators. Unfortunately, none of the 16 parser generators I've tried produced a fast, dependency-free, relatively compact parser that outputs an AST. I tried Rust, C, JS, PHP, and more. I've learned so much about parsing! But I wont't invest any more time into exploring other parser generators. Here's a different idea: build a dedicated parser generator just for this. A recursive descent parser is an easy starting point. ChatGPT buit a nice prototype that can actually parse a simplified SQL syntax and generates a PHP parser: https://chatgpt.com/share/365596de-389a-4e01-a137-223c72a7094a Next step: run that code on the full MySQL grammar. |
That's basically it. Running WordPress with SQLite is a forever dream of many WordPress devs. WordPress Playground aside. Playground just happens to be the main motivator to pursuit it, but benefits are much wider. |
! The custom parser idea worked – check it out !The parser is astonishingly simple. The parse() method is only 50 lines of code. All of the parsing rules are provided by the grammar. It still uses the Gemini-generated Lexer. Perhaps we could switch to the PHPMyAdmin one that's already ported to this repo – TBD. Here's a few parse trees it generated: These are the same examples that derailed almost every MySQL parser I've tried in any language. And now we can parse all of them 🎉 🎉 🎉 ! Beyond MySQL queries, that parser can be used with any BNF grammar whatsoever. We could parse JSON, XML, or any other language that can be described by an BNF grammar. See the README for more technical details. There's a few technical challenges to overcome from here: AmbiguityI had to move the File sizeThe grammar file is 1.2MB large (100kb gzipped). This already is a "compressed" form where all the rules and tokens are encoded as integers. I see two ways to reduce the size:
SpeedThe parser can handle about 50 complex SELECT queries per second on a MacBook pro. We need to do better than that. First, we need to profile and see where most of the time is spent. I suspect it's backtracking – the parser tries matching all the rules in the current branch and trashes the results if it fails. Here's a few optimization ideas:
|
I won't be able to do more progress here in the short term, feel free to pick this one up. The speed is my main concern, once we can make it faster I think we're good to start migrating the SQLite integration plugin. The file size isn't perfect, but we can figure it out in the longer term. |
This definitely sounds like it could be evident of a bug in the parser. Something isn't finding the longest match. Could this be an issue in converting the grammar to ENBF form? I found an ambiguity in the way the IMAP grammar is written, for example, and it's only resolved by careful backtracking. There's a production like this: resp-text = [ "[" resp-text-code "]" ] string
resp-text-code = "ALERT" / astring and I found that
I'm curious if would be easy to quickly prune these from the grammar. That is, set their rules to product a terminal like "UNSUPPORTED" and see if we can eliminate more of the parse. Or is it more the case that we have everything in the lexer and all of the rules that could be reached after entering a |
It's not a bug, it's a feature 😅 The parser is looking for the first rule that matches in full. We'll happily discard the current branch if the 10th token conflicts, but once matches in full then we won't keep checking other branches. This is to avoid geometric complexity. Matching other rules could multiply the parsing time, especially early in the token stream. The underlying assumption is that the grammar is (or can be) factored to yield unambiguous results with this methods. I think this is true if MySQL grammar is context-free, which it seems to be. I'm not 100% certain about my assumptions here, but so far they seem to hold. We'll know once we bring in a large test suite.
I think we could easily do that, yes! :-)
We do have everything in the lexer, but we could have a few pruning stages:
And then, even if the lexer remains aware of all tokens and we only prune the parser grammar, that's still a huge win |
I don't think this is how EBNF grammars work. At least, parsing this grammar as written would require lookahead or backtracking to ensure that it makes the right choice when those choices are present in a production ruleset. Note that Parsing Expression Grammars (PEGs) explicitly denote ordered choice in the way you talk about here, but I don't believe this is a property of the BNF grammars. When the grammar doesn't make prevent this situation, the responsibility is on the parser to resolve the ambiguities and conflicts. |
Maybe I shouldn't be talking about EBNF so much. For us it's just an intermediate form that we use to go from the ANTLR g4 grammar to PHP arrays. ANTLRv3, which the original MySQL Workbench grammar was written in, used the first matching rule. ANTLRv4, which is what the latest MySQLParser.g4 grammar uses, seems to have some adaptive rules and switch between the first match and the longest match depending on how you construct the rule. Personally I'd stick with the first match semantics until we're certain we have to do longest match. Using the first match simplifies a lot. |
I kept pushing after all and got good enough speed to start integrating this parser with the SQLite database integration plugin! It can now parse 700 complex SELECT queries per second, and way more simpler queries. Here's what I did:
I also tried:
To improve the speed even more, we could experiment with factoring the grammar, other types of lookahead tables, and memoizing the matching results per token. However, I don't think we need to do that in the short term. |
This is funny, because I thought that strings were often faster than integers for comparison due to the ability for interning and reference-equality, vs. needing to cast through numeric types. |
Interesting! Well 🤷 I was as surprised about unwinding recursion, I thought removing another function call would make it faster but it didn't. I assume that PHP internals, in all their function calling slowness, are still faster than maintaining a call stack using PHP code. |
how did you remove the recursion? did you use a standard trampoline? did you return an array? function recurse( $input ) {
return $should_recurse ? recurse( $rest_of_input ) : $result;
}
function trampoline( $input ) {
$rest = $input;
while ( null !== $rest ) {
$result = step( &$rest );
}
return $result;
} there's still the same number of user-space function calls here. if the recursion is internalized in a loop then I was just thinking of a different problem. you can also try a |
I inlined the entire adamziel/parser-generator-explorations@3bf8b7a I suspect the slowdown may be caused by creating |
I wrapped up all my explorations in #157. The grammar size problem is solved there. I replaced automatic left recursion factoring with hand-crafted right recursion rules and the grammar.php file size went down from 1.2MB to 70kb. It also increased the parsing speed by ~26%, reduced the memory footprint, and produced a more useful parse tree. |
## Context This PR ships an exhaustive MySQL **lexer** and **parser** that produce a MySQL query AST. This is the first step to significantly improve MySQL compatibility and expand WordPress plugin support on SQLite. It's an easier, more stable, and an easier to maintain method than the current token processing. It will also dramatically improve WordPress Playground experience – database integration is the single largest source of issues. This PR is part of the [Advanced MySQL support project](#162). See the [MySQL parser proposal](#106 (comment)) for additional context. ## This PR ships 1. A **MySQL lexer**, adapted from the AI-generated one by @adamziel. It's over 3x smaller and close to 2x faster. 2. A **MySQL grammar** written in ANTLR v4 format, adapted from the [MySQL Workbench grammar](https://github.com/mysql/mysql-workbench/blob/8.0.38/library/parsers/grammars/MySQLParser.g4) by adding and fixing some cases and reordering some rules. 3. A **script to factor, convert, and compress the grammar** to a PHP array. 4. A **dynamic recursive parser** implemented by @adamziel. 5. A **script to extract tests** from the MySQL repository. 6. A **test suite of almost 70k queries**. 7. WIP **SQLite driver** by @adamziel, a demo and foundation for the next phase. At the moment, all the new files are omitted from the plugin build, so they have no effect on production whatsoever. ## Running tests The lexer & parser tests suite is not yet integrated into the CI and existing test commands. To run the tests, use: ```php php tests/parser/run-lexer-tests.php php tests/parser/run-parser-tests.php ``` This will lex / lex & parse all the ~70k queries. ## Implementation ### Parser A simple recursive parser to transform `(token stream, grammar) => parse tree`. In this PR, we use MySQL tokens and MySQL grammar, but the same parser could also support XML, IMAP, many other grammars (as long as they have some specific properties). The `parse_recursive()` method is just 100 lines of code (excluding comments). All of the parsing rules are provided by the grammar. ### run-mysql-driver.php A quick and dirty implementation of what a `MySQL parse tree ➔ SQLite` database driver could look like. It easily supports `WITH` and `UNION` queries that would be really difficult to implement the current SQLite integration plugin. The tree transformation is an order of magnitude easier to read, expand, and maintain than the current implementation. I stand by this, even though the temporary `ParseTreeTools`/`SQLiteTokenFactory` API included in this PR seems annoying, and I'd like to ship something better than that. Here's a glimpse: ```php function translateQuery($subtree, $rule_name=null) { if(is_token($subtree)) { $token = $subtree; switch ($token->type) { case MySQLLexer::EOF: return new SQLiteExpression([]); case MySQLLexer::IDENTIFIER: return SQLiteTokenFactory::identifier( SQLiteTokenFactory::identifierValue($token) ); default: return SQLiteTokenFactory::raw($token->text); } } switch($rule_name) { case 'indexHintList': // SQLite doesn't support index hints. Let's // skip them. return null; case 'fromClause': // Skip `FROM DUAL`. We only care about a singular // FROM DUAL statement, as FROM mytable, DUAL is a syntax // error. if( ParseTreeTools::hasChildren($ast, MySQLLexer::DUAL_SYMBOL) && !ParseTreeTools::hasChildren($ast, 'tableReferenceList') ) { return null; } case 'functionCall': $name = $ast[0]['pureIdentifier'][0]['IDENTIFIER'][0]->text; return translateFunctionCall($name, $ast[0]['udfExprList']); } } ``` ## Technical details ### MySQL Grammar We use the [MySQL workbench grammar](https://github.com/mysql/mysql-workbench/blob/8.0/library/parsers/grammars/MySQLParser.g4), manually adapted, modified, and fixed, and converted from ANTLR4 format to a PHP array. The grammar conversion pipeline is done by `convert-grammar.php` and goes like this: 1. Parse MySQLParser.g4 grammar into a PHP tree. 2. Flatten the grammar so that any nested rules become top-level and are referenced by generated names. This factors compound rules into separate rules, e.g. `query ::= SELECT (ALL | DISTINCT)` becomes `query ::= select %select_fragment0` and `%select_fragment0 ::= ALL | DISTINCT`. 3. Expand `*`, `+`, `?` modifiers into separate, right-recursive rules. For example, `columns ::= column (',' column)*` becomes `columns ::= column columns_rr` and `columns_rr ::= ',' column | ε`. 6. Compress and export the grammar as a PHP array. It replaces all string names with integers and ships an int->string map to reduce the file size. The `mysql-grammar.php` file size is ~70kb in size, which is small enough. The parser can handle about 1000 complex SELECT queries per second on a MacBook Pro. It only took a few easy optimizations to go from 50/seconds to 1000/second. There's a lot of further optimization opportunities once we need more speed. We could factor the grammar in different ways, explore other types of lookahead tables, or memoize the matching results per token. However, I don't think we need to do that in the short term. If we spend enough time factoring the grammar, we could potentially switch to a LALR(1) parser and cut most time spent on dealing with ambiguities. ## Known issues There are some small issues and incomplete edge cases. Here are the ones I'm currently aware of: 1. A very special case in the lexer is not handled — While identifiers can't consist solely of numbers, in the identifier part after a `.`, this is possible (e.g., `1ea10.1` is a table name & column name). This is not handled yet, and it may be worth checking if all cases in the identifier part after a `.` are handled correctly. 2. Another very special case in the lexer — While the lexer does support version comments, such as `/*!80038 ... /` and nested comments within them, a nested comment within a non-matched version is not supported (e.g., `SELECT 1 /*!99999 /* */ */`). Additionally, we currently support only 5-digit version specifiers (`80038`), but 6 digits should probably work as well (`080038`). 3. Version specifiers are not propagated to the PHP grammar yet, and versions are not applied in the grammar yet (only in the lexer). This will be better to bring in together with version-specific test cases. 4. Some rules in the grammar may not have version specifiers, or they may be incorrect. 7. The `_utf8` underscore charset should be version-dependent (only on MySQL 5), and maybe some others are too. We can check this by `SHOW CHARACTER SET` on different MySQL versions. 8. The PHPized grammar now contains array indexes of the main rules, while previously they were not listed. It seems there are numeric gaps. It might be a regression caused when manually parsing the grammar. I suppose it's an easy fix. 9. Some components need better test coverage (although the E2E 70k query test suite is pretty good for now). 10. The tests are not run on CI yet. 11. I'm not sure if the new code fully satisfies the plugin PHP version requirement. We need to check that — e.g., that there are no PHP 7.1 features used. Not fully sure, but I think there's no lint for PHP version in the repo, so we could add it. This list is mainly for me, in order not to forget these. I will later port it into a tracking issue with a checklist. ## Updates Since the thread here is pretty long, here are quick links to the work-in-progress updates: - [First update with a MySQL query test suite.](#157 (comment)) - [Quick update, focusing on lexer.](#157 (comment)) - [Custom grammer conversion script, preserving version, fixes, and more.](#157 (comment)) - [Wrap up](#157 (comment)). ## Next steps These could be implemented either in follow-up PRs or as updates to this PR – whichever is more convenient: * Bring in a comprehensive MySQL queries test suite, similar to [WHATWG URL test data](https://github.com/web-platform-tests/wpt/blob/master/url/resources/urltestdata.json) for parsing URLs. First, just ensure the parser either returns null or any parse tree where appropriate. Then, once we have more advanced tree processing, actually assert the parser outputs the expected query structures. * Create a `MySQLOnSQLite` database driver to enable running MySQL queries on SQLite. Read [this comment](#106 (comment)) for more context. Use any method that's convenient for generating SQLite queries. Feel free to restructure and redo any APIs proposed in this PR. Be inspired by the idea we may build a `MySQLOnPostgres` driver one day, but don't actually build any abstractions upfront. Make the driver generic so it can be used without WordPress. Perhaps it could implement a PDO driver interface? * Port MySQL features already supported by the SQLite database integration plugin to the new `MySQLOnSQLite` driver. For example, `SQL_CALC_FOUND_ROWS` option or the `INTERVAL` syntax. * Run SQLite database integration plugin test suite on the new `MySQLOnSQLite` driver and ensure they pass. * Rewire this plugin to use the new `MySQLOnSQLite` driver instead of the current plumbing. --------- Co-authored-by: Jan Jakes <[email protected]>
Let's go through the MySQL documentation pages and make sure even the complex SELECT queries are supported by the SQLite integration plugin:
This likely means rewriting
execute_select
as more of a grammar parser or a state machine and reason about each encountered token. In contrast, the current approach is to consume all the tokens unless a tactical adjustment applies. This way we could reuse the SELECT logic for WITH, UNIONs, subqueries, etc. Currently we cannot, because theexecute_select
method assumes it acts on an entire query, not on a part of it.The implementation could look like this:
For starters, just migrating to a state machine approach would be more than enough as it would unlock support for UNIONs and easy ignoring of tokens like
HIGH_PRIORITY
orSQL_SMALL_RESULT
.The text was updated successfully, but these errors were encountered: