GitHub - jamii/hytradboi-jam-2022 - JOYK Joy of Geek, Geek News, Link all geek

README.md

Goal: Pass all 4259065 tests in sqllogictest. From scratch. No dependencies. No surrender.

I wrote the test harness in advance, so my starting point on Monday morning was 4.2 million test failures.

> zig build test_slt -Drelease-safe=true -- $(rg --files deps/slt)
HashMap(
    error.Unimplemented => 4259065,
)
passes => 0

Serious databases write their own parsers, but serious databases aren't finished in a week. We're going to need a shortcut. Let's grab this bnf extracted from the SQL 2016 spec.

Only 9139 lines long.

Also impossible to parse, because they don't escape tokens. Sometimes a [ is just a [, but sometimes it opens an optional group.

<left bracket> ::=
  [

<less than operator> ::=
  <

<identifier body> ::=
  <identifier start> [ <identifier part>... ]

Also check this out:

<national character large object type> ::=
    NATIONAL CHARACTER LARGE OBJECT [ <left paren> <character large object length> <right
    paren> ]
  | NCHAR LARGE OBJECT [ <left paren> <character large object length> <right paren> ]
  | NCLOB [ <left paren> <character large object length> <right paren> ]

Yeah, they line-wrapped to 90 columns in the middle of <right paren>, so I have to strip excess whitespace from names after parsing them.

So parsing the bnf is kind of a mess, but I only have to parse this one bnf and not bnfs in general so I just mashed in a bunch of special cases.

I don't really have a way of looking at the result yet, but I at least have the right number of definitions.

So now I just have to use the bnf to parse sql. I don't really know how to do that because I always just hand-write recursive descent parsers whenever I need to parse something.

The sql grammar is left-recursive so I can't just go top-down or it will potentially loop forever. But there is a neat trick that I heard when working with xtdb earlier this year. It's tricky to explain, but the intuition is similar to how iterative depth-first search avoids getting stuck in infinitely deep search trees.

So whenever we hit a rule that might be left-recursive, we make an entry in the cache that says "parse failed". That forces it to explore other branches. If it manages to find some base case, we put that in the cache and try again to see if we can parse a bigger chunk of the input. If we ever get a result that is not better than the result already in the cache then there isn't any point exploring deeper trees.

Here's the core of that loop. It took a while to debug the logic, but I think it makes sense now that I've worked through it.

It took a lot more debugging to start successfully parsing actual sql:

My bnf parser had the wrong binding power for |. I had to turn it into a slightly Pratt parser to fix it.
The bnf from the spec doesn't describe actual sql that well (eg select 1; is invalid in the spec despite being accepted by every database since forever). So I've had to start tweaking the bnf to get the sqlite tests to parse.

Here's where the scoreboard stands at the end of the day.

> time zig build test_slt -Drelease-safe=true -- $(rg --files deps/slt)
HashMap(
    error.Unimplemented => 172471,
    error.ParseError => 4086594,
)
passes => 0
________________________________________________________
Executed in   94.84 secs    fish           external
   usr time   94.33 secs    2.63 millis   94.33 secs
   sys time    0.53 secs    0.00 millis    0.53 secs

Tomorrow I'll start working through those parse failures. Probably I'll need to build a little gui debugger to help, because trying to read through printlns of all the productions of that massive grammar is not time efficient.

I realized that I unthinkingly made the parser greedy. It will backtrack within an either, but if the first branch succeeds it will never go back to explore the second branch. That means when it hits something like avg(c) it parses the avg as an identifier and then can't backtrack to try to parse it as a function call later.

SQL can be parsed with this minimal backtracking (that's how the materialize parser works) but the bnf grammar is not designed with this in mind and it would be way too much work to edit it by hand.

I read a bunch about table driven parsers for arbitrary context-free grammars (because who knows what class of grammar this bnf is). Eventually I decided it would be safer to just add backtracking to the parser I already have, since that's a method I already understand.

I flailed for a long while trying to figure out how to combine this with the memoized loop used for handling left-recursive definitions. Eventually I came up with what I think is a workable design, where the memoized value is stored on the parse stack and the loop works by adding backtrack points whenever the a recursive definition completes succesfully.

I started on the implementation but at this point I'm just staring at the screen and not making forward progress, so I'm going to stop early and get some rest.

My original plan looked like this:

sql parser (3 days)
name resolution (1 day)
type inference and lowering (1 day)
optimization and execution (1 day)
grind out the long tail of bugs (1 day)

I estimated 3 days to get the sql parser working because I've been to sql-land before and I know what horrors lurk there, but I'm a little worried now that it will take even longer.

Yesterday I was trying to be all fancy make the backtracking actually backtrack (which is much harder when you don't have generators or lazy lists built into the language). Today I decided to just get on with it and generate every subparse. I got this working really quickly.

Unfortunately, I then discovered that the bnf from the spec is wildly ambiguous. Check out the 7 different parses for select 1;.

I ran the thing against the whole test suite just to see how badly I'm doing:

HashMap(
    error.NoParse => 3971842,
    error.Unimplemented => 275660,
    error.AmbiguousParse => 11563,
)
passes => 0
________________________________________________________
Executed in   31.88 mins    fish           external
   usr time   31.77 mins    0.00 micros   31.77 mins
   sys time    0.10 mins  734.00 micros    0.10 mins

It's really slow because every complex query produces a bazillion parse trees. And it's still not parsing most of the tests correctly.

Plus, even if I get this whole thing working, I have to then turn those awful parse trees into something sane.

So... this is not going to work.

I didn't want to roll my own parser because we did that at materialize and while it worked out fine, it's more than 10kloc of rust. If I was just directly transcribing it I would have to type at 5 characters per second for 24 straight hours. I looked around at some other databases and even the ones that use parser generators are huge:

> scc cockroachdb-parser/pkg/sql/scanner/ cockroachdb-parser/pkg/sql/parser/
-------------------------------------------------------------------------------
Language                 Files     Lines   Blanks  Comments     Code Complexity
-------------------------------------------------------------------------------
Go                           6      2305      202       356     1747        403
AWK                          4       243       27        73      143          0
Bazel                        2       183       11         9      163          0
Shell                        2        49       13         8       28          1
Happy                        1     14658      843         0    13815          0
Markdown                     1       284       65         0      219          0
gitignore                    1         8        1         2        5          0
-------------------------------------------------------------------------------
Total                       17     17730     1162       448    16120        404
-------------------------------------------------------------------------------
Estimated Cost to Develop (organic) $500,413
Estimated Schedule Effort (organic) 10.572043 months
Estimated People Required (organic) 4.205190
-------------------------------------------------------------------------------
Processed 506614 bytes, 0.507 megabytes (SI)
-------------------------------------------------------------------------------

Except sqlite, which has a tiny little grammar definition.

> scc sqlite/src/parse.y sqlite/tool/lemon.c sqlite/tool/lempar.c
-------------------------------------------------------------------------------
Language                 Files     Lines   Blanks  Comments     Code Complexity
-------------------------------------------------------------------------------
C                            2      6961      356      1195     5410        916
Happy                        1      1928      179         0     1749          0
-------------------------------------------------------------------------------
Total                        3      8889      535      1195     7159        916
-------------------------------------------------------------------------------
Estimated Cost to Develop (organic) $213,397
Estimated Schedule Effort (organic) 7.647269 months
Estimated People Required (organic) 2.479133
-------------------------------------------------------------------------------
Processed 285597 bytes, 0.286 megabytes (SI)
-------------------------------------------------------------------------------

I only have to parse sqlite grammar, and even then probably only a fraction of it. So it looks like writing a parser generator might be plausible and by now it's really my only option if I want to get any tests passing at all.

So here's my grammar so far. It's parsed by GrammarParser. In theory I could do this at comptime, but until we get a comptime allocator that's going to be a yak shave and I am way too many yak shaves deep already. So I laboriously but reliably write out all the rules into grammar.zig.

The Tokenizer is written by hand and seems to be basically done - it runs without complaint on the entire test set. I might discover bugs there later, of course.

The Parser is fun. There is a single parse function, but because it reads the rules from grammar.zig at compile time it gets specialized for each parse rule. Basically I got the same result as hand-generating the parser code, without having to splices a bunch of strings together. After the jam maybe I'll have some yak shave time to cut out grammar.zig entirely and do all the grammar stuff at comptime, and then it'll be a pretty sweet system.

The best part about this is that I get really nice parse trees by generating rich types from the input grammar. Eg.

pub const anon_18 = ?distinct_or_all;
pub const anon_19 = ?from;
pub const anon_20 = ?where;
pub const anon_21 = ?group_by;
pub const anon_22 = ?having;
pub const anon_23 = ?window;
pub const anon_24 = ?order_by;
pub const anon_25 = ?limit;
pub const select = struct {
    distinct_or_all: *anon_18,
    result_columns: *result_columns,
    from: *anon_19,
    where: *anon_20,
    group_by: *anon_21,
    having: *anon_22,
    window: *anon_23,
    order_by: *anon_24,
    limit: *anon_25,
};

I still can't parse many of the tests because I haven't implemented expressions, operator precedence etc. So I'm still stuck in parsing land. But I can see the light at the end of the tunnel now.

And at least it's fast:

> time zig build test_slt -Drelease-safe=true -- $(rg --files deps/slt)
HashMap(
    error.ParseError => 4257825,
    error.Unimplemented => 1240,
)
passes => 0
________________________________________________________
Executed in   11.85 secs    fish           external
   usr time   11.49 secs    0.44 millis   11.49 secs
   sys time    0.39 secs    1.79 millis    0.39 secs

I'm going to spend tomorrow fleshing out as much of the parsing as possible and then switch to trying to analyze and execute some of the simpler tests.

I lost a lot of time in the morning to segfaults in the zig compiler. I eventually worked around it by moving more stuff from comptime to codegen.

I fleshed out the parser and grammar a bunch more. I added memoization again to handle left recursion, as well as a ton of debugging output.

I was making pretty good progress through parsing more and more queries when something I added caused some queries to loop seemingly forever in the parser. I've been bangind my head against this for hours and getting nowhere.

I suspect something in the interaction between memoization and the stack of expr_* types that handle precedence in the grammar. Maybe I'll figure it out tomorrow.

It's not looking good for passing any tests at all, let alone all of them. The light at the end of the tunnel was a train.

I figured it out at home.

I was barking up totally the wrong tree. The expr grammar is actually not left-recursive after the changes I made, which means the memo is not being applied. But the memo was covering up the fact that I wrote the grammar in a silly way that required exponential time to parse with memoization.

Which means I actually don't need memoization (which I spent most of today working on) at all for my grammar.

This whole left-recursive memoization thing has been a total dead end. I should have stuck with techniques I'm familiar with from the start. This is not the time to be trying experimental new things.

The silver lining is this:

HashMap(
    error.ParseError => 3725149,
    error.Unimplemented => 533916,
)
passes => 0
________________________________________________________
Executed in   50.07 secs    fish           external
   usr time   49.83 secs    1.75 millis   49.82 secs
   sys time    0.23 secs    1.90 millis    0.23 secs

That's about 12.5% of the tests parsing!

I made some changes to the parser generator, added a whole bunch of debugging tools to the parser itself and then sat down and cranked on the grammar till I can parse 100% of the tests. I'm finally out of the damn tunnel.

Then I sketched out a framework for planning and evaluation and filled out just enough to handle scalar-only queries.

HashMap(
    error.NoPlan => 4066003,
)
skips => 3869887
passes => 193062
________________________________________________________
Executed in  121.77 secs    fish           external
   usr time  120.77 secs    0.02 millis  120.77 secs
   sys time    0.96 secs    1.01 millis    0.96 secs

4.5% of tests are passing!

This is how I expected the week to go. Not death by bnf.

The odds of hitting 100% now are very low - there is a long tail of weird behavior that I expected to have much more time to implement. But I think in the next two days I might be able to get to something respectable.

Full steam ahead:

total => 5939714
HashMap(
    error.TypeError => 198542,
    error.NoPlan => 2345772,
    error.StatementShouldError => 2,
    error.ParseError => 2,
    error.TestExpectedEqual => 9379,
    error.BadColumn => 42749,
)
skips => 127880
passes => 3215388 (= 54.13%)

________________________________________________________
Executed in  581.30 secs    fish           external
   usr time  258.12 secs  660.00 micros  258.12 secs
   sys time  342.46 secs  324.00 micros  342.46 secs

What are we looking at here?

total => 5939714. Some tests are marked with tags like onlyif sqlite or skipif sqlite. Before I was only looking at tests with no tags (ie generic sql), but today I found out that tags are often used to split out setup code like:

onlyif oracle
statement ok
CREATE TABLE foo(...funky oracle workaround...)

skipif oracle
statement ok
CREATE TABLE foo(...regular sql...)

I was skipping both, which left me with no table. So now I'm doing any statements tagged with skipif, which leaves me with a lot more tests to pass.

passes => 3215388 (= 54.13%). That's a lot of progress.

skips => 127880. Most of the slt files start with some statements to set up some data, followed by a bunch of data. If I know that I screwed up one of those statements, I still try to execute the queries but I assign any errors to the skip category so I'm not debugging the wrong sql.

error.NoPlan => 2345772. Things that my planner does not support yet.

error.TypeError => 198542. Sqlite has some hilarious implicit type casts. I didn't implement these, so sometimes my queries fail when sqlite tries to eg add a boolean to a string. Some of these might also be from the planner miscounting columns though.

sqlite> select TRUE + "foo";
1

error.BadColumn => 42749. Sometimes I generate plans that refer to non-existent columns. I catch these during eval so that the bounds check doesn't kill my test run.

error.TestExpectedEqual => 9379. I produced the wrong output for the test. Surprisingly few of these.

error.ParseError => 2. I added a bunch of new tests, so I had to expand the grammar some more. I didn't catch everything.

error.StatementShouldError => 2. Some statement was supposed to abort with an error but it didn't. This will probably screw up other tests in the same file.

The major changes I made today were supporting most statements, almost all scalar expressions, SELECT/FROM/WHERE and some forms of non-correlated subqueries.

I also added memory (via std.heap.GeneralPurposeAllocator) and cpu limits (via counting backwards branches) to evaluation, to prevent the odd bad plan from trashing my test run.

With one day left to go I have to be strategic about what I support next. A few days ago I dumped a list of grammar rules by how many test cases they occur in. The list is probably a bit out of date with the new added tests, but hopefully directionally correct.

The big ones I'm missing are:

ORDER BY = 804266
JOIN = 143153
GROUP = 108561
subquery = 97229
function call = 57144
VIEW = 57144 (this is probably responsible for most of the skips though)
UNION/INTERSECT/EXCEPT = 20440
DELETE = 19877 (again, probably responsble for a lot of skips and/or wrong results)
CASE = 16811
HAVING = 13375

So tomorrow I'll probably just try and crank through features in that order and see where I end up.

I completely rewrote the way name resolution works to handle the addition of all the weird edge cases in select_body. I can now handle stuff like select a, count(a+1) as count, a+2 as c from foo where c > 0 order by b, count. (Think about what order those scalar expressions have to get executed in). I'll go into depth about that in the final writeup.

Then I just cranked. I added way more detail to the error reports so I could priority sort missing features and then I typed a lot of code.

> git diff 3b25276a871f752a700e7a4942d51fd50ce63f5 --stat                                                nix-shell
 README.md             |    7 +
 lib/sql.zig           |   22 +-
 lib/sql/Evaluator.zig |  286 +++++++++++++++++++---
 lib/sql/Planner.zig   | 1097 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------
 lib/sql/grammar.txt   |    2 +-
 lib/sql/grammar.zig   |   21 +-
 test/slt.zig          |    9 +-
 7 files changed, 1188 insertions(+), 256 deletions(-)

Funnily enough, it looks like I've been averaging about 1000 lines per day:

> scc ./lib                                                                                              nix-shell
-------------------------------------------------------------------------------
Language                 Files     Lines   Blanks  Comments     Code Complexity
-------------------------------------------------------------------------------
Zig                          8      6476      185       106     6185        708
Plain Text                   3       390       17         0      373          0
-------------------------------------------------------------------------------
Total                       11      6866      202       106     6558        708
-------------------------------------------------------------------------------
Estimated Cost to Develop (organic) $194,627
Estimated Schedule Effort (organic) 7.384347 months
Estimated People Required (organic) 2.341581
-------------------------------------------------------------------------------
Processed 268876 bytes, 0.269 megabytes (SI)
-------------------------------------------------------------------------------

I implemented ORDER BY, GROUP BY (without grouped columns), functions, aggregates, CASE, CAST, correlated IN, INNER/CROSS JOIN and various edge cases in sqlite's implicit coercions and comparisons.

I got pretty damn close.

> zig build generate_grammar && time zig build test_slt -Drelease-safe=true -- $(rg --files deps/slt | sort -h)
...
total => 5939714
HashMap(
    error.NoPlanOther => 133170,
    error.TestExpectedEqual => 16820,
    error.NoPlanExprAtom => 2848,
    error.OutOfJuice => 2162,
    error.WrongNumberOfColumnsReturned => 1681,
    error.NoPlanCompound => 1000,
    error.NoPlanConstraint => 550,
    error.NoPlanLeft => 191,
    error.BadEvalAggregate => 59,
    error.NoPlanStatement => 32,
    error.ParseError => 2,
    error.StatementShouldError => 2,
)
skips => 123015
passes => 5658182 (= 95.26%)

________________________________________________________
Executed in   18.64 mins    fish           external
   usr time  462.87 secs    0.00 micros  462.87 secs
   sys time  658.11 secs  891.00 micros  658.11 secs

Man, if I hadn't spent more than half the week in that parsing rabbit hole I really think I might have made it to 100%. But this'll do.

</article

GitHub - jamii/hytradboi-jam-2022

README.md

Recommend

How Do I Teach These Kids?!

New PS5 exploit unlocks root privileges, read/write memory access

LG's C2 OLED 120Hz 4K TVs with up to $270 gift cards hit new lows from $1,197 (U...

特斯拉和“蔚小理”，围剿BBA?

The Russian Prime Minister urges crypto regulation

Using Data Quality Management, microservices for location data in SAP Data Intel...

中科创达2022年9月接待超200家机构调研，自动驾驶领域布局等被问及

4 ways to deliver a personalized banking experience

Different Debts Change Different Behaviors

Kioxia develops functional 2TB microSDXC memory card prototype | TechSpot

About Joyk