Ask HN: How do you test SQL?

		Ask HN: How do you test SQL?
	459 points by pcarolan 11 hours ago \| hide \| past \| favorite \| 231 comments
	I've been looking for resources for our data team to apply best practices for testing SQL pipelines (we use DBT) but have not found anything. How do you test SQL pipelines? What patterns, tools and best practices would you recommend? Any good reference material you know of?

Try and write any complex SQL as a series of semantically meaningful CTEs. Test each part of the CTE pipeline with an in.parquet and an expected_out.parquet (or in.csv and out.csv if you have simple datatypes, so it works better with git). And similarly test larger parts of the pipeline with 'in' and 'expected_out' files.

If you use DuckDB to run the tests, you can reference those files as if they were tables (select * from 'in.parquet'), and the tests will run extremely fast

One challenge if you're using Spark is that test can be frustratingly slow to run. One possible solution (that I use myself) is to run most tests using DuckDB, and only e.g. the overall test using Spark SQL.

I've used the above strategy with PyTest, but I'm not sure conceptually it's particularly sensitive to the programming language/testrunner you use.

Also I have no idea whether this is good practice - it's just something that seemed to work well for me.

The approach with csvs can be nice because your customers can review these files for correctness (they may be the owners of the metric), without them needing to be coders. They just need to confirm in.csv should result in expected_out.csv.

If it makes it more readable you can also inline the 'in' and 'expected_out' data e.g. as a list of dicts and pass into DuckDB as a pandas dataframe

One gotya is SQL does not guarantee order so you need to somehow sort or otherwise ensure your tests are robust to this

If your database supports it, unit tests are an absolutely ideal use-case for temporary tables or global temporary tables. A global temporary table can be defined with the same schema as the correct table and will "exist" for the purpose of view/CTE definitions, but any data inserted into the table will only ever be visible from that specific thread context. The rules depend but basically either it exists until the thread is closed, or until the session calls commit, but the semantics will (theoretically) be optimized for one thread and rapid inserts/clears.

If you build your unit tests so that a thread will not commit until it's finished, or clear it before/after, it's pretty ideal for that. You can feed in data for that test and it won't be visible anywhere else.

Potentially if you scripted your table creates you could add an additional rule that GLOBAL TEMPORARY TABLE gets added to any other tabledefs but only for unit tests. Or just update both in parallel.

The other useful use-case I've found for this is that you can do an unlimited-size "WHERE myId IN (:1, :2 ...)" by doing a GTT expressed as "INNER JOIN myId = myGtt.myid", and this has the advantage of not thrashing the query planner for every literal sqltext resulting from different lengths of myList (this will fill up your query cache!). Since it's one literal string it always hits the same query plan cache.

I am told this has problems in oracle's OCI when using PDBs, apparently GTTs will shit up the redo-log tablespace (WAL equivalent). But so do many many things, PDBs are apparently a much different and much less capable and incredibly immature implementation that seems essentially abandoned/not a focus from what I've been told. I was very surprised having originally written that code in regular Oracle and not having had problems but PDBs just aren't the same (right down to nulls being present in indexes!).

Off topic incoming (sorry )

I used this trick (join temporaryFoo instead of where foo in ...) in production fifteen years ago, using MySQL. The gain was really astonishing. Several instructions can be optimized using joins on specialty craft tables (I know of LIMIT for instance).

This is one of the worst drawbacks of orm everywhere: nobody even seems to think about those optimisations anymore.

also, views can be defined as ORM objects/POJOs. They can be read-only, some views are "trivially-remappable" (if there is a 1:1 mapping from view columns to table columns) or even you can use INSTEAD OF INSERT/UPDATE triggers to take writes on that view and do something completely else with it. ;)

I've used that to do dumb shit like lever a table schema into an ORM mapping that it wasn't really designed for, that I had to maintain fallback compatibility conditions onto the tables.

views are really a db-level "interface" implementation that few people really exploit fully. Here is the definition, here is the implementation. And yes global temporary tables are such a cute cheat for unlimited-size query-specific data ;)

One caution with PostgreSQL, CTEs under some circumstances (and in all circumstances, prior to PostgreSQL 12) act as optimization barriers. Specify `NOT MATERIALIZED` before the CTE definition to ensure that they are optimized same as a sub-SELECT would be.

This is a good point, although it’s worth noting that there are definitely cases where you want the sun table to be materialised (esp. when that table is small and referenced many times)

Yeah, its definitely one of those things where it is worth reading the docs [0], and not just adopting an “always do it this way” rule of thumb.

[0] https://www.postgresql.org/docs/current/queries-with.html#id...

Is 'sun table' a particular concept here? A typo?

CTE = common table expression, i.e. a WITH clause (possibly recursive) before your SELECT (or other) statement. https://learnsql.com/blog/what-is-common-table-expression/

Yes, I didn't know what it was so I had looked it up before I saw yours. The definition I got is below:

A common table expression, or CTE, is a temporary named result set created from a simple SQL statement that can be used in subsequent SELECT, DELETE, INSERT, or UPDATE statements.

I knew about WITH clauses, just didn't know the name.

> Try and write any complex SQL as a series of semantically meaningful CTEs

I find this helps a heck of a lot with maintainability + debugging as well

It can wind up with more of a procedural thought though than set-based. Not always, just something to pay attention to and pushing filters early / explicitly. It is asking more of the optimizer a lot of the times and that's were a bunch of the cautions come in to play.

I really like this answer because it tests CTEs in a modular way. One question I have is: how do you test a the CTE with an in.csv and out.csv without altering the original CTE? Currently I have a chain of multiple CTEs in a sequence, so how would I be able to take the middle CTE and "mock" the previous one without altering the CTE actually used in production? I prefer not to maintain a CTE for the live query and the same CTE adapted for tests.

A few quick tips to help slow Spark tests.

1. Make sure you're using a shared session between the test suite. So that spin-up only has to occur once per suite and not per test. This has the drawback of not allowing dataframe name reuse across tests, but who cares.

2. If you have any kind of magic-number N in the SparkSQL or dataframe calls (e.g coalesce(N), repartition(N)) change N to be parameterized, and set it to 1 for the test.

3. Make sure the master doesn't have any more than 'local[2]' set. Or less depending on your workstation.

> Try and write any complex SQL as a series of semantically meaningful CTEs.

Could you or anyone else on the post provide an example?

let's say you are doing a paystub rollup - a department has multiple employees, an employee has multiple paystubs.

if you are storing fully denormalized concrete data about the value of salary/medical/retirement both pre- and post-tax that was actually paid to each pay period (because this can vary!), then you can define a view that does salary-per-employee (taxable, untaxable, etc), and then a view that rolls up employees-per-department. And you can write unit tests for all of those.

that's a super contrived example but basically once group aggregate or window functions and other complex sub-queries start coming into the picture it becomes highly desirable to write those as their own views. And you can write some simple unit tests for those views. there are tons of shitty weird sql quirks that come from nullity/etc and you can have very weird specific sum(mycol where condition) and other non-trivial sub-subquery logic, and it's simple to just write an expression that you think is true and validate that it works like you think, that all output groups (including empty/null groups etc) that you expect to be present or not present actually are/aren't, etc.

I'm not personally advocating for writing those as CTEs specifically as a design goal in preference to views, personally I'd rather write views where possible. But recursive CTEs are the canonical approach for certain kinds of queries (particularly node/tree structures) and at minimum a CTE certainly is a "less powerful context" than a generalized WHERE EXISTS (select 1 from ... WHERE myVal = outerVal) or value-select subquery. it's desirable to have that isolation from the outer SQL query cursor imo (and depending on what you're asking, it may optimize to something different in terms of plan).

Writing everything as a single query, where the sub-sub-query needs to be 100% sure not to depend on the outer-outer-cursor, is painful. What even is "DEEP_RANK()" in the context of this particular row/window? If you've got some bizarre (RANK(myId order by timestamp) or whatever, does it really work right? Etc. It's just a lot easier to conceptually write each "function" as a level with its own unit tests. Same as any other unit-testable function block, it's ideal if it's Obviously Correct and then you define compositions of Obvious Correctness with their own provable correctness.

And if it's not Obviously Correct then you need the proof even more. Encapsulate whatever dumb shit you have to do to make it run correctly and quick into a subquery and just do a "inner join where outerQuery.myId = myView.myId". Hide the badness.

> Try and write any complex SQL as a series of semantically meaningful CTEs

I used this exact same method. Not only does it help me but those who come after trying to understand what's going on.

My ignorance of the topic (and experience with a mostly unrelated one) is showing, but all I could think of when you said CTE was "chronic traumatic encephalopathy". This made a lot more sense when you generalized answering the question as if it's a given that Python is necessary (I know that's not your intent, but that's how it comes off).

Not much more to say, just observing, sorry if this is irrelevant commentary.

CTE==Common Table Expression. It's not specific to any particular language. It's basically like a view, except you can define them the same place as you use them, and they can be recursive.

Thank you for the clarification, that actually makes sense in context!

... Although also now I'm a little. put off because you reminded me of working with DOS-based SAP interfaces and Oracle's Java-based attempts to make "interactive views" circa 2006 :/

Sometimes using databases feels just like repeatedly slamming your head into your desk, so that fits.

I used to think that, but once you start thing about them in the right way, the relational model is pretty nice.

Agreed, but it doesn’t solve every problem. And then there are the purely operational issues; for example a few days ago autovacuum was never able to completely finish vacuuming one particular table, but then the problem mysteriously went away and now it’s fine. Wonderful.

>> I could think of when you said CTE was "chronic traumatic encephalopathy"

The title is "Ask HN: How do you test SQL"

Probably not that relevant for a data team, but this is what we do as a backend team:

We use Microsoft SQL's docker image and spin it up in the background on our laptop/CI server so port 1433 has a database.

Then we have our homegrown migration file runner that will compute a hash of the migrations, make a database template_a5757f7e, and run the hundreds of migrations on it, whenever we add a new SQL migration (todo: make one template build on the previous).

Then we use the BACKUP command to dump the db to disk (within the docker image)

Finally, each test function is able to make a new database and restore that backup from file in less than a second. Populate with some relevant test data, run code, inspect results, drop database.

So our test suite uses hundreds of fresh databases and it still runs in a reasonable time.

(And..our test suite is written in Go, with a lot of embedded SQL strings, even if a lot of our business logic is in SQL)

We spin up a docker container running the DB technology we use, run our DB migration scripts on it, and then run integration tests against it. You get coverage of your migration scripts this way too.

This is what we did at my last job. You can catch DB specific issues that a false implementation wouldn’t show and make sure all your code paths work as expected.

Every time new issues cropped up we would put new data in the test data designed to reproduce it. Every edge case we would run into.

It provided so much confidence because it would catch and trigger so many edge cases that testing with mocks or by hand would miss.

Edit: also, it’s great for test/dev environments. You don’t have to worry about losing important data or filling new environments with data. Just start with the full test data and you’re good to go. It’s got stuff for all the corner cases already. Something got screwed up? Blow it away and reload, it’s not precious.

same here. every backend service with a sql datastore runs an real ephemeral db instance during CI, runs the normal migrations against it, and uses it for tests of the direct data access/persistence code. everything else in the service with a dependency on the data-access stuff gets a mocked version, but testing the actual data access code against a real datasource is non-negotiable IMO.

thats a huge amount of work and money.

at my company they just told us to stop reporting edge cases. much easier, much cheaper.

Sounds like a place that's not really interested in solving customer needs. :(

Most places aren't like that, at least the ones I've seen. :)

I've used this same approach as well. Testcontainers is a nice way to help with this!

https://www.testcontainers.org/

What is testcontainer? Reading the first page I don't understand what benefits it buys me:

> Testcontainers for .NET is a library to support tests with throwaway instances of Docker containers for all compatible .NET Standard versions. The library is built on top of the .NET Docker remote API and provides a lightweight implementation to support your test environment in all circumstances.

Edit: Ok, example helps. https://dotnet.testcontainers.org/examples/aspnet/

I can declare docker infrastructure from my code, right?

Yep instead of starting the container in the build script it’s nice to start it in the test code file itself.

I rolled my own with docker for a few years and recently made the switch to testcontainers. So far so good - but if you’re in an environment or language where test containers are difficult, rolling your own really it ant to hard. It also keeps you honest with maintaining good migration scripts.

We do this too for PostgreSQL: to ensure the tests are really fast:

    - we create a template database using the migrations
    - for *every* integration test we do `CREATE DATABASE test123 TEMPLATE test_template;`
    - we tune the PostgreSQL instance inside Docker to speed up things, for exampling disabling synchronous_commit

On a successful test, we drop the test123 database. On a failed test, we keep the database around, so we can inspect it a bit.

The really great thing about this approach (IMHO), is that you can validate certain constraint violations.

For example, exclusion constraints are great for modelling certain use cases where overlapping ranges should be avoided. In our (go) code, the test cases can use the sqlstate code, or the constraint name to figure out if we hit the error we expect to hit.

This approach is pretty much as fast as our unit tests (your mileage may vary), but it prevents way more bugs from being merged into our codebase.

Out of curiosity, how fast is really fast?

Just did a sequential run (to get some better measurements), and this is an excerpt of the things happening in the PostgreSQL instance inside the Docker container, for creating and dropping the databases:

    08:25:37.114 UTC [1456] LOG:  statement: CREATE DATABASE "test_1675239937111796557" WITH template = test_template
    [noise]
    08:25:48.002 UTC [1486] LOG:  statement: DROP DATABASE "test_1675239947937354435"

Start time of first test: 2023-02-01 08:25:03.633 UTC

Finish time of last test: 2023-02-01 08:26:13.861 UTC

82 tests, or 0.856 seconds per test (sequentially).

In parallel, we take 6.941 seconds for 82 tests, or 0.085 seconds per test.

Running sqlite in memory as a test db speeds up your test runner as crazy. You can do this if you use an sql query builder library, because it can translate your queries to the specific database.

This can be a good fast/local test or maybe a sanity test ... but there are definitely differences between databases that need to be accounted for. You wanna take that green test pass with a bit of skepticism. So you always want to test on the same DB engine that is running your prod workloads. If your surface area is small, you can get by with the approach you mentioned, but it would need to be marked tech debt that should be paid down as soon as possible, or as soon as that bug that manifests itself in PSQL but not sqlite appears :)

On Postgres you can run

SET SESSION synchronous_commit TO OFF;

(Update: I just looked in our test code, you can also replace the CREATE TABLE commands with "CREATE UNLOGGED TABLE" to disable write-ahead logging.)

There are possibly other tricks?

I slightly disagree with the tech debt comment, though. If you get a huge speed up, it may be worth paying for the occasional bug, depending on the circumstances. Or you could do both, and only run the test on Postgres occasionally.

Sqlite is the most popular database in the world by a large margin. While you are correct for those who use other databases, the majority case you are wrong.

Of course most people who have complex queries are probably not using sqlite and so may not care about testing the database.

You can get the 'in-memory' speed advantage by putting the datastore on a ramdisk (or even just disabling fsync, which is pretty easy to do in postgresql).

There’s also the eatmydata program which does similar by using LD_PRELOAD to intercept io calls.

problem is that its not always compatible with features you use on your production database

Can split test regime so that as much as possible is covered with SQLite, and then have a second test phase with a heavyweight db only if the first phase passes. So code errors, malformed SQL, etc. cause it to fail fast and early, and you only test with the real DB once you know everything else is working.

Or along similar lines you could divide it such that developers can test things locally on their machines with SQLite, but once it gets pushed into CI (and passes code review etc.) it's tested against the heavy db.

That still doesn't fix the compatibility issues. Postgres has features/syntax that sqlite does not have, so you can't test postgres syntax with sqlite sometimes

So you test against a different database technology than the one you software uses? I understand why that works but it seems odd

A real nice thing about Postgres and Mysql is that in the JVM world the H2 and HSQLDB engines have large compatibility, you can use them in-JVM for unit test speed in many cases. Doesn't help developing the SQL, does help with testing.

Snowflake, on the other hand, is just special.

Why should this be faster than a local postgres instance with no traffic?

Because you have neither IPC nor IO, whereas with a local postgres server instance you have both?

Yup, same. Last time i set this up i used Sqitch¹ for migrations, which encourages you to write tests for each migration; caught a lot of bugs early that way, all in a local-first dev environment. Worked especially well for Postgres since plpgsql makes it easy to write tests more imperatively.

¹: https://sqitch.org/

I resented writing the verify scripts for my migrations, after writing unit and integration tests, but yes it is valuable

Yeah. That’s it. I have seen method with sample data (low fidelity) and/or performed against production copies of RDS DB (high fidelity). It’s still hard to catch some migrations or other operations under workload, but you can emulate that as well to a certain point.

Same (though I don't use Docker). Did TDD for a while like this, and it wasn't perfect, but it worked better than anything else. I didn't even know how to run the frontend; that was a separate team.

bonus points if you add test data for integration tests with sad paths too

Absolutely this. I stood up a negative test suite for continuous DB queries late in 2020 and it's caught many potential show stopper integration issues since; about 45% more YoY than pre-suite.

Took about a week of duplicating happy path tests, investigating gaps, and switching inputs/assertion types to get everything passing, but less than a week later we had our first verifiable test failure.

This is a great way to test for backwards-incompatible changes if your fleet is running canaries or instances on different versions backed by a singleton database. You apply the migrations, checkout the app from the old version, and then re-run your test suite. Any failures are a reasonably high signal that some backwards-incompatible migration was introduced.

Do you do this in place of unit tests (where you have to mock/stub the DB interactions) or do you do both?

Sorry for the digression first. (If anyone has different definitions for the ideas here, I would love to learn.)

I think the answers would depend on the types of tests that the term "this" encompasses. From how I understand it, calling something a unit test or an integration test depends on the context of what is is being tested. For example, if a developer is writing a unit test for a HTTP handler, and the handler implementation includes calls to an external database, then one would have to use a mock for the database, in order for the test to be deemed a true unit test [1]. If the developer is writing an integration test for the same HTTP handler, then the database would have to be a real database implementation [2].

On other hand, if the developer were testing SQL queries for syntactical or logical correctness against a database, these tests would want to use a real database implementation. Note that though the test uses a real database, it is still a true unit test [3]. Additionally, note that using a mocked database here would not serve the purpose of the tests, which is to catch syntactical and logical errors against a database engine. This can, of course, only be achieved by using a real database—or, if you insisted on using a "mock", then, by implementing an entire SQL engine, with the exact same quirks as the real database's engine, inside the mock!

On the original question:

> Do you do this in place of unit tests (where you have to mock/stub the DB interactions) or do you do both?

I guess the answer would be: It would depend on the objectives of and types of tests. Do both of them, because some tests, such as unit tests on the HTTP handler, would use use mocks, while other tests, such as the SQL query correctness tests, would use the real database.

[1] A true unit test is one that has no external interactions, neither directly nor transitively, besides interactions with the system under test (SUT). The SUT here is the HTTP handler.

[2] An integration test should include external system interactions, if any. That's what integration means.

[3] The SUT is each SQL query and the its interaction with the database. There are no interactions in the test with systems outside the SUT, so it is still a true unit test.

I was wondering the other day how to classify tests that use a test double/fake like pg-mem, which isn't returning stubbed results but isn't the Dockerized test DB either :

https://github.com/oguimbal/pg-mem

would love to do this, but how does one spin up a redshift cluster inside of a docker container?

try not to cry, cry a lot, and then resolve not to vendor lock yourself to a black box data store next time.

jokes aside, redshift is based on pg^1, you can try an older version to get some semblance of it running locally.

1. https://docs.aws.amazon.com/redshift/latest/dg/c_redshift-an...

And this is why you don’t rely on a closed source stack if you have any alternative.

Redshift speaks the postgres protocol so you might be able to use postgres. There are a few purpose-built docker images (googleable) that may replicate Redshift slightly better than just `docker run -p 127.0.0.1:5439:5432 postgres:latest`, but if you're at the point of having a test suite for your data warehouse code, you're likely using Redshift-specific features in your code and postgres won't suffice.

I have seen teams give each developer a personal schema on a dev cluster, to ensure their Redshift SQL actually works. The downside is that now your tests are non-local, so it's a real tradeoff. In CI you probably connect to a real test cluster.

Well the first step is to get Amazon to part with their lucrative closed source software.

Would redshift serverless help at all? I realize that it may not have parity with the existing functionality, just a thought.

There's pgTAP for Postgres [1], the same approach probably is valid for other databases.

Here's [2] a slide deck by David Wheeler giving an introduction into how it works.

[1] https://pgtap.org/

[2] https://www.slideshare.net/justatheory/unit-test-your-databa...

Unfortunately, I’m not surprised people test queries in their applications’ unit tests. What they’re actually testing is the ORM/query builder. Instead, with pgTAP, you can test specifically your queries.

I write mostly batch ETL stuff. All plain psql and bash. We don’t have a good testing setup to be honest. What we do use instead:

Plenty of constraints, uniques and foreign keys and not nulls. Enum types.

Visuals, dump to csv and plot some graphs. Much easier to find gaps and strange distributions visually.

Asserts in DO blocks, mostly counts being equal.

Build tables in a a _next suffix schema and swap when done.

Never mutating the source data.

Using psqls ON_ERROR_STOP setting.

Avoid all but the most trivial CTEs, preferring intermediate tables that can be inspected. Constraints and assertions on the intermediate tables.

“Wasting” machine resources and always rebuilding from scratch when feasible. CREATE TABLE foo AS SELECT is much simpler than figuring out which row to UPDATE. Also ensures reproducibility, if you’re always reproducing from scratch it’s always easy. State is hard.

Overall i’m quite happy with the workflow and very rarely do we make mistakes that unit tests would have caught. Our source data is complex and not always well understood (10+ years of changing business logic) so writing good tests would be very hard. Because we never touch the raw source data any errors we inevitably make are recoverable.

This talk by Dr Martin Loetzsch helped a lot: https://youtu.be/whwNi21jAm4

I admit it's not the best but I do almost like you. Graphs are definitely helpful.

I also have a few 'test' queries that insert there results into a 'test_results' table. Most of the queries check the cardinality of the table since to me wrong cardinality is where the biggest errors come from. I do something like :

insert into test_table

select case when count(*) = count(distinct users) then 'pass' else 'fail' end as result, 'test_cardinality_temporary_table_a' as test_name from temporary_table_a

Google internally has developed a SQL language dialect and some investments have been made to have a first class language supports for SQL. It has several interesting functionalities that made SQL nice for ETL use cases.

  * It has a language level module support, similar to other languages. Thus SQL functions are reusable across multiple codebases without depending on code generation tricks. One of the major blocker for SQL adoption has been complex domain specific business logic and now the situation is better.
  * It has an official unit test support. Google use Blaze (which is known as Bazel externally), so adding a unit test for SQL code is as simple as adding a SQL module (and its test input) dependency to SQL test target, write a test query and its expected output in a format of approval testing. Setting up the DB environment is all handled by the testing framework.
  * It has an official SQL binary support. It's just a fancy name for handling lots of tedious stuffs for running a SQL query (e.g. putting everything needed into a single package, performing type checks, handling input parameters, managing native code dependencies for FFI etc etc).

None of those are technically too sophisticated at least in theory, actually these combined together become pretty handy. Now I can write a simple SQL module which mostly depends on other team's SQL module, do a simple unit test for it then run a SQL binary just as other languages. I haven't worried a single time on how to set up a DB instance. This loop is largely focused on OLAP so it's a bit different for OLTP, which has another type of established testing patterns.

One easy method is just to test sql the way you test anything else:

1)Set up a test db instance with controlled data in it as the basis for your test cases. Ideally this data is taken from real data that has caused pipeline problems in the past but scrubbed for PII etc. You can also use or write generators to pad this out with realistic-looking fake data. If you do this the same dataset can be used for demos (once you add data for your demo paths).

2)Write test cases using whatever test framework you use in your main language. Say you code in python, you write pytest cases, java -> junit etc. You can help yourself by writing a little scaffolding that takes a sql query and a predicate, runs the query and asserts the predicate over the result. If you don't have a "main language", just write these test cases in a convenient language.

3)Consider resetting the state of the database (probably by reloading a controlled dump before each test batch) so any tests which involve inserts/deletes etc work. You may actually want to create an entirely new db and load it before each test run so that you can run multiple test batches concurrently against different dbs without contention messing up your results. Depending on your setup you may be able to achieve a similar effect using schemas or (sometimes but not always) transactions. You want each test run to be idempotent and isolated though.

Doing it this way has a number of benefits because it's easy to add your sql test cases into your CI/CD (they just run the same as everything else).

Gitlabs has their guide up, I love it and use it all the time. I've been doing data engineering in a small team for about 4 years, helping hospitals with data and hopefully making it easier to understand. Something that is overlooked or undervalued in my opinion, have stark distinctions for separating out technical and business logic tests. It makes it easier communicating what's happening in the event something is 'wrong' vs wrong, and it's easier to bring a non-technically inclined team member up to speed. Also, I think it's good to learn from the SaaS side of things and not bloat up or overengineer with infrastructure as data engineering is the latest development flavour. Keep it simple. Great expectations is a great tool however I think small teams should take really hard looks at their needs and see if a simple orchestration engine and SQL testing is enough. A centralized source for testing is great, however infrastructure isn't free even when it is you are paying for it with you and your teams time.

I have to admit, like others in this comment section, I was searching for ways to automate ensuring our team's written SQL code is working, whether that's by their technical or business logic requirement. But there's a lot of good insight on data management in general in that handbook and appreciate now knowing about it.

If you're using dbt, dbt tests are a good start: https://docs.getdbt.com/docs/build/tests

You can hook up dbt tests to your CI and Git(hub|lab) for data PRs.

Depending on your needs, you can also look into data observability tools such as Datafold (paid) or re_data (free)

SQL being relatively pure, functional, algebraic and whatnot, doesn't require the same rigor of automated test coverage that more "systems" programming languages do. (By "systems" I include all languages that people use to integrate the various parts of a software system - i.e. regular programming languages like Java, C#, TypeScript, C++ and so on. Not just low level languages.)

Stored procedures are a different beast though. Having significantly struggled to debug stored procedures running in MSSQL on a Macbook (on Windows SQL Management Studio lets you set breakpoints, on Mac you're SOL), if I was building an application based on them I'd definitely try to spin up some kind of testing framework around them. I guess what I'd probably do is have a temporary database and some regular testing framework that nukes the db, then calls the stored proc(s) with different inputs and checks what's in the tables after each run. Sounds slow and clunky?

> on Windows SQL Management Studio lets you set breakpoints, on Mac you're SOL

Good news! Since SSMS18 you're SOL on Windows too, as Microsoft completely removed that feature :)

I believe you are expected to use visual studio to manage all this crap... And debugging in it was awful years ago, SSMS was slightly less worse IMHO.

I think that when it comes to testing databases... most people just don't.

Look at this JetBrains survey: https://www.jetbrains.com/lp/devecosystem-2021/databases/

Around half of the people never debug stored procedures. Three quarters of people don't have tests in their databases. Only half of the people version their database scripts.

Personally: the answer is containers. Spin up a database in a container (manually or on CI server) and do whatever you need with it. Seed it with some test data, connect an app to it, check that app tests pass when writing to and reading from a live database (as opposed to mock data stores or something like H2), then discard the container.

Even if you don't have a traditional app, throwaway instances of the real type of DB that you'll be using are great, both for development and testing.

Write a toolbox for your tests that is able to spin up and tear down an instance of the database that you use (eg. on a random port and random password). Make sure that the new instance is really independent, especially storing its data in a separate (temporary) ditectory.

During test: - At the start of the test (fixture), run a new DB instance - Apply DB schema. - possibly: Remove constraints that that would disturb your tests (eg. unimportant foreign keys) - possibly: Add default values for columns that are not important for your test (but do with caution) - run you test - Assert results (maybe also directly as access to databse or via a dump of tables). - Tear down database possibly removing all data (except error logs).

I used this pattern to test software that uses MySQL or MariaDB server. For Microsoft SQL server it may be enough to create a new database instead of running a new instance (possible but not as easy as for MySQL/MariaDB).

On CI server this can be used to run tests against all required DB server types and versions.

DBT supports testing right out of the gate, you can write little queries and macros to test and validate every column.

In SQL-heavy ETL pipelines, I normally don't test the SQL queries by themselves, but do black box & performance testing to verify that the output of a certain batch job matches what I expect (automated acceptance testing).

This is easier if you have the same input every time the tests run, like a frozen database image, because then you can basically have snapshot tests.

we do something similar, we run some validation tests against the output regarding file size and line count but not the actual data

Two ideas here:

1) The same way you'd write any other tests. Use your favourite testing framework to write fixtures and tests for the SQL queries:

  - connect to the database
  - create tables
  - load test data
  - run the query
  - assert you get the results you expect

For insert or update queries, that assertion step might involve running another query.

2) DBT has support for testing! It's quite good. See https://docs.getdbt.com/docs/build/tests

I did data engineering for 6 years and am building a company to automate SQL validation for dbt users.

First, by “testing SQL pipelines”, I assume you mean testing changes to SQL code as part of the development workflow? (vs. monitoring pipelines in production for failures / anomalies).

If so:

1 – assertions. dbt comes with a solid built-in testing framework [1] for expressing assertions such as “this column should have values in the list [A,B,C]” as well checking referential integrity, uniqueness, nulls, etc. There are more advanced packages on top of dbt tests [2]. The problem with assertion testing in general though is that for a moderately complex data pipeline, it’s infeasible to achieve test coverage that would cover most possible failure scenarios.

2 – data diff: for every change to SQL, know exactly how the code change affects the output data by comparing the data in dev/staging (built off the dev branch code) with the data in production (built off the main branch). We built an open-source tool for that: https://github.com/datafold/data-diff, and we are adding an integration with dbt soon which will make diffing as part of dbt development workflow one command away [2]

We make money by selling a Cloud solution for teams that integrates data diff into Github/Gitlab CI and automatically diffs every pull request to tell you the how a change to SQL affects the target table you changed, downstream tables and dependent BI tools (video demo: [3])

I’ve also written about why reliable change management is so important for data engineering and what are key best practices to implement [4]

[1] https://docs.getdbt.com/docs/build/tests [2] https://github.com/calogica/dbt-expectations [3] https://github.com/datafold/data-diff/pull/364 [4] https://www.datafold.com/dbt [5] https://www.datafold.com/blog/the-day-you-stopped-breaking-y...

The hard part about testing SQL is decoupling from infrastructure and big data sources. We use DuckDB, and pandas dataframes mock data sources to unit test SQL. Python testing frameworks (or simple assert statements) can be used to compare inputs and outputs.

When the tests pass, we can change from DuckDB to Spark. This helps decouple testing Spark pipelines from the SparkSession and infrastructure, which saves a lot of compute resources during the iteration process.

This setup requires an abstraction layer to make the SQL execution agnostic to platforms and to make the data sources mockable. We use the open source Fugue layer to define the business logic once, and have it be compatible with DuckDB and Spark.

It is also worth noting that FugueSQL will support warehouses like BigQuery and Snowflake in the near future as part of their roadmap. So in the future, you can unit test SQL logic, and then bring it to BigQuery/Snowflake when ready.

For more information, there is this talk on PyData NYC (SQL testing part): https://www.youtube.com/watch?v=yQHksEh1GCs&t=1766s

Fugue project repo: https://github.com/fugue-project/fugue/

fwiw, you can share a spark session between unit tests. Even persist a spark session throughout the day so your tests run against a hot session.

Straight TDD with spark is perfectly fine if you know what you're doing. I'm not saying it's easy or there's an easy guide somewhere, but it's possible.

If you're using Pyspark via the API, it's likely an incredibly important part of your process.

Fair enough, agreed. It is tricky to “mock” as you said.

Our CICD platform and their owners get unhappy if we spawn an ad hoc spark session for testing purposes.

There is also a general expectation that unit tests are self contained and portable. So you could execute them in mac, linux, and arm ISA without much effort.

Another point was that we need to make this mocking or test setup easy because data scientist and ML Modellers are the most important persona who needs to write these tests ideally.

So mocking the data source with an abstraction layer and passing pandas dataframes, worked reasonably well for our use case.

Test the queries your application is making. I wouldn't put much effort into this. You have to trust that other people test their stuff anyway so why make a difference with a database? I'd much rather test that your backups work.

And that can be done by dumping the database (possibly verifying the content of that dump), taking a backup, restoring the backup to a fresh container, then comparing dump of that freshly restored database to the one you took at the start.

He is the one trying to test his stuff though, hence the question.

Use a testing framework to mock some database into the DB, run your queries, verify the result. Make sure you have a variety of data you use for tests to fully exercise the surface of logic you expect to hit.

Basically, treat the query and database as a black-box for testing like you would another third party API call.

I would strongly suggest having a layer of code in your application that is exclusively your data access and keeping any logic you can out of it. Data level tests are pretty onerous to write in the best circumstances and the more complexity you allow to grow around the raw SQL the worse of a time you'll have - swapping out where clauses and the like dynamically is a cost you'll need to eat, and sometimes having a semi-generic chunk that you reuse with some different joins can be more efficient than writing ten completely different access functions with completely different internal logic so judgement is required.

At the end of the day a database is like any other third party software component - data goes in, data comes out... the nice thing is that SQL is well defined and you've got all the definitions so it's easier to find the conditional cases you need to really closely tests... but databases are complex beasties and it'll never be easy.

I think the tricky part is DB itself might get funny as time goes by, downstream logic changes might invalidate top level queries that built on now dated assumption

Overall data integrity is hard.

Best tool nowadays has to be to spin up a database and execute the queries against it. If you are on a database setup that spinning up an instance takes a long time, consider docker.

Be wary of too many techniques that are supposed to be making it easier to test, but also make it hard for you to leave a query pipeline. In particular, SQL should be very easy in the "with these as our base inputs, we expect these as our base outputs." Trying to test individual parts of the queries is almost certainly doomed to massive bloat of the system and will cause grief later.

My favorite interview question. No, I mean when I'm being interviewed. The sheepish grins let me know I'm not alone.

Best ideas IMO (no particular order):

- make SQL dumber, move logic that needs testing out of SQL

- use an ORM that allows composing, disconnect composition & test (ie EF for .NET groups, test the LINQ for correct filtering etc, instead of testing for expected data from a db) (I see this has already been recommended elsewhere)

* edited formatting

Interesting. We have totally different philosophies here.

1. Moving logic out of SQL can break transactionality and hurt performance (more data leaves the DB)

2. ORMs hide the SQL from you, making all sorts of other things harder

My favoured approach is to test my application against a real local database, built from a fresh snapshot each time.

I do not change my application code to make testing easier. The application code is optimised for maintainability - so simplicity and ease of reading.

EF core makes it simple to test against a real sqlite db. It is super fast to run the tests and gives a reasonably realistic outcome.

> using interfaces

Kinda, but personally I describe as using LINQ queries. The dbcontext just isn't hooked up. It's a method that takes in an IQueryable<T> (there's the interface I suppose) and outputs a filtered IQueryable<T>. The unit test (see my next response) provides a test collection and expects a certain result.

> and integration tests

No, unit tests

Just as a real world (somewhat) counterpoint to this - you need to be very careful about performance metrics in particular.

Functionality is pretty easy to verify with LINQ / mock data sets (though it is also easy to mock things in a way that aren't representative of real data by mistake) but the performance characteristics of a real SQL engine vs. unit testing in this way can lead to some real gotchas once it's deployed. Particularly if you're using an ORM, multiple enumerations over database queries, strange ORM choices in joins/exists/whatever can absolutely murder performance.

I would still recommend having integration and/or performance testing downstream of these types of unit tests though I agree they are useful.

great point and agree - and, actually it argues even better against my point #1 about moving business logic out of SQL queries.

The teams I've been working on have resorted to data tests instead of code tests. That means that the data produced by your code is tested against a certain set of expectations - in stark contrast to code being tested _before_ its execution.

We've written our own tool to compare different data sources against each other. This allows, for example, to test for invariants (or expected variations) between and after a transformation.

The tool is open source: https://github.com/QuantCo/datajudge

We've also written a blog post trying to illustrate a use case: https://tech.quantco.com/2022/06/20/datajudge.html

Datafold (https://www.datafold.com/) is solving exactly that problem and has integration with dbt.

Related - how is any declarative language tested?

Quick web search confirms suspicions, it is not easy

https://www.metalevel.at/prolog/testing

Your link has the answer, declarative testing. You need to think about why you wrote that code and declare some outputs for given inputs that give you confidence in your code.

The other tests from your link are just silly. You did not write code to terminate or to provide an answer for all inputs, you wrote code to provide the right answer.

To test some HTML for instance, think about what information you want the page to convey, load up a browser, and check that the info is displayed. Easy as that.

> You did not write code to terminate or provide an answer for all inputs...

Definitely not an expert. But to my eyes checking for non-termination seems very close to things that happen in SQL all the time.

For example, a parent query joins two subqueries that both map data, and then maps on that data. To test, I could just test one specific scenario with specific values in all columns. That would be more of a concrete test case.

But I might want something more robust checking the outer bounds of acceptable data (like a min and a max). That seems much closer to the non-termination test.

Of course, I'm open to being corrected on this.

SQL is not a declarative language. It is a functional language, and structured language on the top as extensions. HTML is a declarative language.

> Although SQL is essentially a declarative language (4GL), it also includes procedural elements.

https://en.wikipedia.org/wiki/SQL

-3 votes shows how this is a common misunderstanding.

If SQL was a declarative language, there would not be expression in it, for example SELECT (1+2) , only SELECT 3

These categories based on the approach of the programming, not the underlying technology.

Imperative languages (high level): a step-by-step description of a process, like giving orders to someone/something in a sequence, organized further with loops and conditions.

Analogue: cooking recipe. Languages: C, Pascal, Basic, etc (they can be structured or object oriented, still imperative)

Functional language (not mathematical definition, forget stateless, non-mutable etc for a moment): It is an approach that solves the problem of cardinality of the imperative approach. There is nothing which is "one", only "one of them". It's a little bit confusing how i try to define this, but this is the most important aspect of this. We only filter and transform elements of a set, and even if this is nothing (for not mathematical aspect), but syntax sugar, it gives a programmer a safe way to do things, without caring about the details of enumerating things or null checks, etc. Or another angle: functional programming eliminates loops, and put the sequence in the data, and keep conditions. It is really handy to be honest. Purely functional languages are mostly experiments, because making something functional is kinda self-motivated, but other languages will pick up more functional elements in time, which is great.

Analogue: assembly line in a factory. Thing are coming, they are changed (like car body painted) or removed (like quality control). Languages: LINQ, SQL, F#

Declarative language: your approach here is instead of describing how to do something, you only describe what you want as a result. CSS is a perfect example, however it is picking up non-declarative elements nowadays, original it was only capable to describe how a font a paragraph look like. HTML is also declarative. Instead of actually drawing a rectangle, you just say, this width, height on this position, etc.

Analogue: you order a coffee in coffee shop. You don't care any of the details, just the parameters of the coffee you want to drink. Languages: HTML, CSS

Of course there are blurred lines like you can say all high level language are declarative, because you never go down to the hardware level to manually do everything, but I think what is important the approach, how you start to solve the problem, and if we see this from this point, SQL cannot be a declarative language. It was designed to be as close to English as possible to make non-IT people do programming, but it successfully failed it's purpose, because it is a pretty good language for programmers, but it was never designed to be declarative.

> If SQL was a declarative language, there would not be expression in it, for example SELECT (1+2) , only SELECT 3

SQL is I think generally considered not entirely declarative, but that is not an example that shows that. Is there any declarative language that satisfies that, that has no addition (or operator? Or addition of constants? I'm not really clear what about it you think makes it not declarative?).

By this rule, Prolog and HCL/Terraform are not declarative either.

one way of seeing this, everything, which is high level language is declarative, but if we do this, we make this classification meaningless. it was just an example, for an operator, which is an "activity" is not part of a declarative mindset (usually). you can twist the idea of SQL to be declarative, but in this case we go back to my first sentence, and distinction will have no purpose.

i give you other angles.

imperative has structures like: sequence (by code), loops and conditions

functional

** has no loop, the loop itself the cardinality, which is always multiple

** has no sequence, it's encoded by data (edit here, actually it has)

** has (of course) condition

So, by basically we redefine the fundamental elements of the imperative model

Purely declarative languages has no sequence/loop or conditions (in programming understanding)

Answering to you, HTML (especially earlier) is pretty close to purely declarative, i does not have loops or conditions, however, this is not 100% true, but close. I don't know those languages you mentioned, so i can not have an opinion about them

> Declarative language: your approach here is instead of describing how to do something, you only describe what you want as a result. CSS is a perfect example, however it is picking up non-declarative elements nowadays, original it was only capable to describe how a font a paragraph look like. HTML is also declarative. Instead of actually drawing a rectangle, you just say, this width, height on this position, etc.

SQL fits this perfectly. For example JOINs don't describe actions, they describe relationships. Have you ever looked at what the query planners come up with to satisfy those relationships?

you can force SQL to fit in this, but this way everything become a declarative language. the approach of solving problems is the difference.

SQL does not fit my definition, because you reach your goals through multiple transformation and filtering, and this is how you reach your goal. you define the way, the process, not the end result. under there are some comments where i speak about this.

No.

SQL is almost the _textbook definition_ of a declarative language.

It does not define the way. It defines the end result.

You make a statement without reasoning or examples: empty argument. You are not arguing any more, just want to be right.

> This is the "texbook definition"

The textbook definition of gravity is F=Gx((M1xM2)/r2)

According to Newton's textbook. Ask today's physicists about this.

I mean, I get what you are aiming for. HTML, though, is a markup language. You can call it declarative, but you don't get anything other than the HTML that you create. That is, it is not generating anything. You type what you get. Put differently, it is not a program.

SQL is far and away understood as a declarative language for what data you want out of a relational database. I challenge you to find any literature that does not describe it as a declarative language.

Now, can we munge definitions and pull in an odd true scotsman argument about it not being a "true declarative" language? I mean, yeah. But, this is like arguing that LISP is not a functional language by some specific modern view of that term. Certainly true, but far from useful. And almost certainly not what anyone you would talk to expects from those terms.

"don't get anything other than the HTML that you create" that's the point of declarative languages

"That is, it is not generating anything." if you speak about code generation like code behind in VS, it has nothing do with the paradigm. If you speak about underlying technology, it does not matter. In case of declarative languages, the framework determines what the language is capable of. And that's the point of them, they are generalizing the solution for common problems. The browser do a lot of things in the background when it draws a rectangle the same way how SQL server is doing a lot things when you execute a query, but these things have nothing to do with programming paradigm, which can be understood without computers.

Imperative: cooking recipe

Functional: assembly line in a factory

Declarative: your assistant

(compare the real world examples to the things i mentioned)

> SQL is far and away understood as a declarative language for what data you want out of a relational database. I challenge you to find any literature that does not describe it as a declarative language.

Ok, and I challenge you to compare HTML and SQL. Are they the same by nature. Are the steps, the approach of problem solving is the same? Because if SQL is declarative, than it is really similar to HTML

> Now, can we munge definitions and pull in an odd true scotsman argument about it not being a "true declarative" language? I mean, yeah. But, this is like arguing that LISP is not a functional language by some specific modern view of that term. Certainly true, but far from useful. And almost certainly not what anyone you would talk to expects from those terms.

We are speaking about semantics and taxonomy here. These kind of a questions are always debatable, so i can't really say anything about this, because you are speaking of ways of discussion. What i'm saying that my taxonomy makes the most sense to me, and if someone truly can challenge it, it's gonna make my mind change, but "literature mentions it somehow else" is not an argument makes me change my mind.

Comparing to html is nonsensical. Period. It is not instructions on how to do anything, but markup of a document. Would be akin to asking if a bitmap is declarative. It is literally the data. It is not a declaration of what you want, but is a definition of what you have. In that vein, you would need a new taxonomy of "definitive" languages.

And again, I get where you are wanting to go. But you are literally arguing against all literature on sql. And your definition of declarative is restricted to only markup languages, evidently. Prolog would also not be declarative, by this view. Which is akin to arguing lisp is not functional. Technically true. The least useful way of being true.

If you want to know why sql is declarative, consider, how is a join performed? Table scan, hash join, merge? Yes, there are many ways to poke at the query plan, but that is like claiming c is assembly because there are extensions.

Even for the ddl statements, how is an insert actually performed into the backing store? When you create an index, what is created? How is the data copied to it? How is the data stored at all? Btree? Something else?

"Comparing to html is nonsensical" No, absoltely not. We are speaking about programming paradigms. We can speak about those without computers, we never have to speak about HTML. A food recipe is a program, what you are executing when you cook, and it is imperative. Going to the restaurant you can achieve the same (the dish) but in a declarative way.

But I compare SQL and HTML for you. In both cases you are creating a text/a script in order to achieve something. The way how you achieve this can be really different in HTML and SQL. You can create a chess algo in SQL, but you cannot in HTML (purely), because in case of declarative language the framework around it strongly limits what the language could do. How would you explain this, if SQL and HTML are the same by nature? Focus on the text you write, because that is your product. Where is a condition is HTML, and why there is not?

Bitmap (like all data: xml, csv, whatever) is not a language in a classical understanding, but if you wanna make it, yes, they are purely declarative.

"It is not a declaration of what you want, but is a definition of what you have" Nope, i can define this without having a computer:

imperative: cooking recipe

functional: assembly line in a factory

declarative: your assistant. or let's make SQL declarative: you ask chatgpt to write a query, and define this query in english

"declarative is restricted to only markup languages" no, the form of the language is irrelevant. CSS is not markup, but still declarative.

Elements of languages cannot be declarative or not, you can only describe the entire language as such, because we are speaking about programming paradigms.

C is assembly at the end of the day, as everything else for the computer itself, but this is irrelevant here. From purely imperative, which is basically machine code to purely declarative (maybe chatgpt?) it's a scale, you put everything somewhere. Functional programming is syntax sugar over iterative, but it's really important one, as it changes how your mind try to solve problems. So basically what I'm saying everything is an abstraction over another one, down below the machine code, but this is just how thing works, not how you use your brain to solve things. I'm saying all this, because not the technical details, the implementations behind is the point when we speak about a purely theoretical classification which programming paradigm is.

You keep retreading the same ground. Have you actually gone out and researched why so many people are telling you that SQL is declarative? Right now, it is reading as though you are being willfully ignorant of the entire field.

Worse than just ignoring the entire field. You are caught up in a strict taxonomy that just doesn't work. Cooking recipes are, amusingly, mixed. They are a declaration of all ingredients and supplies you will need, with an often hybrid list of instructions on how to mix them. Would I declare them as fully declarative? No, but nor would I declare that they are imperative only. That is a part.

And again, your definition of functional versus imperative versus declarative will get you in some amusing historical binds. With how strict you are trying to be, literally no programming language is declarative. Even haskel has do notation.

Please give some effort to understand why so many people are telling you you are wrong. You are not wrong, in that if you expand the concept of SQL to everything that is possible with the language today, there are concepts that are not declarative in it. You are sadly misguided in thinking that makes your point. Would be like my sneaking in the "script" tag to laugh at how HTML is not declarative. It is a markup language, not a programming one. And people have done a lot to add to it so that it can do more than it originally could. Often out of necessity to get stuff done.

You taking apart languages, saying this is imperative part and this is declarative. That does not make sense only entire languages can be described like that.

But if you want: In the recipe the ingredients can be "int y;", like declaration, or something else like malloc() Stating that i will need something is not declarative programming. Declarative is defining the end result. So none of the ingredients or the declaration are declarative programming (if you want to turn recipe to declarative, you go to a restaurant), but as i told, you can't do this with elements of languages, it just does not make sense.

In practice language paradigms can overlap related to concrete products. You can use (and actually should) c# for functional programming, but that does not make the language functional, the fundamental design is imperative. In the future, everything will be declarative (chatgpt is the first glimpse), so in some way we're just wasting our time here. Also, speaking about AI, model training does not fit any of the current paradigms, it's gonna be a new one, and maybe the only one, who knows...

You tell SQL what you want, not how to get it. That's declarative.

  SQL : CSV :: GraphQL : JSON :: React : HTML

Really?

SELECT CASE WHEN employee.type = 'contract' THEN salary CASE WHEN employee.type = 'full-time' THEN salary + benefit_costs END CASE FROM employee

Tell me how is this declarative?

-SQL is functional -CSV is not a language, it's a data format -GraphQL, im not familiar -JSON is literally executable javascript code, arguable -React is a javascript framework, binding is a declarative nature (if it has it, i dont know it too well), but it is not a language, it's a framework -HTML is absolutely, 100% declarative, yes

I'm not the person you were replying to but you've misunderstood their comment - they were saying that CSV is to SQL what JSON is to GraphQL (and HTML to React) these being declarative languages for records, objects, and markup

A "case" statement in SQL (or an "iif") is still declarative, how else would you express specific cases when necessary?

Oh, i see, i did not get it. However i don't understand how "You tell SQL what you want, not how to get it. That's declarative" related to that list

iif and case by itself is not declarative or imperative. only an entire language can be described as such

SQL is declarative in the sense that you dont specify the strategy to employ in retrieving the data (ie which index to access and how, whether to sort and merge or build a hash table etc etc) however you DO need to specify how to represent your data as records (which could entail specific cases for when certain values go in the same cell as you pointed out).

You don't care about how the SQL server gets the data the same way how you don't care about what is happening in the background when you use File.ReadAllText The technology behind does not define the programming paradigm. You can speak about these paradigms without computers.

Imperative programming: really detailed cooking recipe

Functional programming: assembly line of a car factory

Declarative programming: a robot what can do what is programmed for, and sometimes can work with really unspecific instructions (like this one line in a file "Hello world" is not a valid HTML, but the browser still renders it)

I think what you're saying is accurate for describing what style of programming you might be doing, afterall we can write C++ in a declarative style. if you're coding in SQL then you're doing functional programming, but that doesn't change the fact that SQL is a declarative language for describing recordsets. you wouldn't refer to SQL as a functional programming language because it's not general purpose.

No, programming style can be "hacky", "script kiddie" or "architect kind design" or whatever. I'm speaking about "programming paradigms".

"functional programming, but that doesn't change the fact that SQL is a declarative language for describing recordsets"

ok, im describing recordsets for you: this set has 4223 records. I'm thinking of something else, but i can't really describe a recordset with anything else. Describing is not what you do when you create a software. You have a problem and you literally write a solution like an author write a book. And as you said, with SQL, you use a functional approach, how you do it when you use lambda or LINQ. Therefore SQL is functional.

SQL is turing complete with the structured extensions, so in that way can be general purpose, but something is general purpose or not does not decide the paradigm. Usually languages are more restrictive moving to the declarative side of things, but that is by design. The point is that you write less text.

I’ve not encountered your definition of declarative before. SQL is often cited in cs texts as an example of a declarative language. That said, SQL does have a lot of imperative features, but those features are used to declare the result of the sql dml query.

SQL does not make sense as a declarative language to me. How you can say that SQL and HTML is the same by nature? Also I can't see why we can't argue about this against the literature. Saying "you are wrong because 20 years old textbook say you are wrong" is not an argument for me

Does the presence of conditional expressions mean that the language which permits them is necessarily not declarative?

no, the approach makes the difference.

when you create html, you dont care how the rectangle is drawn, you focus on defining the result itself.

if sql was declarative, than the code of the SQL above would be: GIVE ME ALL THE COST OF THE EMPLOYEES --, I DONT CARE ABOUT THE DETAILS

the details, which in this example the difference between fulltime/contract employees would be handled by the framework (SQL server) which would already has a concept of two kind of employee like in HTML can have rounded tips of rectangle or angular tips, because the framework already understand these ideas

an SQL query itself (if no structured language used) is just a big function applied on a set. It has never have a concept of the things you do, like building an accounting solution to a company. HTML has the concept of paragraphs, button, etc etc, all the things you wanna do. but with SQL you step by step define what you want by conditions and transformations.

or another way to separate them -imperative is a combination of -sequence/condition/loop

-function -no loop (cardinality is always multiple) -sequences is encoded as data -conditions <- yes it has conditions

> if sql was declarative, than the code of the SQL above would be: GIVE ME ALL THE COST OF THE EMPLOYEES --, I DONT CARE ABOUT THE DETAILS

This is exactly what SQL is. There are exceptions, but they're exceptions.

> an SQL query itself (if no structured language used) is just a big function applied on a set.

I don't think you have a correct model of SQL. SQL is a language that expresses definitions, not functions. It is absolutely declarative.

Some extensions of SQL can provide functional capabilities, yes, but they are exceptional.

>This is exactly what SQL is. There are exceptions, but they're exceptions.

no, i meant the query is literally this: GIVE ME ALL THE COST OF THE EMPLOYEES

or if we want to formalize more: ALL COST OF ALL EMPLOYEES

the point here that in a declarative mindset EMPLOYEE is a word that SQL server understands, like HTML understand paragraph <p> or table with rows and cells <table><tr><td>

>I don't think you have a correct model of SQL. SQL is a language that expresses definitions, not functions. It is absolutely declarative.

that just not true. without defining the how through transformations and conditions, the only thing you get out of the sql server is a table how it is stored. until you reach a certain complexity you are true, it can be seen as declarative, but my point here, is the approach of problem solving is the separation here. you dont even have to speak about computers:

cooking recipe VS assembly lines in factories VS a robot with ai, which can do whatever is programmed to, but in certain cases work out the details by itself when not given specifically (not the best example, but started to get exhausted)

Also how would you say that by nature HTML and SQL is the same?

it is possible to reduce your query w/case statements to essentially "all cost of all employees" if you would model every type of employee and cost relationally (and then some view for all-employees which is the union of each type projecting the approptiate cost and so on) then your query gets to just ask for records again without having to "tell how" those things are to become records.

I dont think html is anything like sql (only in the sense that all markup languages are also declarative) but a comprable situation would be if you were trying to render some ascii art in html and had to use pre tags in order to tell the layout engine how to show it.

It is not possible, but it is the future, you already have a glimpse with chatgpt creating sql for you. Now that is real, 100% declarative SQL, no debate on that, as everything will be some-when in the future.

You are describe here the abstraction you create on a primitive (related to the system you design) system, and introduce ideas like "employee" "unions" etc If you do good job, when you programming you combine generic purpose systems and introducing new ideas in a way that another human can understand (words), but in certain way you "made the computer system understand" those things too. That's your job. So basically you teach SQL server that a union employee is a thing with the dbo.IsEmployeeUnion function.

In case of html you are not introducing new things. You actually can, because the browser is really flexible (like i make a non-functional quasi elements a lot, to make the html code itself more understandable), but HTML is not designed for you to introduce new elements/ideas (originally, nowadays kinda with CSS). Anyways this is how you see the difference in another way.

I try to tell all of you how the technical details, the implementation does not matter, only just the language itself, when we speak about paradigms, but it seems i'm failing.

By the way, i think training AI models does not fit to any of the paradigms we have here, it's gonna be (or it is?) a new way to solve problems, and that also has nothing to do with files, or any kind of implementation. The important is what you see on the screen, and what you do.

You do not understand SQL. HTML and SQL are essentially unrelated. It would be to your benefit to spend some time reading and learning on these topics.

Of course they are, however they are both languages. And you classify them into the same category, so it should have some similarities, right?

The most important requirement for a functional language is that functions are first-class values. i.e. Not SQL.

On the other hand, the details of query execution are left to the planner and optimizer.

What's the case that it's functional, but not declarative?

You are speak about things which called delegates, typed functions, defined by the parameters it gets, and the data it returns. Using delegates does not make a language functional. If you really want to speak about the underlying technology, than every high level language is declarative. You can draw a triangle by drawing three sides, or you can just put this in a function like drawtriangle(v1,v2,v3), and tadam, your language is declarative. The underlying tech is not important, the approach of the developer is important. See my longer answer above

Your comments in this thread have made for a pretty tough read, but I think your angle is finally made clear here.

Supposedly, “every high level language is declarative”.

It’s an opinion I suppose, but I doubt it’s one you’re going to find much support for. What I think this feels like to most people is that things aren’t what people think they are because you’ve decided to reimagine the commonly used definitions of the words used to describe those things.

Not sure if it’s a straw man exactly, but I feel like I’m in a straw man’s garden.

I'm sorry, i'm speaking with multiple people on multiple threads here, and don't have too much time to express myself perfectly. Every language is declarative is an interpretation that can make sense in certain angle, like when you use some imperative language, you create mini libraries for yourself, and that's already steps to a declarative variation, but this does not changes the fundamental design of the language.

But this is not my angle, i say the borders drawn up by the way how you, as a programmer, approach solving a problem. My take is declarative when you define the *end result* you want to get. You certainly don't do this with SQL, because that case the your product (what you write in the query) would be the end result set, the data itself. Of course in case of HTML you don't literally draw in the rectangle in the code, but you describe the end result. This is the difference.

Other arguments you wrote is out of scope for me regarding this conversation.

That's quite the slippery slope though.

You don't literally draw a rectangle with HTML, but you also don't tell the database's query planner how to do its job.

You're also not comparing things on equal terms. HTML isn't any kind of programming or scripting language.

If I understand you correctly, SQL would only be declarative if the user literally wrote the entire result set of the query. This would then obviate the need for SQL. The result set of a query is called "data". It's not a declarative language.

You're free to redefine terms, but as I mentioned, you're going to have a hard time finding people to agree with you.

Technical and implementation details again. When we speak about programming paradigm, we speak about the language alone, not any software system, this is a theoretical discussion. It does not matter how it is executed, what matter is how you express yourself in that language to achieve your goals.

There are no bits, bytes and cpus in this conversation. This is classification and taxonomy of artificial languages created for other purpose than communication between 2 persons.

In SQL you gives instruction in a functional way of thinking. Functional programming is a restricted variation of imperative one for a purpose the same way how object oriented programming is basically structured programming with restrictions imposed on the programmer (for a purpose again, and a very good reason for that). So fundamentally SQL is imperative, but the details of that completely hidden and therefore irrelevant.

In HTML you describe the result itself. So "SQL would only be declarative if the user literally wrote the entire result set of the query" is not completely, but somewhat true. The way how you define a result can vary, but SQL definitely does not do that. But the way how i can imagine SQL as declarative, if i "programming by example" or ask chatgpt to translate my natural language and create the query. But these are forced examples, because IMO SQL is absolutely not declarative, and cant really be.

If HTML is a programming language or not, it's debatable, but i think it's enough that is a language (programming or not) which is designed to describe a flow-kind of a text layout (like a word file) in the first versions, then other visible elements, so i think it's not completely, but kinda irrelevant question from the viewpoint of paradigm.

> this is a theoretical discussion

With all due respect, I don’t understand what your aim is with this discussion, and at this point I think I’m more confused than when we started. What is your aim here?

> If HTML is a programming language or not, it's debatable

I suppose to some degree, everything is debatable. But this point isn’t somewhere where we’re going to find common ground. Another commenter said it well — with HTML, you get what you write. It’s not a programming language.

In any case, I appreciate you taking the time to expand on your reasoning. I’m just still struggling to make heads or tails of it.

SQL is absolutely a declarative language. It is practically the textbook definition of a declarative language.

Okay, I'll grant it's not pure declarative.

But the part that is declarative is sure difficult to test.

everything is partly declarative. you can draw a rectangle line by line, or you can put this code into a function or a procedure, and call it declarative, because it is something like that.

i did not speak about testing, however what a declarative language can do is defined by the framework under it. so you test that I guess?

I don't think there's any perfect universal answer for this. I only have a few things I've done that work.

Rails for Ruby comes with some pretty nice setups for testing the database code. There's a test DB by default with the same schema as Production, and the usual test frameworks (FactoryBot and RSpec) make it easy to set up some data in the actual DB for each spec, run model code that makes actual SQL queries, and assert against the results.

I would have hoped most other web hosting frameworks would make as much effort to making it straightforward to test your database code, but it doesn't really seem to be the case.

In Rust, there's a very handy crate called sqlx. What it does is, at compile time, it runs all of the SQL in your codebase against a copy of your database to both validate that it runs without errors and map the input and output types to typecheck the Rust code.

When it comes to stuff like validating that your queries are performant against production datasets or that there isn't any unexpected data in production that breaks your queries, well I pretty much got nothing. Maybe try a read replica to execute against?

Testing databases with Docker, dummy data, etc. can be very slow so it’s a big win to use Bazel as the test executor. This enables caching between runs and between machines. Saves us about 20 mins of CI time every build.

GitHub Actions trivialized this for us. Spawning a Postgres database for testing is easy and carefree. Spawns. Runs operations and evaluates the state of the database after each operation.

We have two flavours of test: one that drops the transaction each time, ensuring a clean, known state. And one that doesn’t, allowing your tests to avoid lots of overhead by “walking through a series of incremental states”.

Yes, some might call the latter heresy. But it works great.

On a related note (though it does touch upon testing) mitchellh open sourced Squire[1] recently (today?). Note though, that it's PostgreSQL specific.

[1] https://github.com/mitchellh/squire

You get a lot of bang for your buck with an expected input/output setup. In basically every database it's trivial to set up a few tiny tables, and it's cheap to run a query on small input. Pick a few edge cases and a few non-trivial representative examples, and any passing query written by a real person will likely express the logic you care about or will expose an additional input/output pair to add to the tests. Combine the high efficacy with the ease of writing and understanding such a test, and it becomes hard to argue against having at least a few.

I regret making an integration in SQL Server Integration Services, with some nice addon tools that can make http requests and such. In the end it is untestable piece of hard to follow solution.

Anyone have any recommendations on testing SSIS ?

Well, I used to work at a place where they used https://tsqlt.org/ for testing it worked great

I dont know the technical detail of how to set it up, it was already setup when I worked there

But basically, we wrote SQL script that included statements to

1. create the db structure, tables or views

2. insert statement to enter test data (you can insert corner cases etc..)

3. ran the function or procedure

4. ran an assert to confirm if results are to our expectation

test script were ran by the CI/CD process

To be honest.

I would love testing to work.

Have set up and maintained several unit test suites in Jest.

Wrote several large e2e test suites in Cypress.

I don't think anyone won time from simply having a manual checklist and testing manually.

Maybe me and my former teammates are doing it wrong. Talking 8+ teams, from corporate to startup.

But loved de proven wrong. E2e def. catched most issues.

I'm not sure I understand your comment, you seem to say that automating testing isn't better than manual testing, yet the e2e tests caught most issues?

Side note: I believe it is good form to always have unit tests that test the "up" and "down" for every single migration in your app. It's not always possible but if you're strict about it you can avoid a lot of bad patterns and have a much healthier set of migrations

In GraphJin an automatic GraphQL to SQL compiler we use the gnomock library it startups a database instance (docker) then create the schema and tests data and finally our code connects to it and runs a series of tests. We run these across Mysql, Postgres and a few other DB's. Gnomock supports a wide range of them. Right now we don't take down the db for every test only between test runs but its fast enough that we could. This whole thing runs of a simple `go test -v .` command and we run it on every commit using a githook. https://github.com/dosco/graphjin/blob/master/tests/dbint_te...

.NET Shop using SQL Server here, but I think something similar to what we do can apply to any stack. We use TestContainers [1] to spin up a container with SQL Server engine running on it. Then use FluentMigrator [2] to provision tables and test data to run XUnit integration tests against. This has worked remarkably well.

[1] https://dotnet.testcontainers.org/

[2] https://fluentmigrator.github.io/

Realistically, most of my tests are integration / end to end tests. They typically get written only when it comes to patch time, where you first want proof that the old system works before you tear it apart and rebuild it. I think that’s probably the only SQL testing I’ve ever done and honestly, if they are fast enough, that kind of integration testing is all you will need too.

As the meme say: App worked before. App work afterwards. Can’t explain that.

I started in the software engineering space and move into data engineering, and I was floored with the complete lack of tooling. There is a HUGE gap between software engineering and data engineering when it comes to both tooling and practice. Even the simplest "unit" test of "Is the SQL statement valid" is not all that common in frameworks and tooling but in practice is like 90% of the production failures that I've seen.

Starting with a framework that is programming language first (IE Spark) can help you build your own tooling to help you actually build unit tests. It's frustrating though, that this isn't just common across other ETL tooling.

We use this and take an example-based tests approach for any non-trivial DBT models: https://github.com/EqualExperts/dbt-unit-testing

More trivial example:

    {%
        call dbt_unit_testing.test(
            'REDACTED',
            'Should replace nullish values with NULL'
        )
    %}
        {% call dbt_unit_testing.mock_source('REDACTED', 'REDACTED', opts) %}

            "id" | "industry"
            1    | 'A'
            2    | 'B'
            3    | ''
            4    | 'Other'
            5    | 'C'
            6    | NULL

        {% endcall %}

        {% call dbt_unit_testing.expect(opts) %}

            "history_id" | "REDACTED"
            1            | 'A'
            2            | 'B'
            3            | NULL
            4            | NULL
            5            | 'C'
            6            | NULL

        {% endcall %}
    {% endcall %}

Here's a nifty python module for testing against postgres that a friend wrote. It creates a temporary database on the fly for testing and tears it down at the end. You can then populate the database and run all kinds of integration tests against it.

https://github.com/ugtar/pg_temp

Phoenix/Elixir/Ecto seems to handle this really well. When you write a test in Phoenix, you can use `DataCase` which automatically creates sandboxed transactions [1] for each test. This makes writing integration tests a breeze.

[1] https://hexdocs.pm/ecto_sql/Ecto.Adapters.SQL.html#module-sq...

If I were building a tool to test SQL, then I'd try to load the SQL into a dataframe, then test it by mocking the tables and the output. This is a tough problem to solve. If testing is important, possibly move away from SQL and towards ORMs.

I use a fake object in place of a database connection which gives fake responses when the correct SQL query is sent to it.

Example:

db = Fake().expect_query("SELECT * FROM users", result=[(1, 'Bob'), (2, 'Joe')])

Then you do:

db.query("SELECT * FROM users")

and get back the result.

In Python if you do this in a context manager, you can ensure that all expected queries actually were issued, because the Fake object can track which ones it already saw and throw an exception on exit.

The upside of this is, you don't need any database server running for your tests.

update: This pattern is usually called db-mock or something like this. There are some packages out there. I built it a few times for companies I worked for.

This is testing the code that emits the SQL code, not the SQL code itself. Sometimes that's what you want, but it's not what OP asked for.

I experienced this issue in many companies and found there to be no real solution, especially when not testing with the real data. This was one of the reasons I created DrvDB [1] as it allows you to store a copy of the data and very quickly spin up containers to test large databases in CI, and verify the output, performance, etc. is what you expect.

You can achieve the same thing with "docker commit"-ing data into docker images of your dB engine of choice, and firing your queries on them, but that only really works with smaller datasets.

[1] https://devdb.cloud

Sorry for the offtopic remarks, but DevDB is relevant to my interests since I've been working on a similar/related and potentially synergistic tech as an on-again/off-again side project for a few years, and would love to chat with you about it. I've got the same username on GitHub and Twitter as here if you want to reach out through either of those channels.

From an SQL database implementation perspective, in my toy Python barebones SQL database that barely supports inner joins (https://github.com/samsquire/hash-db) I tested by testing on postgresql and seeing if my query with two joins produces the same results.

I ought to produce unit tests that prove that tuples from each join operation produces the correct dataset. I've only ever tested with 3 join operations in one query.

For a user perspective, I guess you could write some tooling that loads example data into a database and does an incremental join with each part of the join statement added.

Personally I like using dimensions from each set and seeing if the business logic side lines up. So measure sales month over month with 2 different sources and/or have the full join count important fields on matches what makes it in and out of the combination as a score board.

assuming you are asking about sql select statements, the problem is knowing what the correct answer is so you can test against it. for most data, you don't, and probably cannot know this.

not a unique problem with sql, btw.

You can know what the answer is against a small test dataset though. Obviously the challenge is ensuring it's representative, that you hit the edge cases of real data etc. But it's better than nothing

i have never worked with a small dataset. mocking one doesn't work, for the reasons you suggest, and others.

basically you cannot test queries against a big database. you just have to hope for the best.

Just, what? This sounds terrible. Why can't you run a test query against a large dataset? If need be, anonymize your data set into an integ environment, and go nuts. Could there be a data combination that surprises you someday? Certainly. Most queries, though, are much much simpler than that and can, in fact, be tested.

So I'm not an expert, but for simplistic use cases I merely make use of https://github.com/oguimbal/pg-mem

It's a lot faster and easier than dealing with containers and the like.

I really love (as with lots of things) how Ecto from the Elixir community handles this, you have an extra database and because Postgres has awesome transactions you can even run all theses tests of your whole data layer in parallel, including any SQL. Ecto is largely a domain specific language for writing modular SQL so that helps test things too.

SQL pipelines can be tested pretty easily. If they look congested, use a snake (Python, if you like) to try to knock the CRUD out of them.

Surprised only a single comment in this thread had mentioned the "explain" keyword that exists in MySQL.

Excellent for checking delete queries before running them.

If your app is API driven, I would create test data with the APIs, and test the queries against what that generates. APIs tend to change less often than the internal data representation, and the app developers can usually figure out how to fix the tests when they do.

We test it implicitly with integration tests across the API for the most part. We occasionally have unit tests (in C#) that directly call stored procs.

The backbone for this is that we spin up a DB per unit test, so we don't have to worry about shared state.

I try to do some kind of compile-time query checking. I really like sqlx with Rust, and other languages have some kind of equivalent (although maybe not as nice) like JOOQ. If you can store the queries in some kind of configuration, like SQL files, then this is easy no matter the language.

I've been using IntegreSQL and it works pretty well

Happy to hear that! When it comes to testing services that depend on PostgreSQL, this is still my preferred solution.

https://github.com/allaboutapps/integresql

disclaimer: author

Abstraction layer between the query you write and the one that gets executed. This way you can mock the schema, run the query on the mock to assert some condition x.

A ref() concept like dbt's is sufficient. When testing, have ref output a different (test-x) name for all your references.

Part of the solution could be using tooling which can compile time check SQL is valid like https://github.com/launchbadge/sqlx

in production, on a friday before I leave for vacation

for real though I love tools like SequelPro or TablePlus that let me work out a query before I bake logic or stuff into my apps. Also sometimes I use it to work out the data needed for reports. I am working with salesforce for the first time in my life and apparently there are tools that let me treat it like I'm used to SequelPro.

Creating a separate isolated environment is totally the way to go, and migrate the data from your existing database to it. This way the data turns to "dummy" data but it provides you with something to conduct testing with.

I always did that with integration tests. Put some data in the db. Use the repository aka your sql and validate the results.

Most of the times there is a layer around your sql (a repository, a bash script or whatever) that you can use for integration testing.

In production ;)

But my app is for six users at one site, it’s not mission critical, and the sqlite DB is backed up hourly.

Life’s too short for (unnecessary) testing.

I've been wanting to put together a property testing framework that lets you specify properties as queries.

Disclosure. I'm CEO of Neon (neon.tech).

One of the premises that we have is the ability to instantly create a test environment by creating a branch. I'd love to hear what you think about it.

Disclaimer: I am building a data platform called Blast, and part of our focus is to make SQL pipelines easier to maintain at every aspect: easier to write, easier to test, easier to ensure it is correct before they are deployed. Link in my bio if you are interested, happy to have a conversation anyway. This is a problem we have been thinking and providing solutions for quite some time.

There are a few types of tests one would like from a SQL pipeline, each with a different value add:

- Quality assurance tests: these are things like DBT tests, they mainly test the accuracy of the result after the tables are produced. Examples of this would be tests like "this column should not contain any `null` values" or "it should have only X, Y and Z values". They are valuable checks, but the matter of the fact is that there are many cases where running this sorts of tests after the data is produced is a bit too late.

- Integration tests: specify an input table and your expected output, and run your queries against it, the end result must match with the expectations at all times. This is useful for running them regularly and serve as "integration tests" for your SQL assets. They allow validating the logic inside the query, provided that the input is covering the cases that needs to be covered, they can be executed in CI/CD pipelines. We are exploring a new way of doing this with Blast CLI, effectively running a BigQuery compatible database in-memory and running tests against every asset in the pipeline locally.

- Validation tests: these tests aim to ensure that the query is syntactically correct on the production DWH, usually using tricks like `EXPLAIN` or dry-run in BigQuery. These sorts of tests would ensure that the tables/fields referenced actually exist, the types are valid, the query has no syntax errors, etc.. These are very useful for running in CI after every change, effectively allowing catching many classes of bugs.

- Sanity checks: these are similar to the quality assurance tests described above, but with a bigger focus on making sense out of the data. They range from "this table has no more rows than this other table" to business-level checks such as "the conversion rate for this week cannot be more than 20% lower compared to last week". They are executed after the data is produced as well, and they would serve as an alerting layer.

There is no silver bullet when it comes to testing SQL, because in the end what is being tested is not just the SQL query but the data asset itself, which makes things more complicated. The fact that SQL has no standardized way of testing things and the language has a lot of dialects make this harder than it could have been. In my experience, I have found the combination of the strategies above to have a very good coverage when it comes to approximating how accurate the queries are and how trustworthy the end result is, provided that a healthy mix of them is being used throughout the whole development lifecycle.

Relational databases rely on math. You may test your implementation; Or the input to your statements.

This approach didn't use an ORM and run the tests concurrently against the same database.

I follow those steps on my pipeline:

Every time I commit changes the CI/CD pipeline follow those steps, on this order:

- I use sqitch for the database migration (my DB is postgresql).

- Run the migration script `sqitch deploy`. It runs only the items that hasn't been migrated yet.

- Run the `revert all` feature of sqitch to check if the revert action works well too.

- I run `sqitch deploy` again to test if the migration works well from scratch.

- After the schema migration has been applied, I run integration tests with Typescript and a test runner, which includes a mix of application tests and database tests too.

- If everything goes well, then it runs the migration script to the staging environment, and eventually it runs on the production database after a series of other steps on the pipeline.

I test my database queries from Typescript in this way:

-in practice I'm not strict on separating the tests from the database queries and the application code, instead, I test the layers as they are being developed, starting from simple inserts on the database, where I test my application CRUD functions that is being developed, plus to the fixtures generators (the code that generate synthetic data for my tests) and the deletion and test cleanup capabilities.

-having those boilerplate code, then I start testing the complex queries, and if a query is large enough (and assuming there are no performance penalties using CTE for those cases), I write my largue queries on small chunks on a cte, like this (replace SELECT 1 by your queries):

    export const sql_start = `
    WITH dummy_start AS (
        SELECT 1
    )

    export const step_2 = `${sql_start},
    step_2 AS (
        SELECT 1
    )
    `;

    export const step_3 = `${step_2},
    step_3 AS (
        SELECT 1
    )
    `;

    export const final_sql_query_to_use_in_app = ` ${step_3},
    final_sql_query_to_use_in_app AS(
        SELECT 1
    )

    SELECT \* FROM final_sql_query_to_use_in_app

Then on my tests I can quickly pick any step of the CTE to test it

    import {step_2, step_3, final_sql_query_to_use_in_app} from './my-query';

    test('my test', async t => {

        //
        // here goes the code that load the fixtures (testing data) to the database
        //

        //this is one test, repeat for each step of your sql query
        const sql = `${step_3} 
            SELECT * FROM step_3 WHERE .....
        `;
        const {rows: myResult} = await db.query(sql, [myParam]);
        t.is(myResult.length, 3);

        //
        // here goes the code that cleanup the testing data created for this test
        // 
    });

and on my application, I just use the final query:

        import {final_sql_query_to_use_in_app} from './my-query';

        db.query(final_sql_query_to_use_in_app)

The tests start with an empty database (sqitch deploy just ran on it), then each test creates its own data fixtures (this is the more time consuming part of the test process) with UUIDs as synthetic data so I don't have conflicts between each test data, which makes it possible to run the tests concurrenlty, which is important to detect bugs on the queries too. Also, I include a cleanup process after each tests so after finishing the tests the database is empty of data again.

For sql queries that are critical pieces, I was be able to develop thounsands of automated tests with this approach and in addition to combinatorial approaches. In cases where a column of a view are basically a operation of states, if you write the logic in sql directly, you can test the combination of states from a spreadsheet (each colum is an state), and combining the states you can fill the expectations directly on the spreadsheet and give it to the test suites to run the scenarios and expectations by consuming the csv version of your spreadsheets.

If you are interested on more details just ping me, I'll be happy to share more about my approach.

This is a beautiful and powerful approach (and great explanation), IMO. Personally, I'd keep all CTEs (your steps) as units and combine them separately. But you've commented on that in your other comment.

how do you know that what your select statements return is correct? on a real database?

Yes, it always runs on a real database, without mocks, as integration tests where you can test each CTE's auxiliary statement separately, which acts as a step of our sql pipeline. So the initial data is inserted on the tables and then I can exercise the assertions on the sql queries or views against the real database. In theory it could be easy to rearrange the concatenations of the CTE strings from above so it can be tested as a unit when it's put together with the previous auxiliary step as a temporary table but I never had the need for that since the integrations are simpler and works really well for me. The essential part of my approach is to treat the sql code as concatenated pieces of strings, and call slices of that with just the right concatenation to exercise the test to that slice on the real database, which is valid since the query will always be valid.

There is another pattern too, when I implemented a RLS based multi-tenancy with RBAC support, which needed an relatively large sql codebase and needed to be battle tested because it was critical, I've splited a big part of the sql code in a lot of sql functions instead of views to test the code units or integrations (using something similar to dependency injection but for the data, to switch the tenant RBAC's contexts), because for the sql functions I can pass different Postgresql's Configuration Parameters to test different tenants for example.

Excellent question. Not sure why it's getting no traction.

For my own use-cases, I usually test this at the application level and not the DB level. This is admittedly not unit-testing my SQL (or stored procs or triggers) but integration-testing it.

With ORMs you can get pretty close to this being unit testing for the DB though.

I haven't seen an ORM that handles analytical queries well. I'd rather write raw SQL than use SQLAlchemy for complex queries with multiple joins, aggregations, and window functions.

I agree about the limitations of ORMs.

However I have had great luck with using an ORM to load up the database and data, and then having a unit test that calls the function which does raw SQL in the middle. And now the raw database tests are integrated with the unit tests for the rest of the environment in a way that keeps them synchronized with the application code that also interacts with the same database.

And, of course, the limits of ORMs do not change the fact that they have use cases.

There’s ORMs and there’s ORMs - at one end you have the (reprehensible) Active Record anti-pattern, at the other end you have EF Core extended with one’s own build-time type generation - they’re both “ORMs” to everyone involved, but they’re totally incomparable.

…not to say they that EF Core doesn’t have flaws (it does, and they’re legion) but the ORMs of today are nothing like the ORMs of the 1990s… or even like 2010’s NHibernate.

The problem with writing raw SQL (which I do personally prefer myself, too) is now you need to generate types and/or mappings for each distinct query’s resultset schema - doing that by-hand is tedious and error-prone (or use untyped dict objects for every row, ew) - so what you really need is a project build-step that finds every query in your project and runs it against a prototype database instance in order to get schema result typing info, then generates the strong-types/mappings code for you before everything else gets compiled…

…and it works - but now you have possibly thousands of classes/structs that are all-so-similar but also subtly different - namely disjoint members (so they can’t exist in an inheritance hierarchy, e.g. NewUser won’t have a UserId value, result-types would be immutable, unless they need to be mutable, etc…). It’s all such a huge pain. In a C# project of mine that does something like this, it means that every business-entity typically has at least ~5 actual class/struct/interface types associated with it: e.g. NewUser, IReadOnlyUser, IWritableUserValues, struct UserKey, MutableUser, UpdateUserValues, etc.

…surely there’s a better way?

I like the approach of just using reflection to do the mapping and throw an error when things don't quite match up like with https://stackoverflow.com/a/21956222/7608007. Combined with something like record types or Lombok, it's not that much effort to create a bean class for each result mapping you care about.

The problem with using something like runtime reflection is you lose compile-time type-safety (i.e.: the guarantees that a particular resultset contains a particular column, its type and nullability is correct, etc) .

With not complex data for sure, I think SQL testing is so much easier once you start having hierarchies and many to many relationships start piling up.

I used to work on SQL pipelines, sadly I didn’t get to implement some of these suggestions, but we did implement others.

For testing:

Run your query/pipeline against synthetic/manual data that you can easily verify the correctness of. This is like a unit test.

Run your query/pipeline on sampled actual data (eg 0.1% of the furthest upstream data you care about). This is like an integration test or a canary. Instead of taking 0.1% of all records you might instead want to sample 0.1% of all USERID so that things like aggregate values can be sanity checked.

Compare the results of the new query to the results from the old query/pipeline. This is like a regression test. You may think this wouldn’t help for many changes because the output is expected to change, but you could run this only on e.g. a subset of columns.

Take the output of the new query (or sampled query, or the manual query) and feed it to whatever is downstream. This is like a conformance test.

For reliability:

If the cost is not prohibitive, consider persisting temporary query results (eg between stages of your pipeline) for 1-2 weeks. This way if you catch a bug from a recent change you only need to rerun the part of your pipeline after the breakage. May not make sense to do if your pipeline is not big

If the cost is not prohibitive you could also run both the new and old versions of the pipeline for ~a week so that you can quickly “rollback”. Ofc whether this is viable depends on what you’re doing.

The big failure modes with SQL pipelines IME are

1. unexpected edge cases and bad data causing queries to fail (eg you manually test the new query and it works fine, but in production it fails when handling Unicode)

2. not having a plan for what to do when a bug gets caught after the fact

3. barely ever noticing bugs or lost data because nobody is validating the output (for example, if you have a pipeline that aggregates a user’s records over a day, any USERID that’s in the input data for that day should also be in the output data for that day).

4. This can be very hard to solve depending on your circumstances, but upstream changes in data are the most annoying and intractable to solve. The best case here is you either spec out the input data closely OR have some kind of testing in place that the upstream folks run before shipping changes.

To address these, you need to take the approach of expecting things to fail, rather than hoping they don’t. This is common practice in many SWE shops these days but the culture in the data world hasn’t quite caught up. I think part of the problem is that automating this testing usually requires at least some scripting/programming which is outside the comfort zone for many people who “just write SQL.”

Have small databases.

Learn to use IMPORT TABLESPACE in MySQL or just dump and import SQL.

Every time you run a test you set up the mock databases again.

It depends a lot on your use case. In my case we have SQL running against tables with trillions of rows, so we need to take a look at every single SQL query in the code that runs more than tens of milliseconds or often enough to get significant. There is no automation for a good DBA looking at an execution plan; I heard about a guy that works in some financial company where his job for the past 10 days was to tune the same ~ 10 queries to the death, but if your app is working with a database that can be hosted comfortably on a smartphone, none of this is needed.

Two days ago, everyone: “You should use SQL because everyone knows it and modern SQL is pretty good.

Other languages are too complicated. :(“

Everyone today: “tries using sql

Oh wow, the tooling is quite basic, and you can’t express complex data structures and imperative code. :(“

What did you expect?

Look, I spent 4 years in this rabbit hole, and here’s my advice:

Don’t try to put the square peg in the round hole.

You want easy to write, simple code and pipelines? Just use sql.

Have a dev environment and run everything against that to verify it.

Do not bother with unit testing your CTEs, it’s hard to do, there are no good tools to do it.

If you want Strong Engineering TM, use python and spark and all the python libraries that exist to do all that stuff.

It won’t be as quick to write, or make changes to, but it will be easier to write more verifiably robust code.

If you treat either as something it is not (eg. Writing complex data structures and frameworks in sql) you’re using the wrong tool for the outcome you’re trying to achieve.

It’ll take longer and “feel bad”, not because the tool is bad, but because you’re using it in a bad way.

Recommend

Reducing Go Execution Tracer Overhead With Frame Pointer Unwinding

苹果只会为iPhone 15 Pro机型提供Wi-Fi 6E，拉开与标准机型之间的配置差距

Acer reveals latest additions to TravelMate series of Windows 11 laptops for stu...

u-blox微型蓝牙模块ANNA-B4的12种工业应用-品玩

LinkedIn now lets you add an SEO title and description

Is DeFi Failing Already?

Snowfall Protocol (SNW) attracts investors with lucrative returns

14家企业在京研讨达成地理测绘新规下发展共识：推进产业开放，建立统一标准-品玩

2023开学礼物选哪款，解读科大讯飞学习产品-品玩

如何实现科幻片“工业化”？电影《流浪地球2》中的漫步者给出答案-品玩

About Joyk