Ask HN: Does anyone else think SQL needs help?

		Ask HN: Does anyone else think SQL needs help?
	31 points by fny 5 hours ago \| hide \| past \| favorite \| 67 comments
	I have been writing a decent amount of SQL for three years now, and while the paradigm is extremely powerful, the language lends itself to unmaintainable spaghetti code quickly. Fundamentally, SQL lacks meaningful abstraction, and there's no sane way to package very common operations. Say you want to find elements in a row that correspond to some maximum value when grouped by date. Today, you'd need to write something like this EVERY SINGLE TIME: ``` SELECT sd1.sale_person_id, sd1.sale_person_name, sd1.no_products_sold, sd1.commission_percentage, sd1.sales_department FROM sales_department_details sd1 INNER JOIN (SELECT sale_person_id, MAX(no_products_sold) AS no_products_sold FROM sales_department_details GROUP BY sale_person_id) sd2 ON sd1.sale_person_id = sd2.sale_person_id AND sd1.no_products_sold = sd2.no_products_sold; ``` Wouldn't something like this be nicer? ``` SELECT sale_person_id, max(no_products_sold) as max_sold, link(sale_person_name, max_sold), FROM sales_department_details ``` Frankly, it seems like some sort of macro system is needed. Perhaps the SQL compiles into the above?

Wouldn't

SELECT * FROM

(SELECT DENSE_RANK() OVER ( PARTITION BY sd1.sale_person_id ORDER BY sd1.no_products_sold DESC) r ,sd1.sale_person_id, sd1.sale_person_name, sd1.no_products_sold, sd1.commission_percentage, sd1.sales_department FROM sales_department_details sd1) sd

Where sd.r = 1

do the same without the inner join?

Agreed that SQL lends itself to spaghetti. However after working with it for 25 years both as a developer but also collaborating with analysts and data scientists, I have to say I appreciate its basic declarative nature and remarkable portability as it has expanded into distributed databases.

For me the value of being able to walk up to an unfamiliar database of widely varying tech and start getting value out of it immediately is the killer app. Macros or other DB-specific extensions would be useful at times, but I’m not sure how much they would enable solving the messiness problem in a generic way that reshapes the paradigm more broadly. My instinct is the messiness is a consequence of the direct utility and trying to make it nicer from a programmerly aesthetics point of view might be hard to sell. It’s not like SQL doesn’t have affordances for aesthetics already (eg. WITH clauses).

SQL is masquerades as declarative. It's relational algebra in disguise. This results in...

- Queries that are difficult to interpret/verify.

- Non trivial amount of effort to go from question to query: I often feel like I have to jump through hoops to obtain something trivial. I have to think, and I don't want to think.

I can't believe that at time of this writing, only a few other comments are mentioning VIEWs! Like that is this dude's exact complaint!

smh...

Views, especially Materialized Views if you’re on Postgres, are about as close as you can come to a “one neat trick that solves all your problems” in modern software development imo.

Let's say you have 14 derived facts (eg aggregations, computations involving joined tables, etc) about a SalesPerson. Do create 14 single-column views? Or have one view with 14 columns? Either way the composability is terrible. If you go with 14 views, the syntax is awful and not very resistant to refactoring. If you go with one view, you will eventually end up paying a performance cost for computations you actually aren't using in certain places (barring incredible and maybe impossible advancements in query optimization).

> If you go with one view, you will eventually end up paying a performance cost for computations you actually aren't using in certain places (barring incredible and maybe impossible advancements in query optimization).

This is outdated; most modern DBs treat views as they would CTEs. If you don’t reference a defined field downstream, it doesn’t need to get calculated.

It's more about the fact that a join can change the shape of the result set. Even if the columns being projected aren't surfaced, the DB still has to process the joins to make sure the result set has the correct shape. For example, a human might know that a join will always be one:one or one:zero-or-one, but the DB has no choice but to make sure. Perhaps using subqueries instead of joins would work, but that gets ugly too.

(Or maybe my knowledge is outdated and the optimizers have gotten way better than they were 3-4 years ago.)

You can SELECT just the columns you want with one view, and IIRC that will exclude other columns from being retrieved unless they are the source of a derived value.

The bigger problem is views are tightly coupled to the underlying data.

As others have said, any decent optimiser will prune out values that aren't SELECT'ed. I don't think it's that complex either - find the unused columns, remove them from subviews, recurse all the way down. Try it on MSSQL/Postgres and you can see it being done.

I find CTEs (the WITH statement) greatly improved my ability to run more complex queries because they offer an abstraction I can use to name and compose queries together.

SQL views are useful for that kind of thing too.

The only issue I see with CTE's and have encountered, is poor query performance. The query planner needs to improve in this regard. At one point I noticed some of my CTE with queries being written to a temp file on disk and used to join against, when an inlined query did not do this producing the same results.

>Frankly, it seems like some sort of macro system is needed

Views, materialized views, CTEs.

No, these are all highly coupled to the underlying data structure. Even stored procedures tend to know a bit too much.

Say I have a table with the following properties:

- It has a column that corresponds to an id

- It has a column that corresponds to a category

- It has a column that corresponds to a value

- It has a column that corresponds to a timestamp

- It has a bunch of other columns

I want to have a general transformation that allows me to:

Select an aggregate min, max over a period of time for a given category along with the other columns linked to min/max.

The stored procedure ends up being a mashup of string concatenation. Meanwhile, it just feels something is missing to make the language expressive enough.

>Select an aggregate min, max over a period of time for a given category along with the other columns linked to min/max.

The min/max aggregations over a period of time per category is pretty straight forward (aggregating rows), but what do you mean by "along with the other columns linked to min/max"? How would other columns come into play after aggregating rows?

> No, these are all highly coupled to the underlying data structure

Difficult to be sure what you mean, but I think you're wrong. You can select from any table source.

> Select an aggregate min, max over a period of time for a given category along with the other columns linked to min/max.

   select min(category) as mincat, max(category) as maxcat,
      min(value) as minval, -- etc
   where ts between <start> and <end>

I am not sure what you're asking for if the above is not it. Can you give an example please?

You could take a look at EdgeDB. It's built on postgres, but uses its own query language and data model. Its data model has build in support for links and polymorphism. The query language EdgeQL makes it easy to follow links and output embedded arrays.

Agreed, EdgeDB[0] looks to be trying to solve a lot of SQL's perceived flaws with EdgeQL[1]. It's not the first attempt and I suspect it won't be the last but composability is explicitly one of the challenges it addresses.

[0] https://www.edgedb.com/ [1] https://www.edgedb.com/showcase/edgeql

It sounds like you just reinvented VIEWs?

VIEWs and generated columns should be able to help with these problems?

https://www.postgresql.org/docs/14/ddl-generated-columns.htm...

> A generated column is a special column that is always computed from other columns. Thus, it is for columns what a view is for tables.

Unfortunately with Postgres, generated columns are always stored on disk, so you lose some storage space.

Finally, something for mariadb fans to hold on to in the flame war.

Haha I guess :-D. Though I have a workaround for complicated generated columns that I don't wish to store: using a function. So my select would look like:

  SELECT a, b, something_complicated(table) FROM table.

Of course you have to remember to do that but hey it's something.

I think every dialect has an idiomatic way to do what you're asking. e.g., in Snowflake it's `select * from sales_department_details qualify no_products_sold = max(no_products_sold) over (partition by sales_person_id)`, PSQL `select distinct on (sales_person_id) * from sales_department_details order by sales_person_id, no_products_sold desc`, ...

Postgres can do the"function over (partition by col_a order by col_b) as well.

Agree, SQL can already do what the OP proposes in almost the exact same syntax.

and together with CTEs, intermediate temp tables or views, you can totally avoid the spaghetti and break down the queries in manageable chunks.

postgres has window functions but it does not have the `qualify` clause.

There are a number of query languages to address this type of reuse and composability for analytical query workloads.

- MDX: created by Microsoft to provide a dimensional query language. The language is incredibly powerful, but depends on your understanding of dimensional modeling (go read Kimball is the best starting point for learning MDX). There are several tools, both commercial and open source, which implement an MDX interface to data.

- DAX: Microsoft's attempt to make MDX more approachable. A relational, functional language built on top of an in-memory columnstore relational storage engine, and used in several Microsoft products (Power BI, Analysis Services Tabular mode).

- Qlik has its own expression language whose name I am not sure of.

- Looker has LookML

There are a lot of BI tools out there. Not all have an expression language to support reusable measures/calculations, but many do. You may want to look into one.

Thanks! MDX and DAX look powerful but also unreasonably complicated. LookML seems even more constrained. I'll keep digging.

Such languages offload more work onto the data model than SQL does.[0] For example, in the Tabular model (where you would use DAX), relationships are defined in the model (and supported by an in-memory data structure that you can shortcut-think of as a pre-computed hash-join) and implicit in (most) expressions you write.

If you think of SQL as a general-purpose query language, these BI-centric languages are DSLs for analytical queries. They are most useful for workloads where you have many queries that all use the same data model and are reasonably served by a pre-defined set of calculations/metrics.

[0] Do not mistake me, good data modeling is important for SQL as well.

I am using JooQ now at work to write SQL statements in a Java DSL. Java’s type system mostly works as intended and is helpful, particularly in conjunction with an IDE. Many kinds of metaprogramming are possible with JooQ.

In the semantic web, RDF can put every table in the universe in a unique namespace and OWL can bind together multiple tables and do inference at a high level, then you can query it with SPARQL which is quite similar to SQL at heart.

The Department of Defense was asked by Congress to answer “Where does the money go?” and tried to use OWL and SPARQL queries across all the relational databases they owned but they couldn’t get it to work.

I can’t help but think something along those lines could present as a ‘low code’ data tool.

You can accomplish something similar with views, foreign tables, stored procedures, user defined functions, and other mechanisms which are common in databases like PostgreSQL, Microsoft SQL, etc.

I find PostgreSQL pretty handy because it supports embedded data structures such as lists in columns, supports queries over JSON documents in columns, etc.

My favorite DB for side projects is ArangoDB which uses collections of JSON objects like tables and the obvious algebra over them which lets you query and update like the best relational dbs but you don’t have the programmability of views, stored procedures, etc.

> Frankly, it seems like some sort of macro system is needed.

Check out stored procedures and views.

Here's a great summary: https://stackoverflow.com/a/5195020

Pine [1] and jq [2] are combinator query languages.

Every expression represents a filter of some kind.

In SQL (and in relational algebra), every expression represents a dataset.

That means composing SQL expressions is composing data, not composing operations on data.

Since there is a somewhat low, backwards-compatible barrier to entry to build a combinator language on top of SQL (Pine is an example), whether or not trading the increased learning curve of using combinators for better composability is a good idea can stand the test of time.

[1]: https://github.com/pine-lang/pine [2]: https://stedolan.github.io/jq/

Check my project:

https://tablam.org

Exist 2 major ways to solve this: You do a transpiler (like most ORM are, actually) or you go deeper and fix from the base.

The second could yield much better results. It will sound weird at first, but not exist anything that block the idea of "a SQL" language work for make the full app, with UI and all that.

SQL as-is is just too limited, in some unfortunate ways even for the use-case of have a restricted language usable for ad-hoc queries, but that is purely incidental.

The relational model is much more expressive than array, functional models (because well, it can supersede/integrate both) and with some extra adjustment you can get something like python/ML that could become super-productive.

create view max_products_sold as select sale_person_id, max(no_products_sold) AS no_products_sold from sales_department_details group by sale_person_id;

select sale_person_id, sale_person_name, no_products_sold, commission_percentage, sales_department from sales_department_details natural join max_products_sold;

This is a very poorly written SQL.. Try some modern syntax: select DISTINCT sale_person_id, sale_person_name, max(no_products_sold) over (partition by sale_person_id) AS max_products_sold, commission_percentage, sales_department from no_products_sold;

The problem with SQL is that it's all about transforming data, and that would better be modelled as a pipeline.

https://prql-lang.org/

Put your example on https://www.db-fiddle.com/ or http://sqlfiddle.com/ or some other sql fiddle then we can all experiment with it and come up with leaner solutions.

Just asking for mooar build in stuff just leads to bloated software.

Yes, I can't believe that the awesome power of relational DBs is still hindered by the major flaws of SQL. Here is my attempt at something better: https://docs.racket-lang.org/plisqin/Read_Me_First.html The feature I am most pleased with is that joins are values (or expressions, if you prefer) which can be then refactored using normal techniques.

If you make the joining keys have the same name, you can use USING(col_foo) instead of table1.col_foo = table2.col_foo. It's one of the reasons I always use tablename_id as primary key name and foreign key name. Doesn't always work (for example 2 foreign keys linking to the same table but represent something different like created_by and modified_by).

Check out dbt (not dbt-cloud), great open-source SQL tool, macro system is included in the ever-growing set of features. It helps managing SQLs a lot.

Feel you. I've long moved to jupyter notebooks as my de-facto database tool. I use python functions that generate the SQL, which I then execute in jupyter and load the results into pandas.

The one that drives me nuts is

   select this,count(*) from that group by this

which is the most common data exploration query of them all which makes you type ‘this’ twice.

Tested this out on Postgres and it works:

  SELECT
    name, count(1)
  FROM 
    (VALUES ('Kev',12), ('Meg', 14), ('Kev', 14)) 
  people(name, age)
  GROUP BY
    1

Produces: [[Kev, 2], [Meg, 1]].

But I don't get it. Could someone explain how this works?

Most SQL dialects accept a positional or a direct reference to the column name. “1” represents the first column created by the select statement, “name”.

Ah I see. Is that how `count(1)` also works? Or is that different.

Please never allow this into production. I believe it is deprecated, it certainly is fragile in general.

I actually find this less readable. i have to go back and scan the select to see what column this index corresponds to. Totally breaks the flow of reading.

What would you prefer to see instead, bearing in mind that anything you will suggest is likely to bsave minimal typing but introduce new complexity into sql which everyone agrees is already creaking like a constipated elephant.

something like a function or macro

   CREATE MACRO count_values(this, that) WITH DEFINITION
       SELECT this,COUNT(*) FROM that GROUP BY this

People build composable and reusable abstractions on top of sql all the time.

Examples would help. It's an assertion people frequently make. I have struggled with homegrown not-particularly-good ones. One problem is finding a good set of abstractions for the things I really need to abstract on. So there are probably different abstractions to note.

In the same vein that different languages provide higher-level abstractions for machine code or JVM bytecode.

Stored procedures, user-defined functions, views…

There is a database library called Diesel that I use which has good support for composition. The easiest example to point to is on their own webpage: https://diesel.rs/guides/composing-applications.html

The Django ORM is one of my favourite abstractions on top of SQL.

With dynamic sql you can write your own. I know you don't have time, just throwing it out there.

Recommend

Does anyone else experience FOMO?

Ask HN: Anyone else getting a 500 on their Google Calendars?

Question - Does anyone else think the S22 Ultra is a really bad phone? | Page 7...

Anyone else think the 8Gb 24” iMac is a slow old dog?

Does anyone else love the 11” MacBook Air?

Does anyone else feel like the M1 MBA is now an even better deal?

Does anyone else find the trackpad on their MacBook Pro more convenient to use t...

Does anyone else have mixed feelings with remote work?

Ask Slashdot: Does Anyone Still Use Ogg Vorbis Format? - Slashdot

Does Anyone Else Feel the Crypto Vibe Shift?

About Joyk