10

A Cool SQL Problem (And Why It Is Also a Bullshit SQL Problem)

 3 years ago
source link: https://ryxcommar.com/2019/06/24/a-cool-sql-problem-and-why-it-is-also-a-bullshit-sql-problem/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

The Problem

Your office needs to figure out the minimum number of rooms required to organize meetings for any particular day. To accomplish this task, you have a table with the following information: a meeting ID, a start time, and an end time:

CREATE TABLE meetings (
id  INTEGER UNIQUE,
start_time  TIME,
end_time    TIME
);

For any table of that format, figure out the number of rooms you need. Your solution must be in SQL. Assume no meetings go past midnight. Meeting times are scheduled in minute increments.

You have 50 minutes to solve this problem.

Example:

Consider the following table:

INSERT INTO meetings VALUES
(1, '07:00:00', '11:00:00'),
(2, '12:30:00', '17:00:00'),
(3, '11:30:00', '14:00:00'),
(4, '09:30:00', '11:30:00'),
(5, '10:00:00', '15:00:00'),
(6, '09:00:00', '13:30:00');

You would need 4 rooms to schedule all these meetings:

image-3.png

If you’d like to solve this problem, don’t scroll down. Otherwise, read on.


Why This Problem is Hard

  • This problem blurs the line between a SQL problem and a more “programmatic” problem. At the end of the day, it can be considered a SQL problem with a clean SQL solution, but I wouldn’t blame someone for wanting to do this in Python or Scala. In fact, there are probably lots of smart people who can’t solve this problem in SQL in 50 minutes, but can solve this problem in another language more suitable for actual analysis. So in a sense, you’re testing not just someone’s coding abilities but specifically their SQL proficiency.
  • This is a bit of a general algorithms-y problem, and it is biased toward people who have done problems like this recently either in their career or at job interviews.
  • (Minor spoiler) There are two correct solutions to this problem, and one of those solutions is butt-ugly. The good solution that’s clean and not butt-ugly is incredibly hard to come up with!
  • (Minor spoiler) The correct solution is not intuitive– in fact, I’d say the problem is designed in a way to deliberately mislead about what either of the correct solutions look like. The minimum reproducible example confirms this: it spits back a “correct” answer for the most plausible incorrect solution.
  • As a note: I can find this problem on StackOverflow and literally none of the solutions provided are the good one. The answer to the SO question with a green checkmark and 4 upvotes is a pseudo-code version of the butt-ugly solution. Nobody in the thread provided the good solution.

All told, this problem is a hard (but fair) problem. If you used it as an interview question, it would filter out a candidate pool massively. And maybe that’s what your company wants. With that said…

This Problem Was Bullshit (In Context)

This is a 6-figure problem. By that I mean anyone who not only gets this problem correct in the allotted 50 minutes, but provides the most optimized solution and avoids the butt-ugly solution, should be making 6-figures. Anyone who can do the elegant solution and isn’t making 6-figures is selling themselves short.

This is a problem I encountered during a code screening for an “analyst” position paying about $80k/year.

In other words, there is a mismatch here between the difficulty of the problem and the salary expectation. Once you’ve filtered your candidate pool down to people who are answering problems like this correctly (never mind elegantly) and without cheating, you’re competing for candidates who can make over $100,000/year if they really try.

Also of note, because this problem is algorithms-y, you’ll filter out a lot of smart and otherwise qualified candidates, and it’s especially a slap in the face if the job itself doesn’t require doing this sort of work on a consistent basis. And the types of people who do this stuff consistently are making more than $80k/year.

That all leads us to the real issue here. The only people who are going to get this problem correct and also settle for a $80k salary are cheaters and wusses, but mostly cheaters. They’ll search for the StackOverflow solution, transcribe it, and land an interview.

Companies that try to pay mid-tier salaries for top-tier candidates will not consistently get top-tier candidates. I don’t understand why companies think they can keep doing this when unemployment is below 4%. Companies that pull stunts like this will mostly attract mid-tier hucksters, frauds, and charlatans because honest top-tier candidates can do better. And the few honest, top-tier wusses that take the job offer will eventually be driven mad vis-à-vis working with a bunch of hucksters, frauds, and charlatans.

This is how a hard prescreening problem plus a mismatched, low salary leads to a horrible company culture down the line. And from what I’ve heard from people who have worked at this company, that’s exactly what they have.

My suggestion for this company would be to be reasonable. It’s OK to hire mediocre people, honest to god, and chances are most of your employees are in fact mediocre. Your interview questions can reflect that. Mediocre people do so much important work in the economy. Not every part of a goddamn interview process needs to be about filtering down to only candidates who graduated from Stanford summa cum laude with a computer science degree. It’s also about matching the right people to the right jobs, which is contingent on salary. Get a grip.

The Solution

Before I mention the real solution, I want to debunk the solution that a lot of people will think of, which is to do a Cartesian join on (m1.start_time < m2.start_time AND m1.end_time > m2.start_time) OR (m1.start_time < m2.end_time AND m1.end_time > m2.end_time) OR (m1.start_time > m2.start_time AND m1.end_time < m2.end_time) and take MAX(COUNT(id)). In fact, one of the reasons I both like this problem as a challenge and hate it in the context of an “analyst” position is because that actually looks like the solution, it’s what a lot of people will think is the solution, and you’ll spend a lot of time trying to get the ON condition “correct” instead of going for the actual correct solution.

There is a simple counterexample that demonstrates why this is wrong: Imagine one 8-hour meeting that lasts the whole day, and a bunch of 1-hour meetings that are back-to-back. The method above will tell you that you need 9 rooms, but obviously you only need 2. So that can’t be the correct solution. (Note that the solution to the example is 4, and the method just described also returns 4. Sneaky bastards!) To test if your solution really works, the solution query should return 2 on this:

DELETE FROM meetings;
INSERT INTO meetings VALUES
(1, '09:00:00', '17:00:00'),
(2, '09:00:00', '10:00:00'),
(3, '10:00:00', '11:00:00'),
(4, '11:00:00', '12:00:00'),
(5, '12:00:00', '13:00:00'),
(6, '13:00:00', '14:00:00'),
(7, '14:00:00', '15:00:00'),
(8, '15:00:00', '16:00:00'),
(9, '16:00:00', '17:00:00');

Onto the real solutions. The trick is to think of the “time” as the domain, not meetings.id. Time is not a field in the data, so you have to make your own time field.

An inelegant solution that leads to the correct answer is to create your own table of 60*24 = 1,440 rows of single minute times ranging from 00:00 to 23:59. Then join on when the meeting is in that minute interval, you get the idea. Yes, this is the pseudo-code StackOverflow solution.

There’s a much better solution which takes advantage of the fact that, if nothing is happening at 4:30am, then you don’t need a row there. Same for 4:31am, 4:32am, and so on. What matters is not every single minute but the endpoints. In other words, all that needs to be represented in the table are start_times and end_times. There are two columns that you have to combine with UNION to actually get the domain of all times represented, and from there you do the rest. With that said, here is what the good solution looks like:

SELECT MAX(overlap) as solution FROM (
SELECT COUNT(t) as overlap FROM (
SELECT * FROM (
SELECT start_time AS t
FROM meetings
UNION SELECT end_time AS t
FROM meetings
) as times
JOIN meetings AS m
ON times.t >= m.start_time AND times.t < m.end_time
)
GROUP BY t
);

And there’s that. Just make sure the JOIN condition uses a half-open interval to account for back-to-back meetings that can be in the same room.

Is It Ethical to Post This?

I’ve done about a dozen of these sorts of interview problems the past few months, a dozen more throughout my lifetime (notwithstanding Codewars), and I’ve not felt compelled to post about any of them. I wanted to talk about this problem specifically for a handful of reasons, including because I liked the problem and also because encountering the problem for a $80k job pissed me off.

A few friends told me not to work at this place to begin with; nevertheless I was rejected by the company that provided this problem, so I never got a choice in the matter. Since I won’t be working there, it’s not like they could rescind my employment as punishment for posting this and finding the problem.

If someone uses this information for their own leverage in a job interview, it’s not my responsibility if someone cheats. They can already cheat and get a passable solution via a simple Google search. There is value in sharing this question because it’s interesting, and hopefully it helps others prepare for similar problems. Code challenges, especially these really brief challenges, are also an ethical and reasonable way for companies to not waste their time interviewing idiots.

A better question would be, though: is it ethical to underpay your employees? The answer seems to be no, and posting both the problem and solution online provides a way to remedy underpayment (in addition to simply being valuable information to people who may never see this problem in the wild). After all, this is a 6-figure problem pulled from a 5-figure job posting. I’ve done much easier problems for actual 6-figure job posts (in fact, I’m doing one such problem later this evening). If someone needs to cheat to get the answer to this problem, there is a possibility they could be a 5-figure employee. In which case, upon accepting the job offer that they cheated their way into, they’d be perfectly matched to the perfect job for them.

Cheers, and happy job hunting to both the 3.6% of unemployed labor force participants and other job hoppers.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK