11

New grad vs senior dev

 4 years ago
source link: https://ericlippert.com/2020/03/27/new-grad-vs-senior-dev/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

A student who I used to tutor in CS occasionally sent me a meme yesterday which showed “NEW GRAD vs SENIOR DEVELOPER”; the new grad is all caps yelling

NO! YOU CAN’T JUST USE BRUTE FORCE HERE! WE NEED TO USE SEGMENT TREES TO GET UPDATE TIME COMPLEXITY DOWN TO O(LOG N)! BREAK THE DATA INTO CHUNKS AT LEAST! OH THE INEFFICIENCY!!!

and the senior developer responds

Ha ha, nested for loops go brrrrrrrrrrrr…

OK, that’s silly and juvenile, but… oh no, I feel a flashback coming on.

It is 1994 and I am a second-year CS student at my first internship at Microsoft on the Visual Basic compiler team, reading the source code for InStr for the first time. InStr is the function in Visual Basic that takes two strings, call them source and query , and tells you the index at which query first appears as a substring of source , and the implementation is naive-brute-force.

I am shocked to learn this! Shocked, I tell you!

Let me digress slightly here and say what the naive brute force algorithm is for this problem.

Aside: To keep it simple we’ll ignore all the difficulties inherent in this problem entailed by the fact that VB was the first Microsoft product where one version worked everywhere in the world on every version of Windows no matter how Windows was localized; systems that used Chinese DBCS character encodings ran the same VB binary as systems that used European code pages, and we had to support all these encodings plus Unicode UTF-16. As you might imagine, the string code was a bit of a mess. (And cleaning it up in VBScript was one of my first jobs as an FTE in 1996!)

Today for simplicity we’ll just assume we have a flat, zero-terminated array of chars, one character per char as was originally intended.

The extremely naive algorithm for finding a string in another goes something like this pseudo-C algorithm:

bool starts(char *source, char *query)
{
  int i = 0;
  while (query[i] != '\0')
  {
    if (source[i] != query[i])
      return false;
    i = i + 1;
  }
  return true;
}
int find(char *source, char *query)
{
  int i = 0;
  while (source[i] != '\0')
  {
    if (starts(source + i, query))
      return i;
    i = i + 1;
  }
  return -1;  
}

The attentive reader will note that this is the aforementioned nested for loop ; I’ve just extracted the nested loop into its own helper method. The extremely attentive reader will have already noticed that I wrote a few bugs into the algorithm above; what are they?

Of course there are many nano-optimizations one can perform on this algorithm if you know a few C tips and tricks; again, we’ll ignore those. It’s the algorithmic complexity I’m interested in here.

The action of the algorithm is straightforward. If we want to know if query “banana” is inside source “apple banana orange” then we ask:

  • does “apple banana orange” start with “banana”? No.
  • does “pple banana orange” start with “banana”? No.
  • does “ple banana orange” start with “banana”? No.
  • does “banana orange” start with “banana”? Yes! We’re done.

It might not be clear why the naive algorithm is bad. The key is to think about what the worst case is. The worst case would have to be one where there is no match, because that means we have to check the most possible substrings. Of the no-match cases, what are the worst ones? The ones where starts does the most work to return false.  For example, suppose source is “aaaaaaaaaaaaaaaaaaaa” — twenty characters — and query is “aaaab”. What does the naive algorithm do?

  • Does “aaaaaaaaaaaaaaaaaaaa” start with “aaaab”? No, but it takes five comparisons to determine that.
  • Does “aaaaaaaaaaaaaaaaaaa” start with “aaaab”? No, but it takes five comparisons to determine that.
  • … and so on.

In the majority of attempts it takes us the maximum number of comparisons to determine that the source substring does not start with the query . The naive algorithm’s worst case is O(n*m) where n is the length of source and m is the length of the query .

There are a lot of obvious ways to make minor improvements to the extremely naive version above, and in fact the implementation in VB was slightly better. The implementation in VB was basically this:

char* skipto(char *source, char c)
{
  char *result = source;
  while (*result != '\0' && *result != c)
    result = result + 1;
  return result;
}
int find(char *source, char *query)
{
  char *current = skipto(source, query[0]);
  while (*current != '\0;)
  {
    if (starts(current, query))
      return current - source;
    current = skipto(current + 1, query[0]);
  }
  return -1;
}

(WOW, EVEN MORE BUGS! Can you spot them? It’s maybe easier this time.)

This is more complicated but not actually better algorithmically; all we’ve done is moved the initial check in starts that checks for equality of the first letters into its own helper method. In fact, what the heck, this code looks worse . It does more work and is more complicated . What’s going on here? We’ll come back to this.

As I said, I was a second year CS student and (no surprise) a bit of a keener; I had read ahead and knew that there were string finding algorithms that are considerably better than O(n*m). The basic strategy of these better algorithms is to do some preprocessing of the strings to look for interesting features that allow you to “skip over” regions of the source string that you know cannot possibly contain the query string.

This is a heavily-studied problem because, first off, obviously it is a “foundational” problem; finding substrings is useful in many other algorithms, and second, because we genuinely do have extremely difficult problems to solve in this space. “Find this DNA fragment inside this genome”, for example, involves strings that may be billions of characters long with lots of partial matches.

I’m not going to go into the various different algorithms that are available to solve this problem and their many pros and cons; you can read about them on Wikipedia if you’re interested.

Anyways, where was I, oh yes, CS student summer intern vs Senior Developer .

I read this code and was outraged that it was not the most asymptotically efficient possible code, so I got a meeting with Tim Paterson, who had written much of the string library and had the office next to me.

Let me repeat that for those youngsters in the audience here, TIM FREAKIN’ PATERSON. Tim “QDOS” Paterson, who one fine day wrote an operating system, sold it to BillG, and that became MS-DOS, the most popular operating system in the world. As I’ve mentioned before, Tim was very intimidating to young me and did not always suffer foolish questions gladly , but it turned out that in this case he was very patient with all-caps THIS IS INEFFICIENT Eric. More patient than I likely deserved.

As Tim explained to me, first off, the reason why VB does this seemingly bizarre “find the first character match, then check if query is a prefix of source ” logic is because the skipto method is not written in the naive fashion that I showed here. The skipto method is a single x86 machine instruction. ( REPNE SCASB, maybe? My x86 machine code knowledge was never very good. It was something in the REP family at least.) It is blazingly fast. It harnesses the power of purpose-built hardware to solve the problem of “where’s that first character at?”

That explains that; it genuinely is a big perf win to let the hardware do the heavy lifting here. But what about the asymptotic problem? Well, as Tim patiently explained to me, guess what? Most VB developers are NOT asking if “aaaab” can be found in “aaaaaaa…”. The vast majority of VB developers are asking is “London” anywhere in this address, or similar problems where the strings are normal human-language strings without a lot of repetitions, and both the source and query strings are short .  Like, very short. Less than 100 characters short. Fits into a cache line short.

Think about it this way; most source strings that VB developers are searching have any given character in them maybe 2% of the time, and so for whatever the start character is of the query string, the skipto step is going to find those 2% of possible matches very quickly . And then the starts step is the vast majority of the time going to very quickly identify false matches. In practice the naive brute force algorithm is almost always O(n + m). 

Moreover, Tim explained to me, any solution that involves allocating a table, preprocessing strings, and so on, is going to take longer to do all that stuff than the blazingly-fast-99.9999%-of-the-time brute force algorithm takes to just give you the answer. The additional complexity is simply not worth it in scenarios that are relevant to VB developers. VB developers are developing line-of-business solutions, and their line of business is not typically genomics; if it is, they have special-purpose libraries for those problems; they’re not using InStr .

And we’re back in 2020. I hope you enjoyed that trip down memory lane.

It turns out that yes, fresh grads and keener interns do complain to senior developers about asymptotic efficiency, and senior developers do say “but nested for loops go brrrrrrr ” — yes, they go brrrrrr extremely quickly much of the time, and senior developers know that!

And now I am the senior developer, and I try to be patient with the fresh grads as my mentors were patient with me.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK