5

How can we help fix replies on lobste.rs?

 3 years ago
source link: https://lobste.rs/s/j4luhv/how_can_we_help_fix_replies_on_lobste_rs
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
How can we help fix replies on lobste.rs?

Currently, going to lobste.rs/replies will give you this message:

2020-12-03: We recently migrated the database and this feature hammers this db config, so it's unavailable until we can sort that out. Sorry for the hassle. - Peter

Having looked at the code, the query in question is definitely a hefty one.

I’ve looked at ways of optimizing it and all that, as have I’m sure people smarter than I. I’m sure it’s doable, but it’s a matter of how much time it would take and how much of the rest of the codebase would need to change.

No rush, as I realize we all enjoy this community for free, but is there an update on the plan to get it fixed? I (and I’m sure others) would be happy to help out in whatever way we could.

PS> And, again, thank you to everyone involved in keeping lobste.rs running. It is without a doubt the best tech link aggregator and discussion site out there.

  1. cadey

    15 hours ago

    | link

    This is a huge hot take but I actually like it this way. It makes lobste.rs a lot more calm. If I really want to come back to a conversation I can go to the “My Threads” view.

    1. Because nobody has directly disagreed and this is currently the top comment, I just want to say: I disagree. I miss the notifications. It’s a lot more work to check for replies on mobile.

    2. zge

      11 hours ago

      | link

      Not noticing that someone didn’t reply might be good for civility when it comes to heated topics, but it’s just one side effect of not noticing if someone responds to you. It’s pretty easy to not notice a reply or a question between larger threads, and that seems like a loss too.

      If the issue is fixed, maybe reply-notifications could be turned of in debate-threads, as a kind of cooling measure.

      1. peter

        7 hours ago

        | link

        You can also use pushover or email notifications, both work great for me.

    3. How do you use “my threads” to make it useful? I find it’s always full of stuff I would have to sift through to find what I wanr

      1. cadey

        11 hours ago

        | link

        That sifting being required is the thing I find useful. It makes it harder for me to go back to that conversation, so it makes it easier for feelings in the moment to pass.

      2. Do a ‘Find in page’ for your username.

    4. Personally, I fear it will make lobste.rs more HN. Meaningful responses which made me learn various things and challenge my assumptions, often changing my view, were extremely valuable to me on lobste.rs. Now, I’m very afraid of this becoming “throw my $0.02 and forget” both for me and for others. I noticed I’m already missing valuable responses when I once accidentally went through “my threads” further than I planned.

    5. I don’t. I usually don’t go back to threads after I read through them except to reply to people who have replied to me. Now I have to read through my threads all the time….

  2. For the curious, here is the SQL which generates a view, used by this rails controller and model.

    1. Out of interest, why is this generated on demand? When you post a reply, you must be looking up the parent post already (or have its unique identifier from the generated HTML). Can’t you just look up the author of that post and add a new row to a replies table with the {parent ID, post ID}? If you have that table with an index for parent ID, it’s easy to select all replies to posts from a single person. Given the small number of posts any given individual does, it should then be fairly cheap to do a join to look up the dates of the posts and sort.

    2. It seems that there is hell lot of missing indices that should greatly improve performance of that query.

      For example indices on:

      • is_deleted
      • is_moderated
      • is_following
      • do partial index on score instead of full index

      These should provide some speedup.

      Alternatively try to create materialized view (I know that MySQL do not have them per se, but there are workarounds) or just switch to PostgreSQL.

      1. It seems that there is hell lot of missing indices that should greatly improve performance of that query.

        I don’t know about the internals of the database, but I’m guessing that the majority of comments are not deleted/moderated so the DB might still choose to do a full scan. is_following seems promising, but this comment mentions that the predicate isn’t getting pushed into the view so it may just be doing the joins for every user at once.

    3. Wowser. Looks like fun SQL.

      The normal answer would be views, but it appears from reading other comments that this isn’t an option, so we’re left with the classic cliche answers, any of which might work: RAM-up, brute-force, change engines, shard, and so forth.

      The trick probably is figuring out which of these is the easiest to try first. I’m not a Rails guy, so I don’t know the implications of switching engines to Postgres, but that intuitively feels like the right place to start playing around.

      ADD: Forgot temp/semi-permanent tables. Sometimes you can use cron and temp tables to work the same as views would.

      ADD2: Yeah, come to think of it, maybe some kind of temp shim is the way to go. You’re not going to solve the problem, but you can presort out the where clauses such that the indexes will allow the outer query to return faster. You’d need to work with it a bit to be sure, tho. A lot depends on memory size, how much data we’re talking about, and how often you need to hit it.

      1. peter

        7 hours ago

        | link

        I don’t think any of these are great solutions. The real answer is figuring out the query optimizer issue, and fixing it. Since the issue isn’t fixable within a single query (MySQL not pushing down a key predicate into the view), the next step is to perform multiple queries to work around the query optimizer. The predicate in question filters 300k rows down to ~4 (mentioned elsewhere in the thread), so the app should run that initial query, and then a second query using those results.

        For some reason people tend to avoid working around query optimizer issues with multiple queries. I can’t imagine why.

        switching engines to Postgres

        The Postgres optimizer has problems too. It can get confused about lateral joins—or queries that should be optimized to lateral joins—and end up doing the same thing as here. I’ve seen Postgres queries perform full table scans on huge tables to build up one side of a hash join, even though it estimates the other side of the join will filter to ~1 row.

    4. A reasonable compromise might be to time bound reply notifications. If the story is 1-3 months old, do you necessarily need a reply notification? How many people are replying to stories that are that old? (As a proportion of total comments.)

      With some covering indexes it might be a good enough strategy. Definitely better to have even 90% of notifications than 0%. (Unless your @cadey.)

      1. I really enjoy receiving late replies to topics or commenting on old posts. Newer responds I catch more often just by rereading the comments.

        1. I do too, but if we can get most of the value with minimal changes then I think it’s a worthwhile compromise. Other options like migrating the database, setting up a VPS or changing lots of business logic are a bigger ask for the maintainers. Plus, an okay solution gives some breathing room for more permanent solution. (Or removes all impetus to make a permanent solution.)

    5. e12e

      6 hours ago

      | link

      Thank you for the links. I think I would probably try to make a view for user stories and comments (my stories, my comments - filter out negative scores, unfollowed), then try to look for children (replies), and filter/rank those. Possibly that would entail changing the relationship to easier be able to query children/replies - but I’m guessing the current “where parent_id…” should work fine.

      It would probably mean four queries, but they might be easier to benchmark/optimize.

      Normally I’d fight strongly for fewer/one query per controller method - but obviously in this case that’s been tried.

      Personally would probably consider moving to pg - but doing that because of a single query/rails view is silly.

  3. Here is a relevant comment and discussion:

    MySQL doesn’t push the where clause down into the view, so does a full table scan of all 300k comments instead of using an index to look at ~4 rows

    1. I just winced because I was @’ed but I never saw it, due to the problem.

    2. agent281

      edited 8 hours ago

      | link

      If the view isn’t cutting it, would it be challenging to work with the query directly? Alternatively, it could go into a stored procedure parameterized with the user_id. I haven’t really worked with Rails so I don’t know how palatable either option would be.

      ** Edit ** At work we have a stored procedure that’s around 8k LOC, builds up a large XML structure and joins dozens of tables. The primary table it joins against is about as big as the comments table. The query has lots of covering indexes and can run in about 50-100 ms. It’s a monster query, but it works for our relatively low volume environment. (Internal company services for a midsize company.) Despite being very large the fact that we can push the predicates in as a sproc makes it pretty performant.

  4. The code was written back when the site was hosted using MariaDB not MySQL. If the underlying databases pushes predicates down the performance is fine.

    The two fixes are either:

    • Shift the site to Postgres and deal with the sister site downstream issues
    • Migrate off of DigitalOcean’s hosted MySQL and into a MariaDB instance we self manage
    1. I’m shocked that in this day and age any host is still offering paying customers MySQL that is not mariadb. That just smacks of irresponsability on the part of DigitalOcean.

      What is the main overhead for “self manage” going to be here? Backups?

      1. arp242

        12 hours ago

        | link

        MySQL is still actively developed, right? Although I’m not sure what the MySQL vs. MariaDB status is these days, and MariaDB definitely seems to have bigger mindshare.

        1. MySQL is still actively developed, by Oracle. Guess the rest. MariaDB took most of the community developers with them.

      2. soc

        12 hours ago

        | link

        Sounds like DO is optimizing for billable use …

    2. soc

      12 hours ago

      | link

      I could ask in my hackerspace regarding hosting.

      How much space/CPU/bandwidth do you need?

    3. soc

      12 hours ago

      | link

      Perhaps it would be an option to ask the oster, whether they have plans to offer other hosted DBs in the future?

    4. I‘m not familiar with Ruby, but wouldn‘t it be the best solution to use an ORM to make the implementation vendor independent? Something like SQLAlchemy for Ruby.

      1. That tends to be a good way to add performance problems, not remove them.

      2. arp242

        12 hours ago

        | link

        Rails comes with ActiveRecord, which is “SQLAlchemy for Ruby” (or rather, it’s probably the reverse: SQLAlchemy is “ActiveRecord for Python”).

        These kind of complex queries don’t tend to do well in AR though. There’s a reason it offers a “raw SQL escape hatch”.

        1. jfb

          4 hours ago

          | link

          SQLAlchemy isn’t really an active record pattern ORM though – it’s not nearly that broken.

  5. snej

    11 hours ago

    | link

    I remember reading about similar perf issues at Twitter early on (2008?) when they still used Rails and MySQL. The general issue is “PubSub is a bitch to scale.” I’ve always suspected that big joined queries are the wrong way to do this … it’s just that they’re the hammers relational DBs give you.

    What if instead you materialized the pending-replies queue for each user? When a comment is being posted, jut insert a reference into the reply queue of the parent comment’s author. Then when someone goes to the replies page, read out their replies queue and drop the rows you just displayed. Both of these are trivial SQL statements that operate on a single “replies” table and require only a simple index on userid to optimize.

  6. Nei

    17 hours ago

    | link

    sometimes with MySQL it can help to do the processing in the code and use multiple smaller SQL queries. Not sure if this would be possible or make the task easier here


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK