4

Launch HN: Rootly (YC S21) – Manage Incidents in Slack

 1 year ago
source link: https://news.ycombinator.com/item?id=31653985
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Launch HN: Rootly (YC S21) – Manage Incidents in Slack

Launch HN: Rootly (YC S21) – Manage Incidents in Slack
86 points by kwent 7 hours ago | hide | past | favorite | 60 comments
Hi HN, Quentin and JJ here! We are co-founders at Rootly (https://rootly.com/), an incident management platform built on Slack. Rootly helps automate manual admin work during incidents like the creation of Slack channels, Jira tickets, Zoom rooms & more. We also help you get data on your incidents and help automate postmortem creation.

We met at Instacart, where I was the first SRE and JJ was on the product side owning ~20% GMV on the enterprise and last-mile delivery business. As Instacart grew from processing hundreds to millions of orders, we had to scale our infrastructure, teams, and processes to keep up with this growth. Unsurprisingly, this led to our fair share of incidents (e.g. checkout issues, site outages, etc.) and a lot of restless nights while on-call.

This was further compounded by COVID-19 and the first wave of lockdowns. We surged in traffic by 500% overnight as everyone turned to online grocery. This highlighted our need for a better incident management process as it stressed every element of it. Our manual ways of working in Slack, PagerDuty, Datadog, simply weren’t enough. At first, we figured this was an Instacart-specific problem but luckily realized it wasn’t.

A few things here. Our process lacked consistency. Depending on who was responding and their incident experience it varied greatly. Most companies after they declare an incident rely on a buried-away runbook like on Confluence/Google Docs to try and follow a lengthy checklist of steps. This is hard to find, difficult to follow accurately, slow, and stress inducing. Especially after you’ve been woken up to a page at 3 am. We started working on how to automate this.

Fast forward to today, companies like Canva, Grammarly, Bolt, Faire, Productboard, OpenSea, Shell use Rootly for their incident response. We think of ourselves as part of the post-alerting workflow. Tools like PagerDuty, Datadog act like a smoke alarm to alert you to an incident, which hand off to Rootly so we can orchestrate the actual response.

We’ve learned a lot along the way. We realized the majority of our customers use the same 6 (Slack, PagerDuty, Jira, Zoom, Confluence, Google Docs, etc.) tools, follow roughly the same incident response process (create incident → collaborate → write postmortem), but their process varies dramatically. The challenge in changing these processes is hard.

Our focus in the early days was build a hyper opinionated product to help them follow what we believe are the best practices. Now our product direction is focused on configuration and flexibility, how can we plug Rootly into your already existing way of working and automate it. This has helped our larger enterprise customers be successful with their current processes being automated.

Our biggest competition is not PagerDuty/Opsgenie (in fact 98% of our customers use them) or other startups. Its internal tooling companies have built out of necessity, often because tools like Rootly didn’t exist yet. Stripe (https://www.youtube.com/watch?v=fZ8rvMhLyI4) and GitLab (https://about.gitlab.com/handbook/engineering/infrastructure...) are good examples of this.

Our journey is just getting started as we learn more each day. Would love to hear any feedback on our product or anything you find frustrating about incident response today.

Leaving you with a quick demo: https://www.loom.com/share/313a8f81f0a046f284629afc3263ebff

PagerDuty has a product called RunDeck, which sounds familiar to your offering. Do you view them as competition?
I will echo the other comments on no upfront pricing. Even though this could be potentially useful for my team, I won't "contact" you for pricing. I am sick of having to deal with salespeople who want to know a ton of info about your business so they can gouge every last cent out of you and then some.

I would gladly pay a little extra just to have clear pricing and sign up with a credit card. I have got an engineering org to run, don't have time to moonlight as a procurement officer.

Congratulations on your launch!

The incident response space is brutally competitive and there are so many players all providing the same functionality.

I think the main problem you would have with your customers is their inertia. If an enterprise has their tools and processes setup, even though you provide a better tool at a lower price point, it's not worth their time switching to a new provider if whatever they have is working just fine.

s.gif
Thank you!

And yes the space is heating up for sure. Really good awareness and attention developing as a result though. We are also noticing monitoring companies starting to snag up companies to developing into this space (e.g. Datadog https://www.datadoghq.com/blog/incident-response-with-datado...). But by far our fiercest competition are companies that are still building a subset of what we have internally. Depending on complexity and incident volume we've seen many cases where it's good enough like you mentioned.

Inertia and change management is the #1 barrier to adoption. Companies have ways of workings (right or wrong) that are engrained and established. To come in and rip it all up and say "this is the right way to manage incidents" is a tough pill to swallow. Even the inability to manage IaC or integrate with a specific tool can cause quite a bit of friction. The technical setup of any of these tools is quite easy, the real home run is how does that tool help you drive adoption?

Congrats on the launch, I would use this for my smaller team. Why have a pricing page with no pricing on it though?
s.gif
because they want to look at crunchbase and find out how much funding you got then charge you based off that.

I wish companies would boycott companies doing contact us for pricing

s.gif
Well, then they would charge me pennies because the startup I work for has $0 funding :)
s.gif
> I wish companies would boycott companies doing contact us for pricing

A lot of companies do, by simply never contacting them for pricing.

s.gif
Thank you for the kind words and feedback.

If you email [email protected] we can get you some pricing ASAP. It'll be dependent on number of users and level of support/onboarding/custom feature development required.

But if you're a small team, Rootly is powerful and ready to go out of the box with defaults and you might not require some of the custom stuff.

To answer your question directly, we customize each package for each individual customer on a bunch of variables.

s.gif
I'm sorry, but if your pricing is too complex to write out on the pricing page so I can know upfront if it's even worth talking to you, I won't reach out to you. It also suggests that your pricing is simply too complex, and that would hurt me as a paying customer as well, as I wouldn't fully understand the invoices I receive from you.

Building a calculator that lets potential customers input their variables and you show what the pricing would do solves that problem fully, and takes minimal amount of time to implement.

That you haven't spent that short amount of time in order to be transparent, looks shady to me (and others, some of them even write the same in this comment thread), as otherwise you'd surely display your pricing upfront.

s.gif
Thank you for the feedback, a calculator is a good suggestion! We realize there are a fair number of people that will be turned away by this, we'll see what we can do for a better middle ground.
s.gif
If there is no pricing, I will likely never return. If you cannot get your money right, how can I expect you will get anything else right?
Watching the demo, how is this more than a glorified Slack workflow to a generic ticketing system? The only additions I can see are lifecycle events that update your ticket and the postmortem reminder?

This could be easily done with an engineer/tech following a runbook/doc that's linked in the alert. Cutting tickets, creating a slack channel, escalating, and/or setting up a bridge can be done in minutes (and most alerting systems can automate that).

IME, defining the and enforcing the process is more important than the automation.

s.gif
Appreciate the feedback!

You could certainly go through a checklist/runbook that got attached to the incident. This is usually the setup most companies we speak with have. However, compliance, consistency, and speed on following that during a stressful incident tends to be quite poor. Tasks like creating Slack channels, Zoom bridge, I agree aren't difficult but not things expensive engineers should be focused on. We want them to focus on putting out the fire and not the admin.

A few things for example a simple doc won't be able to accomplish:

- automatically track incident metrics

- set recurring reminders to e.g. update the statuspage

- auto-archive channels after periods of inactivity

- create incident timeline without copy-pasting

- update Jira/Asana/Linear tickets with incident metadata and action items

- automatically invite responders to the incident channel

None of which are impossible to do manually, just a question of how much time you'd like peoples spending on it. If you have a few incidents a year I agree this is likely overkill.

I’ve always wondered about building a startup on another startup’s back. What happens if they cut you off? Is getting bought up by Slack the end goal here? Seems like a big risk, one whim at Slack and you’re toast.
s.gif
We have seen Slack start investing in this area with their own Workflow Builder they announced last year. One of the big use cases they highlighted was incident response. We haven't ran into any customers trying to leverage that just yet though as still a lot of heavy lifting required.

IMO what makes Slack so power is their app ecosystem. We aren't too worried about them shutting that down or competing with us. We see the awesome folks on the Slack Platform team continue to invest heavily there.

But if Slack wants to seriously compete in this space we'd welcome it. The more attention and competition the better. Most accounts we approach we've found they didn't know off the shelf solutions existed!

s.gif
It’s actually pretty common to start on the back of an incumbent. As a startup gains success it can do more to reduce dependency by building more of the end to end experience and distribute risk to more than one partner.
s.gif
That is spot on. We're investing in our Web platform which has the exact same experience and quite a few companies running incidents from there.

But the Slack ecosystem has been great for us so far. Easy to develop on and fairly flexible in terms of what we can do. I think the most challenging part is going through the review cycle, can take longer than expected when you're constantly shipping new features!

s.gif
Let's hope slack itself doesn't start using rootly for incident response :)
s.gif
Although this post is largely focused on our integration with Slack, we have a standalone Web platform that does the exact same. We have quite a few companies (especially MS Teams shops) run incidents solely from there.

This also serves as a backup when Slack goes down, users can continue using Rootly (top 5 most common questions we get).

I do t get the pricing on things like this and pingdom. This stuff seems like it should be cheap, like $5/user/mo. But everyone seems to go expensive.

There are other industries that are similar, but this always stood out to me as an industry where the pricing never felt right to me.

s.gif
I appreciate the feedback.

We tried to think of our pricing as the value a user would get out of the tool. The amount of time and headache we'd save them when actually responding to an incident. We've found a lot more pushback for smaller sized companies (e.g. <20 eng) but have also realized the challenges of managing incidents are less pronounced.

Just curious for my own learning, is the thinking behind "should be cheap" motivated by potentially not everyone would need access to tools in monitoring, alerting, response?

s.gif
For non-technical purchasers, it makes sense to price by the value the organisation will get out of the tool. However for tools that technical users are involved in you have to fight against the "I could build this myself" factor.

There are lots of tools that are basically CRUD apps, or maybe CRUD with a chat interface in this case, which are fundamentally straightforward to build a first-pass version of. I'm sure the product here is far better than a first-pass version, but it's an uphill battle to justify that when the pricing is on the value to the user, rather than based on the cost to build.

Another complicating factor for this market in particular is that there are often two types of users: regular and infrequent. In my experience tools like this would be used heavily by the engineering team, but there was value in having everyone in the org have access to the tool. There may be 10x the number of non-engineers, but they're often worth 1/10th or less to have on the platform. Each person isn't worth having by themselves but having everyone there is worth something. Nickel-and-diming customers on the basis of lots of users who rarely use the platform isn't great.

Edit: also, don't have a pricing page with no pricing on it.

s.gif
This was an awesome response and you absolutely nailed it.

A vast majority of our customers actually have some sort of internal bot built. It's quite limited usually and largely focused on the "creation" portions of an incident. Most commonly create an incident channel, create Zoom, link to a few integrations, and that seems to be it.

But depending on who you speak with that can be enough which is totally fair. Especially when complexity and incident volumes are low.

And yes agree not every user on the platform will value it the same. An SRE vs. someone in legal will find different value out of it. Our goal is to make this accessible to the entire org not just engineering teams.

Thank you for the feedback on the pricing page as well!

s.gif
One last thought, if engineering teams paid "what it was worth" for a tool for every piece of their stack, they'd have no money for engineers or building an actual product. It is very easy to spend the entire personnel budget again on tools, and most companies just can't do this. At that point, you're competing with your competitors for business, and with other unrelated service providers for budget.
s.gif
Not the OP, but in one of the smaller companies (~30 people/~12 engineers) you describe that's currently looking for a tool like this.

I can't actually see the pricing because it's behind a nebulous "contact us" link, but if this is more than about $5/user/month I would definitely balk at the price.

Larger companies already have dedicated platform and tooling teams with enough technical talent and bandwidth to build this sort of solution (I've seen something eerily similar to this at a previous employer that had about ~75 engineers). IMHO it's the small companies that need off the shelf incident management because they have very few people to dedicate to solving this problem and need a way to manage the communication chaos that incidents can cause.

s.gif
Thank you for the feedback, we'll make some changes to the page.

Agree on those companies having technical talent and in fact a vast majority of our customers came to us with their own bot built out that resembles of our features. It really comes down to the age old question, build vs buy? The maintenance cost and feature enhancements for something owned in-house can be burdensome but I am obviously bias towards it.

We've seen it takes usually at least an engineer one whole quarter to standup the basics of nailing creating incidents right with some basic automation.

But the times customers decide building it themselves isn't worth it is often a) someone that owned it had left the company and b) its a big distraction from their core focus/ has become a full-time job to maintain.

s.gif
You'd balk at more than $5/user/mo? For 12 engineers, that's only $60/mo (and for all 30 employees, $150/mo). I'd guess that you're not a C-level since you said you'd balk at that price, which is incredibly cheap. I pay more for my business's status page.
s.gif
firehydrant.com has a free tier that allows people to open incidents from Slack. It also includes the service catalog, runbooks, and status pages.
Congrats on the launch. How is this different than FireHydrant?
s.gif
I've been a victim of this too and it sucks. People will flat out copy your business, and while they're at it, they'll go and copy and paste your painstakingly-written documentation as well. This really leaves a bad taste in my mouth.

I always try to go with the OG when I find these types of instances, since they're likely the one actually innovating (I know nothing of FireHydrant, so this is just conjecture).

s.gif
FireHydrant was my worst experience out of every incident manager I experimented with - literally nothing worked during our tests - and after two months of asking they're still refusing to remove our account; we still get a weekly email dashboard.
s.gif
Yikes! Never like seeing that. Can you email me directly and I will get this sorted? [email protected]
s.gif
Tickets #1452, #1454, #1588. Was told "the account has been removed" on March 22nd, but I continue to receive "Last week on FireHydrant" emails specific to our org, most recently on May 29.
s.gif
Thanks, emailed you as well. Sorry you ran into this!
s.gif
Sorry to hear that. I'll do my best to resist asking if you'd like to chat instead ;).

We try to take a very "partnership" centric approach. What that looks like day-to-day is our engineers/success/leadership team collaborating in a shared Slack Connect channel on new features. For a lot of our customers we get deep into the problem and bring in outside speakers from the industry to come do workshops, AMAs, etc. that might align well with the challenges.

This is the fun part about the job!

s.gif
>Sorry to hear that. I'll do my best to resist asking if you'd like to chat instead ;).

Wow - you're looking for dirt on a company in the EXACT thread that calls out some shady things you've done against them.

s.gif
Thank you!

Great question, There are quite a few differences, namely in our product design focus. We've taken a more configurable and flexible approach that focuses on plugging into a companies existing stack and their process. Often times we'll have customers send us their entire playbook on what they have now and ask us to automate that as a starting point (e.g. rename Slack channels to my Jira number for incidents, etc). We do this to hopefully reduce the amount of change required when a new tool is brought in. As a result we focus on features such as our Workflows engine that allows for this customization. Another big area of focus for us is unsurprisingly Slack, we think of the other areas of Rootly such as our Web platform to be the backend that powers this.

FH does a lot of things well and has great customers too. They have a sleek UI, strong security posture, and more. Their approach is more opinionated in guiding you through incident best practices. There is no wrong answer here as we hear plenty from customers that want both.

s.gif
Why did you or your team copy & paste the FyreHydrant docs?
Congrats on the launch! This is really cool. I remember having to join several incidents and it was always a mess, especially people being left out, others being added who should not be there in the first place.

What happens when the incident is over? Where does all that data live and can there be some fancy data analytics that could potentially address bigger issues that keep reoccurring?

s.gif
Many thanks!

So glad you asked. Once the incident is in a resolved, we'll prompt you to edit your postmortem. This can be done inside of Rootly but most commonly we'll auto-generate a Confluence or Google Doc. Here you'll have all your incident metadata, template to fill out, but most importantly your incident timeline (no copy-paste required).

From there we can help you do things like automatically scheduling your postmortem meeting with everyone that was involved.

We also want to help you improve your process and response. We'll prompt anyone involved in the incident for feedback (they can submit anonymously too) and collect important metrics.

There are top line metrics like incident count, MTTX, outstanding action items but also finer grained ones which is what I think you might be hinting at. For example you can visualize automatically what services are being impacted the most. That might be an indicator as an area to focus on more.

We try to keep the garden walls on the product quite low, allowing you to export any of the data out of Rootly and into your own analytics engines.

What do you see as different to incident.io?
s.gif
Great question.

Incident.io is likely our closest competitor and are doing amazing work over there. They have a strong team and smart founders that have been in the trenches before. From a product perspective they are my favourite of our competitors if I had to pick one myself to use.

Our differences are largely driven by the customer segment we serve and the needs of SMB vs. enterprise. We've found by going upmarket to enterprises such as Canva, Shell, Bolt, it is quite difficult to develop an opinionated platform based on only industry best practices as each organizations approach to incidents (regardless if it's the same tooling) vary greatly in their process. You'll find Rootly to be a lot more pluggable, customizable, and can turn any knob the make the product work for you if needed. The reason for this is because we find change is hard, even small process tweaks in process. We want to reduce as much of that as possible when a new tool is brought in.

We've been around longer so naturally our product maturity is further along. Such as Workflows (https://rootly.com/changelog/2021-11-30-workflows-2-0), integrations (30+), Terraform/Pulumi provider, and security (https://rootly.com/security).

Again, they do a great job. Just different needs and requirements for startups vs. enterprise.

If helpful, customers have written reviews for us on our respective G2 pages: https://www.g2.com/products/rootly-manage-incidents-on-slack...

s.gif
Hey folks. Co-founder of incident.io here.

Firstly, thanks for the kind words Quentin – appreciate it, and congrats on your progress so far.

incident.io is pitched slightly differently to Rootly, insomuch as we're not building an engineering product, but instead something that's designed to work for entire organisations. I saw first hand what this looks like at Monzo – a bank here in the UK – where incidents weren't just declared when Cassandra fell over (ahem, https://monzo.com/blog/2019/09/08/why-monzo-wasnt-working-on...), but were also declared for things like excessive customer support queries and not enough people to serve them, regulatory issues, or a customer threatening staff in reception. All of these things require teams to form quickly, communicate well, and follow a process. We're building for this.

In terms of market and customer segments, we're working with a wide range of companies with up to 6k employees. That said, we're a perfect fit right now for folks in the 200-1500 people range.

By all means reach out if you have any questions.

s.gif
Oh and if you're not in the market to buy something, I open sourced the tool I originally wrote at Monzo: https://github.com/monzo/response
Took a quick look at your loom. Very cool. In practice, we prefer just looking through a single channel with threads, but perhaps that's because we're a small team.
s.gif
Thank you!

You're not alone, I've ran into quite a few cases where working from a thread will suffice (big and small companies). We are still team dedicated incident channel but there are a few things that threads do better.

For example, the number of incident channels that get spun up can get out of hand. Threads are a lot cleaner for that. So we built a workflow that'll let you specify an auto archive behaviour.

We also don't want you to lose context of your threads or conversations, you can run /incident convert in Slack and we'll pull that context over. Lastly, we often see people working from threads from a primary #incidents or #outage channel for better visibility. We'll actually let you specify exactly who you want to notify whenever an incident gets opened/closed.

But generally we found we can do a lot more powerful things in dedicated channels around integrations or even assigning roles if that is important :)

Happy to chat more on your use case tho!

s.gif
Great stuff. Good luck with your product. I imagine we'll need it later but for the moment, I think we're too small to be worth your time.

Would have loved to have run into you guys when/if you were looking for seed investment though

s.gif
Appreciate the kind words and agree on fit.

Happy to compare notes and stay in touch if you'd like to connect of course [email protected].

Hi Quentin! This is Jade

The loom link at the end 404s for me.

s.gif
Fixed, thank you for catching that Jade!
s.gif
From our experience we still see them in a number of deals. I think their focus has shifted more towards SLOs and Microsoft Teams though, areas where we aren't investing in right now.
Just built out Rootly and it's fantastic product. On top of that JJ and team are responsive and really heavily engaged with customers, listening to feedback and continually implementing improvements.

Having trialed Rootly against competing products Rootly won out for the customer engagement. The competition just wasn't responsive.

Implementing easy incident management tooling also saw the volume of incidents increase rapidly as teams started to notice and handle issues as actual incidents. While the metrics increase is bad on the face of it, actually addressing far more issues that were previously just ignored is fantastic, and had lead to increased stability through better incident and problem management.

s.gif
Thank you for the kind words, Alex. The whole team over at Gemini has been such a treat to work with.

The collaboration on building new features together is what really excites us. Appreciate you always pushing us to be better.

I feel like I should get you a t-shirt now that says "more incidents = better" because I couldn't agree more.

s.gif
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK