2

Skyflow raises $30 million to bring data privacy to enterprise LLMs

 1 month ago
source link: https://diginomica.com/skyflow-raises-%2430m-data-privacy-enterprise-llms
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Skyflow raises $30 million to bring data privacy to enterprise LLMs

By Phil Wainewright

April 5, 2024

Dyslexia mode

Data privacy vault concept image in deep blue © putilich via Canva.com
(© putilich via Canva.com)

Among the many concerns that enterprises have when adopting generative AI, data privacy comes high on the list. Whether they're using third-party Large Language Models (LLMs) or building their own, one of the top priorities is putting in effective guardrails to prevent confidential or sensitive information being inadvertently exposed where it shouldn't be seen. This market need has sparked a new surge in demand for Skyflow, whose privacy vault technology is designed to safeguard Personally Identifiable Information (PII), and which last week announced a new $30 million funding round led by storied VC firm Khosla Ventures, an early big backer of OpenAI. I caught up with Anshu Sharma, CEO and co-founder, to find out more.

While there are established mechanisms for protecting sensitive data in traditional databases and data warehouses, LLMs bring new challenges. The existing mechanisms, such as role-based access and encryption, already have limitations according to Skyflow, which uses a technique called polymorphism in tandem with encryption and tokenization to more securely protect PII and other confidential data. The limitations are greater in an LLM, as there's no predetermined structure for traditional mechanisms to latch onto. Sharma elaborates:

How do you turn a large language model app into an enterprise-grade system with all the controls we built in the last 30 years around databases and data warehouses? ...

The same things that you worry about, when you have data going into BigQuery, or Snowflake or Databricks, you have the same worry when you're building models, except with those kinds of things, you have potentially other solutions, legacy solutions... You can do role-based access control and encryption in a database. Our way of doing polymorphic encryption and tokenization is better, but you can at least do something natively.

When it comes to large language models, you can't even say, '[An individual's] date of birth should be deleted.' There's no rows, there's no columns, and there's no delete. So essentially, you have to build at the same time, if you're going to deploy LLMs in the enterprise — you can't go to production without controls in place.

Protecting PII in an LLM

The lack of structure to help identify PII within the LLM is not the only challenge. While this can be overcome by using encryption to obscure the PII, or better still, substituting tokens that take its place, this then removes the meaning that the LLM would otherwise use to make sense of the data. This is where Skyflow's approach of adding polymorphism comes into its own. This ensures that the tokenization still resembles the data type that it replaces — for example, representing someone's first name and last name with tokens that still look like a first and last name. This makes it possible to train a model and do inference using meaningful data that nevertheless protects any PII. Sharma explains:

In the old days, if you're building a database application, and you want to test something, you can just create synthetic fake data. I can just generate random numbers. But if you feed random numbers to a model, whether you're fine-tuning it, or you're building your own, or you're building your RAG, or vector, whatever, it can't make any sense of it. LLM is sensemaking. You can't make sense of synthetic data or fake data. We can turn real data — retain its shape — but turn it anonymous and pseudo-anonymous.

The Skyflow API effectively acts as a data privacy trust layer, storing the original PII in its privacy vault, and only exposing as much as is needed for each step of an application. When PII is needed by the LLM for training or inference purposes, this can be provided in the form of polymorphic tokens. When an answer needs to be shown to a user or processed by the application, the developer can then call a redaction algorithm or tokenization to control exactly how much or little of the PII is displayed, depending on the access required — a user might be shown just part of a name or number, or a token might be passed to a payment procesor to invoke a card transaction. All of these steps can be logged for later audit. Sharma sums up:

Because we have a vault along with our ability to intercept and [mask or tokenize], we can put that information back. So from an inference perspective, your query can be with real data, but that real data never touches the model. The model generates a response thinking you are James Bond, and I am George Clooney. And then we put back just initials, just that last name, first name, whatever. And so, essentially, your inference, your chat application, whatever you're building, behaves as if you were building it on a regular enterprise-grade database.

Rising demand

The demand to apply Skyflow's solution to LLMs has ramped quickly since it was introduced last year, Sharma tells me. The first wave of customers were largely digital-native software vendors, along with some established companies running on public cloud platforms. While the largest enterprise technology vendors have the resources to build many of these capabilities in-house, it's a big undertaking to do from the ground up. Sharma comments:

If you don't already have 15 of those things, then you have to build it from scratch... All the last-generation SaaS companies have to adopt a Salesforce-like trust layer. We can give it to them in a day.

There's since been a noticeable uptick in enquiries from traditionally tech-laggard industries. Generative AI's ability to work with unstructured data has given these industries an opportunity to automate processes that sit outside of traditional databases and haven't been viable to automate in the past. Sharma says:

There's a crossing-of-the-chasm effect going on. [Some] companies are very forward. They are obviously jumping on this very fast because they are tech forward. But weirdly enough, there are companies that have traditionally [said], 'I'm buried in PDFs, what am I going to do buying more databases?' They are finally able to build interesting things. Healthcare, health insurance, insurance in general. Almost any industry that's customer-service-heavy. Those are the areas where I've seen sudden and increased traction, in terms of market.

Investor interest

This expansion of the opportunity is one of the factors that has caught the interest of investors, he adds. As data sources broaden, first from transactional databases out to cloud data warehouses and now into data lakes that encompass unstructured data too, the challenges of protecting PII become ever more complex. He says:

One of our larger customers that's building a model in the healthcare space, they're processing 100+ million health records, which are in PDFs, and voice files, and images.

When I was talking to investors, I was [saying], 'Look, if there is a billion rows of data in a company that's structured, then there is 10 billion rows of data that's semi-structured, and there's 100 billion words of data that's sitting around in emails and PDFs and images. So it gets to orders of magnitude.

It's also more dangerous, because, in your HR application or CRM application, you know which column has a phone number. But if you're ingesting four million emails to build a model for better customer service, there are phone numbers and account numbers [anywhere].

There's some interesting problems by the way. One of them, I call the Paris Hilton problem. If you see the words Paris and Hilton, do you assume it's Paris Hilton herself? Is it in the Hilton in Paris? The answer is, you have to derive from the context whether something is a name of a person, a city, a hotel. Those problems don't exist in database applications.

Khosla Ventures led the latest the $30 million funding round, with participation from existing investors. Sharma told TechCrunch that Skyflow had positioned the fundraise as an extension to its series B round, which raised $45 million back in 2021, to appeal to "early growth" investors. He tells me that the firm, led by tech industry luminary Vinod Khosla, was on his shortlist from the outset. He says:

Vinod has bet on many companies that have become large public companies, and he's always the guy who's looking out for the next big thing ... I knew him and his style of investing, which is take big, bold bets and build large companies. I was quite sure I wanted to go that route too.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK