Community

Follow the information (The real data problem)

Robot wearing dunce hat sits with head in hand in futuristic circuit backdrop

Image Credit: Donald Iain Smith/Getty

Step right up! Come one come all! Welcome to the highest stakes game of Three Card Monte that the world has ever seen.

Deep learning is facing The Data Problem: the demand for labeled data is nearly infinite, and, arguably, the lack of labeled data in the enterprise is the key bottleneck to progress.

Let’s find the answer.

First, we’re going to pick from the staggering number of techniques that have emerged in the last several years to address The Data Problem at the core of artificial intelligence. The cards are all laid out in front of us and, surely, under one of them is the secret to the next slew of unicorns and decacorns.

Unsupervised learning, foundation models, weak supervision, transfer learning, ontologies, representation learning, semi-supervised learning, self-supervised learning, synthetic data, knowledge graphs, physical simulations, symbol manipulation, active learning, zero-shot learning and generative models.

Just to name a few.

The concepts bob and weave and join and split in bizarre and unpredictable ways. There’s not a single term in that long list that has a universally agreed-upon definition. Powerful tools and overhyped promises overlap, and the dizzying array of techniques and tools is enough to throw even the savviest customers and investors off-balance.

So, which do you pick?

All data, no information

The problem, of course, is that we never should have been watching the cards in the first place. It was never a question of which magical buzzword was going to eliminate The Data Problem because the problem was never really about data in the first place. At least, not exactly.

By itself, data is useless. In fewer than a hundred keystrokes, I can set my computer to generate enough random noise to keep a modern neural network rolling through instability until the heat death of the universe. With a little more effort and a single picture from a 10 megapixel phone, I could black out every combination of three pixels and create more data than exists on the internet today.

Data is just a vehicle. Information is what it’s carrying. It’s critical not to confuse the two.

In the examples above, there is plenty of data, but almost no information. In massively complex, information-rich systems like loan approvals, industrial supply chains, and even social media analysis the problem is reversed. Rivers of thought and galaxies of human expression are boiled into reductive binaries. Like trying to mine a mountain with a pickaxe.

This is the heart of The Data Problem. It’s an unfathomable bounty of information — a billion cars on the road — that’s somehow both tangible and inaccessible. It’s thousands of people and billions of dollars carrying scant loads of tailings and gravel back and forth in captcha tests and classification datasets.

That’s where the tsunami of buzzwords comes in. For all of the hundreds of papers and the complexity of the methods themselves, the motivations and core principles are simple. The best and simplest explanation is one that I credit to Google’s Underspecification paper.

Molding neural networks

Imagine every possible neural network as a massive, fuzzy space. It can do nearly anything, but naively it does nothing.

There is something that we want this neural network to do, but we’re not yet sure what. It is like unmolded clay with infinite possibilities. It’s an unconstrained mess, filled to bursting with Shannon entropy, a mathematical formalization of possibility – the amount of freedom left in a system. Equivalently, the amount of information and work we would need to add to the system to eliminate those possibilities.

Today, we are principally interested in mimicking humans. So that information, and that work, must come from humans.

So, to progress, humans have to make decisions. There must be a winnowing down of that massive space. A reduction in Shannon entropy. Like finding the perfect drop of water in an ocean of possibility, and it’s exactly as impractical as you imagine. More practically, it’s like finding the right swath of ocean. This is the equivalence set – an infinite subset of the infinitely large ocean where every option is equivalently optimal.

As far as you can tell.

Supervision, information captured in data, is the way that we winnow the ocean. It is how we say: “out of everything that you could do, this is what you should do.” That is the key and clarity to cutting through the noise. There’s no free lunch here, and in the blizzard of techniques and mathematics flowing at you, the information flows are what you need to focus on.

Where is new information entering the system?

Nvidia’s Omniverse Replicator is a wonderful example. It is a synthetic data platform. In truth though, that tells you very little. It describes the data, but the information is the physics simulations. It’s completely different from other synthetic data platforms like statice.ai that focus on using generative models to convert information trapped in personally-identifiable data into non-identifiable synthetic data that contains the same information.

Another case study is Tesla’s unique active learning approach. In traditional active learning, the key source of information is the data scientist. By specifying an active learning strategy that is well-suited to the task, new training examples will cut down your equivalence set even further than usual. In one of Andrej Karpathy’s recent talks on the subject, he explains how Tesla improves significantly on this technique. Rather than having data scientists craft an optimal active learning strategy, they leverage several noisy strategies together and use further human selection to identify the most impactful examples.

Unintuitive, they improve the overall system performance by adding additional human intervention. Traditionally this would be seen as a regression. More intervention means less automation which, in the traditional lens, is less good. Seen through the lens of information however, this approach makes perfect sense. You’ve dramatically improved information bandwidth into the system, so the rate of improvement accelerates.

This is the name of the game. The explosion of buzzwords is frustrating, and without doubt, a huge number of the people that have co-opted those buzzwords have misunderstood the promise in them. Nonetheless, the buzzwords are indicative of real progress. There are no magic bullets, and we’ve explored these fields for long enough to know that. However, each of these fields has led to benefits in its own right, and research continues to show that there are still significant gains to be made by combining and unifying these supervision paradigms.

It’s an era of incredible possibility. Our ability to use information from previously untapped sources continues to accelerate. The biggest problems we face now are an embarrassment of riches and a bewilderment of noise. When it all seems like too much, and you have trouble sorting fact from fiction, just remember:

Follow the information.

Slater Victoroff is founder and CTO of Indico Data.

DataDecisionMakers

Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.

If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.

You might even consider contributing an article of your own!

Follow the information (The real data problem)

Follow the information (The real data problem)

All data, no information

Molding neural networks

Where is new information entering the system?

DataDecisionMakers

Recommend

Data Journey from SAP MDM to SAP MDG

STEPN 回应部分用户被识别为 Bot：服务器遭遇 DDOS，已恢复网络并推出双倍能量活动

苹果AR/VR头显因处理器散热问题推迟到明年发布

China’s zero-Covid catastrophe threatens Xi’s dream of becoming a new emperor

13 Best Hair Straighteners We've Tested (2022): Flat Irons, Hot Combs, and Strai...

供不应求的PMIC

明基发布投影机GS50；LG发布小带鱼屏显示器_原创_新浪众测

差生文具多！雷军晒单反惨遭网友吐槽徕卡：我绿了？

エピステームステムサイエンスロート製薬からスキンケアについてのアンケート | 【...

华为产能恢复？新麒麟处理器要来了！

About Joyk