A Massive Opportunity Exists To Build ‘Picks And Shovels’ For Machine Learning

A Massive Opportunity Exists To Build ‘Picks And Shovels’ For Machine Learning

A Massive Opportunity Exists To Build ‘Picks And Shovels’ For Machine Learning

Machine Learning – Many multi-billion-dollar companies have been built by providing tools to make software development easier and more productive. Venture capitalists like to refer to businesses like these as “pick and shovel” opportunities, a reference to Mark Twain’s famous line: “When everyone is looking for gold, it’s a good time to be in the pick and shovel business.”

Atlassian, which offers a suite of software development and collaboration tools, has a public market capitalization above $30B. GitHub, a code repository, was acquired for $7.5B by Microsoft in 2018. Pivotal, which accelerates app development and deployment, was valued at $2.7B in VMWare’s acquisition last year. Many more of today’s hottest high-growth startups—LaunchDarkly, GitLab, HashiCorp—offer tools for software development.

Each of these companies’ tools are built for “traditional” software engineering. In recent years, an entirely new paradigm for software development has burst onto the scene: machine learning.

Building a machine learning model is radically different from building a traditional software application. It involves different activities, different workflows and different skillsets. Correspondingly, there is a need—and opportunity—to build a whole new generation of software tools. The reward for developing this next wave of “picks and shovels”: many billions of dollars of enterprise value.

How exactly do traditional software development and machine learning differ? In traditional software development, the core task is writing code. The human programmer’s job is to craft an explicit set of instructions to tell the software program what to do given different contingencies. For software programs of any sophistication, the volume of human-written code can be immense. The Internet browser Google Chrome has 6.7 million lines of code; the operating system Microsoft Windows 10 reportedly has 50 million.

On the other hand, the fundamental premise of machine learning (as its name suggests) is that the program learns for itself how to act, by ingesting and analyzing troves of data. Human programmers need not write large volumes of rules (code) to guide the software’s actions.

In this regime, the core set of tasks for software engineers is completely different. Their primary activities become, instead, to prepare datasets for the machine learning model to ingest and learn from; to establish the overall parameters that will guide the model’s learning process; and to evaluate and monitor the model’s performance once it has been trained.

The developer tools—the “picks and shovels”—to streamline and enhance this set of activities will look quite different from those built to support traditional software engineering.

At present, a mature ecosystem of machine learning developer tools simply does not exist; machine learning itself remains a nascent discipline, after all. As a result, ML practitioners generally manage their workflows in ad-hoc ways: in documents saved on their local hard drives, in sequentially-adjusted file names, even by hand. These methods are not sustainable or scalable for production-grade machine learning deployments.

This market gap represents a massive opportunity. In the years ahead, billions of dollars of enterprise value will be created by providing tools for the machine learning development pipeline. Below, we walk through a few key categories in which these tools will be needed.

Data Labeling

The dominant ML approach at present is known as supervised learning, which requires a label to be attached to each piece of data in a dataset in order for the model to learn from it. (Think, for instance, of a cat photo accompanied by a text label that says “cat”.) The process of creating these labels is tedious and time-consuming.

A crop of startups has emerged to handle the unglamorous work of affixing labels to companies’ corpuses of data. These startups’ business models often rely on labor arbitrage, with large forces of workers labeling data by hand in low-cost parts of the world like India. Some players are working on technology to automate parts of the labeling process.

The most prominent company in this category is Scale AI, which focuses on the autonomous vehicle sector. Scale recently raised $100M at a ~$1B valuation from Founders Fund. Other data-labeling players include LabelBox, DefinedCrowd and Figure Eight (now owned by Appen).

It is unclear how durable these businesses will be over the long term. As has been previously argued in this column, the need for massive labeled datasets may fade as the state of the art in AI races forward.

Dataset Augmentation

More toward the cutting edge of machine learning, researchers and entrepreneurs are working on a set of innovations to reduce the amount of real-world data needed to train models and to enhance the value of existing datasets.

One of the most promising of these is synthetic data, a technique that allows AI practitioners to artificially fabricate the data that they need to train their models. As synthetic data increases in fidelity, it will make machine learning dramatically cheaper and faster, opening up myriad new use cases and business opportunities.

The first commercial use case to which synthetic data has been applied at scale is autonomous vehicles; startups focusing here include Applied Intuition, Parallel Domain and Cognata. Other companies, like recently-launched Synthesis AI, are seeking to build synthetic data toolkits for computer vision more broadly.

A related category can be thought of as “data curation”: tools that evaluate and modify datasets pre-training to optimize the cost, efficiency and quality of model training runs. Gradio and Alectio are two promising early-stage startups pursuing this opportunity.

A final set of data tools likely to become increasingly valuable relates to “semi-supervised learning”, an emerging technique that trains models by leveraging a small amount of labeled data together with large volumes of unlabeled data. Semi-supervised learning holds much promise because unlabeled data is vastly cheaper and easier to come by than labeled data. Snorkel.ai, out of Stanford University, is one project generating buzz in this space.

This article originally appeared on forbes.com To read the full article and see the images, click here.

Nastel Technologies uses machine learning to detect anomalies, behavior and sentiment, accelerate decisions, satisfy customers, innovate continuously.  To answer business-centric questions and provide actionable guidance for decision-makers, Nastel’s AutoPilot® for Analytics fuses:

  • Advanced predictive anomaly detection, Bayesian Classification and other machine learning algorithms
  • Raw information handling and analytics speed
  • End-to-end business transaction tracking that spans technologies, tiers, and organizations
  • Intuitive, easy-to-use data visualizations and dashboards