How To Develop Successful Machine Learning Projects On A Budget
It is a natural goal of any organization to maximize the return on investment of their machine learning (ML) projects. To do this, an organization must be aware of a number of pitfalls that can diminish return on investment. Many technical leaders are new to the ML field and face challenges in appraising the advice and technologies offered by third parties. Lack of direct experience with ML project planning can lead to overruns. Misunderstanding data quality requirements can lead to poor results.
A typical mistake is the inadvertent incentivization of data scientists toward longer delivery timelines. The nature of their work can be ambiguous. It is easy to explore many technical ideas before deciding on the desired approach. Delivery decisions are often driven by achieving technical state of the art (SOTA), rather than considering the business value of marginal increases in technical accuracy. This tendency is omnipresent in data science due to the torrential rate of improvement in new data science techniques.
It is important for leadership to establish a success framework for new machine learning projects and per dollar invested. Below, I suggest a model for evaluating successful machine learning projects.
In this scenario, your goal is to develop an internally facing machine learning model that provides a significant improvement over an existing business process. Let’s set your investment threshold at $250,000 USD. Beware, this is a small amount of investment. Entry-level data science salaries are $95,000 USD per year, putting the fully burdened cost for 12 months of a junior data scientist at roughly half your total capital.
Necessary But Not Sufficient
Let’s assume your organization doesn’t have prior artificial intelligence (AI) work. Therefore, the first member of your team is not a data scientist. Rather, your first hire is a data engineer. A data scientist’s role is to investigate data for insights. A data engineer’s role is to create the data set.
This person is responsible for two foundational pieces of our investment. The first is the generation of your compounding dataset and the second is the creation of your data pipeline.
I described the importance of the compounding dataset in a previous article. To summarize, a compounding dataset is: “a dataset that, in conjunction with a machine-learning model, continually yields complementary data… These datasets are important because they prevent other companies from effectively making similar products. I draw a distinction between general big data and a compounding dataset. The distinction of the latter is that it grows in importance exponentially rather than purely in size.”
The data pipeline is the engineering infrastructure that supports the collection, processing and storage of your compounding data set. The data engineer is the person who will create the data pipeline and allow you to quickly generate multiple datasets for testing. By the time this is complete, in my experience, you will have spent around $140,000 USD.
This begs the question, how can you evaluate the business value of the data you are collecting? Typically this is done with a data scientist. However, betting your investment on an engineer and junior data scientist seems like a questionable decision. You cannot be confident that a junior data scientist will create a machine-learning model that is an improvement over an existing process.
Is there a creative solution you can use instead?
This article originally appeared on forbes.com To read the full article and see the images, click here.
Nastel Technologies is the global leader in Integration Infrastructure Management (i2M). It helps companies achieve flawless delivery of digital services powered by integration infrastructure by delivering tools for Middleware Management, Monitoring, Tracking, and Analytics to detect anomalies, accelerate decisions, and enable customers to constantly innovate, to answer business-centric questions, and provide actionable guidance for decision-makers. It is particularly focused on IBM MQ, Apache Kafka, Solace, TIBCO EMS, ACE/IIB and also supports RabbitMQ, ActiveMQ, Blockchain, IOT, DataPower, MFT, IBM Cloud Pak for Integration and many more.
The Nastel i2M Platform provides:
- Secure self-service configuration management with auditing for governance & compliance
- Message management for Application Development, Test, & Support
- Real-time performance monitoring, alerting, and remediation
- Business transaction tracking and IT message tracing
- AIOps and APM
- Automation for CI/CD DevOps
- Analytics for root cause analysis & Management Information (MI)
- Integration with ITSM/SIEM solutions including ServiceNow, Splunk, & AppDynamics