The dos and don’ts of machine learning research
Machine learning is becoming an important tool in many industries and fields of science. But ML research and product development present several challenges that, if not addressed, can steer your project in the wrong direction.
In a paper recently published on the arXiv preprint server, Michael Lones, Associate Professor in the School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh, provides a list of dos and don’ts for machine learning research.
The paper, which Lones describes as “lessons that were learnt whilst doing ML research in academia, and whilst supervising students doing ML research,” covers the challenges of different stages of the machine learning research lifecycle. Although aimed at academic researchers, the paper’s guidelines are also useful for developers who are creating machine learning models for real-world applications.
Here are my takeaways from the paper, though I recommend anyone involved in machine learning research and development to read it in full.
Pay extra attention to data
Machine learning models live and thrive on data. Accordingly, across the paper, Lones reiterates the importance of paying extra attention to data across all stages of the machine learning lifecycle. You must be careful of how you gather and prepare your data and how you use it to train and test your machine learning models.
No amount of computation power and advanced technology can help you if your data doesn’t come from a reliable source and hasn’t been gathered in a reliable manner. And you should also use your own due diligence to check the provenance and quality of your data. “Do not assume that, because a data set has been used by a number of papers, it is of good quality,” Lones writes.
Your dataset might have various problems that can lead to your model learning the wrong thing.
For example, if you’re working on a classification problem and your dataset contains too many examples of one class and too few of another, then the trained machine learning model might end up learning to predict every input as belonging to the stronger class. In this case, your dataset suffers from “class imbalance.”
While class imbalance can be spotted quickly with data exploration practices, finding other problems needs extra care and experience. For example, if all the pictures in your dataset were taken in daylight, then your machine learning model will perform poorly on dark photos. A more subtle example is the equipment used to capture the data. For instance, if you’ve taken all your training photos with the same camera, your model might end up learning to detect the unique visual footprint of your camera and will perform poorly on images taken with other equipment. Machine learning datasets can have all kinds of such biases.
The quantity of data is also an important issue. Make sure your data is available in enough abundance. “If the signal is strong, then you can get away with less data; if it’s weak, then you need more data,” Lones writes.
In some fields, the lack of data can be compensated for with techniques such as cross-validation and data augmentation. But in general, you should know that the more complex your machine learning model, the more training data you’ll need. For example, a few hundred training examples might be enough to train a simple regression model with a few parameters. But if you want to develop a deep neural network with millions of parameters, you’ll need much more training data.
Another important point Lones makes in the paper is the need to have a strong separation between training and test data. Machine learning engineers usually put aside part of their data to test the trained model. But sometimes, the test data leaks into the training process, which can lead to machine learning models that don’t generalize to data gathered from the real world.
“Don’t allow test data to leak into the training process,” he warns. “The best thing you can do to prevent these issues is to partition off a subset of your data right at the start of your project, and only use this independent test set once to measure the generality of a single model at the end of the project.”
In more complicated scenarios, you’ll need a “validation set,” a second test set that puts the machine learning model into a final evaluation process. For example, if you’re doing cross-validation or ensemble learning, the original test might not provide a precise evaluation of your models. In this case, a validation set can be useful.
“If you have enough data, it’s better to keep some aside and only use it once to provide an unbiased estimate of the final selected model instance,” Lones writes.
Know your models (as well as those of others)
Today, deep learning is all the rage. But not every problem needs deep learning. In fact, not every problem even needs machine learning. Sometimes, simple pattern-matching and rules will perform on par with the most complex machine learning models at a fraction of the data and computation costs.
But when it comes to problems that are specific to machine learning models, you should always have a roster of candidate algorithms to evaluate. “Generally speaking, there’s no such thing as a single best ML model,” Lones writes. “In fact, there’s a proof of this, in the form of the No Free Lunch theorem, which shows that no ML approach is any better than any other when considered over every possible problem.”
The first thing you should check is whether your model matches your problem type. For example, based on whether your intended output is categorical or continuous, you’ll need to choose the right machine learning algorithm along with the right structure. Data types (e.g., tabular data, images, unstructured text, etc.) can also be a defining factor in the class of model you use.
One important point Lones makes in his paper is the need to avoid excessive complexity. For example, if you’re problem can be solved with a simple decision tree or regression model, there’s no point in using deep learning.
Lones also warns against trying to reinvent the wheel. With machine learning being one of the hottest areas of research, there’s always a solid chance that someone else has solved a problem that is similar to yours. In such cases, the wise thing to do would be to examine their work. This can save you a lot of time because other researchers have already faced and solved challenges that you will likely meet down the road.
“To ignore previous studies is to potentially miss out on valuable information,” Lones writes.
Examining papers and work by other researchers might also provide you with machine learning models that you can use and repurpose for your own problem. In fact, machine learning researchers often use each other’s models to save time and computational resources and start with a baseline trusted by the ML community.
“It’s important to avoid ‘not invented here syndrome,’ i.e., only using models that have been invented at your own institution, since this may cause you to omit the best model for a particular problem,” Lones warns.
Know the final goal and its requirements
Having a solid idea of what your machine learning model will be used for can greatly impact its development. If you’re doing machine learning purely for academic purposes and to push the boundaries of science, then there might be no limits to the type of data or machine learning algorithms you can use. But not all academic work will remain confined in research labs.
“[For] many academic studies, the eventual goal is to produce an ML model that can be deployed in a real world situation. If this is the case, then it’s worth thinking early on about how it is going to be deployed,” Lones writes.
For example, if your model will be used in an application that runs on user devices and not on large server clusters, then you can’t use large neural networks that require large amounts of memory and storage space. You must design machine learning models that can work in resource-constrained environments.
Another problem you might face is the need for explainability. In some domains, such as finance and healthcare, application developers are legally required to provide explanations of algorithmic decisions in case a user demands it. In such cases, using a black-box model might be impossible. For example, even though a deep neural network might give you a performance advantage, its lack of interpretability might make it useless. Instead, a more transparent model such as a decision tree might be a better choice even if it results in a performance hit. Alternatively, if deep learning is an absolute requirement for your application, then you’ll need to investigate techniques that can provide reliable interpretations of activations in the neural network.
As a machine learning engineer, you might not have precise knowledge of the requirements of your model. Therefore, it is important to talk to domain experts because they can help to steer you in the right direction and determine whether you’re solving a relevant problem or not.
“Failing to consider the opinion of domain experts can lead to projects which don’t solve useful problems, or which solve useful problems in inappropriate ways,” Lones writes.
This article originally appeared on venturebeat.com, to read the full article, click here.
Nastel Technologies is the global leader in Integration Infrastructure Management (i2M). It helps companies achieve flawless delivery of digital services powered by integration infrastructure by delivering tools for Middleware Management, Monitoring, Tracking, and Analytics to detect anomalies, accelerate decisions, and enable customers to constantly innovate, to answer business-centric questions, and provide actionable guidance for decision-makers. It is particularly focused on IBM MQ, Apache Kafka, Solace, TIBCO EMS, ACE/IIB and also supports RabbitMQ, ActiveMQ, Blockchain, IOT, DataPower, MFT, IBM Cloud Pak for Integration and many more.
The Nastel i2M Platform provides:
- Secure self-service configuration management with auditing for governance & compliance
- Message management for Application Development, Test, & Support
- Real-time performance monitoring, alerting, and remediation
- Business transaction tracking and IT message tracing
- AIOps and APM
- Automation for CI/CD DevOps
- Analytics for root cause analysis & Management Information (MI)
- Integration with ITSM/SIEM solutions including ServiceNow, Splunk, & AppDynamics