Designing AI systems: Fundamentals of AI software and hardware
Artificial intelligence is already solving problems in all aspects of our lives, from animation filmmaking and tackling space exploration, to fast food recommendation systems that improve ordering efficiency. These real-world AI systems examples are just the beginning of what is possible in an AI Everywhere future and they are already testing the limits of compute power. Tomorrow’s AI system solutions will require optimization up and down the stack from hardware to software, including in the tools and frameworks used to implement end-to-end AI and data science pipelines.
AI math operations need powerful hardware
A simple example can help illustrate the root of the challenge. In Architecture All Access: Artificial Intelligence Part 1 – Fundamentals, Andres Rodriguez, Intel Fellow and AI Architect, shows how a simple deep neural network (DNN) to identify digits from handwritten numbers requires over 100,000 weight parameters for just the first layer of multiplications. This is for a simple DNN that processes 28×28 black-and-white images.
Today’s AI solutions process image frames of 1280×720 and higher, with separate channels for red, green and blue. And the neural networks are more complex, as they must identify and track multiple objects from frame-to-frame, or extract meaning from different arrangements of words that may or may not affect that meaning. We are already seeing models surpassing trillions of parameters that require multiple weeks to train. Tomorrow’s AI solutions will be even more complex, combining multiple types of models and data.
AI application development is an iterative process, so speeding up compute-intensive tasks can increase a developer’s ability to explore more options or just get their job done more quickly. As Andres explains in the video above, matrix multiplications are often the bulk of the compute load during the training process.
In Architecture All Access: Artificial Intelligence Part 2 – Hardware, Andres compares the different capabilities of CPUs, GPUs and various specialized architectures. The AI-specific devices, and many new GPUs, have systolic arrays of multiply-accumulates (MACs) that can parallelize the matrix multiplications inherent to the training process.
Size and complexity of AI systems require hardware heterogeneity
As neural networks become more complex and dynamic — for instance those with a directed acyclic graph (DAG) structure — they limit the ability to parallelize these computations. And their irregular memory access patterns require low-latency memory access. CPUs can be a good fit for these requirements due to their flexibility and higher operating frequencies.
With increasing network size, even larger amounts of data need to be moved between compute and memory. Given the growth of MACs available in hardware devices, memory bandwidth and the bandwidth between nodes within a server and across servers are becoming the limiting factors for performance.
Once a network is trained and ready for deployment as part of an application, a new set of hardware requirements typically arises. AI inference often requires low latency, or producing an answer as quickly as possible, whether that means keeping up with city traffic in real-time, inspecting parts in a production line, or providing timely fast-food recommendations to reduce wait times. Additional requirements such as cost, form factor and power profile, tend to be more application-specific. CPUs are often used for deployment because one core can be dedicated to AI inference, leaving the other cores available for the application and other tasks.
The trend is toward more heterogeneous computing, combining general-purpose CPU-like compute with dedicated AI-specific resources. The degree toward which these devices are specialized for training versus inference may differ, but they share common features that improve AI processing. The Bfloat16 (BF16) data type provides floating-point dynamic range with shorter word lengths, reducing the size of data and enabling more parallelism. 8-bit integer (INT8) data types enable further optimization but limit the dynamic range and thus require the AI developer to make some implementation tradeoffs. Dedicated MAC-based systolic arrays parallelize the heavy computation loads associated with training and inference. And high-bandwidth memory provides wide highways to speed data between compute and memory. But these hardware features require software that can take advantage of them.
Unifying and optimizing the software stack
A key focus of this trend towards heterogenous computing is software. oneAPI is a cross-industry, open, standards-based unified programming model that delivers performance across multiple architectures. The oneAPI initiative encourages community and industry collaboration on the open oneAPI specification and compatible oneAPI implementations across the ecosystem.
We have already outlined how a given AI system may require different types of hardware between training and inference. Even before training, while preparing the dataset and exploring network options, a data scientist will be more productive with faster response times with data extraction, transformation and loading (ETL) tasks. Different systems will also have different hardware requirements depending on what type of AI is being developed.
AI is typically part of a specific application, for instance the filmmaking, space exploration or fast-food recommendation examples listed earlier. This top layer of the stack is developed using middleware and AI frameworks.
There is no shortage of AI tools and frameworks available for creating, training and deploying AI models. Developers choose these based on their task — for instance PyTorch*, TensorFlow* or others for deep learning, and XGBoost, scikit-learn* or others for machine learning — based on their experience, preferences or code reuse. This layer of the stack also includes the libraries used during application and AI development, such as NumPy, SciPy or pandas. These frameworks and libraries are the engines that automate the data science and AI tasks.
But with all the innovation in hardware to accelerate AI, how can a framework or library know how to take advantage of whatever hardware resources it’s running on? This bottom layer of the software stack is what enables all the software from the upper layers to interact with the specific hardware it’s running on, without having to write hardware-specific code. As Huma Abidi, Senior Director of Artificial Intelligence and Deep Learning at Intel, explains in Architecture All Access: Artificial Intelligence Part 3 – Software, this layer is created by developers that understand the capabilities and instruction sets available on a given device.
This article originally appeared on venturebeat.com, to read the full article, click here.
Nastel Technologies is the global leader in Integration Infrastructure Management (i2M). It helps companies achieve flawless delivery of digital services powered by integration infrastructure by delivering tools for Middleware Management, Monitoring, Tracking, and Analytics to detect anomalies, accelerate decisions, and enable customers to constantly innovate, to answer business-centric questions, and provide actionable guidance for decision-makers. It is particularly focused on IBM MQ, Apache Kafka, Solace, TIBCO EMS, ACE/IIB and also supports RabbitMQ, ActiveMQ, Blockchain, IOT, DataPower, MFT, IBM Cloud Pak for Integration and many more.
The Nastel i2M Platform provides:
- Secure self-service configuration management with auditing for governance & compliance
- Message management for Application Development, Test, & Support
- Real-time performance monitoring, alerting, and remediation
- Business transaction tracking and IT message tracing
- AIOps and APM
- Automation for CI/CD DevOps
- Analytics for root cause analysis & Management Information (MI)
- Integration with ITSM/SIEM solutions including ServiceNow, Splunk, & AppDynamics