Best practices for bolstering machine learning security
Nearly 75% of the world’s largest companies have already integrated AI and machine learning (ML) into their business strategies. As more and more companies — and their customers — gain increasing value from ML applications, organizations should be considering new security best practices to keep pace with the evolving technology landscape.
Companies that utilize dynamic or high-speed transactional data to build, train, or serve ML models today have an important opportunity to ensure their ML applications operate securely and as intended. A well-managed approach that takes into account a range of ML security considerations can detect, prevent, and mitigate potential threats while ensuring ML continues to deliver on its transformational potential.
Machine learning security is business critical
ML security has the same goal as all cybersecurity measures: reducing the risk of sensitive data being exposed. If a bad actor interferes with your ML model or the data it uses, that model may output incorrect results that, at best, undermine the benefits of ML and, at worst, negatively impact your business or customers.
“Executives should care about this because there’s nothing worse than doing the wrong thing very quickly and confidently,” says Zach Hanif, vice president of machine learning platforms at Capital One. And while Hanif works in a regulated industry—financial services—requiring additional levels of governance and security, he says that every business adopting ML should take the opportunity to examine its security practices.
Devon Rollins, vice president of cyber engineering and machine learning at Capital One, adds, “Securing business-critical applications requires a level of differentiated protection. It’s safe to assume many deployments of ML tools at scale are critical given the role they play for the business and how they directly impact outcomes for users.”
Novel security considerations to keep in mind
While best practices for securing ML systems are similar to those for any software or hardware system, greater ML adoption also presents new considerations. “Machine learning adds another layer of complexity,” explains Hanif. “This means organizations must consider the multiple points in a machine learning workflow that can represent entirely new vectors.” These core workflow elements include the ML models, the documentation and systems around those models and the data they use, and the use cases they enable.
It’s also imperative that ML models and supporting systems are developed with security in mind right from the start. It is not uncommon for engineers to rely on freely available open-source libraries developed by the software community, rather than coding every single aspect of their program. These libraries are often designed by software engineers, mathematicians, or academics who might not be as well versed in writing secure code. “The people and the skills necessary to develop high-performance or cutting-edge ML software may not always intersect with security-focused software development,” Hanif adds.
According to Rollins, this underscores the importance of sanitizing open-source code libraries used for ML models. Developers should think about considering confidentiality, integrity, and availability as a framework to guide information security policy. Confidentiality means that data assets are protected from unauthorized access; integrity refers to the quality and security of data; and availability ensures that the right authorized users can easily access the data needed for the job at hand.
Additionally, ML input data can be manipulated to compromise a model. One risk is inference manipulation—essentially changing data to trick the model. Because ML models interpret data differently than the human brain, data could be manipulated in ways that are imperceptible by humans, but that nevertheless change the results. For example, all it may take to compromise a computer vision model may be changing a pixel or two in an image of a stop sign used in that model. The human eye would still see a stop sign, but the ML model might not categorize it as a stop sign. Alternatively, one might probe a model by sending a series of varying input data, thus learning how the model works. By observing how the inputs affect the system, Hanif explains, outside actors might figure out how to disguise a malicious file so it eludes detection.
Another vector for risk is the data used to train the system. A third party might “poison” the training data so that the machine learns something incorrectly. As a result, the trained model will make mistakes—for example, automatically identifying all stop signs as yield signs.
Core best practices to enhance machine learning security
Given the proliferation of businesses using ML and the nuanced approaches for managing risk across these systems, how can organizations ensure their ML operations remain safe and secure? When developing and implementing ML applications, Hanif and Rollins say, companies should first use general cybersecurity best practices, such as keeping software and hardware up to date, ensuring their model pipeline is not internet-exposed, and using multi-factor authentication (MFA) across applications.
After that, they suggest paying special attention to the models, the data, and the interactions between them. “Machine learning is often more complicated than other systems,” Hanif says. “Think about the complete system, end-to-end, rather than the isolated components. If the model depends on something, and that something has additional dependencies, you should keep an eye on those additional dependencies, too.”
Hanif recommends evaluating three key things: your input data, your model’s interactions and output, and potential vulnerabilities or gaps in your data or models.
Start by scrutinizing all input data. “You should always approach data from a strong risk management perspective,” Hanif says. Look at the data with a critical eye and use common sense. Is it logical? Does it make sense within your domain? For example, if your input data is based on test scores that range from 0 to 100, numbers like 200,000 or 1 million in your input data would be red flags.
Next, examine how the model interacts with a variety of data and what kind of output it produces. Hanif suggests testing models in a controlled environment with different kinds of data. “You need to test the components of the system, like a plumber might test a pipe by running a small amount of water through it to check for leaks before pressurizing the entire line,” he says. Try feeding a model poor data and see what happens. This may reveal gaps in coverage; if so, you can build guardrails to secure the process.
Query management provides an added security buffer. Rather than letting users query models directly, which might open a door by which outsiders can access or introspect your models, you can create an indirect query method as a layer of protection.
Finally, consider how and why someone would target your models or data — whether intentionally or not. Rollins notes that when considering attacker motivations one must consider the insider threat perspective. “The privileged data access that machine learning developers have within an organization can be attractive targets to adversaries,” he says, which underscores the importance of safeguarding against exfiltration events both internally and externally.
How might that targeting change something that could throw off the whole model or its intended outcome? In the scenario of an external adversary interfering with a computer vision model used in autonomous driving, for instance, the goal might be to trick the model into recognizing yellow lights as green lights. “Think about what happens to your system if there is an unethical individual on the other end,” says Hanif.
Tech community rallies around machine learning security
The tech industry has become very sophisticated very quickly, so most ML engineers and AI developers have adopted good security practices. “Integrating risk management into the fabric of machine learning applications—just as any business would for critical legacy applications, like customer databases—can set up the organization for success from the outset,” noted Rollins. “Machine learning presents unique and novel approaches for thinking about security in more thoughtful ways,” agreed Hanif. Both are encouraged by a recent surge of interest and effort in improving ML security.
In 2021, for example, researchers from 12 organizations, including Microsoft and MITRE, published the Adversarial ML Threat Matrix. The matrix aims to help organizations secure their production ML systems, by better understanding where ML systems are exposed or vulnerable to bad actors and trends in data poisoning, model theft, and adversarial examples. The AI Incident Database (AIID), created in 2021 and maintained by leading ML practitioners, collects community incident reports of attacks and near-attacks on AI systems.
Although ML systems introduce complexities that require novel security approaches, companies that thoughtfully implement best practices can better ensure long-term stability and positive outcomes. “As long as ML practitioners are aware of the complexity, account for it, and can detect and respond if something goes wrong, ML will remain an incredibly valuable tool for businesses and for customer experiences,” says Hanif.
This content was produced by Insights, the custom content arm of MIT Technology Review. It was not written by MIT Technology Review’s editorial staff.