Hacking AI/ML: H2O Exposes Entire Filesystem

H2O-3 is a popular AutoML open source tool for creating machine learning models
Exposes your entire file system remotely by design, no exploit needed
As of writing, no fix exists for current version 3.42.0.1

What is H2O-3?

H2O-3 is a low-code tool that abstracts away most of the details of creating a machine learning model. It is the most popular open source repository from h2o.ai with more than 500,000 downloads a month. Users either download the standalone .jar file and run the server or import the h2o library in Python or R and initialize the server with h2o.init().

Vulnerabilities

File Path Exposure

H2O allows users to see the entire filesystem’s file and directory paths through their Typeahead API call.

Local File Include

H2O allows users to read any file on the system.

Arbitrary File Overwrite

H2O allows users to overwrite any file on the filesystem.

Via these built-in vulnerabilities, attackers have a direct and stealthy path to stealing secure socket shell (SSH) keys, API keys, models, data and absolutely anything else that interests them on the H2O server. Through the file overwrite vulnerability, an attacker can cause a denial of service or poison datasets with malicious data.

The H2O maintainers were contacted multiple times about these vulnerabilities over the course of several months with no response.

Vulnerable by Design

One should note that there was no actual exploitation needed. H2O was (and is, at the time of this writing) insecure by design. Initially, we thought there must be some kind of warning in place in the H2O documentation that the tool is specifically designed to only be deployed to the localhost interface and never be exposed remotely. This would mostly mitigate all of these vulnerabilities as the user would only have access to files that they already have access to as a local user. This is not the case, however, as H2O implements various methods of authentication implying the server is designed to be used by multiple users. Additionally, there does not appear to be a safe way of running H2O from the h2o.jar file at all.

The H2O quickstart guide:

If one follows this guide and runs H2O from the h2o.jar file, they have exposed their entire file system to all remote computers on the network. Even specifying arguments such as: java -jar h2o.jar -ip 127.0.0.1 still exposes the server’s file system to the entire network. It seems the only safe way of running H2O is by using the programmatic interface and not overriding the bind_to_localhost argument in h2o.init().

Practical Attack Paths

A realistic attack path of an outsider exploiting this vulnerability on an average organization:

1. Machine learning engineer starts an H2O server on their work laptop.
2. Intern clicks an attachment in a phishing email despite a few spelling errors.
3. Attacker logs into intern’s laptop and runs EyeWitness to screenshot the homepage of all web servers on the internal network. Attacker sees a screenshot of H2O Flow.
4. Attacker steals everything from models and data to the ML engineer’s GCP/Azure/AWS access keys through the H2O Flow web server.

A realistic attack path of an outside attacker exploiting this vulnerability in a security-hardened organization is as follows:

1. Attacker sends 500 phishing emails to employees with an industry-specific Word document attached. Embedded in the Word document is a malicious macro which gives the attacker access to the employee’s computer. 2% of recipients open the document.
2. Attacker logs into one of the victim’s computers. They collect other user passwords by dumping them plaintext from memory via mimikatz or running Responder to capture passwords and hashes traveling across the network.
3. Attacker either relays captured password hashes using ntlmrelayx or sprays them across the network using CrackMapExec’s stealthy modules to login to other machines on the network and collect more passwords.
4. Attacker gains access to 3 more servers using the above method. One of these servers has access to the cloud network which stores the H2O server.
5. Attacker scans cloud network with nmap and discovers a web server running.
6. The attacker promptly downloads all the organization’s data, models, and private access keys through the vulnerable API calls shown above.

Further Research into Potential for Remote Code Execution

We have noticed a couple patterns in the security research we’ve conducted on machine learning tools. One, overly permissive file system access is common. Two, sanitization during the loading of data and models is often overlooked. Both problems can lead to remote code execution and we feel there is strong potential for H2O to contain this vulnerability.

There are two likely paths to remote code execution. First, abuse of one of the many file save functions to either install a backdoor or overwrite a file used for remote login, such as SSH keys. H2O has many such functions that write remote user input to the disk. However, so far in our research there has usually been a constraint in the format of the output file. An example is the arbitrary file overwrite vulnerability above. This vulnerability writes a CSV formatted file rather than the raw user input which was parsed by the frame. There are many such file saving functions in H2O and further exploration should be performed to either confirm or deny any of these functions have the ability to write raw data to the disk remotely.

Second, a common issue in the machine learning world is the use of insecure model serialization formats. The widely used library Pickle, notable as PyTorch's default format for model weights, allows for the ability to execute arbitrary code when loading files.

H2O doesn’t use pickle, it uses a few custom model formats: MOJO, POJO, and binary. If the model can be payloaded with command injection then uploaded to H2O and used for inference, then we would have remote code execution.

MOJO (Model Object Optimized) models have the ability to be uploaded remotely to the server and used for inference but this model format is largely just model metadata along with the model weights. This makes it an unlikely injection target. The files included in a MOJO zip file can be seen below.

Binary models are an H2O-specific Java serialization format called Iced which can also be uploaded to the server and used for inference. This is the most likely candidate for code injection given the significant amount of Java deserialization attacks seen in the past. Although we were unsuccessful in injecting into these models through the time we had to investigate, this area of H2O security research warrants a deeper dive. Code injection in the binary model would directly lead to remote code execution.

Last, POJO models are Plain Old Java Objects. One can payload these models with code execution in a way that doesn’t affect the model’s ability to predict. Below is an example of forcing the model to run an arbitrary command before it prints it’s predictions:

While this is a useful attack as an unsuspicious attachment in a phishing email to a machine learning engineer, POJO models can’t be uploaded to the H2O server nor does a conversion into MOJO format keep the arbitrary code execution.

Join the MLSecOps Security Research Community

Machine learning tool developers are under pressure to deploy as fast as possible due to the speed with which the industry is evolving. This speed of development comes at the expense of investing time in secure development practices. Protect AI’s goal is to help secure the AI/ML field, present and future, by uncovering novel risks and promoting security to be baked in from the beginning. You can join us in this effort by joining our AI/ML bug bounty platform at huntr.mlsecops.com or enjoy the huntr community in Discord.

AI/ML Hacking Resources

H2O Exposes Entire Filesystem