H20 comes with many features. The second part of the series H2O in practice proposes a protocol to combine AutoML modeling with a traditional modeling and optimization approach. The goal is to define a workflow that we can apply to new use cases to gain efficiency and delivery time.
In collaboration with EDF Lab and ENEDIS, our overall goal is to assess the difficulty of working with the H2O platform, understand how it works, and find its strengths and weaknesses in the context of a real-life project.
In the first article, Real Experience with H2O, the challenge was to build a model using AutoML and compare it to a reference model built using a traditional approach. For the second challenge, we were given a ready-to-use database from another business problem (still related to preventive maintenance) and five days to produce the best model with H2O. To prepare for this, we have developed an operational protocol, which we present in this article. It helped us prepare a baseline model comparable to the existing one in just two days.
This protocol provides guidance on how to combine AutoML modeling with custom model algorithms to improve performance. The duration of the course is analyzed in two examples to get a global picture of what to expect.
Environment, datasets and project descriptions
Project 1, identification of sections of low-voltage underground cables in need of replacement; train and test set had over 1 million rows and 434 columns each. In this use case, we used sparkling water, which combines H2O and Spark and which distributes the load on the Spark cluster.
- H2O_cluster_version: 184.108.40.206
- H2O_cluster_total_nodes: 10
- H2O_cluster_free_memory: 88.9 GB
- H2O_cluster_allowed_cores: 50
- H2O_API_Extensions: XGBoost, Algos, Amazon S3, Sparkling Water REST API Extensions, AutoML, Core V3, TargetEncoder, Core V4
- Python_version: 3.6.10 final
Project 2, preventive maintenance, confidential. train and test set had about 85,000 lines. 605 columns were used for training. In this case we were using a non-distributed version of H2O running on a single node.
- H2O_cluster_version: 220.127.116.11
- H2O_cluster_total_nodes: 1
- H2O_cluster_free_memory: 21.24 GB
- H2O_cluster_allowed_cores: 144
- H2O_API_Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
- Python_version: 3.5.2 final
In both cases, the task was to classify highly unbalanced data (Class 1 < 0.5%). AUCPR was used as an optimization metric during training. As a final evaluation of the model, two business metrics were calculated, both representing the number of failures on feeders of two different cumulative lengths. 5-fold cross-validation was used to validate the models. In both projects, the challenge was to compare the best H2O model to an internal model that had already been optimized.
Modeling protocol with H2O
We combined AutoML features with custom modeling algorithms to get the best of both worlds. After trying different approaches, we found the proposed protocol to be the shortest and simplest.
- Depending on the available time.
- If you have enough time, run AutoML to build multiple models (30+) with no time limit to get the longest and most accurate training (
- If you’re short on time, or if you’re only looking for an approximate result to gauge baseline performance, set the maximum training time (
For each model, calculate a business metric (and/or additional metrics) and compile them into a custom leaderboard.
Combine the AutoML leaderboard and the custom leaderboard and check:
- Which family of models scores the highest?
- How are business and statistical indicators related?
- Are the models performing so poorly that we don’t want to spend extra time optimizing them?
Due to project confidentiality, we will describe the results only to illustrate the example. The XGBoost family performed much better than any other algorithm. Deep learning performed the worst, unsurprisingly, on tabular data. For example, the best XGBoost model detected 3-4 times more incidents than the best deep learning model, depending on the business metric. The second most efficient family was GBM, which at best detected about 90% of the XGBoost cases. In both designs, the models with the highest AUCPR had the highest business performance, but overall, the correlation was not large.
Run an AutoML grid search on the most successful families of algorithms (
include_algos) to test multiple models. Select the models you want to optimize (models of interest) and save them.
Print the actual parameters of the models of interest (
Use these parameters as a basis for manually defining the model. Set network search hyperparameters (
H2OGridSearch) If you want to test many or are short on time, use the random grid search strategy. Otherwise, build models from all combinations of hyperparameters (Cartesian strategy).
Compute additional metrics on network search models and optionally test for variable significance. Models with similar scores may be based on different variables that may be more important in the business context than just the score.
Choose the best model(s) for each family and save it.
Build stacked ensembles and recalculate additional metrics if desired.
How long does it take?
We used two datasets of different sizes (1 million rows x 343 columns and 85,000 rows x 605 columns). Because they were processed in two different environments, we cannot directly compare processing times. We’d like to give you an idea of what you can expect with some rough estimates of the duration of each phase.
The work process in the Program 1.
- build 40 AutoML models → ~ 8.5 h
- extract the best model parameters for each family and set the optimization hyperparameters (XGB and GBM in this case)
- random search.
- 56 XGB models → ~ 8 hours (+ business metrics → ~ 4.5 hours)
- 6 GBM models → ~ 1 hour (+ business indicators → ~ 0.5 hours)
- random search.
- keep the best models
Given that the dataset was not very large (< 200 MB), we were surprised that it took AutoML so long to complete 40 models. Maybe they are not well matched or we need to optimize the cluster resources. Our preferred solution was to start the long calculations at the end of the day and have the results in the morning.
The work process in the Program 2.
- run AutoML for 10 minutes (fixed time; + business metrics ~ 15 minutes)
- build 30 GBM models with AutoML → ~ 0.5 h (+ business metrics → ~ 1 h)
- extract the parameters of the best model and define the hyperparameters of the network search
- random search: 72 GBM models → ~ 1 hour (+ business metrics → ~ 2 hours)
- keep the best model
The second database was quite small and we manage to complete all the steps in one working day.
How good were the models?
The comparison was based solely on business measurement values. We did not perform any proper statistical tests on the results, although we did account for the difference in our results. So it should not be taken as a criterion but as an observation. In both cases, we managed to produce the models with roughly the same performance as the references. Usually we had several candidates with such scores. Most importantly, we achieved this in only a fraction of the time required for the reference model.
We demonstrated on two real-life problems how to reduce the time to build a good baseline model using automatic machine learning with H2O. After we spent time understanding the advantages and limitations of the platform, we were able to build a model comparable to the reference model in just a few days. This is a significant speedup compared to the traditional approach. In addition, a user-friendly API reduces coding time and makes code easier to maintain.
Collaborators who contributed to this work are: