Optimizing App Predictions with AutoML Tables

December 16, 2019
|

Mikalai Tsytsarau, PhD,
GCP Professional Data Engineer, DELVE

AutoML Tables automatically build and deploy powerful machine learning models based on table data containing feature vectors. It scales model complexity and topology based on input data size, applies regression on simpler datasets, and uses more advanced models, like ensemble and deep learning, for more complex ones. Under the hood, AutoML uses Google Cloud’s infrastructure which supports model training, deployment, and serving with low latency and high scalability regardless of workload volume, and with the added benefit of pricing being dependent on consumed resources only. Moreover, a trained model is ready to be immediately deployed for batch and online prediction using SQL-like or API queries.

AutoML reduces the time needed to build usable machine learning models from months to days (sometimes less); speeding up the development stage saves money and minimizes time-to-market.

The lifecycle of machine learning in AutoML consists of three major steps: connecting data and selecting features, training and evaluating a model, and finally, deploying a model for serving predictions, as schematically shown in Figure 1. Each step of this process has a special configuration page, which details each step with a few possible options and parameters. However, with common data types, it’s absolutely possible to just use the default setting and let the system do the bulk of configuration work.

In the following paragraphs, we will highlight the process of configuring, training and evaluating a model for simple native-App features, and share some important details which can help improve this process and the end result for your App.

Figure 1: AutoML lifecycle (source: https://cloud.google.com/automl-tables/)

Training a Model

The AutoML visual interface is conveniently structured into several tabs that guide you through major machine learning lifecycle steps. AutoML provides detailed feedback on estimated parameters and selected options throughout each step of the configuration, enabling you to troubleshoot as necessary. However, the whole process of feature selection can be automatic, with AutoML recognizing a wide range of out-of-the-box data types (e.g., numbers, categorical features, strings, timestamps, and structures). All this helps to avoid routine data preparation and model configuration problems and lets analysts focus on feature engineering. An example of features from our sample user dataset, recognized by AutoML, is shown in Figure 2:

Figure 2: An example of feature recognition in AutoML

As can be seen in the above figure, we selected a categorical column named ‘payer_status’ as the target for prediction and aim to predict users who will make an in-App purchase. This feature picks one of two possible values: ‘payer’ and ‘non_payer’, making it a classic binary classification problem. Picking other types of targets for prediction is also possible, as well as excluding unnecessary columns, such as user ‘ID’. From this figure, one can also notice that upon evaluating training data, the system automatically determines if any of the features can be empty. At this point, we can override a system’s choice by knowing the properties of our data better and/or expecting other values in the future. We can then proceed by creating a training dataset.

An AutoML training dataset must meet a few minimal and maximal requirements to be useful for machine learning, the major ones are related to the size of the data: A dataset must have 1000 to 100,000,000 rows

Dataset schema must contain between 1 and 1000 features At least 50 rows (instances) should be present for each class

Dataset schema must contain between 1 and 1000 features

– A dataset must have 1000 to 100,000,000 rows
– Dataset schema must contain between 1 and 1000 features
– At least 50 rows (instances) should be present for each class

Usually, 10,000 to 100,000 rows of data are sufficient to start learning and retrieve meaningful results for prototyping. However, the more unique training data you push through the production model, the better performance it will demonstrate due to its adapting architecture and parameters. Having a smaller scope of the problem than the minimal requirements and duplicating training data to meet these requirements is also possible, but not recommended, since the model won’t reach its potential performance. Below is a list of a few tips that can improve training time and model performance:

Usually, 10,000 to 100,000 rows of data are sufficient to start learning and retrieve meaningful results for prototyping. However, the more unique training data you push through the production model, the better performance it will demonstrate due to its adapting architecture and parameters. Having a smaller scope of the problem than the minimal requirements and duplicating training data to meet these requirements is also possible, but not recommended, since the model won’t reach its potential performance. Below is a list of a few tips that can improve training time and model performance:

Tips for Improved Training:

We believe that AutoML facilitates model training and engineering, but does not necessarily eliminate feature engineering which is always beneficial for training, no matter how small or big the problem size is.

Tips for Evaluating Model

After setting up the dataset and creating your model, AutoML provides detailed feedback on the model’s performance. Let’s take a look at example model statistics for our test data, shown in Figure 3:

Figure 3: AutoML model evaluation details (simplified)

In the above figure, the main parameter we should look at is called ‘Accuracy’. This is a percentage of all training data samples in which classes were correctly predicted by a model. However, this statistic doesn’t take into account the distribution proportions among classes and can, therefore, be misleading if we have unbalanced data that contains more samples of one class than another. In such a case, even a ‘dumb’ predictor, assigning all records to the same (prevailing) class, can yield good results. Therefore, we should mainly look at more balanced evaluation metrics. For instance, such metrics can be:

AUC measures the area under a curve, which plots true positive rate versus false-positive rate, and is demonstrated in Figure 4. The curve shows how ‘true’ positive results can be predicted if we allow a certain ‘false’ positive error level. For instance, if our model reaches a 100% true rate with 0% errors, then our AUC equals 1.0 and we have an absolute and accurate model. We note that this measure is specific to the model in general, and is not dependent on the actual error-performance level we pick (represented by the blue dot in the image). We will get back to this parameter later on.

F1 is another measure for evaluating prediction performance that averages Precision and Recall of the model. Since there is a tradeoff between the proportion of true positive predictions (Precision) and true positive predictions as a fraction of all positive samples (Recall), this average also reflects the model’s discriminatory performance in general, regardless of the relative proportion of class samples.

Figure 4: AutoML model evaluation ROC curve

Figure 5: Class confusion matrix

Finally, AutoML outputs a class confusion matrix, shown in Figure 5. This matrix indicates how many samples of one class were predicted as another and vice-versa, depending on your selected decision boundary. By default, the model picks a boundary which maximizes our target metric, which is F1 in this example. However, there are some situations where you’ll want to override this choice, for example:

Tips for Feature Engineering

Feature Engineering has many benefits even in the context of automated learning. First, if we apply domain knowledge, we can engineer more meaningful/better features than would be possible with other available ML methods. Second, engineered features have considerably smaller data sizes, reducing training time and costs. Finally, such features can be used for regular analytics, such as statistical purposes or user segmentation. However, there are some feature engineering challenges that are mainly associated with time and cost of establishing data pipelines and feature processing:

When designing a model for user predictions, it’s important to test various packs of features and note their performance. Sometimes it is possible to trade off some of the more valuable, but variable, features for those less dependent on user demographics and more on user engagement (e.g. country and language for usage statistics). This way we can design a model less dependent on user acquisition sources, and train it on readily available and static data. Overall, we recommend testing various, even unobvious features, including the following:

Feature importance provides valuable feedback for data impact on resulting prediction. For instance, it can demonstrate which features were more or less useful for discriminating target values, as shown in Figure 6. In this case, we found that country, language, and device price had a bigger impact on prediction than a device’s brand or its screen size. In another scenario, when adding App statuses and event features to the mix, we found that they, in turn, became more important than user demographic and geographic features.

Figure 6: Feature importance

Tips for pLTV predictions


Ready to take your ads, and your business, to the next level? Get in touch with the DELVE team today.