Efficient Feature Engineering of Event Data in BigQuery

December 13, 2019

Mikalai Tsytsarau, PhD GCP Professional Data Engineer

Most App developers are familiar with Firebase – a powerful platform and Software Development Kit backed by Google, which provides a secure database, analytics, messaging, and everything else needed for app development, all in one seamless package.

Firebase provides convenient user profile and event logging functionality, which facilitates app analytics, user attribution, and churn analysis. A distinct feature of Firebase is its ability to log an unlimited number of events comprised of 500 user-defined kinds, and it also provides free analytics.

Firebase also generates predictions and organizes user segmentation based on an event stream (event occurrences): all with full integration.

However, when we base predictions only on event occurrences, it’s possible to miss the context of each event or user. Ideally, we need to analyze both event occurrences and aggregated (‘meta’) event data (i.e., engineer new features).

Figure 1 shows the differences between Event Analytics and Feature Analytics:

In-app events arrive in an ordered sequence, analyzed for churn or causality patterns (e.g., using funnels for Event Analytics)

Next, we analyze events of the same kind as a collection with statistical methods (e.g., revealing the distribution of their values for Feature Analytics)

Let’s discuss the benefits feature engineering can bring to the table:

Benefits of Feature Engineering

However, there are also some challenges with feature engineering:

Challenges of Feature Engineering

Additionally, the nested storage format of event and user parameters in Firebase makes it challenging to construct efficient aggregation queries in any database and makes data extraction even more complicated. Luckily, Google’s BigQuery can speed-up feature engineering and analytics, which makes the overall process much more manageable.

Understanding BigQuery

BigQuery is Google’s enterprise analytical data warehouse, which runs blazing-fast SQL queries on gigabytes and petabytes of data. Firebase can export complete event data in its original format to BigQuery daily, where it is processed and analyzed on a massive scale. BigQuery can handle huge feature aggregation queries as long as they have efficient joins, and as long as analysts de-normalize and unnest repeated Firebase data. 

Understanding Firebase Data

After setting up a daily export to BigQuery, Firebase creates a new table in specified BigQuery dataset, or in separate ones, if you have Android and iOS versions of your app. As previously mentioned, Firebase exports all data to BigQuery in its native format. This provides user data, such as demographic and campaign data, in a nested structure named user_dim and event data in a structure named event_dim. 

The user_dim-event_dim tuples display as repeated rows in small batches per user, which we can easily unroll. However, Firebase also supports custom app events, which can be associated with up to 25 parameters as key-value pairs. Unpacking and analyzing the parameters is a desired target for app analysts, but it’s also a complicated task.

Let’s take a look at the example row of event data for one user, loaded in a single batch from Firebase, shown in Figure 2:

In the above figure, we see that each (sub)row of event data is associated with a set of repeated key-value pair rows. Additionally, all rows in a batch are associated with a contemporary user profile that has a set of its own parameter key-value records (not shown). This requires writing rather inefficient SQL constructs when you need to unroll and aggregate specific event parameters with the UNNEST function (see Figure 3).

Efficient Feature Engineering: Step-By-Step

First, we unpack all user and event properties from repeated rows to serialized JSON. Next, we detach repeated static user data, such as country and language, into a separate lookup table, to focus on events and profile aggregation.

Then, we extract and store all events and associated profiles from batch records on the same denormalized row structure, which contains a JSON field that stores each property. This row structure creates query event data for users and analyzed features on-demand. It also streams user events and constructs features continually.

BigQuery’s UDF forms the basis for the corresponding transform query. This allows us to define the JavaScript function that transforms repeated key-value rows into JSON. Figure 4 shows this query:

After this transformation, we define aggregation queries that construct features for every user in the database and their relevant events – all in one single pass.

Since the source table stores the complete set of data, the massive parallelism of BigQuery is, in this case, extremely efficient. For example, we can execute complex queries, (shown in Figure 5), by creating complete user and event features, readied for Data Science or other types of advanced analytics:

Stay tuned- in a future blog, we’ll focus on how to extract maximum use from collected features and learn how to apply machine learning that reveals prediction power and other vital insights. 

Are you looking for more information on the Google Cloud Platform and Google Big Query? Check out our blog post here, and learn how we leverage the power of Google Cloud Platform to elevate your digital marketing strategy.

Ready to take your ads, and your business, to the next level? Get in touch with the DELVE team today.