August 27, 2020

Using Machine Learning in Google BigQuery for Marketing Analytics (part 2)


BigQuery’s Machine Learning (BQML) capabilities allows for the development of models with ease using Standard SQL. BQML can carry out the training, testing and predicting of the model without the need of data transfer to another place. This allows marketers to uncover and predict what will work in their future digital marketing campaigns based upon prior data. With this invaluable insight, BQML allows companies to increase the effectiveness of their marketing campaigns by a staggering 79% and tailor these campaigns to their relevant goals. This article will demonstrate the tremendous capabilities of BigQuery’s Machine Learning to marketers and why it is a must-use for every digital marketer today. 


In this article, we will demonstrate 4 use-cases of BigQuery’s Machine Learning in the analysis of 2019 and 2020 Google Ads Data for a specific company:

  1. Using Standard SQL in BigQuery ML to develop Machine Learning Models easily and make more informed decisions on allocation of your marketing budget
  1. Using a logistic regression model to predict which variables (Day Of The Week, Device, Location, etc.) have the highest conversion rates 
  1. Building a cluster model to predict which Bidding Strategy Type has the lowest cost per conversion and lowest cost per click
  1.  Building a clustering model to compare 2020 Google Ads performance to 2019 by grouping Ads into ‘high’ and ‘low’ performance groups based on key metrics 

Use Case #1: Using Standard SQL in BigQuery ML to develop Machine Learning Models and easily gain vital insights 

When exploring whether or not a conversion takes place on a company’s digital platform, it is interesting to see if this is influenced by clickstream data. Are advertisements with greater interactions more likely to lead to a conversion? Does the average position of the advertisement influence whether or not a conversion takes place? Answers to these questions can be pivotal in informing future marketing decisions taken by a company. For instance, if clicks are a strong predictor of conversions, then it would be advisable to target AdWords that have a high click through rate. 

In order to investigate these questions, we shall use standard SQL in BigQuery to easily develop a machine learning logistic regression model to determine whether clicks and impressions are good predictors of whether a conversion takes place or not. 

To do this, first we have to divide the data into training data (which used to train the model) and testing data (which is used to evaluate the model). Since we have more than 50,0000 rows of Ad data, BigQuery will automatically split into 10,000 rows of testing data and the rest will be training data. 

Figure 1: ROC Curve Generated by Model for Training Data

Figure 1 from BigQuery illustrates the ROC curve generated by the model for the training data. As can be seen, the Area Under the Curve (AUC) is 0.9133. This helps to indicate the models certainty and performance as the AUC is near to 1.

Now we will evaluate this model using the testing data in order to see whether it has a similar AUC and F1 score. 

Summary of Model Evaluation using Test Data

In the table, we can see a summary of performance metrics for the model when imposed onto the test data. This suggests that the model is quite robust as in this table we can see the AUC staying quite high at 0.923, while the f1 score is quite close to 0 at 0.0157. 

Using this model we can predict the likelihood of whether or not an advertisement will lead to a conversion based solely on the number of clicks and impressions it has. To demonstrate this, I have used 10 random pieces of our data to predict whether or not they convert based on the model as well as the actual outcome. 

Predictions by the Model

The table above demonstrates the capabilities of the machine learning model in its prediction of whether or not an ad will convert. It is also important to note that models are not perfect. While they give us insight on which ads are likely to convert or not, they may not predict this correctly for every single piece of data. An example of this can be seen in rows 3 and 4, where the model predicts that an ad will not convert, but in reality they did end up converting. 

Finally, we can also take a look at the model weights to see whether or not clicks and impressions are strong predictors of a conversion.

Model Weights for Clicks and Impressions

The model weights shown in the table for the clicks and impressions suggest a very weak relationship between the two variables and conversions. This is because they are very close to 0. It may come as no surprise that they are both positively correlated with conversions since the more we see or click an advertisement, the more likely we are to convert. However, the weak model weights suggest that clicks and impressions are poor predictors of a conversion.

Key Insights 

Overall, we’ve developed a model to investigate whether clicks and impressions are good predictors of a conversion. This model seems to excel in terms of its performance metrics on both the tested and trained data. This suggests that it is a good representation of the real relationship between clicks, impressions and conversions. The model weights suggest that clicks and impressions are positively correlated with conversions. However, this positive correlation appears to be very weak. This implies that it may be a waste of time and money for the company’s future marketing campaigns to try and maximize clicks and impressions if their overall goal is to maximize conversions. Other factors such as location, devices and day of week may be a better focus of the company’s marketing efforts in order to achieve this goal; We will investigate these next.  Therefore, BigQuery’s ability to develop machine learning models with ease using SQL allows for invaluable insights on where this company should allocate its marketing budget in order to maximize its conversions

Use Case #2: Using a logistic regression model to predict which variables (Day Of The Week, Device, Location, etc.) have the highest conversion rates 

There are many different variables that can impact whether or not a conversion takes place. For the sake of simplicity, we will analyze three of these variables and try to determine which of these are optimal for maximizing conversions as well as the relative impacts each variable has on whether a conversion will be made. 

We shall develop a logistic regression model using BigQuery’s machine learning capabilities. This model will analyze the variables of location, device and day of the week in order to see whether these factors can be used to predict whether a conversion takes place. Another important thing to note is that we have only included countries in our model where at least 1 conversion has taken place.

Summary of ML Model Weights 

From the summary of each variable’s weights in our logistic regression model, we can see that the strongest positive correlations occur when the location of the advertisement is in China or India. In addition to this, there are strong positive correlations present when the device is a desktop and the day of the week is either Saturday or Sunday. This suggests that if the company wants to maximize its conversions it should target its campaigns to desktop users and drop the campaign on or just before the weekend. The strongest negative correlations were the category of Tablet and Singapore, which suggests potential markets that the company can improve on in its future ad campaigns.

Key Insights

Overall the Google Ads data for the 2019 and 2020 campaigns provides some valuable insights for future marketing plans. First, we can see that there needs to be an increased focus on improving the mobile/tablet version of the website in order to boost average conversions. In addition to this, our investigation has revealed the importance of the day of the week as a predictor of whether or not a conversion takes place. Finally, this analysis reveals locations such as Singapore, where the company may need to change its current marketing strategy in order to boost conversions. All in all, this helps to reinforce which areas of the marketing are working and which need improvement for future campaigns.  

Use Case #3: Building a cluster model to predicting which Bidding Strategy Type has the lowest cost per conversion and lowest cost per click

We can analyze which bidding strategy type has the lowest cost per conversion and lowest cost per click very quickly and easily using BigQuery’s Machine Learning by building a cluster model of the data.

The idea of implementing cluster analysis using BQML (BigQuery Machine Learning) is to see whether the raw data can be grouped into 3 different groups based upon its cost per click and cost per conversion statistics. We can then analyze the groups that minimize these statistics to try and see if they are made up of a specific bidding strategy type. We plan to group the data into 3 different groups in order for the ML to run smoothly. This is because we need to ignore the groups (Maximize Clicks and Null) that have 0 conversions since dividing by 0 causes problems on BigQuery ML. 

Figure 2: Metrics and Model devised for Clustering Bidding Strategy Model

From our data pre-processing and clustering model as represented in Figure 2, it can be seen that 15 pieces of data in our dataset fit the group that minimizes both cost per click and cost per conversion. In addition to this, the figure presents a relatively low Davies-Bouldin Index (DBI). The lower this metric is, the greater the performance of the model. If we look at the advertisements assigned to group 1 as well as their bidding strategies, we may get a better idea on the best bidding strategy method. 

Breakdown of Cluster Group 1 

This table shows that the group that minimizes both cost per click and cost per conversion is largely made up of Ads with a Bidding Strategy of “Target CPA”. This suggests that this may be the best bidding strategy to minimize both these metrics. 

Key Insights

Despite the ‘optimal’ bidding strategy varying depending on business goals, we have attempted to narrow down the best strategy by measuring overall performance using two key metrics: Cost Per Click and Cost Per Conversion. Through comparison of all bidding strategies using the key metrics through a machine learning cluster model, we can quickly and easily conclude that the best Bidding Strategy is the “Target CPA”. Without BigQuery and its valuable capabilities, we would not be able to conclude which bidding strategy we should use for future campaigns.   

Use Case #4: Comparing 2020 Google Ads performance to 2019 by building a clustering model that groups Ads into ‘high’ performance and ‘low’ performance based on key metrics 

In order to compare the performance of the Google Ad campaigns from 2019 and 2020, we can develop a cluster model using BQML, where we split the data into two clusters based upon three key metrics: Clicks, Conversions and Impressions. Ideally, we will look at the group that maximizes these metrics and classify them as the ‘high’ performance group. 

Figure 3: Cluster Model By Year

From Figure 3, we can see that our model has divided the data into a smaller cluster with high clicks, conversions and impressions and a larger cluster that is lower in these metrics. Now we shall look at the breakdown of each cluster with particular focus on cluster 1 to see if there is an overwhelming majority that comes from a particular year.

Cluster Model Breakdown by Year

At first glance it can be observed that there are more Ad Groups from 2020 in the desired group (1) than those from 2019. However, this is an unfair comparison as it fails to account for the fact that there are greater total Ad Groups in 2020 than in 2019. A better comparison can be done by looking at the percentage of each year’s Ad Group that is assigned the good performance group. Using this method we can see that 2.23% of Ad Groups in 2020 are in the desired group, which is slightly smaller than the 3.31% of Ad Groups in 2019. Given the small percentages of each group, it may be argued that the year of the campaigns may have not played a pivotal role in Ad performance, but rather that the performance may be attributed to other factors such as keywords. Furthermore, even though our earlier analysis suggested the success of the 2020 campaigns in generating conversions at low costs, this model suggests that there were few Ad Groups in 2020 that carried the entire performance of the year since only 2.23% of all Ad Groups in 2020 are in the top performance cluster. 

Key Insights

Overall, our machine learning model that has clustered Ad Group data into high performance and low performance metrics has suggested that only a small portion of each year's Advertisements were considered to be ‘high’ performance. Therefore, BigQuery provides the marketer a key insight that they should focus on only a select few of its campaigns from each year to achieve its relevant goals.

Data Theory Major at UCLA I am an inquisitive individual who constantly strives for improvement and to increase my depth of knowledge. I am a strong believer of hard-work and one of my favourite quotes is "The only place where success comes before work is in the dictionary". My primary philosophy is to continue to work hard and grow to become a better person in all aspects of my life.

Our latest articles.

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolores.