Over the past few months, Belly has worked on an in-house email delivery and optimization service that makes it a lot easier for us to run tests and quickly optimize campaign performance over time. We use a variety of real-time and batch processing technologies to make this happen, and I’d like to quickly dive into one of the more exciting things this new system enables.
Emails in our system are broken down into three high level delivery types: Transactional, Scheduled, and One-Off (marketing). Examples of transactional emails in our system include User Registration via iPad, Registration via Mobile, First Visit At Business, and Forgotten Password Requests. Scheduled emails are delivered in batches or via a State Machine and include things like New Merchant Day 3/15/28, Merchant Offline iPad reminders, Monthly Member Digests, and New User Drip Campaigns. One-off Marketing emails in our system have by far the highest variability in open rates, call to action click throughs, and unsubscribes and therefore have the most to gain from an intelligent optimization system.
All emails we send out are versioned with templates maintained in our database and use Shopify Liquid to support dynamic content blocks (including a Business’s logo, a list of rewards, nearby businesses, etc.), Zurb Ink to be responsive on mobile devices, and custom template markers to support user-defined variables. User-defined variables allow us to test multiple variations of the same email version simultaneously to test the effectiveness of call to action buttons, subject lines, banner images, copy, and completely different blocks of HTML entirely.
The example template above would permutationally generate 2x2x2 variations on build and pick an enabled one at random on send. As the number of sends for a version grows over time, the system can then automatically disable weak variations and optimize itself until one variation emerges as the most successful. Variations can also be disabled manually if a certain combination doesn’t make sense (ex: Subject A and Header B contradict) and would cause confusion or discretion is used with the backing of statistical confidence. Any number of values for each variable can be used test the effectiveness of a bunch of different subject lines, copy, or buttons so long as the number of variations doesn’t exceed 8 (tests would take too long).
Sometimes the difference in performance (defined as open rates or click through rates on a desired call to action) is so small that the number of sends required to reach statistical significance is very large and the test is ended prematurely to run a new test. The system can look at previous versions of a given type of email to recommend whether a rollback should be considered.
A variety of additional factors can also impact performance for scheduled and marketing emails, including easily testable things like time of day and day of week, and things that require a bit more finesse like perceived value, list targeting sensitivity, and sending domain strength.
Algorithm In Practice
When a new version of a type of email is made active, all variations by default are enabled. Any variations that are confusing (subject and call to action button conflict) can be manually disabled right away, but the system waits for data to be collected until making any statistical conclusions. Once enough data is collected to start making conclusions (1000 sends per variation), we iterate through each variation and disable those with less than a minimum threshold odds of being higher than the best performing variation.
To calculate a variation’s probability of being higher than another variation, we find the standard error and use it to find the ZScore, which is the number of standard deviations between the control and test variation mean values.
We use the ZScore in a probability lookup table to compare the “chance to be different” between the two variations. The minimum threshold odds are calculated by 40% / numberenabledvariations. Any variation with odds below this threshold gets disabled, ensuring that versions will eventually converge to an optimal one given a large enough sample size.
The odds calculation isn’t exact given the added complexity of non mutually exclusive ZScore calculations for Multinomial Distributions (as opposed to Binomial Distributions like standard A/B Test baseline statistics comparisons) since there are N number of simultaneous independent trials, but this works well for our purpose.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
scores =  total = 0 foreach data as (i, row): score = 1 m1 = row.mean n1 = row.samples for data as (j, comp): return if i is j m2 = comp.mean n2 = comp.samples temp = probability(zscore(m1, n1, m2, n2)) if(temp < score) score = temp scores.push(score) total += score # Rescale score to percentage foreach scores as (i, score): scores[i] = score / total
Sample code for the Odds Calculations is included in this JS fiddle Demo.
Areas for Improvement
The approach of treating each variation as completely independent doesn’t take into account that each variation is made up of dependent factors that are shared between variations. Using Logistic Regression, we could isolate each variable (Subject, Banner, Button) to determine it’s individual impact on the outcome rate to know when to disable all variations with a given value, or when a certain value has no significant impact on the desired result (for example: button color having not having an impact on Email Open rate).
In its first few weeks of operation, our new system has already improved click through and open rates by 15-40% for emails we’ve migrated. We expect this to have an even more profound impact across the business as our remaining emails are sent and improved by this system. As successful variations are made baseline for future tests and our emails “evolve” we will only continue to learn even more from our data.