Building a Serverless Analytics App to Capture and Query Clickstream Data

May 17, 2019

,
Register for
Index Conference

Hear talks on search and AI from engineers at Netflix, DoorDash, Uber and more.

The best way to answer questions about user behavior is often to gather data. A common pattern is to track user clicks throughout a product, then perform analytical queries on the resulting data, getting a holistic understanding of user behavior.

In my case, I was curious to get a pulse of developer preferences on several divisive questions. So, I built a simple survey and gathered tens of thousands of data points from developers on the Internet. In this post, I will walk through how I built a web app that:

  • collects free-form JSON data
  • queries live data with SQL
  • has no backend servers

To stay focused on collecting click data, we'll keep the app's design simple: a single page presenting a series of binary options, on which clicking will record the visitor's response and then display live aggregate results. (Spoiler alert: you can view the results here.)

binary-survey

Creating the static page

Keeping with the spirit of simplicity, we'll use vanilla HTML/CSS/JS with a bit of jQuery to build the app's frontend. Let's start by laying out the HTML structure of the page.

<!DOCTYPE html>
<html lang="en" dir="ltr">
  <head>
    <title>The Binary Survey</title>
    <script src="//code.jquery.com/jquery-3.3.1.min.js"></script>
    <script src="script.js"></script> 
  </head>
  <body>
    <div id="header">
      <h1>The Binary Survey</h1>
      <p>Powered with ❤️ by <b><a href="https://rockset.com">Rockset</a></b></p>
      <h3>Settle the debate around very important developer issues!<br><br>We've surveyed <span id="count">...</span> developers. Now it's your turn.</h3>
    </div>
    <div id="body"></div>
  </body>
</html>

Note that we left the #body element empty—we'll add the questions here using Javascript:

// [left option, right option, key]
QUESTIONS = [
  ['tabs', 'spaces', 'tabs_spaces'],
  ['vim', 'emacs', 'vim_emacs'],
]

function loadQuestions() {    
  for (var i = 0; i < QUESTIONS.length; i++) {
    $('#body').append(' \
      <div id="q' + i + '" class="question"> \
        <div id="q' + i + '-left" class="option option-left">' + QUESTIONS[i][0] + '<div class="option-stats"></div></div> \
        <div class="spacer"></div> \
        <div class="prompt"> \
          <div>⟵ (press h)</div> \
          <div class="centered">vote to see results</div> \
          <div>(press l) ⟶</div> \
        </div> \
        <div class="results"> \
          <div class="bar left"><div class="stats"></div></div> \
          <div class="bar right"><div class="stats"></div></div> \
        </div> \
        <div id="q' + i + '-right" class="option option-right">' + QUESTIONS[i][1] + '<div class="option-stats"></div></div> \
      </div> \
    ');

    $('#q' + i + '-left').click(handleClickFalse(i));
    $('#q' + i + '-right').click(handleClickTrue(i));
  }
}

function handleClickFalse(index) {
  // ...
}

function handleClickTrue(index) {
  // ...
}

By adding the questions with Javascript, we only have to write the HTML and event handlers once. We can even adjust the list of questions at any time by just editing the global variable QUESTIONS.

Collecting custom JSON data

Now, we have a webpage where we want to track user clicks—a classic case of product analytics. In fact, if we were instrumenting an existing web app instead of building from scratch, we would just start at this step.

First, we'll figure out how to model the data we want to collect as JSON objects, and then we can store them in a data backend. For our data layer we will use Rockset, a service that accepts JSON data and serves SQL queries, all over a REST API.

Data model

Since our survey has questions with only two choices, we can model each response as a boolean—false for the left-side choice and true for the right-side choice. A visitor may respond to any number of questions, so a visitor who prefers spaces and uses vim should generate a record that looks like:

{
  'tabs_spaces': true,
  'vim_emacs': false
}

With this model, we can implement the click handlers from above to create and send this custom JSON object to Rockset:

let vote = {};
const ROCKSET_SERVER = 'https://api.rs2.usw2.rockset.com/v1/orgs/self';
const ROCKSET_APIKEY = '...';

function handleClickFalse(index) {
  return () => { applyVote(index, false) };
}

function handleClickTrue(index) {
  return () => { applyVote(index, true) };
}

function applyVote(index, value) {
  vote[QUESTIONS[index][2]] = value;
  saveVote();
}

function saveVote() {
  // Save to Rockset
  $.ajax({
    url: ROCKSET_SERVER + '/ws/demo/collections/binary_survey/docs',
    headers: {'Authorization': 'ApiKey ' + ROCKSET_APIKEY,
    type: 'POST',
    data: JSON.stringify(vote)
  });
}

In practice, ROCKSET_APIKEY should be set to a value obtained by logging into the Rockset console. The Rockset collection which will store the documents (in this case demo.binary_survey) can also be created and managed in the console.

Updating existing responses

Our code so far has a shortcoming: consider what happens when a visitor clicks “spaces” then clicks “vim.” First, we will send a document with the response for the first question. Then we'll send another document with responses for two questions. These get stored as two separate documents! Instead we want the second document to be an update on the first.

With Rockset, we can solve this by giving our documents a consistent _id field, which is treated as the primary key of a document in Rockset. We'll generate this field as a random identifier for the visitor on page load:

function onPageLoad() {
  vote['_id'] = 'user' + Math.floor(Math.random() * 2**32);
}

Now let's run through the previous scenario again. When the web page loads, the "vote" object gets seeded with an ID:

{
  "_id": "user739701703"
}

When the visitor clicks a choice for one of the questions, a boolean field is added:

{
  "_id": "user739701703",
  "tabs_spaces": true
}

The visitor can continue to add more responses:

{
  "_id": "user739701703",
  "tabs_spaces": false,
  "vim_emacs": true
}

Or even update previous responses:

{
  "_id": "user739701703",
  "tabs_spaces": true,
  "vim_emacs": true
}

Every time the response changes, the JSON is stored as a Rockset document and, because the _id field matches, any previous response for the current visitor is overwritten.

Saving state across sessions

We'll add one more enhancement to this: for visitors who leave the page and come back later, we want to keep their responses. In a full-blown app we may have an authentication service to establish sessions, a users table to persist IDs in, or even a global frontend state to manage the ID. For a splash page that anyone can visit, such as the survey we're building, we may not have any previous context for the user. In this case, we'll just use the browser's local storage to maintain the visitor's ID.

Let's modify our Javascript code to implement this mechanism:

const ROCKSET_SERVER = 'https://api.rs2.usw2.rockset.com/v1/orgs/self';
const ROCKSET_APIKEY = '...';

function handleClickFalse(index) {
  return () => { applyVote(index, false) };
}

function handleClickTrue(index) {
  return () => { applyVote(index, true) };
}

function applyVote(index, value) {
  let vote = loadVote();
  vote[QUESTIONS[index][2]] = value;
  saveVote(vote);
}

function loadVote() {
  let vote;

  // Handle and reset malformed vote
  try {
    vote = JSON.parse(localStorage.getItem('vote'));
  } catch {
    vote = null;
  }

  // Set _id if unassigned
  if (!vote || !vote['_id']) {
    vote = {};
    vote['_id'] = 'user' + Math.floor(Math.random() * 2**32);
  }

  return vote;
}

function saveVote(vote) {
  // Save to local storage
  localStorage.setItem('vote', JSON.stringify(vote));

  // Save to Rockset
  $.ajax({
    url: ROCKSET_SERVER + '/ws/demo/collections/binary_survey/docs',
    headers: {'Authorization': 'ApiKey ' + ROCKSET_APIKEY,
    type: 'POST',
    data: JSON.stringify(vote)
  });
}

Data-driven app: aggregations on the fly

At this point, we've created a static page and instrumented it to collect custom click data. Now let's put it to use! This generally takes one of two forms:

  • an internal dashboard informing product decisions or triggering alerts around unusual behavior
  • a user-facing feature to enhance a data-driven product

Our survey's use case falls under the latter: as an incentive to answer questions for curious visitors, we'll reveal the live results of each question upon clicking a choice.

To implement this, we'll write Javascript code to call Rockset's query API. We want to send a SQL query that looks like:

SELECT 
    ARRAY_CREATE(COUNT_IF("tabs_spaces"), COUNT("tabs_spaces")) AS q0, 
    ARRAY_CREATE(COUNT_IF("vim_emacs"), COUNT("vim_emacs")) AS q1, 
    # ...
    count(*) AS total 
FROM demo.binary_survey

The response will be a JSON object with counts for each question (count of “true” responses and total count of responses), along with a count of unique visitors.

{
  "q0": [
    102,
    183
  ],
  "q1": [
    32,
    169
  ],
  "q2": [
    146,
    180
  ],
  ...
  "total": 212
}

We can parse this data and set attributes on HTML elements to relay the results to the visitor. Let's write this out in Javascript:

const ROCKSET_SERVER = 'https://api.rs2.usw2.rockset.com/v1/orgs/self';
const ROCKSET_APIKEY = '...';
const QUERY = '...';

function refreshResults() {
  $.ajax({
    url: ROCKSET_SERVER + '/queries',
    headers: {'Authorization': 'ApiKey ' + ROCKSET_APIKEY},
    type: 'POST',
    success: function (data) {
      results = data[0];

      // set the visitor count in the header
      $('#count').html(results['total']);

      // for each question, display the count and % for each side (text + bar graph)
      for (var i = 0; i < QUESTIONS.length; i++) {
        let left_count = results['q' + i][1] - results['q' + i][0];
        let right_count = results['q' + i][0];
        let left_pct = (left_count / (left_count + right_count) * 100).toFixed(2) + '%';
        let right_pct = (right_count / (left_count + right_count) * 100).toFixed(2) + '%';
        $('#q' + i + ' .left').width(left_pct);
        $('#q' + i + ' .right').width(right_pct);
        $('#q' + i + ' .left .stats').html('<b>' + left_pct + '</b> (' + left_count + ')');
        $('#q' + i + ' .right .stats').html('(' + right_count + ') <b>' + right_pct + '</b>');
        $('#q' + i + ' .option-left .option-stats').html('(' + left_pct + ')');
        $('#q' + i + ' .option-right .option-stats').html('(' + right_pct + ')');
      }
    }
  });
}

Even with tens of thousands of data points, this AJAX call returns in around 20ms, so there is no concern executing the query in real time. In fact, we can update the results, say every second, to give the numbers a live feel:

setInterval(refreshResults, 1000);

Finishing touches

Access control

We've written all the logic for sending data to and retrieving data from Rockset on the client side of our app. However, this exposes our fully privileged Rockset API key publicly, which of course is a big no-no. It would give anyone full access to our Rockset account and also possibly allow a DoS attack. We can achieve scoped permissions and request throttling in one of two ways:

  • use a restricted Rockset API key
  • use a lambda function as a proxy

The first is a feature still-in-development at Rockset, so for this app we'll have to use the second.

Let's move the list of questions and the logic that interacts with Rockset to a simple handler in Python, which we'll deploy as a lambda on AWS:

import json
import os
import requests

APIKEY = os.environ.get('APIKEY') if 'APIKEY' in os.environ else open('APIKEY', 'r').read().strip()
WORKSPACE = 'demo'
COLLECTION = 'binary_survey'
QUESTIONS = [
    ['tabs', 'spaces', 'tabs_spaces'],
    ['vim', 'emacs', 'vim_emacs'],
]

def questions(event, context):
    return {'statusCode': 200, 'headers': {'Access-Control-Allow-Origin': '*'}, 'body': json.dumps(QUESTIONS)}

def vote(event, context):
    vote = json.loads(event['body'])
    print({'data': [vote]})
    print(json.dumps({'data': [vote]}))
    r = requests.post(
        'https://api.rs2.usw2.rockset.com/v1/orgs/self/ws/%s/collections/%s/docs' % (WORKSPACE, COLLECTION),
        headers={'Authorization': 'ApiKey %s' % APIKEY, 'Content-Type': 'application/json'},
        data=json.dumps({'data': [vote]})
    )
    print(r.text)
    return {'statusCode': 200, 'headers': {'Access-Control-Allow-Origin': '*'}, 'body': 'ok'}

def results(event, context):
    query = 'SELECT '
    columns = [q[2] for q in QUESTIONS]
    for i in range(len(columns)):
        query += 'ARRAY_CREATE(COUNT_IF("%s"), COUNT("%s")) AS q%d, \n' % (columns[i], columns[i], i)
    query += 'count(*) AS total FROM %s.%s' % (WORKSPACE, COLLECTION)
    r = requests.post(
        'https://api.rs2.usw2.rockset.com/v1/orgs/self/queries',
        headers={'Authorization': 'ApiKey %s' % APIKEY, 'Content-Type': 'application/json'},
        data=json.dumps({'sql': {'query': query}})
    )
    results = json.loads(r.text)['results']
    return {'statusCode': 200, 'headers': {'Access-Control-Allow-Origin': '*'}, 'body': json.dumps(results)}

Our client-side Javascript can now just make calls to the lambda endpoints, which will act as a relay with the Rockset API.

Adding more questions

A benefit of the way we've build the app is we can arbitrarily add more questions, and everything else will just work!

QUESTIONS = [
    ['tabs', 'spaces', 'tabs_spaces'],
    ['vim', 'emacs', 'vim_emacs'],
    ['frontend', 'backend', 'frontend_backend'],
    ['objects', 'functions', 'object_functional'],
    ['GraphQL', 'REST', 'graphql_rest'],
    ['Angular', 'React', 'angular_react'],
    ['LaCroix', 'Hint', 'lacroix_hint'],
    ['0-indexing', '1-indexing', '0index_1index'],
    ['SQL', 'NoSQL', 'sql_nosql']
]

Similarly, if a visitor only answers a subset of the questions, no problem—the client-side app and Rockset can handle missing values gracefully.

In fact, these circumstances are generally common with product analytics, where you may want to start tracking an additional attribute on an existing event or if a user is missing certain attributes. Since we've built this app using a schemaless approach, we have the flexibility to handle those situations.

Rendering and styling

We haven't fully covered the logic yet for rendering and styling elements on the DOM. You can see the full completed source code here if you're curious, but here's a summary of what's left to do:

  • add some JS to show/hide results and prompts as the visitor progresses through the survey
  • add some CSS to make the app look nice and adapt the layout for mobile visitors
  • add in a post-survey-completion congratulatory message

And voila, there we have it! End to end, this app took just a few hours to set up. It required no spinning up servers or pre-configuring databases, and it was easy to adapt while developing as there was it was just recording free-form JSON. So far over 2,500 developers have submitted responses and the results are, if nothing else, interesting to look at.

Results, as of the writing of this blog, are here. And the source code is available here.