Big Data, Small Data: The Frame Makes The Picture

Beware the Era of the Statistician

Bernard Marr recently published a wildly popular post in the Big Data channel that highlighted some of the best jobs in the growing industry of Big Data. His article scared me a little – it seems from his perspective that Big Data will employ mostly technically proficient individuals (which makes sense) and statisticians (or statisticians disguised as data scientists).

But what about the storytellers? What about having the ability to properly frame questions and derive real value by telling the story that your Big Data is trying to tell?

The Age Old Issue of Lies, Damned Lies and Statistics


My company recently performed an in-depth analysis of the Top Posts in all of the Pulse channels. We really wanted to try and get some insights into how the Pulse algorithm selects articles for the purpose of featuring them (with some very revealing results). One of the metrics we used to assess the success of a post is the correlation between audience interaction and the day on which a post was published.

In our review of other research we found the following:

  1. One researcher found that readers are more likely to share an article on a Tuesday. From this it follows that you should publish a post on a Tuesday if you want it to be shared more often.
  2. Another researcher (who assessed 3000 of the most successful posts) found that readers are more likely to view an article on a Thursday. From this it follows that publishing on a Thursday will give you more views.

Statistically, if you follow their advice your post should have a higher chance of success. But the statistics here are quite misleading. We couldn’t help but notice two issues: firstly, the researchers had different definitions for success (views versus shares) and secondly, neither of them provided a proper context that relates to how Pulse actually selects posts.

The frames they created for their research were too broad in terms of context, and too narrow in terms of success factors. Moreover they relied on correlation (higher views correlates with a Thursday) rather than causation (the underlying drivers that determine higher views on a Thursday, and might be present on other days).

Your Big Data Answer Comes From Your Big Data Question

This is perhaps basic research theory, but we often forget its value – in order to get an answer, we must first understand the question! When we performed the study on LinkedIn Pulse, we asked three key questions:

  1. Context: Where will will find a successful post?
  2. Metrics: How do we measure a successful post?
  3. Boundaries: How should we limit our analysis to get a reliable result?

We took some time to manually assess a number of successful posts and found that their context is almost always the Top Posts section of a Pulse channel. Using this context we were able to compare the success of posts in different channels that adds a critical dimension to understanding why certain featured posts are doing better than others.

We also noticed a pattern that seemed to relate to all of the possible ways in which the audience can interact with a post. We realised that LinkedIn doesn’t measure success by views alone (a view can be a fraction of a second after all), nor by likes or shares alone. LinkedIn’s algorithm seems to use a number of ratio metrics that measure success in terms of overall audience interaction a) over time (we called this velocity) and b) in comparison to other posts in the channel (we called this fame).

Finally our research revealed that there are a number of ways that posts become successful – they are published by an Influencer, by a LinkedIn Editor, by a News Agency (like CNN) or analysed by the Pulse algorithm for inclusion. Influencers and Editors seem to have manual access to selecting channels, and per definition has access to essentially the whole of the LinkedIn community through preferred broadcast channels. News Agencies publish off-site which means that it isn’t analysed in the same way as member published posts. This means that a study that includes post from these sources will give a very wrong picture of what it takes to succeed! We therefore excluded posts from these sources in our study.

Our study revealed something interesting: The most successful member posts (as measured in terms of a ratio between all audience activities) were actually published on a Wednesday. But our study also revealed that the day on which a post is published is significantly less important than how a post aligns with what we deduced to be a channel’s tag cloud.

How framing affects your Big Data implementation

There is very little value in spending large sums of money, time and effort on setting up a Big Data solution that will be used to spit out junk answers to junk questions. Any data analysis tool, its sample notwithstanding (because Big Data really only refers to large samples from disparate sources), is worthless if it doesn’t serve a well defined research frame. If you want clear answers you should create a good frame by determining the context, metrics and boundaries of your question.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.