There is something magical about a spreadsheet – it doesn’t matter who you are or what you are trained to do, the moment you click on those cells and make your first shopping list, you somehow believe that you are smarter than you were before. Perhaps it is the ease with which any person can organise information into rows and columns, or perhaps it is the ability to do formulas that automatically update with changes – whatever the reason, spreadsheets are still tremendously popular.
Spreadsheets have been used for decades and companies have amassed large amounts of data using them. For the most part a typical spreadsheet is formatted like the make-up on a Vegas drag queen and most of them have formulas that contain at least seven IF statements (of which half are broken). They are seriously messy, but they are also seriously important data sources for companies investing in Big Data solutions.
Spreadsheets are to Big Data as Napkins are to Innovation
The role of spreadsheets in companies are changing – the emphasis used to be on using it as a blank canvas for creating various types of reports and charts, or for developing financial or data models. There is certainly still a lot of this going on, but as the market for advanced data tools matures, spreadsheets cannot hold the position as the data tool of choice anymore.
Within the realm of data science, creating a spreadsheet is an increasingly popular way to “sketch out” a data concept, and create small scale proof-of-concept versions of Big Data tools. In this way data scientists can rapidly prototype new ways of managing and analyzing large data sets.
But what about those millions of gigabytes worth of spreadsheets clogging up the company servers? Are they now simply defunct archival files that have mostly anthropological value?
Some Advances in the Big Data Spreadsheets
For a company like Microsoft the move towards Big Data solutions in the market have not gone unnoticed. New analytics tools are making it easier to manage large data sets whilst Excel and other spreadsheet solutions are still hampered by practical limitations (in spite of its theoretical abilities).
However, industry has responded and some players are pushing for spreadsheet-like analyses to integrate with Big Data solutions. Here are some examples:
- Microsoft created an add-in for Excel called “Microsoft Power Query” that allows users to connect Excel to a variety of sources, including Hadoop.
- IBM developed “Big Sheets” which is essentially a web based front-end for Hadoop and allows for DIY analytics
- Datameer and 1010 Data developed some technologies for spreadsheet-like analytics with up to a trillion rows.
The Value of Semi-structured Data
Spreadsheets are often classified as unstructured data because the information contained within them are not linked to a specific database schema of sorts. However, I view spreadsheets as semi-structured data that has two very valuable attributes namely:
- The structural aspects of the spreadsheet: over time your employees may have developed unique ways of dealing with data that is not available using standard tools. Understanding these novel approaches will enrich the efforts of your data scientists.
- The actual content: As messy as it may be, the data contained in spreadsheets can be very useful, especially for historical analyses.
Embrace The Mess
Whether you use spreadsheets as proof-of-concept tools, or harvesting mounds of semi-structured historical data, they remain valuable in companies. It is more than likely that spreadsheets will evolve as the market requires reliable systems that can deal with bigger data sets and allow for the ease of use that spreadsheet solutions have always provided. I encourage you to embrace the mess!