This is a visualization of Wikipedia edits made by a single bot.

What is Big Data?

If you’ve been paying any attention to the tech industry in the past year or so, you’ve probably seen almost as many references to “big data” as you’ve seen to “the cloud.” Both are ridiculous and fairly meaningless umbrella terms for things that actually matter. More on the the cloud later, but as I mentioned yesterday, big data is actually important and has the opportunity to meaningfully change the way we do a lot of things.

Business Intelligence

Big data starts with BI. Although first coined in 1958, business intelligence (BI) was all the rage in the 1990’s and into the early 2000’s. I was even responsible for a project to link some BI concepts in a balanced scorecard methodology to, hopefully, improve performance in one of the commands where I served. The basic idea of BI was this: your business generates lots of data. Shouldn’t you use it to make decisions?

Whether you are a Deming disciple or just a business leader who would like to make informed decisions, there is a set of theories, methods and technologies that help you turn data into informed decisions. It’s a great idea and, like everything else in data science, works great if… If you have the data you need to make the decision. If that data is formatted in a useful way. If you can formulate a question that can actually be answered with the data.

Lots of ifs. Lots of opportunities to create more work just to get to the answers you need (as you might have to start collecting data you don’t already have or format or reformat data to meet this new need) and lots of opportunities to succumb to “garbage in, garbage out” if the data quality isn’t very good.

Oh, then there are the practical concerns. Storage, computing, tools, sizes of datasets. If you had lots of interesting data, your ability to answer interesting questions became greater in theory, but much harder (or at least more expensive) in practice.

So, we sort of limped along for the better part of a decade until someone developed a new set of tools.

big data visualization

Bigger Data

Google ran into this problem first. Or, at least, the problems they were working on and the scale that they were working on made it clear that they were going to have to do things differently. If you are indexing the entire internet to make it searchable, there is no way a conventional database is going to do that for you. When you are answering 5,000,000,000 queries per day from all over the world on really poorly structured data that was never designed to be in a database in the first place, and your users expect not just an answer, but the best answer, in tenths of a second, the existing tools just don’t cut it.

So Google built their own tools. The built their own file system (GFS) to allow them to duplicate data all over the world and manage it on piles of (cheaper) commodity hardware instead of expensive purpose-built hardware. They invented MapReduce and then—and this is the important part—released it to the masses under an open source* license, from which came Hadoop.

At the same time, Facebook was coming of age. You probably don’t think about it, but behind the Facebook page that you interact with on a regular basis lives a truly giant database. This database, however, doesn’t need to follow the constraints of older classical databases. For instance, in a normal “relational” database, it is important that two people can’t modify the same data element at the same time and that queries always produce the singular correct response.

Well, that’s hard—database design is its own science. It’s not that hard if we are talking about a single computer or cluster running a single database instance in a single location. It’s somewhat harder when we try to distribute that cluster over several locations, grow very large or add a significant number of users. It’s nearly intractable (at present, at least) to pull that off on a billion global user-scale.

If, however, we relax some of those constraints, things get much easier. There is an entire family of databases now, called NoSQL databases that relax some of those constraints (called the ACID test) or remove some of the functionality that are common to relational databases.

Have you ever seen your Facebook ticker update with something that should be on your feed and yet it takes some seconds or minutes for it to appear there? That’s because the two queries resulted in two different answers for a while. They were asking different parts of the database that held different answers at that time. In some applications, that would be terrible, catastrophic, even. On Facebook, it’s just not a big deal and it allows them to scale in a way that no relational database could.

Big Data

Which brings us to our destination.

So, what is big data? In the business world, it is the application of BI to really big datasets made possible by things like Hadoop, NoSQL and MapReduce. In general, it is the use of really big datasets to answer questions.

What, though, is big? Well, that depends. Certainly any dataset that is larger than a petabyte is “big” (that’s 1,000 TB or 1,000,000 GB), but, certainly, some datasets that are multi-terabyte can easily overwhelm conventional databases and conventional tools. There are even exabyte (1,000 petabytes!) datasets out there.

That’s it.  It’s not actually magic, it just appears that way because these new tools allow us to do things we couldn’t do before. We can do them faster and we can analyze datasets that were either intractable due to their size or, just as likely, really poorly formatted–big data tools are much less cranky when it comes to what’s called unstructured data.

So, that’s “big data” in a nutshell.

Unfortunately, we still have three problems that big data can’t “magically” fix.

  1. You’ve still got to have the data in the first place,
  2. You’ve still got to have the right data for the question you’re trying to answer, and
  3. You’ve still got to know what question you’re trying to answer.

More on those questions in a future post.


* I’ll do a post on Free and Open Source software in the near future.

The featured image at the top is a visualization of a single user’s (actually, a bot’s) edits on Wikipedia. Such a visualization is made possible by big data analysis.


Tags: , , ,