A big data glossary

Big data is serious. Really serious. But at Big Data Week it became clear that we need to be able to laugh at ourselves… So here is my initial attempt at an alternative glossary for big data. Thanks to the many contributors (intentional and not), and apologies to anyone who disagrees. Enjoy.

Wordle: BigDataGlossary

Agile analytics: Fast, iterative, experimental analysis often performed by people who aren’t.

Apache: Shorthand for the Apache Software Foundation, an open source collection of 100 projects. Not funny, but important.

Big brother: 1) Notional leader of Oceania in dystopian future novel Nineteen Eighty-Four. 2) Regular data related leader on front page of the Daily Mail in dystopian present 2013.

Big data: Any data problem that I’m working on. See also small data.

Cassandra: Open source, distributed data management system developed at Facebook. See Evil Empire.

Clickstream data: Data logged by client or web server as users navigate a website or software system. As in “In case of death please delete my clickstream”.

Data <insert profession>: Way of making a profession sound up to 22.7 times more sexy. Examples: data journalist, data scientists, data philanthropist. Not yet tried, data accountant, data politician.

Data journalist: Journalist who lets facts get in the way of a story.

Data model: the output of data modelling. Note that this is not a model who wants to sound more sexy.

Data modelling: Evil way of understanding how a process is reflected in data. Frowned upon.

Data miner: a data scientist who isn’t interested in a pay rise. Note that this is not a miner who wants to sound more sexy.

Data mining: The queen of sciences. Alternatively a business process that discovers patterns, or makes predictions from data sets using machine learning approaches.

Data scientist: A magical unicorn. With a good salary.

Dilbert test: Simple test for separating data scientists from IT people. If you take two equally funny cartoon strips, one Dilbert, the other XKCD, then a data scientist will prefer XKCD. An IT professional will prefer Dilbert. If anyone prefers Garfield they are in completely the wrong profession.

Elephants: Obligatory visual or textual reference by anyone involved in Hadoop. These are not the only animal in the zoo.

ETL: Extract, transform and load (ETL) – software and process used to move data.  Not to be confused with FTL, ELT, BLT, or more importantly DLT.

Evil Empire: large software/hardware/service vendor of choice. For example, Apple, Google, Facebook, IBM, Microsoft etc… The tag is independent of their evil doing capabilities.

Facebook: Privacy wrecking future Evil Empire.  Source of interesting graph data.

Fail fast: Low-risk, experimental approach to big data innovation where failure is seen as an opportunity to learn. Also excellent excuse when analysis goes terribly, terribly wrong.

Fail slow: Really, really bad idea.

Fork: Split in an open source project when two or more people stop talking to each other.

Google: Privacy wrecking future Evil Empire.  Source of interesting search data.

Google BigQuery: See Evil Empire.

Hadoop: Open source software controlled by the Apache Software Foundation that enables the distributed processing of large data sets across clusters of commodity servers. Often heard from senior managers “Get me one of those Hadoops that everyone is talking about”

Hadoop Hive: SQL interface to Hadoop MapReduce. Because SQL is evil. Except when it isn’t.

In-Memory Database: Trying to remember what you just did rather than writing it down. See also fail fast.

Internet of things: Stuff talking to other stuff. See singularity.

Java: Dominant programming language developed in the 90s at Sun Microsystems. It was later used to form Hadoop and other big data technologies.  More importantly a type of coffee.

Key value pairs: a way of avoiding evil data modelling.

MapReduce: Programming paradigm that enables scalability across multiple servers in Hadoop, essentially making it easier to process information on a vast scale. Sounds easy, doesn’t it?

MongoDB: NoSQL open source document-oriented database system developed and supported by 10gen. Its name derives from the word ‘humongous’. Seriously, where do they get these names?

NoSQL: No SQL. It’s evil. Do not use it, use something else instead.

NOSQL: Frequently (and understandably) confused with the NoSQL. It actually means not only SQL, or that you forgot to take off capslock.

No! SQL!: Phrase frequently heard as a badly written query takes down your database/drops your table.  See Bobby Tables.

Open data: Data, often from government, made freely available to the public. See also .pdf

Open source: an expensive way to not pay for something.

Python: 1) Dynamic programming language, first developed in the late 1980s. 2) Dynamic comedy team, first developed in the late 1960s.

Pig: High-level platform for creating MapReduce programmes used with Hadoop, also from Apache.

Real time: Illusory requirement for projects.

Relational database management systems (RDBMS): Big ol’ databases.

Singularity: When the stuff talking to other stuff finds our conversation boring.

Small data: (pejorative) Data that you are working on.

Social Media: Facebook, Twitter etc… Sources of trivial data in large volumes. Will save the world somehow.

Social Network: 1) A collection of people and how they interact. May also be on social media. 2) What data scientists really hope they will be part of.

SQL: Standard, structured query language specifically designed for managing data held in Big ol’ databases.

Twitter: Privacy wrecking future #EvilEmpire.  Source of interesting data in less than 140 characters.

Unstructured data: 1) The crap that just got handed to you with a deadline of tomorrow 2) The brilliant data source you just analysed way ahead of expectations. Often includes video and audio. You weren’t watching YouTube for fun.

V: Letter mysteriously associated with big data definitions. Usually comes in threes.  No one really knows why.

Variety: The thing you fail to get in big data definitions. “I didn’t see much variety in that definition!”

Velociraptors: Scary dinosaurs that really should be part of the definition of big data, terribly underused.

Velocity: The speed at which a speaker at a conference moves from the topic to praising their own company.

Volume: The degree to which a business analyst’s hair exceeds expectations.

XKCD: Important reference manual for big data.

ZooKeeper: Another an open source Apache project which provides a centralised infra­structure and services that enable synchronisation across a cluster.  Given all the animals in big data you can see why this was needed.