A data lake is the latest business intelligence solution. Wikipedia states the idea of a data lake is:
…to have a single store of all data in the enterprise ranging from raw data (which implies exact copy of source system data) to transformed data which is used for various tasks including reporting, visualization, analytics and machine learning. The data lake includes structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and even binary data (images, audio, video) thus creating a centralized data store accommodating all forms of data.
So a data lake is all of your data. Nothing new there. But the key to understanding a data lake is to also understand a data swamp. According to Wikipedia a data swamp is:
…a deteriorated data lake, that is inaccessible to its intended users and provides little value.
Most non-internet companies have a data swamp. They keep web logs, but those logs are only available to IT staff and used only to support the web applications. There are application databases all over the place, but the users to most of the databases are responsible for data entry. Even if an analyst is allowed into an application database it is usually up to them to find out about the database and request access to it themselves.
So a data lake is about access and providing value. I would add that it is also about organization of your data. I would also add that cloud storage and cloud technology have come a long way towards helping a data lake provide value.
But we still have the question, do you need one? According to Chris Campbell, in his blog post, the users of a data lake can be broken down to the following sets:
- Operational – 80% of the users of a data lake are operational. They are looking for key performance metrics or a slice of the same data every day. An established data warehouse will provide anything that these users will need.
- Shallow analysts – 10% of the users are analysts that perform shallow analysis primarily using the data warehouse and occasionally using source system data when it is not available in the data warehouse.
- Deep analysts – the remaining 10% of the users are analysts that perform deep analysis. They create new data sources based on research and often use many sources of data to answer questions. These users include data scientists who commonly use big data technology and machine learning.
If your company has deep analysts or is planning on hiring deep analysts, then you need to turn your data swamp into a data lake. Given the trends in competition, if you haven’t done this already, you probably should. My next post will talk about the number of ways a data lake can save you money. Even if your company isn’t planning on doing deep analysis, the cost savings of a formal cloud based data lake are something to consider.