Building a Data Lake: An Agile Approach

Data lake

The immense economic growth, ever-expanding cloud-storage capacity, and increasing network connectivity have resulted in an exponentially growing ocean of data. Data flows to companies’ data warehouses in countless structured or unstructured formats from a vast variety of sources.

Despite technological advancement aimed to ease data collection, storage, processing and analytics, business and IT leaders often remain overwhelmed with the volume, variety, and velocity of data at their disposal. The biggest challenge is still harnessing meaningful insights from this massive amount of data.

This is where data lakes can provide value.

What is a Data Lake?

A Data Lake is a centralised storage repository that allows organisations to store data in any format, from any source, and at any scale.

Key capabilities include:

  • Store rational and non-rational raw data with the lowest granularity.
  • Leverage scale-out, cheap storage for storing data as long as required.
  • Acts as a staging layer for structured and unstructured data analysis.

The overall aim of a data lake is to increase data analytics performance and native integration, which has positive effects on the enterprise. According to an Aberdeen survey, organisations who used a data lake are outperforming their competitors by 9% in organic revenue growth. Data Lakes enabled these companies to perform new types of analytics, such as machine learning over new sources like log files, data from social media, and internet-connected devices stored in the data lake. As a result, they could identify opportunities for overall business growth and boost their revenue.

What value does a data lake offer?

Though a data lake is used with enterprise data warehouses (EDWs), it costs significantly less to operate than an EDWs. This concept allows companies to use affordable and readily available hardware. Data is stored in its native format and can be reconfigured as and when needed. 

The ability to harness more data from more sources in less time empowers businesses to analyse data in different ways for faster decision making.

Here’s how a data lake offers value to organisations:

A comprehensive and holistic solution to big data

Today, data-rich companies have a variety of data types from numerous sources, stored in the cloud and on-premise storage spaces. Their data is snowballing and used for countless use cases and workloads. 

Modern data lakes aim at helping companies to address these challenges with a comprehensive, scalable and flexible Big Data solution.

Improved customer interactions

A Data Lake can combine customer data from a CRM platform and a marketing platform. This combination helps businesses to understand the most compelling customer cohort and potential rewards that will increase loyalty and improve customer interactions.

Improve R&D innovation choices

A data lake helps R&D teams test hypothesis, refining assumptions, and assessing results – such as understanding customers’ willingness to pay for different attributes or choose the right material for product design for better performance.            

Increased operational efficiencies

The Internet of Things (IoT) provides organisations with many ways to collect real-time data from internet-connected devices. A data lake makes it easy to run analytics on machine-generated IoT data to identify the ways to reduce operational costs, and increase quality.

Indeed, a well-maintained and governed data lake is a goldmine for data scientists enabling them to form a robust advanced data-analytics program. It allows companies to establish self-service options so that business users could generate their data analyses and report as and when required.

However, combining data lakes with the organisation’s technology architecture, establishing rules surrounding data lakes, identifying requirements to deploy data lakes and realising business benefits from them can be challenging.

Instead of falling back on tried-and-true methods for designing data lakes and updating technology architectures, companies should apply an agile approach to design and rollout of their data lakes.

What is an agile approach for a data lake?

An agile approach allows IT and business leaders to address design and technology questions related to data lakes jointly. Many organisations are designing their data lakes using a “business back” approach. They identify the business units which could gain the most value from the data lake and then factor these into the design of storage solution and rollout decisions.

Then they incrementally populate data lakes with data for specific use cases. Instead of going all in on one designated solution, they pilot 2-3 final candidates from different providers to assess the real-world performance, scalability and ease of integration.

Benefits of this agile approach are:

  • Implementation and performance challenges of data lakes will be found at an early stage.
  • Collect feedback from business units.
  • Allows agile development teams to improve processes and data-governance protocols.

A successful agile data lake should be pattern-based, metadata-driven business data repositories, accounting for data security and governance. Data in the lake should ensure information accuracy and timeliness. Under an agile approach, using metadata management, mastering data management and applying data profiling becomes accessible and practical.

To build a successful agile data lake, you should keep the following principles in mind:

  • A well-implemented ecosystem, architecture, data models, and methodologies.
  • Incorporating exceptional data processing, governance, and security.
  • The deliberate use of job design best practices.

An agile data lake must have a managed lifecycle so that it can augment business decisions and enhance domain knowledge.

Three critical phases of this lifecycle are:


It is about extracting and accumulating raw data in a staging area for downstream processing and archival purposes.


Adaptation is about loading and transforming raw data into usable formats for further processing and/or use by business users.


Data consumption involves performing data aggregations, analytics, dining mining and machine learning, operational system feedback, visualisation and reporting.

By building a data lake, you will only require one use case, offloading ETL workloads from EDW or A/B testing. Approaching it in an agile and iterative manner ensures excellent long-term benefits to businesses.

Data lakes bring enormous benefits to organisations by creating an enterprise data hub and enabling data discovery, real-time analytics, and machine learning.

Microsoft Azure offers the Azure Data Lake, a scalable cloud service composed of data storage and analytics. An integrated solution to building data lakes and managing big data, Azure Data Lake can maximise business productivity and bring valuable insights from your data. Azure Data Lake makes it easy for users to store and process data across platforms. It integrates seamlessly with operational stores and data warehouses for simplified data management. This quick video provides a good overview of Azure Data Lake in managing big data analytics. If you’d like to know more about data lakes and Microsoft Azure’s approach to data analytics, our Inside Info Microsoft BI consultants will be able to assist. Contact Inside Info

You May Also Like

Leave a Reply

Your email address will not be published.