Data Management

Creating a Bespoke Data Lake


Profusion needed to consolidate its many different data silos on one platform (a data lake) to save time in extracting and analysing the data and to offer more services to its clients. However, no platform existed that would meet all of Profusion’s clients’ needs. As a result, Profusion decided to build its own with the goal of being better able to create pioneering data science projects for marketers.

Previously, Profusion’s data science team had to write complicated code to combine data sets from several different sources including MySQL, Postgres and Redshift. This took up a lot of resources that could be better spent on analysing data and supporting clients.


The data lake developed by Profusion used an innovative approach through a 100% open-source based implementation. A lot of companies would be quick to purchase proprietary alternatives to build an end-to-end solution, however using open source tools such as Hadoop, Spark, Hive, NiFi & ElasticSearch, Profusion was able to build the data lake themselves, whilst investing in the continuing development of the open-source community. The open-source community itself is a fast growing field, constantly creating new tools. With Profusion’s data lake now able to support the use of open-source technology, Profusion is able to stay at the cutting-edge of any new open-source innovation, as and when the tools are released, instead of being restricted by the storage solutions it was previously using.

The data lake can be described as modular, meaning it can be scaled in response to client need and demand. The ability to store any type of data, including videos and images or any other binary data, means the data lake can potentially be used for many different services and products when the need for them arises.

A team consisting of a Data Engineer, Data Scientist and Devops Engineer were able to build the data lake in three months ensuring the finished product is fully programmed and reusable. This means that if needed, Profusion can rebuild the data lake at the click of a button. The data lake has also been built to comply with Profusion’s highly secure security protocols – put in place to meet the security needs of Profusion’s clients, including a global bank, and to futureproof as much as possible against incoming data protection legislation.

To build the complex data pipelines Profusion’s team carried out the following steps – extracting data from the source (client and device data etc.), transforming the data (anonymization and pseudonymization where required in the interest of and to comply with data protection) the data and loading the data into Profusion’s new Hadoop based data lake (ETL).

As part of the data lake solution, and to relieve reporting pressure on Profusion’s data science team, a semantic layer of the data within the data lake was created. In other words, data within the data lake is visualised in an easy-to-understand dashboard that is accessible to other Profusion employees. For example, Profusion’s consultants and CRM teams can now easily see how campaigns are performing in terms of engagement, dwell time, devices being used by the audience and operating system.


By creating a data lake, Profusion improved team efficiency by 70%. The data lake also allows Profusion to scale products and services to meet any client need, use structured and unstructured data (such as videos, images and social media transcripts) and utilise open source technology.

Before implementing the data lake, Profusion could only offer basic campaign analysis, but after implementation Profusion is now able to offer services such as churn analysis, experimental design and multi-channel testing. The data lake has also enabled Profusion to combine structured and unstructured data, including videos, images and social media, to greatly expand its product and service offering to include Single Customer View, Attribution Modelling, Image Recognition, Recommendation Engines and Sentiment Analysis. Likewise, Profusion can now use open-source technology to provide more advanced data science services for its clients and prospects. Data stored in the data lake can be kept for an infinite amount of time, allowing Profusion to analyse and compare historical data with current data sets to find trends and patterns over a long period of time. The solution also allows Profusion to undertake real-time data analysis, an essential offering for many retailers and marketers – and under growing demand in the industry.