Five things to consider when building a data warehouse
Our Senior Infrastructure Engineer, Paul Barrett, shares considerations you need to make when building a data warehouse.
In recent months we have received several enquiries about building a data warehouse or ‘data platform’. A data warehouse is a central repository for an organisation's data, which handles the collection, cleansing, transformation, and application of data to generate business insights.
There is no one-size-fits-all approach to building a data warehouse. If you are considering this for your organisation, we recommend talking to an expert who can walk you through the complexities involved and the options open to you.
We asked our very own expert, Paul Barrett, to give us five points organisations should consider when thinking about implementing a data warehouse. Paul’s experience comes from building data platforms to enable our analysts to undertake their analysis and produce valuable insights to answer business questions.
1. Public cloud vs on-premise
One of the biggest considerations is where to host the data and processes that maintain and monitor it. Over the course of recent years there has been a big push towards the public cloud. But any system or business that has grown up over the years is likely to have a mixture of resources across different types of hosting. This brings complexity and challenges to forming a consistent data warehouse.
The advantages of using public cloud are considerable. Products like the Microsoft Data Factory provide several out-of-the-box services for the pipelining and command-and-control functions for data platforms. This means the amount of effort required to implement this critical aspect can be drastically reduced.
However, the most common argument against using a public cloud provider is cost. The public cloud 'pay-as-you-go' commercial model means the cost of a data platform will vary from month to month as the amount of data and pipeline executions evolve. Compare this to a fully on-premise implementation where the compute required to operate the platform is finite and most likely purchased through capital expenditure, meaning there will be little to no variance in ongoing running costs.
This all means that there is no one-size-fits-all approach and analysis of the options and existing investments is needed to provide the best fit for each implementation.
2. Data security
Data security is one of the most critical aspects of a data warehouse to get right. Often many different sources, content creators and consumers of data flow through these platforms. Therefore, ensuring that the right audience has access to what they need while maintaining the key security principal of least privilege, is quite the challenge.
Data security must be kept in the forefront of the mind while building out the platform. This means ensuring sensitive data from source systems is only ingested if it is required and data protection consent is in place to allow its use. It also means providing filtering of data, both vertically and horizontally, based on the role of the individual accessing it.
3. Consumers and data-driven decisions
The design of the data warehouse requires close collaboration with the intended consumers of the reports or the insights that will be generated from it. This is to ensure that their needs and the business needs are met from the outset.
The aim here is to ensure that the relevant data, both statistical and dimensional, is captured in a manner that works for the tooling that the consumers will be using - whilst also allowing for relationships and comparisons to be made between the multiple sources to answer those complex questions. This, after all, is the reason for the data warehouse to exist in the first place.
4. Pipeline design and data freshness
The pipeline is the mechanism used to extract, transform, and load data into the warehouse. Key aspects to consider include:
- The size of the source data
- Any quality issues that may exist with the data that need addressing as it is processed
- How fresh does the data in the warehouse need to be?
- What format is the data in?
- Where is it located?
- Is it accessible directly by the pipeline or are there intermediaries required?
All of this refines the choices on technology and techniques that are employed to build the pipeline and populate the warehouse.
5. Cost estimating
Estimating the cost of a data warehouse is perhaps one of the trickier aspects that can look deceivingly simple from a high level. As mentioned above, there are big differences in the commercial models of the public cloud vs. on-premise.
The easier of the two to estimate is the on-premise implementation where using either existing or purchasing dedicated compute resource can provide a consistent and predictable cost estimate for the running of the platform.
Compare that with the public cloud where most of the tooling and resources are billed at a very granular level. They include the amount of storage used, number of executions of a pipeline, the amount of processing required to execute the pipeline, data transit into and out of the platform.
This means there are many variables that can influence the cost of operating the platform. For example, simply increasing the frequency of pipeline executions, changes a few of the above variables, each having a different impact on the cost. This will give an indication of some of the complexity around creating an estimate on running costs before the data has been analysed and the pipeline designed.
In conclusion
Hopefully this gives some useful insight into what to consider when talking to suppliers or internal teams when embarking on a data platform project. There really is no one-size-fits-all approach so it is worthwhile talking to someone who has worked with a variety of implementations as this can bring options and opinions to your project that might make the different between cost-effective success or expensive failure.