Does Your Data Scientist Hate Your Data?

“This isn’t the right fit for you” - Anonymous Manager -

Those are the words delivered by my manager right after he asked, “How was your weekend?” The lack of emotional intelligence and professional tact aside, the time had come for me to leave my position. This, after eight months of working in three separate relational database systems, generating visualizations from hobbled together data sets, and writing scripts that attempted to fix underlying issues in the data.

While reflecting on my experiences at the company, it would have been easy for my confidence to be shaken. I could have falsely accepted that the problem lay not in the data structures but in my accused ‘lack of attention to detail’. After we agreed that I was not set up for success, I left that day proud of what I had accomplished, but disappointed about what I didn’t. I was excited to build many data products that would drive decision making, but had my hands tied due to the lack of a data governance program and accompanying data warehouse or data vault. Sound familiar?

Data Warehouse?! Can’t I just hire a data scientist?

The Data Scientist, the sexiest job of the 21st century…if I had a nickel. To be clear, my role at the company was that of an analyst and not of a data scientist. However, I did have opportunities to work on advanced analysis and models. This work was largely complicated by the need to script queries and joins across multiple databases, then clean and relabel the data, and aggregate certain metrics before finally moving on to building a model. The initial results from the model didn’t make sense, and researching those issues helped identify more complications in the data set construction. All problems that a data governance and data warehouse program would have solved.

Can your $150,000 dollar a year resource script in python to clean your data? Sure. Can she empty the trash cans and vacuum the floors on each Friday? Probably. However, both scenarios are not an optimal use of time. Your Data Scientist should be focused on building models, hyper-parameter tuning and optimizing outputs. The hours spent cleaning and creating a data set for model input is not their job. Ideally, your Data Scientists should be able to access specific information marts that relate to the business problem they are trying to solve.

Types of Data Warehouse

Data Warehouses are not a one size fits all solution. There are different methodologies and approaches that work in various scenarios. Because of that, you can’t sign in to your AWS or Azure portal, click “create data warehouse using my data” and get going. But maybe with the rise of AI we can put in that feature request. The flip side of that coin is that it doesn’t have to take years of requirements gathering and analysis to construct a data warehouse either.

Data Warehouses gained notoriety in 1992 when Bill Inmon published Building the Data Warehouse. His was a top down approach that aimed to stage data from disparate sources, load that data into the data warehouse and then provide it to end users via data marts. This approach should feel familiar to those experienced with databases, as it is a fully normalized structure. The diagram below shows an Inmon diagram utilizing AWS resources. It’s important to note that the end users are NOT directly accessing data in the Data Warehouse. Instead, dimensional marts are created that end users access.

alt text

In 1996 Ralph Kimball published the data warehouse toolkit, which is a dimensional approach to data warehousing. In the dimensional approach, you are concerned with facts and dimensions, built in a star schema and are NOT worried about normalizing the data. Further, in the Kimball layout, end users and analytical tools are allowed to access the warehouse directly. This is because the warehouse is de-normalized and easily accessible. In both the Inmon and Kimball approach, business rules are applied between the staging areas and the Data Warehouse. If your business rules should change, then you will need to update your data in the warehouse and any reporting previously done becomes obsolete. A Data Vault methodology solves for this and applies business rules after the data is in the warehouse.

alt text

Consider a Data Vault

The Inmon and Kimball approach are very functional for mature enterprises who’s business rules and structures are unlikely to change. Once the enterprise has been evaluated, subject areas defined and business rules in place, they are fairly straightforward to implement. That is why they have been a dominating force in the market. However, those approaches are not designed to be flexible to changes in needs, requirements and lines of business. To solve for that, Dan Linstedt created the Data Vault framework and released it in 2000. The data vault allows for the enterprise to warehouse specific units or functions of the business one at a time. Often, this is done in accordance with Agile practices which makes the methodology even more desirable. As mentioned, business rules are applied between the data warehouse and the marts or directly by end users. This allows the enterprise to store all the source data and makes it easy if business rules should change.

alt text

Which one is right for you?

Whether you are a mature enterprise or a young startup, it takes asking the right questions to decide which approach to take. Framing the right questions to the right parts of the business will help you decide on a methodology and implementation strategy.