The Data Warehouse Toolkit by Kimball was recommended to me as "The Source" for DW. I just started reading it, so no experience yet.
The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition https://www.amazon.com/dp/1118530802/ref=cm_sw_r_cp_apa_i_LZ-7CbHQTXGRM
Ah man... My Google Fu deserted me. I couldn't find a version called a "Definitive guide...". I wonder, are they possibly published with different titles in different territories?
Hang on... This is the version I have. It seems I may be splitting heirs over titles and subtitles (what a douche 🤦♂️) ...I guess this puts me firmly in the reject pile 😭
The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition https://www.amazon.co.uk/dp/1118530802/ref=cm_sw_r_apan_i_VAWAXBY7TRS6VYKPM5FD
https://www.amazon.com/Data-Warehouse-Toolkit-Definitive-Dimensional/dp/1118530802
I saw a similar question in a post and this was recommended and like the poster said it changes how the poster sees databases design and I totally agree. Great book.
FYI, if you’re still in school or recently graduated see if they have access to O’Reilly Media and you can read this and other books for free.
Great resource for exactly what you mentioned.
Have you read The Data Warehouse Toolkit? It's kinda-sorta considered the kinda-sorta bible on this. Regardless, I think it's great reading for anyone dealing with relational data. I linked to Amazon, but the Intl version is available much cheaper on Ebay.
Aside from that, you might also post in /r/BusinessIntelligence, which is probably a bit better of a community for this question (though I probably would have cross-posted here as well).
>data warehouse toolkit
Can you confirm that this is the book you are referring to? Amazon Link
Is this a class assignment you don't want to complete? If you are getting paid to do this and have to ask this sub you do it for you.. Wow.
There is a chapter on purchasing use cases in the bible :
https://www.amazon.com/dp/1118530802/ref=cm_sw_r_cp_apa_i_B64hFbEWCEBND
I've also hired & trained several DE teams over the years and currently lead a large data & analytics practice that is 100% Azure Data Engineering & Power BI. Also, I'm the mod over at r/PowerBI (please subscribe 😊).
Some thoughts -
Correct. To be specific this
Look up the DMBOK and Ralph Kimball’s The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling https://www.amazon.com/dp/1118530802/ref=cm_sw_r_cp_api_r4XMBbY0729K9 .
As someone who has also made the DS → DE transition, your focus will (broadly speaking) shift from "how can I use this data?" to "how can I help others use this data?"
To help answer that question, I got the most bang for my buck studying dimensional modeling. Classics include The Data Warehouse Toolkit and Agile Data Warehouse Design.
Which specific tools/libraries you should learn depends on your new role. Given that you're entering an AWS shop, I assume DMS and Glue are on the table.
Amazon has some books that you should get:
Also, learn about Database normalization from YouTube
The star schema is the jazz of database design--pretty every classical rule of database design that you were told to follow are going to be broken in the design of a star schema. (It's all about knowing how & when to break the rules properly to carry the music analogy one step further...)
Ralph Kimball's The Data Warehouse Toolkit is widely considered the bible of data warehouse/star schema design. While there are other schools of thought on star schemas, he is considered to be the starting point for jumping into star schema design.
Yes, the fact table is made up of FKs to the dimension tables & measures--typically counts & financial elements. And the measures will contain previously calculated and/or derived values so that when it comes to generating results in your queries, there will be very, very few calculations performed--it's more about retrieving rows & performing few/any aggregate functions to display your results.
You'll realize the potential power of dimension tables when it comes to dates alone. Think of what it would take for you to create a report with your transactional tables that compare data from only the first Monday of every month. Finding the date & determining whether or not it's a Monday...ugh. But as a dimension, dates can be accessed like this (let's say I want the first Monday for every month in 2015):
SELECT sum(total_sales) , sum(product_cost) , sum(gross_margin) , avg(gross_margin) , sum(widgets_sold) --or could be count(widgets_sold) depending on how you store the value FROM FACT_TRANSACTIONS INNER JOIN DATE_DIM on FACT_TRANSACTIONS.date_id = DATE_DIM.id WHERE DATE_DIM.year = 2015 and DATE_DIM.day_of_week = 'Monday' and DATE_DIM.week_of_month = 'First' GROUP BY DATE_DIM.month;
And based off that query, it should be easy to see how you could easily filter those results on particular fund(s) by just adding a join to the FUNDS_DIM dimension table & adding the specific fund(s) in the WHERE clause.
Have a nice “coachable” read.
https://www.amazon.com.br/Data-Warehouse-Toolkit-Definitive-Dimensional/dp/1118530802
A 3 edição nao tem em ptbr. Só a segunda ed.
https://www.holistics.io/blog/how-to-read-data-warehouse-toolkit/
Devido a isso recomendo a 3ª
the classic multi-dimensional storage/analysis:
Data Vault:
Datalake, if you plan to be in a MS Stack/Azure (less useful for any other big cloud vendors, i feel):
If you want to get into Snowflake:
First:
Then:
I don't think so...
You should give "The Data Warehouse Toolkit" a read.
The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition https://www.amazon.com/dp/1118530802/ref=cm_sw_r_apan_i_XM6AVWWD293EBY2GAGZZ
No problem, that was the series I went off of years ago and it was enough for me to get a dashboard put together in a few days or so.
Depending on how far you're wanting to get into it/your base knowledge for data I'd also recommend these books:
Data Warehouse Toolkit - Definitive guide for dimensional modeling and creating a star schema (will help you understand how tables should be structured)
The Definitive Guide to Dax - Basically the bible for understanding DAX, how filter context/calculations work in Power BI.
Information Dashboard Design - Great resource for visual design when making a dashboard
Resources on the internet:
Guy in a Cube (Youtube) A lot of great beginner tutorial videos, as well as many cool tips and tricks to make the most out of PBI's functionality and take your reports to the next level
DAX.guide - Detailed explanations of every DAX function with examples, made by the same guys who wrote the DAX book mentioned above.
DAX Patterns - Articles and walkthroughs for solutions to common business problems using DAX (also same guys who wrote the DAX Book)
SQLBI (Youtube) - A lot of both quick and in-depth examples of DAX functions (same guys who wrote the DAX book)
SQLBI (Website) - More in depth articles going into the intricacies of DAX, as well as examples for how to solve both common and complex issues (same guys who wrote the DAX book)
There's actually a set of 3 books, but the most relevant one imo is The Data Warehouse Toolkit
The Data Warehouse Toolkit - Kimball & Ross (https://www.amazon.com/Data-Warehouse-Toolkit-Definitive-Dimensional/dp/1118530802)
The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition https://www.amazon.com/dp/1118530802/ref=cm_sw_r_cp_api_glt_fabc_XW8XAKS43J8PQV0604ZB
Data Architect here, this is what I do for a living - so perhaps I can answer this very large and broad topic in a semi-succinct manner.
> as they're absolutely nonsense with cultists proposing they exist but then giving no fundamental direction on how to build one.
This is objectively false.
> but then giving no fundamental direction on how to build one.
This is for two reasons. 1) It's really dependent on the size of the company, the data, and the data literacy of the company at the time.
On Data Literacy
You need to determine at what stage a company is at with their data.
All of these questions will influence how the Data Warehouse is built, and the surrounding Data Architecture that will feed it.
On Ralph Kimball
There are two major methodologies that are generally referred to for designing a traditional data warehouse - the Kimball method and the Inmon method.
The Kimball methodology is laid out in the canonical book The Data Warehouse Toolkit.
The Inmon methodology is laid out in the book Building the Data Warehouse
Both of these books will get someone in the headspace of what designing a data warehouse entails, however they're older now and can be more comparable to K&R's C Programming book then on a way to implement modern data warehouses.
On Data Warehouse Technology
To simply state that a Data Warehouse is a database is false. It's an entire underlying infrastructure for handling, processing and managing large analytical workloads.
Examples of Data Warehouses include: SQL Server Analysis Services, Amazon Redshift, Azure Synapse, Google BigQuery, Oracle EDW, etc.
Data Warehouses generally store their data in-memory, using column store indexes that index the "tables" based at the column-level instead your normal RDBMS that indexes at row-level. This makes performing calculations and aggregations on them super fast. Modern data warehouses also employee MPP (Multi Parallel Processing) technologies for even further enhancement of performance for analysis workloads, by distributing the compute.
The schema you use to model your data for Data Warehouses will vary based on the data and what you plan to do with it. Just know this is completely different from how you'd model data for a transactional database/RDBMS. The two major schemas used, and their sub-variations, are Snowflake and Star. Each have their advantages and disadvantages.
In Summary
I wanted to provide just a bit of technical depth, some stuff to research on your own, and a general explanation. This is seriously only the start of this topic. I could talk on it for hours. Feel free to ask any specific questions and I can try and provide at least a direction for you to look.
> I need help understanding how a DW is conceived.
Everyone will most likely reference Kimball's book Data Warehouse Toolkit, or Inmon's Building the Data Warehouse. These are both theoretical in nature, are good reads, but are older and everything won't be directly applicable.
There are, unfortunately, not a lot of practical examples of implementation that I've found apart from really simplistic designs out there. One exception may be Building a Data Warehouse with Examples in SQL Server - which offers a full-scope view of an analytics data architecture.
So why is this? Well, it's kind of like how companies use various different processes and models for SCRUM/Agile/Waterfall/Kanban/etc. - very rarely does the theoretical implementation actually fit a real-world company, and each are unique to the company, industry, size, variety of data, goals, etc.
> Does it make sense to start with a cloud solution like Snowflake for a brand new company?
This depends. What are their goals? What are they trying to achieve? How much data do they currently have? If it's a smaller company that's expecting smaller growth patterns over the first few years, then it may be more fiscally conservative to just implement their "data warehouse" within an RDBMS that's much cheaper and then optimize/migrate when the needs arise.
If this is a tech startup that has money to sling at whatever they want and is more concerned about rapid innovation and their next round of funding - instead of keep their books in the black - then Snowflake may be a perfectly viable option.
Books (databricks were giving the 2nd one away for free for a bit, worth checking there first)
The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition https://www.amazon.co.uk/dp/1118530802/ref=cm_sw_r_apan_glt_fabc_K7B3F8XG8002K8RWXMYD
Spark – The Definitive Guide: Big data processing made simple https://www.amazon.co.uk/dp/1491912219/ref=cm_sw_r_apan_glt_fabc_6AHPCXB5T8625VHQ53EV
Udemy (Black Friday deals are good)
https://www.udemy.com/course/taming-big-data-with-apache-spark-hands-on/
https://www.udemy.com/course/master-datawarehouse-concepts-step-by-step-from-scratch/
https://www.reddit.com/r/dataengineering/
We don't bite.
There are a few questions that you need to answer before settling on a solution:
From my perspective, I would first ask where the data is coming from, and whether there might be a better way of retrieving it than the existing system.
Essentially, you're dealing with technical debt right now (I like the analogy of dirty dishes, you don't HAVE to clean them up right after cooking but it sure is easier... you're dealing with crusty, moldy, week old dishes here by the sound of it... it's gross and a lot of work).
In some cases, it makes more sense to rebuild from scratch rather than be beholden to an archaic system that seemed good at the time 30 years ago.
It sounds like what you need is a pipeline, there are two main options ETL and ELT.
ETL (Extract Transform Load) is the more traditional method of grabbing the data, processing and parsing it into a usable form and THEN pushing it into a database.
ELT (Extract Load Transform) is a more recent innovation (not universally better, mind) where you grab the data, throw it in a pile (i.e. data lake) and decide how to process it AFTER wrangling everything in one place.
If you are capable of gathering requirements and saying "The business really only needs XYZ data and we'll make a data warehouse to make the data accessible" then ETL is probably the way to go.
If you have no idea what of that data will be most useful, I would first dump it into a standardized format and then start building out warehouses for specific use cases.
This is somewhat overly general advice, you might want to grab a copy of:
To help you learn the ropes (highly recommended book in DE).
I can probably give some more specific advice if you answer points 1,2 and 3 above.
The holy scripture - https://www.amazon.com/Data-Warehouse-Toolkit-Definitive-Dimensional/dp/1118530802/
OLAP is about analytical workloads. Usually aggregating large amounts of data, filtered on a few data points.
To run these workloads efficiently and make the queries easier to build and read, you organize your data into a dimensional model. E.g. a star/snowflake schema. The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling is pretty much the Bible on this topic.
Yes, you'd often use SQL to interact with your analytical database, same as any OLTP system. Much the same as we can use English both to write a poem and the user manual for a toaster.
This is the bible for that subject:
https://www.amazon.com/dp/1118530802/ref=cm_sw_em_r_mt_dp_arjKFbY5PMZVD
>I would like to hear your experiences, techniques and how/where the end users can use the dimensional models (how can they connect to them, do you use SSAS or are there other techniques available?)
As for experience and techniques, I have learned them on the job plus a few classes in our corporate academy. All the theory I know comes from The Bible, which I never read cover to cover TBH. I am aware this won't be a satisfactory answer but finding good material for new hires is something I genuinely struggle with all the time (maybe there's an opportunity there to be taken!).
How do users connect to data models? You have two main approaches: reporting tool connects straight to the data mart (aka SQL database) or connect to an in intermediary in-memory model. The choice comes down to performance requirements and budget. Different vendors have different products that match nicely, but more often than not they also work with 3rd party software. Here's a quick list:
​
Vendor | Backend | Frontend |
---|---|---|
Microsoft | SQL Server (cloud or on-prem), SSAS (mostly tabular these days) | Power BI, Excel |
Oracle | Oracle DB (again cloud or on prem) | OBIEE aka Oracle Analytics Cloud |
AWS | Redshift | Quicksight |
BigQuery | Looker |
The table above is by no means exhaustive. The point I am trying to make is all backends share one common factor: dimensional modelling. Most BI tools work nicely with them and you should really focus on being able to model your data regardless of the specific vendor. That's IMHO your biggest value as a data engineer.
​
>What advise can you give me to start modeling? Maybe you have some project or resources that I can work on/try out?
Have you searched university websites? This is the first result I found, but I bet there's more.
​
>I need someone that can show me the ropes by working together on a project. Please let me know if that's possible!
This guy started a new project recently, maybe you are interested. I can't vouch for it, but he seems a nice guy, far from the scammers you usually find online. Also you could search on fiverr for mentors
I recently got promoted to Data Architect from a Lead Data Engineer role. Definitely more soft skills involved, such as persuasion, presentation, and workflow/process mapping versus writing data pipeline code.
I’ll definitely second learning business processes. Luckily we just went through a process of mapping all the activities in our key business processes, like quote to cash, hire to retire, incident to resolution, etc. It really helps when stepping into data warehousing and dimensional modeling. Having context on what the business needs goes a long way to delivering useful results.
I also recommend reading the Data Warehouse Toolkit and Agile Data Warehouse Design.
Read this book:
http://www.amazon.com/Data-Warehouse-Toolkit-Definitive-Dimensional/dp/1118530802
Seriously.