I’m one of the domain experts where I work so usually don’t need to interrogate too hard, but I am reading this excellent book on gathering requirements for BI projects.
As a BA I'd expect them to be responsible for requirements gathering. To that end, I'd recommend this:
https://www.amazon.co.uk/dp/0956817203/ref=cm_sw_r_cp_apa_fabc_31q3Fb1TT3YQZ
I'd second the idea of reading Kimball too. Maybe not the whole thing from a technique perspective, but certainly the first 2 chapters which are more high level.
Honestly though, I sometimes feel that data nerds are wired differently and you either get it, or you don't. I've worked with IT professionals and programmers who I'd consider far more gifted and qualified than I am, but data concepts really confuse them, whereas I've always picked it up without much effort.
There is also Agile Data Warehouse Design which is written by ex-Kimball staff applying an Agile approach.
I recommend Kimball first, Corr’s book is very good around asking questions of subject matter experts, and so is well suited to consultants or people new to an industry or employer.
I would recommend this book if you want to understand the creation process better https://smile.amazon.co.uk/Agile-Data-Warehouse-Design-Collaborative/dp/0956817203/ref=sr_1_1?crid=3BXBGVUZN8KZF&dchild=1&keywords=agile+data+warehouse+design&qid=1596777778&sprefix=agile+data+wa%2Caps%2C142&sr=8-1
At a very high level description you would have a set of data sources, the degree to which they can be considered structured would determine the ETL processes you need to carry.
You would need to identify the "fact" ie the entity you want to measure, which would be a table containing a set of attributes:
sales_fact( sales_fact_id,sum_sales, gross_profit, total_cost, net_profit);
The sales fact would be affected by a set of dimensions, these would represent a point of view on the fact you want to measure:
sales_area(country_id,country, region,city); sales_representative(representative_id,name,country;); promotions(promotion_id,promotion_name...date);
As you can see these are denormalized, country is present in both sales_area and sales_representative
The stages would be:
Connect to the data source(s), perform the needed transformation, load those into the destionation tables. If your data needs a lot of cleaning I would use Talend, Pentaho or similar.
For one course I did we used the AirBnB Berlin dataset, we had to explore the data, identify the relevant dimensions, create the logical and physical design for the tables, then design the ETL processes using Pentaho.
Finally we had to create cubes in Visual Studio to perform OLAP analysis and use POWER BI to perform visualizations ( in my case I also connected the database to R to perform further data exploration)
I hope this helps
Thanks for the reply!
Thanks for all the sweet links! It doesn't look like the aggregation Designer link works but I'll poke around to find it.
Does the workbench build a schema for you? I'm looking through the docs atm
Thanks again!
Generally, if you want to analyze data, you want to put it into Dimensional form, which is highly denormalized
Ideally, this would end up on its own server machine - your data warehouse. which you woudl then query with Power BI or what have you
But you can also do the same in your regular database server, using separate tables, or materialized views. Only do this if your regular database server is not used heavily
You don't see much regression analysis in-db, but it's certainly an option. For analyzing data, you want to focus on analytic SQL: group by, aggregate functions, having, window functions. You also want to learn how to load Slowly Changing Dimensions Type 2, which can be tricky
Agile Data Warehouse Design is a good book to start with
https://www.amazon.ca/Agile-Data-Warehouse-Design-Collaborative/dp/0956817203 is a good book regarding collaborative dimensional schema design
This is a good book. Shorter than the Kimball book anyways:
https://www.amazon.ca/Agile-Data-Warehouse-Design-Collaborative/dp/0956817203
This is a good answer on the DBA stack exchange:
https://dba.stackexchange.com/questions/45655/what-are-measures-and-dimensions-in-cubes/45669#45669
I am also a beginner to data engineering as well (currently a senior software engineer), but here's what I've learned so far -- there are 2 layers in your data stack where you can add SCD, each solves different problems and require different implementations.
​
For all of the below, I will discuss Type 2 SCD, which (like others in this thread have said) seems like the most standard way to add "history" to your data.
1) SCD for your data model (ie: dimension tables). This helps answer questions like "our revenue per user is X, how does that compare with last month? last year?" Let's say you run your data models in a daily batch, then you'll append 1 new row per day. You'll also update the `end_at` timestamp for the previous row.
Typically I always argue "don't overengineer something and only build it when you need it", but in this area, most people can predict that these historical questions will come, so it's worth adding it now (because it's very difficult to add later). Lawrence Corr's Agile Data Warehouse Design book pretty much says the same thing.
​
2) SCD for your source/raw models (ie: the data coming from your transactional database). This helps you answer questions like "when X event happened, what was the data in Y tables"?
A better example: "I wonder how our purchases break down by zipcode." If you have Type 1 SCD data from your mysql/postgres database, then all you can answer is "What is each user's zipcode right now. But if the purchase was made 2 months ago, then you have no idea what zipcode that user lived in 2 months ago".
This is often harder to implement and is typically done by an "ingestion vendor". Fivetran has "History mode" and Airbyte has "Incremental Sync - Deduped History". They have slight differences between how they implement them, but they are very similar. This is often implemented with "Change Data Capture", CDC, where each change to a database row means a new row in your scd table. For example: Let's say a user changes their "users.first_name" 5 times in one minute, then that you would be 5 more rows in the "users" SCD table.
​
Note: DataCoral adds columns which are very similar to type2 scd, but are missing the "end_at" column which tells you when one row was superseded by another. I haven't tried to use these columns as a type2 scd (todo "point in time queries", PIT), but I *think* you should be able to replicate the "end_at" PIT logic with a window function.
​
Orthogonal: When talking about "history", it is also possible your datawarehouse has "time travel" support. Snowflake allows you to pass in a specific timestamp to a query and the query will return results as the data looked at that time. While this is interesting, I label it as orthogonal because it solves a different set of problems. For the standard plan, time travel only goes back 1 day and for the enterprise plan it goes back 90 days. And it's very likely any "historical query" will require you to look back further than 90 days, so timetravel isn't the right solution for that problem.
​
Hope this helps!
I had good experience with this book:
https://www.amazon.com/Agile-Data-Warehouse-Design-Collaborative/dp/0956817203
It’s also introducing some kimball ideas and basic star schema and snowflake schema stuff.
I had good experience with this book:
https://www.amazon.com/Agile-Data-Warehouse-Design-Collaborative/dp/0956817203
It’s also introducing some kimball ideas and basic star schema and snowflake schema stuff.
Is this the one you mean? https://www.amazon.com/Agile-Data-Warehouse-Design-Collaborative/dp/0956817203
OLAP is all about analyzing data.
OLAP is a set of operations you can do against a set of data, such as slicing, dicing, pivoting.
You don't need a database to do OLAP, for example, you can do OLAP operations against a flat file using MS Excel PivotTables.
Some stackexchange answers here:
http://dba.stackexchange.com/questions/45655/what-are-measures-and-dimensions-in-cubes
http://stackoverflow.com/questions/18916682/data-warehouse-vs-olap-cube
This is a good book: http://www.amazon.com/Agile-Data-Warehouse-Design-Collaborative/dp/0956817203