Not strictly DE, but https://www.amazon.com/Schema-Complete-Reference-Christopher-Adamson/dp/0071744320/ref=nodl_ . I found the idea of dimensional modeling/star schema to be something people would talk about in the abstract, but never anything real. This book helped me out a lot. Perhaps not as useful as it was 10yr ago, but there’s still some gems in there.
Take this with a grain of salt, I'm not a JDE but currently training to be one.
You're likely feeling imposter syndrome. You're downplaying your own ability and throwing your confidence off. Companies hiring JDE's will know you're far from perfect and don't expect you to excel at everything straight away. They will be helping to train you and help familiarise you with relevant software / languages. I'm guessing you have a team / someone more senior to shadow?
​
Ask questions. Understanding why something works is a big part of growing your knowledge.
​
Practise. I'm not sure what languages you are using however it may be a good idea to practise during your downtime at home. You can find some amazing free resources for languages such as udacity. The more you practise, the more fluent you will be with a language.
​
Finally, it is normal to feel like this, pretty much everyone I know with a job in the tech sector felt useless when they first started. Data engineering has so many different softwares and applications that it will feel daunting in the beginning, however it is all about finding your stride.
​
Let me know if you have anymore questions!
There should be 2 ways of handling this, depending on your app.
Either you handle writing to Gsheet using the Sheet API. It can be complicated here cause you'll have to handle which spreadsheet to write to, depending on user amd your FE workflow... can be tedious!
Or you use - what seems to be the go to solution for this kind of case - Google App Script and make a macro that the user can run from Gsheet to extract the data.If you need more flexibility you can even make an "Add-On" and build a form so that user can extract what they need from the database
I would also recommending looking for existing vendors that built this kind of "Connectors" for Gsheet like https://workspace.google.com/marketplace/app/api_connector/95804724197
Hey - good for you to take the plunge.
Kinda' a different situation but I went from DA to DS on my own, saw this post this morning and thought my two cents fit here:
I used Udemy.com for some broad topics that interested me, and their professors were extremely friendly and supportive. Just make sure to professionally format all correspondence, because the professors experience (sorry if being toxic here) a lot of `why doesn't import pandas as pd' kind-of whine-y questions.
The second place, and I did not do this, but informational interviews via linkedin. Reach out to Big Wigs in the industry, explain your situation, ask to meet with them to learn more about their job/day-to-day. Do not straight up just ask for a Mentor, it's selfish and impolite imo. If there's an organic connection then see if you can move the relationship forward with correspondence via slack/email/whatever.
It's going to be a grind, you're going to get a lot of people ignoring you but it's all about finding that one connection so keep at it.
"to learn you need to take chances, make mistakes, and get messy"
Openstreetmap!! Think of it as the Wikipedia for maps. There are edits globally every second. You could also try using publically available satellite imagery such as Landsat or Sentinel.
Nobody reccomend usually this book, but is an awesome book. All the SQL (and process) in this book should be familiar eventually for you:
https://www.amazon.co.uk/gp/product/B08WGSM9CJ/ref=ppx_yo_dt_b_d_asin_title_o00?ie=UTF8&psc=1
Learn about Lambda Architecture. What DB or technology to use in which circumstances. A good book by Nathen Marz:
https://www.amazon.com/Big-Data-Principles-practices-scalable/dp/1617290343/ref=nodl_
Hi u/ryanblumenow, No. The project simulates building a data pipeline given an already existing data model.
Enterprise data arch involves a lot of data modeling, consolidating with multiple teams, planning, etc. The book <strong>https://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimensional/dp/0471200247</strong> goes over this in detail. Hope this helps.
I’ve recently been reading this and it’s great!
Data Pipelines Pocket Reference: Moving and Processing Data for Analytics https://www.amazon.com/dp/1492087831/ref=cm_sw_r_cp_api_glt_fabc_AR2KW90AHAH9DG9NV8VS?_encoding=UTF8&psc=1
I read this book Data Pipelines Pocket Reference recently. It's pretty good book. Though it doesn't cover Spark based ETL stuff.
A modern Postgres (version > 11) can do anything MongoDB or ElasticSearch can; IF you use the right extensions and spend some time tuning for your workload.
MongoDB is great if you just want to start doing stuff and don't want to be constrained by schemas. But please for the love of everything that's good don't expose it to the internet directly...
ElasticSearch is great if you need to index a bunch of documents but if you find yourself wanting to use half the search operators on most of your queries ask yourself if your workload would be better served by an SQL database like PostgreSQL. And ES is not the right choice if you want joins or window functions.
Since you're on Google, BigQuery is your best bet. What I would do is:
Another option if you have 100% control over the API that writes to Firestore. What I like to do is send cloud logging logs with 'events'-type records in JSON that represent i.e. 'record-changed' with the data that was changed. Your code becomes something like (python pseudo-code, I don't remember firestore api by heart):
firestore.write(data)
event = Event.from_data(data)
special_logger.info(event.json())
Then, you can setup a log sink to bigquery that will stream these records directly to BigQuery. With this architecture, you get a pretty much 0 cost (streaming inserts and cloud logs) streaming pipeline. Latency is usually in the order of ~30-60 seconds on my side.
If your DBT models are views, you get SQL transformations directly on top of that streaming data and can have real-time Data Studio report for example.
I like the Udacity DE Nanodegree.
It will help you to get some ideas and how to structure a project. They have Data Architecture Nanodegree too, but I think is worthy only after getting the DE first.
They have a list of requirements over at https://www.udacity.com/course/data-engineer-nanodegree--nd027
So basically you need to have intermediate skills in Python and SQL before. And that’s also needed for a DE job, especially SQL. You might get away with basic python skills at some companies, but I‘d highly recommend to get some decent python skills. I‘d recommend a basic python course and one for data analysis. Of course, you could learn it on your own, as you prefer.
Not sure what is the skill level you are aiming for but I found the data architecture nano degree on udacity to be quite useful. https://www.udacity.com/course/data-architect-nanodegree--nd038. Even thou the course recommends a 4 month training window, I am pretty sure it can be completed in 2-3 months. While the course is not extensive, it does help explain the fundamentals of OLTP/OLAP systems, data-modeling (star and snowflake schema) etc.
Here's the list of projects:
- Data Modeling with Postgres and Apache Cassandra
- Data Infrastructure on the Cloud (AWS)
- Big Data with Spark
- Data Pipelines with Airflow
and the last project is a capstone, which you will combine all that you have learned during the course. You will gather data from different data sources and perform ETL to create a clean database for analysis.
You can find more details in the link below :
https://www.udacity.com/course/data-engineer-nanodegree--nd027
Look into Udacity’s Data Engineering Nano degree https://www.udacity.com/course/data-engineer-nanodegree--nd027. However, if you are looking to switch careers, you should play to your strengths; namely sql, data modeling, pushing and pulling data.
With your experience I believe that you only need structure in you knowledge, I totally recommend you to take the Udacity Nanodegree Data Engineering program that let you take advance in your path to become Data Engineer. I am make it it right know and is Fu$&@ing awesome Data Engineer
I just started taking this one:
https://www.udacity.com/course/data-engineer-nanodegree--nd027
​
It costs a lot, but because there were basically no other resources that seemed more comprehensive than this, I went for it. I've heard good things from current students also.
https://www.influxdata.com/products/influxdb-cloud/
There's a free tier which should work for a POC. No manual shard balancing and the Cloud offering is elastic- I think they have an open source version as well.
The official kubernetes docs are great.
I recommend using kubeadm to bootstrap your own cluster (docs here). From there you will have an environment to use for learning.
In terms of use cases, kubernetes is a docker container orchestrator so I also recommend getting familiar with docker as it will be how you actually run on kubernetes. It manages your compute for you, allowing you to simply define an application declaratively, then let kubernetes take care of restarting it if it crashes or store the logs for you to access etc... You can use it to host your databases and also run batch compute jobs which can communicate with the databases internally. The best part of it though is that it scales to thousands of machines and thus, any time spent on your homelab will be directly applicable to a production environment.
If you are interested in data warehousing and cloud, I think kubernetes is the most important framework/technology you can learn. I'm a bit biased since I have drank the kool-aid though.
If you have any specific questions for setting up a homelab I can send you some of my config files / scripts.
As someone who has also made the DS → DE transition, your focus will (broadly speaking) shift from "how can I use this data?" to "how can I help others use this data?"
To help answer that question, I got the most bang for my buck studying dimensional modeling. Classics include The Data Warehouse Toolkit and Agile Data Warehouse Design.
Which specific tools/libraries you should learn depends on your new role. Given that you're entering an AWS shop, I assume DMS and Glue are on the table.
This a very practical book that outlines all of the components of a successful business intelligence strategy Business Intelligence Roadmap: The Complete Project Lifecycle for Decision-Support Applications Edit: https://www.amazon.com/dp/0201784203/ref=cm_sw_r_cp_apa_glt_fabc_GSNMV3STKH0FZ5SMR3MK?_encoding=UTF8&psc=1
It depends on the data warehouse architecure you will be implementing.
I recommend you these two books which are reference manuals as well for DEs.
Building a Scalable Data Warehouse with Data Vault 2.0
The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling
If you want to beef up your infrastructure knowledge specifically around maintaining servers, this is the book my systems administration class used in college. Check to make sure it's been updated to cover infrastructure as a service if you feel inclined, but it has really good coverage on what RAID is, how and when to do backups, etc. It also provides suggestions on general IT infrastructure stuff you might not be aware of.
Should have been embedded in my text but if I failed at mobile this is the generation I had. It looks like there's a newer one as well. https://www.amazon.com/dp/0321492668/ref=cm_sw_r_cp_apa_i_u2SNCb0DNJY9B
I don't know if any common practice for this scenario, and it may depend on how you are deploying things. One way you might do this is by having a main repo for your dags and adding the others as submodules in that repo. That way you can include all of those in a single CI/CD process to deploy the dags.
I have not tried this, but in theory it seems sound.
Hacktoberfest is about to start tomorrow. I'd say join up for that and also I believe Udacity has a free course. Might be a little stale now, but should still give you gist. :-) https://www.udacity.com/course/version-control-with-git--ud123
But if you don't fully buy my completely honest interpretation of the 100% truth, I didn't hate Udacity's Data Engineering NanoDegree. It's not amazing and it deserves a lot of the hate it gets on here. But it's not bad either and I did learn a few things.
Check out https://www.udacity.com/course/big-data-analytics-in-healthcare--ud758. They have a sunlabs bootcamp tutorial on using Spark to wrangle healthcare data.
From there, you could then check out datasets from Kaggle. I would suggest picking a dataset that is relevant to the industry you want to be working in. E.g Stocks data for finance, Healthcare data for Healthcare.
Note that spark is written in Scala. And there’s pyspark which is an API to spark. So pick wisely.
Mentoring is great fun, you can't easily fail at that since it's always playing to your strengths. If it's something way out of your comfort zone, start and lead a reading group of e.g. DDIA and you can learn and teach at the same time. Beyond that, asking here for experience with specific systems will usually net you some good advice of what pitfalls there are and what tools would pair beautifully with your needs.
As a fellow europoor, I'm also in need of a salary correction.
Tbh the only ressource I know is this book on Amazon
Not a Udemy course but I’ve good things about this book: Data Engineering with AWS. There is a GitHub repo where you can follow along with the book too.
Not a Udemy course but I’ve good things about this book: Data Engineering with AWS. There is a GitHub repo where you can follow along with the book too.
This is very cool actually, not so while ago I found this book :
It covers into detail sustainable, decoupled "microservice" architecture if you find it intrensting it worth having a look
Maxime Beauchemin, the creator of Apache Airflow, explains this way more eloquently than I ever could:
>There’s a multitude of reasons why complex pieces of software are not developed using drag and drop tools: it’s that ultimately code is the best abstraction there is for software.
My company used Informatica for over 10 years. My first role was "ETL Developer" for which, I was hired for Informatica skills first and foremost. That said, I saw its relevance wane as we migrated from on-prem, RDBMS-like row-based databases, to columnar, MPP data warehouses, and then the migration to cloud became the nail in the coffin.
What Maxime says above rings so true - in Informatica's (PowerCenter) GUI, if you want to have the same workflow run 10 times, each with slight variations (perhaps delivering the same data to 10 different destinations), you end up having to Copy/Paste the workflow in the GUI and then click through a series of dialog boxes before you can change the one parameter that is different about those 10 workflows. Later, when someone else comes to look at those workflows, they have no idea that they are exactly the same except for that one parameter. In Airflow, using Python, you can simply write a 'for' loop, iterating though each of the paraemters, and creating a task for each. 5 lines of code can replace thousands of copy/pasted GUI components.
I'd highly recommend the entire post, The Rise of the Data Engineer by Maxime for a manifesto that still mostly holds true, even 4 years after it was written..
This book is super helpful
Just a little more info about the author he is a Microsoft SQL Server MVP. I read his two other books when I was first learning SQL. T-SQL Fundamentals is great for someone who has never written SQL before.
However THIS book T-SQL Querying brought my understanding of SQL Server and how RMDBs in general up to a very advanced level very quickly. If you are at all interested in optimizing queries and how to understand query plans and effectively use indexes I can't recommend a better book. It gets into very advanced topics like creating CLR stored procedures with C#, using hash indexes , system versioned (temporal) tables and way way more.
This new book on Window Functions is the 2019 updated version of the original 2012 book which I never read. I am ordering it because Window Functions have so many uses in DE and DA/DS. The book is specifically for T-SQL and Microsoft SQL Server (including Azure) but most RMDB implementations use Window Functions I believe it's part the ANSI SQL standard as of 2008. The optimization parts are specific to SQL Server but the use cases and functionality of these window functions are platform neutral.
They got a whole book on Window Functions now. I read the authors other two books and always saw this book but it was an older one based on SQL Server 2012. This is the 2nd edition and published in 2019 so I think I'm gonna buy it to add to my collection :
If you are working with SQL Server then this book. That's basically the entire top of the book (optimizing queries , which operators to use for best performance , understanding query plans and which type of indexes are best in certain situations)
As for anything else like Oracle or MySQL I don't know as they are entirely different engines/SQL language extensions.
If you're a professional in the field, you should have been aware of CI/CD for a logn while.
CI/CD is considered a pretty basic feature of software dev pipelines for 10+ years.
The Fowler book on it was published in 2010, but it existed way before that already. https://www.amazon.com/Continuous-Delivery-Deployment-Automation-Addison-Wesley/dp/0321601912/ref=asc_df_0321601912/
Hands on experience with GCP is definitely going to make the exam a lot easier, but I wouldn't say it's required if you know how to study well. I studied for about 2 months straight, about an hour or two each day. I mainly used this book as it includes pretty much everything you need to know as well as lots of practice questions. There's also this youtube playlist that goes in depth into the logic behind solving exam questions, and I highly recommend it
Just to say I use https://joplinapp.org/ similar to Obsidian. I keep each technology on a different note. Since it's md I can add syntax, screenshots, links of the websites I learnt something from. Just keep track of my learning like we used to do with notebooks back in school.
The biggest plus for me is that I stay on track, it doesn't take ages to get my head around something all over again because I record my progress.
USA, I'd say do the Designing data intensive applications but that's the long route.... try this book: https://www.amazon.com/System-Design-Interview-insiders-Second/dp/B08CMF2CQF/ref=sr_1_1_sspa?keywords=system+design&qid=1660526006&sprefix=system+des%2Caps%2C128&sr=8-1-spons&psc=1
A data lakehouse is the warehouse 2.0. Separate your data into layers and different end users interact with different layers.
https://www.amazon.com/Building-Data-Lakehouse-Bill-Inmon/dp/1634629663
While this isn't a structured path, I enjoyed the new Fundamentals of Data Engineering book by two consultants in the space who know what they're talking about.
The book is exceptionally good at synthesizing how DE concepts fit together into a coherent system, without getting into specific tools (that might change over time).
Too Big to Ignore, Simon. It's important to think of the business uses for technology -- how do they add value to the business, or towards customers? This business-focused book does that, in pretty plain terms.
I think so, yes. They presume a certain level of knowledge around programming in general, and around SQL, etc. but the approach is very fundamental. They focus on the DE lifecycle as a whole, and not on any individual product like Airflow or whatever. Of course that tends to be the case with all O'Reilly books that I have read.
Depending on your cloud choice, https://www.amazon.com/dp/1492079391?psc=1&ref=ppx_yo2ov_dt_b_product_details this is also an excellent book. It focuses on AWS, and goes farther into ML and AI than you probably want to as just a DE, but to get to the point of ML and AI you have to learn all the DE stuff too, so it covers that excellently.
Hey! Read this book: Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems - https://www.amazon.com/-/es/Martin-Kleppmann/dp/1449373321
This is THE BOOK for data engineering. It will help you to understand more about what's going on under the hood and a lot more. You can then ask questions about the system of your company and maybe even identify opportunities for improvement.
​
Nice... I work for a one but it's really good. I will def use it for future interviews.
> Designing Data Intensive Applications
I've looked title, checked book description - it has nothing to do with DE. "This book is for software engineers, software architects, and technical managers who love to code."
Jesse Anderson has written a book about data teams. You might wanna check that out to get a more holistic view on data teams and where data engineering fits.
I remember this feeling! Here’s a great book if you want to feel even more seen:
Free version: https://github.com/ms2ag16/Books/blob/master/Designing%20Data-Intensive%20Applications%20-%20Martin%20Kleppmann.pdf
Not free version: https://www.amazon.co.uk/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321/ref=nodl_
Not sure about a cheat sheet, understanding which tool to use when requires some understanding of distributed systems and their limitations, such as the CAP theorem. This book goes deep on how databases work, getting into the nitty gritty on things like b-trees and index implementation and eventually zooming out to distributed databases. It's a grind but it's an amazingly thorough walk through (at least for someone like me who only had working knowledge of databases prior):
Database Internals: A Deep Dive into How Distributed Data Systems Work https://www.amazon.com/dp/1492040347/ref=cm_sw_r_awdo_BV61XFFBK9HS97W061HG
The canonical Designing Data Intensive Applications by Martin Kleppman is a bit easier to get through and gives a really great base understanding to work from with regards to distributed systems, and examines many different distributed technologies with discussions on their tradeoffs.
>Designing Data Intensive Applications
I guess you mean the book by Martin Kleppmann ?
I read A Beginners guide to Scala
I don't think so...
You should give "The Data Warehouse Toolkit" a read.
The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition https://www.amazon.com/dp/1118530802/ref=cm_sw_r_apan_i_XM6AVWWD293EBY2GAGZZ
I would aim higher than 120k honestly given your background. You could still turn around that specific call too if it goes on. You could say something after the next round like:
> Having heard more about the position, needed skills, and responsibilities, I would need $xyz-k salary.
This is an excellent book on negotiation if you need more support in this area: Never Split the Difference
Nothing beats Code which starts with the true basics of signaling and then takes you to binary logic, logic gates, all the way to building up a whole computer. The bottom-up approach is great and really helps you understand computers at every level of abstraction.
A bit off the beaten path, but I've been toying with the idea of using Node Red as a job scheduler. There are timer plugins to launch flows at specified times, email plugins that can catch flow errors, automatic version control plugins, an exec
command with support for stdout, stderr, error code (you could build a subflow to do the logging automatically). For reading logs you could maybe create a "dashboard" but I've never used that feature. It does support multiple users but I don't know if authentication is as granular as the job/flow level.
What’s the masters degree in? Doesn’t matter, don’t do it. If you want a CS masters you can do this for way cheaper — http://www.omscs.gatech.edu/home
If you want to do something specifically related to DE, do this https://www.udacity.com/course/data-engineer-nanodegree--nd027
Since your company is paying for it, i recommend browsing through these https://www.udacity.com/course/data-engineer-nanodegree--nd027
>there are only 2 courses we have which would give you some exposure to Hadoop/Cloud technologies
Seems to be these ones: - https://www.udacity.com/course/data-analysis-and-visualization--ud404 - https://www.udacity.com/course/big-data-analytics-in-healthcare--ud758
I'm not too familiar with udacity, but are the contents the same as the actual classes in OMSA? Do you know?
I paid $899.10 after a 10%-discount for students who enrolled in the inaugural class in a 5-months term.
It seems the pricing model has changed and you can pay monthly ($399) or 5 month in advance ($1795), if you dedicate full time to it you can cover it in 1 or 2 weeks.
Here's the pricing page: https://www.udacity.com/course/data-engineer-nanodegree--nd027
It’s a website to help you prepare you for technical interviews. A lot of questions from faang companies interview can be found there. It has both free and paid version if I am not wrong. https://leetcode.com/
Thanks for the link although that's for S3 to BQ. Here's a decent link for BQ to S3: https://www.freecodecamp.org/news/how-to-import-google-bigquery-tables-to-aws-athena-5da842a13539/#3af9
What I'm a tad worried/confused about is that we have some Python/Chalice/Lambda code that I presume simply is moving data from BQ to S3. I'm not sure if this is super necessary if we can simply load the data to GCS and then from GCS to S3 in fewer lines of code...so not sure if I'm missing the point in involving Chalice and Lambda functions.
First of all congrats you are not staying where you are! I just want to give you some recommendations overall and not only related to DE role. Because now you need to understand you will be required to learn new stuff your whole life. So
1/ From the school I believe biology can be a technology. But please ask yourself what is your mindset? Is it more technical or more humanitarian? I mean what you love to do more - playing with tech, math, doing bio formulas etc or maybe you are interested in history or languages or nature. How comfortable are you learning DE now?
2/ If your answer is tech - any SE position will fit you (DE as well, of course).
3/ But if the answer is No - I would not recommend you to chose DE role. I would recommend you to chose some product role (product manager and later owner, business analyst etc). There is a bunch of product roles where you can find yourself https://www.aha.io/roadmapping/guide/product-management/what-makes-up-the-product-team From a business analyst later you might become data analyst if you want. Nobody, even you, knows your path now. ;)
4/ Don't just study. Please apply your knowledge right away in several directions. By doing a pet project, by involving yourself in some opensource project (fix and push some bugs there, join the community, help others there). Try to find a job asap. Some junior one, even volunteer but do something for production right now.
The market is overloaded now. I am (as a head of engineering of adtech saas service) looking for one more DE in our team right now and the same is for most companies. I think you will find a job easily but you need to prove your willingness and ability to switch.
Good luck to you. And if you have any questions, I will gladly help. ;)
So currenlty for demo there are 2 options:
- one is to run the tool locally (just docker needed), load some sample data to Postgres(or just monitor itself), and run a scan on this.
- second option, definitely happy to show it to you :) If that's good feel free to pick a date here :) https://calendly.com/mateuszklimek/30min?month=2021-03
And yea, there is no live hosted demo - we will be working on that ;)
If the company is well-known, there should be other groups with data scientists.
More to the point, it sounds like your company doesn't have data engineers and that's the real problem. You shouldn't be asking data scientists how to put ML, etc into production. You should be asking data engineers about how to do it. You can read some of this in books but the real knowledge comes from experience. These data engineers should have the experience that you could leverage.
I just premiered a new talk covering this point. See slide 20 for the relevant quote.
Tableau is free for a year for students. https://www.tableau.com/academic/students
I kept my aws running for 2-3 months after I was done with my project while I was interviewing. Cost wasn’t that much. But that depends on your pipeline, services and the machines you used. Like I said, check with your school if you’re a student if they offer aws/gcp credits
> I keep feeling that my SQL skills just aren’t good enough to move on?
How far are your skills? What can you do? Create tables, insert data, create complex queries, etc? What's the most complex query you've made (you can post pseudocode).
On the other hand: Sometimes you just hit a wall where you don't know how to continue. I usually just pick up something else and let the things that I've just learned just sink in for a month or three. When I then come back (to SQL), I can then more easily recognize the things I don't know and thus find more info to learn them.
Does that make sense? It's OK to temporarily stop learning thing A, so you can let things sink in and focus on thing B, only to come back at a later time.
> Has anyone been in this position?
Definitely, yes. For context: I'm a Junior data engineer that's a Software Engineer by education.
PS: Do you also use something to query just MySQL? Something like DBeaver? If so, then creating a connection should already be familiar to you - should make it easier to start in Python :)
Read this instead if you can: Star Schema The Complete Reference (https://www.amazon.com/dp/0071744320/) but also, make sure you understand dimensional modeling, difference between fact and dimension, and slowly changing dimensions. Its also useful to know what a cube or data mart are. Both books are not really written for a cloud data warehouse audience but an on prem data warehouse audience. So is the job cloud or no? If cloud, add The Informed Company: How to Build Modern Agile Data Stacks that Drive Winning Insights (https://www.amazon.com/dp/1119748003/). You don’t have to read all 3 books just skimming one of the three looking for the concepts mentioned should help.
I am in the middle of reading through the commonly referenced Martin Kleppmann's [Designing Data-Intensive Applications](https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321). I had just gotten through the Graph-Like Data models portion which feels like it could be a great solution to your problem. Perhaps looking up the Cypher Query Language for the Neo4j graph database could be helpful and an opportunity for you to implement a new solution adding it to your toolbelt.
I can't speak to if this is even used anymore or worth while. The book was published in 2017 and I haven't ever used this type of database before myself.
If you are committed to the SQL route I am guessing you will need to have a comment table which can reference other comments in the table. From there you can do a self-join to build up your data set.
Does your company have a self service BI tools that would alleviate some of the issues with scheduling? Tableau, Power BI?
These tools can help centralize some of the “reporting” stuff and provide some automation.
What’s the real issue? Data quality? Automation? Data accessibility? Reusability? Sharing?
Here’s a great book in case that’s all your looking for… get at least through the first section.
Star Schema The Complete Reference https://www.amazon.com/dp/0071744320/ref=cm_sw_r_cp_api_glt_i_RBKV51H6C2SPVXB4XHRA
I like this approach, in the new Spark API they introduced a transform method for dataframes, that makes it easier for consistent method chaining but can also lead to complex code when overdoing it.
Introducing too much separation will end up with small functions that make you jump around the code just to figure out what's going on, increasing the overall cognitive load when reading the code. There are a lot of interesting arguments against shallow methods in A Philosophy of Software Design
My rule is if it's a complex logic that you want to test seperatly and can encapsulate in a separate function then extract it to and unit-test it otherwise chain things together.
def compute(df): return ( df .select() .filter() .withColumn() .transform(extract_isbn) )
def extract_isbn(df): return ...
Perfectly I would like to just read the compute function and know what's going on.
Unit tests are also IMO a bit tricky when working with Spark as you don't want to end up unit-testing Spark's API, I put more emphasis on e2e/functional tests.
you might have to check out deep-dive books on whichever specific database engine you're actually using as the implementation is different even if the 'language' is basically the same. my workplace is a MS shop and this book is a huge resource for me.
I actually recognized this from a Hard Leetcode problem:
https://leetcode.com/problems/tournament-winners/
This was my solution:
with cte as ( select first_player as player_id, first_score as score from Matches union all select second_player as player_id, second_score as score from Matches )
select group_id, player_id from ( select cte.player_id, group_id, dense_rank() over (partition by group_id order by sum(cte.score) desc, cte.player_id asc) as score_rank from cte join Players p on cte.player_id = p.player_id group by cte.player_id, group_id ) a where score_rank = 1
Steps:
My team does - but DE is such a broad area, it really just boils down to what "kind" of DE you are. I found this helpful when first defining structure for my team: https://www.freecodecamp.org/news/the-rise-of-the-data-engineer-91be18f1e603/; I like to categorize DE into
7 and 8 I consider fair game here and there and then come with the territory in some cases, but obviously don't expect them to be core functions.
Really small teams may spend cycles on different areas (all, some, or just one) - depending on the nature of the team (embedded DE on DS/Analytics vs. DE on Data Infrastructure/Data Platform teams).
My team spends most of our time on 1, 5, and 6, so we index more towards at least some comfortability with backend and web dev, especially because we support lots of data viz and reporting platform apps and the familiarity helps to simplify solutions. I imagine a DE focused more on warehousing and fleet management wouldn't have to think as much about API/DB connectors since there's more focus on the data itself.
I've been using Visual Studio Code https://code.visualstudio.com/ for a few years now, it has a a decent Python plugin (as well as others!) and is fairly simple. It's a lot more stripped back compared to it's big brother, I could never stand actual Visual Studio as it was always just too much for me.
I actually got into Data Engineering because someone needed a Linux Engineer to help with their Apache Spark/Hadoop cluster, which means I'll write code on pretty much anything that has a vim plugin, so I'm never sure if I should be trusted on what IDE to use, although VSCode does get a lot of praise online.
However, if you're going to write a load of R I would stick with RStudio because it's such a good bit of software, you really can't go wrong with it.
Awesome! The slack channel is super helpful and there’s an offer on the homepage for a free one on one integration session in turn for some user testing.
https://calendly.com/great-expectations-1/great-expectations-integration-one-on-one
If you truly want to learn SQL and be able to solve any DE problem like cleansing, joining and transforming etc I suggest learning Set Theory. T-SQL Querying by Itzik Ben-Gan is one of the best books on the subject. Check the reviews.
https://www.amazon.com/T-SQL-Querying-Developer-Reference-Ben-Gan/dp/0735685045
It teaches SQL and its origins (set theory) with some focus on T-SQL variant but the basic principles are applicable to ANSI SQL, hence to any major relational database. I have aced all SQL interviews after studying and using the book as a reference for 6 months in my first DE job.
I have read sections of this book and it’s a pretty good overview of the kinds of questions you might be asked and the kind of answers you should give: System Design Interview – An insider's guide, Second Edition https://www.amazon.com/dp/B08CMF2CQF/ref=cm_sw_r_cp_api_glt_fabc_FA1Q8NMYREK027DS5S6Z
The particulars of the exact questions aren’t super important imo. Try to demonstrate that you know the sorts of high level things to think about and are comfortable with making thoughtful trade offs.
Yes, I'm using a modified version of this template: https://www.overleaf.com/latex/templates/plushcv/jybpnsftmdkf . I have built my CV on overleaf (using latex because I knew how to do it and it is simple to keep it updated) but you could probably find a similar template for word, etc.. I could help you by reviewing your CV if you would like but please notice that I do not have any experience with this and I have never interviewed a candidate before. Best of luck!
atleast this is what i am doing
I'm actually doing this at the moment. I'm spinning up containers using MiniKube. Still learning the basics of k8s as I'm used to Docker, but seems straightforward.
I would rate SQL the least imp. Data strcutures and algo should really be on the same bullet, they go hand in hand and are the most imp to crack the online assessment. There are tons of reviews of the online assessment, online. Check leetcode.
Use Elasticsearch. Has a host of support for keyword searches using Query DSL.
Preferably use Elasticsearch 7.x since you could also use the Dense type to store vectors that could help support semantic similarity.
Refer https://www.elastic.co/blog/text-similarity-search-with-vectors-in-elasticsearch
Also, a word of caution, you will get a lot of rejection. That's just what looking for a job is like now. Also, go on hackerrank and leetcode and get used to solving problems like that in short amounts of time in front of people. It's the worst and the only way to crack it is just to do it over and over and over.
Doing these types of problems in an interview is in no way related to anything you'll do in the job, but it's just super common to weed out people. Hang in there and don't stop applying to jobs.
I like GitLab's approach.
https://about.gitlab.com/company/culture/all-remote/handbook-first-documentation/
Culture is a huge enabler/blocker; if the will fails at any level it becomes very hard to start/maintain.
gitlab does their data engineering in the open .
​
https://about.gitlab.com/handbook/business-ops/data-team/organization/engineering/
Well if you have a manager or direct person to talk to, I think sounding off the idea of an up-to-date organic knowledge base for onboarding new members might be a good way to get you involved in the discussion, at least as silent listener/observer. Documentations are a boon for most organizations, the fact that you can't find it inside the code base & how that affect your progress to understanding them should be a valid reason for improvements. You were a data analyst, virtually their client, so you might have valuable perspective into your data team's goal.
https://about.gitlab.com/handbook/business-ops/data-team/organization/engineering/
Probably the best wiki you'll find is going to be here https://about.gitlab.com/handbook/business-ops/data-team/
Gitlab has a lot of information on how to run a Data Engineering Dept on their public wiki.
A First Course in Database Systems by Widom and Ullman
That book had chapters on data modeling and how to design relational databases so it might answer your questions or at least be a good start. There are PDFs online you can find.
Here is the Amazon link: https://www.amazon.com/First-Course-Database-Systems-3rd/dp/013600637X/ref=nodl_
Yep. You can use Grafana to visualize Prometheus metrics. With regards to alerts you can use Alertmanager https://prometheus.io/docs/alerting/latest/overview/ to fire alerts based on your rules e.g. you can fire an alert when a failure count has increased, which will produce one alert or you can require that there are 0 failures at all times which will produce alerts until the task is marked as success. Alertmanager has lots of features like grouping, silencing etc. At my previous job SRE team hooked it up with Slack and OpsGenie (a service for incident handling with escalation rules, ability to send SMS and voice mails etc).
If you have any knowledge of docker + kubernetes you can use: https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/
Though MediaWiki and Dokuwiki have been good players in the market for some time, nowadays some modern tools like Document360 work better for product documentation and user guides.
Document360 is more user-friendly, the author can use both Markdown and WYSIWYG editor with live preview panel, Google-like drive to capture and store digital assets, Advanced analytics to understand the article performance, lot of integrations extensions, Support versioning, Smart search function, support article tagging, and multi-language support.
If you are looking for diagramming support you can create diagrams with tools like Viso or Smartdraw and import it as an image to the text in Document360
I am sure there are a lot of great open source documentation tools, but if you want a single tool for good text input AND diagraming I would mention archbee.io in this thread.
It has native and mermaid diagrams and a block based editor that supports markdown shortcuts. You can check the gifs on the homepage to get a feel of how it works and what it does.