i've been reading Designing Data-Intensive Applications by Martin Kleppman and i would recommend to all backend developers out there that want to step up their game.
(i also love that it's a language agnostic book)
Backend->Distributed is a logical progression.
They may be out there, but I’m unaware of “Junior Distributed Systems” roles as a category. Alternatively you could look at DevOps roles. I strongly recommend Designing Data Intensive Applications, although you are going to need experience prior to diving in.
Anyone interested in the above should read Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppman
Te lanzo algunas ideas:
*Paginación con SQL sin usar Skip limit o similar. *Función que comprueba si una página está online o no cada X tiempo. Emite una alerta si pasa más de Y caída, que esta alerta sea emitida como recordatorio cada cierto tiempo, otra alerta cuando se recupere, etc. * Alguna función que haga uso de una estructura de datos probabilística. P.e usa un Bloom filter para tener un contador que admita ciertos falsos positivos a cambio de consumir menos memoria.
Te recomiendo este libro https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321
Mentoring is great fun, you can't easily fail at that since it's always playing to your strengths. If it's something way out of your comfort zone, start and lead a reading group of e.g. DDIA and you can learn and teach at the same time. Beyond that, asking here for experience with specific systems will usually net you some good advice of what pitfalls there are and what tools would pair beautifully with your needs.
As a fellow europoor, I'm also in need of a salary correction.
I interview people related to distributed systems and also work in a big company as backend. From my perspective what I see in a candidate is not how they come up with a perfect solution but how they reason about different solutions and how they find problems in them. For example, let's consider you have to decide what database to use for an application. You could start with deciding SQL or nosql, you should consider number of updates per second and maybe if the content is not big you could go with a postgres db. If number per requests grow after some time, and you start having connection problem you can always increase your db size but still keep only one instance and one source of truth. What if this is not enough. Well now, if you are using AWS, you can use aurora and maybe have more that one replica... So, basically you have to think of problems. Scale, concurrency and consistency. This unfortunately it is something that I learned by working on it, and I sucked so much on my interview on this when I did not have any experience.
If you want, I'm in EU timezone, we can have a chat sometime:)
This is a very good book https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321
Hey! Read this book: Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems - https://www.amazon.com/-/es/Martin-Kleppmann/dp/1449373321
This is THE BOOK for data engineering. It will help you to understand more about what's going on under the hood and a lot more. You can then ask questions about the system of your company and maybe even identify opportunities for improvement.
The word you're looking for is infrastructure.
I strongly recommend you read Designing Data Intensive Applications to get a better idea of what using microservices actually implies.
> Designing Data Intensive Applications
I've looked title, checked book description - it has nothing to do with DE. "This book is for software engineers, software architects, and technical managers who love to code."
Honestly, read the book "Designing data-intensive applications" which will help you understand the use cases between this and so much more. I genuinely think this book is the best foundational resource for anyone wanting to grasp the options available, the tradeoffs between them and the appropriate use cases to apply them in.
Do you have any resources you can recommend to people who wanna get into distributed systems?
I'm thinking of picking up DDIA at some point although some people say it's not for beginners.
I disagree completely, almost everyone just vastly underestimates the software engineering challenges to keep a global used service like Twitter always available and performant. It is almost seen as the canonical example service for distributed system design and frequently referenced in that book. I recommend it if you want to grasp even just a bit of how challenging this can be.
Also, in my experience, they are viewed highly in the tech world too. It's not uncommon for Big N folks to have had an internship at Twitter. Go on any given train-for-a-prestigious-tech-job side like LeetCode, and they will have sections dedicated to Twitter just like Google, Microsoft, Facebook, and Netflix.
My guess is Twitter fails at the level of the C-suite. When the high-level decisions are bad, no technical innovation can save you.
>Designing Data Intensive Applications
I guess you mean the book by Martin Kleppmann ?
>For example, understand why you'd use Postgres (sql) as opposed to DynamoDB (nosql)
I worked only with SQL databases my entire career and I have no idea how many reads and writes these things can do. I try to write the queries in a sane manner and keep my fingers crossed. I feel like an imposter for not knowing these numbers.
I did read this book and it helped me understand different use cases for the four types of NoSQL databases.
There’s probably many, many ways to answer this question. But it sounds like you’re trying to get a handle on scaling out an architecture.
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems https://www.amazon.com/dp/1449373321/ref=cm_sw_r_cp_api_glt_i_YJ0JH4AT7CX0HCPWVZF5 has been super helpful in this regard. It won’t give you all of the answers, but will have enough information to get you thinking about how to scale.
Can you recommend resources to improve my knowledge about designing data intensive systems? It can be paid content, I have a training budget.
I've already found this book:
https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321
Is it any good?
My previous client had a similar problem in the financial sector. We encountered the same problems of keeping the transactions in sync.
Fully writing down the methodologies takes a lot of time. But this is an exercise to learn, I can recommend picking up a book. This book has helped me in the mentioned project and will surely help you as well.
Livro é algo que é caro. Qualquer livro ficção tá custando 50 pila pra cima.
É 5% do salário mínimo .
Livro no Brasil é muito caro. Outro dia tava vendo um livro técnico (https://www.amazon.com.br/Designing-Data-Intensive-Applications-Martin-Kleppmann/dp/1449373321/) e o livro tá custando quinhentos e vinte reais.
Quinhentos pila. É um livro importado e não incide imposto de importação em livro . 50% do salário mínimo.
This one is also great, especially if you're planning on focusing on backend, but it might be more useful when you've been on the job for a while
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems https://smile.amazon.co.uk/dp/1449373321/ref=cm_sw_r_cp_api_glt_fabc_WCZHP2SD9XFSNCNKY7AA
Your db choice highly depends on the application. If you want to read more about when to choose what and why, you can read Designing Data Intensive Applications
> There is a difference between working hard and EARNING it. Everyone "works hard". From dorsey to the janitor. Lets say a doctor earns $100K per year. Dorsey is worth $5 billion? Did he put in the work of 50000 doctors to "earn" his wealth?
> It's impossible for anyone to "earn" that much money. You get that kind of wealthy by getting others to "earn" it and skim it off for yourself.
Then the janitor should go lead a team that's going to run a business that will change the world. We live in a world that rewards innovation.
> Right because twitter software stack is so "revolutionary". The only people who are "mesmerized" by twitter are people who aren't in the tech field.
If you think this way, then you aren't in the tech field. Or if you are, then you are doing some very trivial shit but still have the ego of an accomplished engineer.
Serving any app to billions of people at once is extremely difficult. Read this book if you actually care to acknowledge any of the challenges associated with running a massively distributed system.
> Nope. Like I said, it's simply impossible for anyone to "earn" that much wealth. No more than a slave owner "earns" all his wealth.
Go make a company that serves a billion people. Go make a company that can get anything delivered anywhere in the world in 2 days. Do anything that creates novel sources value in this world and you get rewarded.
Seriously man you are one of the most close-minded people I've ever seen on this subreddit. Go join the anti-billionaire millennials in /r/all, this place is for rational people.
Given you're using RDS, maybe a Lambda that connects to the main and regional database, iterates through all the data and upserts it into the regional database? That's just off the top of my head, hard to know the best solution without more details on your architecture.
For the second solution, the key is that both the main and regional databases need to be consuming the same events. i.e. you can't write directly to the main database, instead you only write to Kinesis and let Kinesis update the main database. Events will also need to be idempotent, since events are at-least-once. It's tricky to do and doesn't work for all usecases, but imo event-driven architecture is a great way to build scalable and reliable systems. Good read on this: https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321/ref=sr_1_2?crid=3TZ3NFHS6SGP0&dchild=1&keywords=scalable+system+design&qid=1614385300&sprefix=scalable%2Caps%2C236&sr=8-2
In general, it looks like a good case to use log-structured storage engines. Compaction process can also be further optimized using B-Trees and LSM-Trees.
> Designing Data-Intensive Applications
Another recommendation for this book. It is my first recommendation for systems architecture books due to the breadth of topics it covers. Link.
Your comment of "you and the rest of the industry would really like to know" says otherwise.
You didn't come here to have a conversation, just to pump your own tires. Whether you believe it or not, you are speaking buzzwords and I can see right through it.
If you genuinely are struggling, read that system primer and practice with this book
You aren't the only one that is struggling in this area but that doesn't mean everyone is. Frankly there is a vendor model of providing services to lift and shift in bulk with large enterprises that experiments in the $100ks. It's one of the easiest and hottest paths I've seen of getting into FAANG or starting your own service.
Strong recommendation: https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321/
Design is all about scaling. And the big problem with scaling is scaling the data. That book is a great look at how that happens.
I would echo the other comments as a DE hiring manager i don't care if you don't know the tools per se, but understanding the concepts and knowing why you would choose approaches or designs is much more important.
A good book to read is : https://smile.amazon.co.uk/dp/1449373321/ref=cm_sw_r_cp_apa_i_jn5kFbBVKBAF9
I would also highly suggest learning FP. I am a Senior Data Engineer with Amazon and this was the most valuable thing I taught myself when I started working at Amazon. Along with all these suggestions the book I recommend to every new data engineer is https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321 as this book gives you a great starting point to going more in depth on the various architectures you will encounter. The other thing I would say is don’t just learn sql syntax but focus on how database internals work in order to truly understand applying optimization skills. You will often be handed some query/crude pipeline from either Software Engineers or Data Scientists and will need to be able to optimize them to be production ready.
Basically, read the following book:
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems https://www.amazon.com/dp/1449373321/ref=cm_sw_r_cp_apa_i_2Qi6AbQCM834T
And build things that are related to ML. There's many ML projects that have open source implementations and you can easily download and run. Stand up a server to serve predictions, then stand up an API server in front of it.
I'm a little late to the thread but I work at a company that operates at a large scale and I've found Designing Data Intensive Applications to be the best overview of modern techniques for scalable applications
> When is it okay to get complacent in your job and when is it not?
That's 100% up to you. Different strokes for different folks and all that.
> How important is it to constantly be working on or learning new stuff?
Extremely important. So much so that I give almost no pushback if my people wanna spend a few days per month at a conference/training. Company will even pay for most of it. Find a company that has a line-item in the budget for professional development -- dollars that are specifically intended to be spent by the end of the year on training, conferences, etc.
And that's not exclusive to software/data/compsci. Any skilled labor is changing constantly. Professional development is important.
> For the data engineers out there what skills should I perfect that will make me employable / desirable anywhere?
Become familiar with a variety of query languages and syntax. SQL, Elastic, AQL, N1QL, a time series DB -- the specific one doesn't really matter, just know more than "basic SQL joins" that you'll see in an undergrad database course.
Recommended reading: Designing Data Intensive Applications.
Designing Data-Intensive Applications seems to be the industry standard, although it's not Go specific.
Designing Data Intensive Applications is your ticket here. It takes you through a lot of the algorithms and architecture present in the distributed technologies out there.
In a data engineering role you will probably just be munging data through a pipeline making it useful for the analysts/scientists to use, so a book recommendation for that depends on the technology you will be using. Here are some of my favorite resources for the various tools I used in my experience as a Data Engineer:
Good luck in your new position!
Hey, DE here with lots of experience, and I was self taught. I can be pretty specific about the subfield and what is necessary to know and not know. In an inversion of the normal path I did a mid career M.Sc in CS so it was kind of amusing to see what was and was not relevant in traditional CS. Prestigious C.S. programs prepare you for an academic career in C.S. theory but the down and dirty of moving and processing data use only a specific subset. You can also get a lot done without the theory for a while.
If I had to transition now, I'd look into a bootcamp program like Insight Data Engineering. At least look at their syllabus. In terms of CS fundamentals... https://teachyourselfcs.com/ offers a list of resources you can use over the years to fill in the blanks. They put you in front of employers, force you to finish a demo project.
Data Engineering is more fundamentally operational in nature that most software engineering You care a lot about things happening reliably across multiple systems, and when using many systems the fragility increases a lot. A typical pipeline can cross a hundred actual computers and 3 or 4 different frameworks.doesn't need a lot of it. (Also I'm doing the inverse transition as you... trying to understand multivariate time series right now)
I have trained jr coders to be come data engineers and I focus a lot on Operating System fundamentals: network, memory, processes. Debugging systems is a different skill set than debugging code, it's often much more I/O centric. It's very useful to be quick on the command line too as you are often shelling in to diagnose what's happening on this computer or that. Checking 'top', 'netstat', grepping through logs. Distributed systems are a pain. Data Eng in production is like 1/4 linux sysadmin.
It's good to be a language polyglot. (python, bash commands, SQL, Java)
Those massive java stack traces are less intimidating when you know that Java's design encourages lots of deep class hierarchies, and every library you import introduces a few layers to the stack trace. But usually the meat and potatoes method you need to look at is at the top of a given thread. Scala is only useful because of Spark, and the level of Scala you need to know for Spark is small compared to the full extent of the language. Mostly you are programatically configuring a computation graph.
Kleppman's book is a great way to skip to relevant things in large system design.
https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321
It's very worth understanding how relational databases work because all the big distributed systems are basically subsets of relational database functionality, compromised for the sake of the distributed-ness. The fundamental concepts of how the data is partitioned, written to disk, caching, indexing, query optimization and transaction handling all apply. Whether the input is SQL or Spark, you are usually generate the same few fundamental operations (google Relational Algebra) and asking the system to execute it the best way it knows how. We face the same data issues now we did in the 70s but at a larger scale.
Keeping up with the framework or storage product fashion show is a lot easier when you have these fundamentals. I used Ramakrishnan, Database Management Systems. But anything that puts you in the position of asking how database systems work from the inside is extremely relevant even for "big data" distributed systems.
https://www.amazon.com/Database-Management-Systems-Raghu-Ramakrishnan/dp/0072465638
I also saw this recently and by the ToC it covers lots of stuff.
But to keep in mind... the designers of these big data systems all had a thorough grounding in the issues of single node relational databases systems. It's very clarifying to see things through that lens.
For web related stuff I highly recommend the red book
I read https://hckrnews.com/ with my morning coffee and while compiling.
I read https://devurls.com/ on Sundays
I'm also a fan of tech books. Obviously there's a lag between the cutting edge and what's available in print, some I'd recommend are:
I'm currently reading Designing Data-Intensive Applications by Martin Kleppmann, I'm totally loving it. However I'm not sure if it is precisely systems programming, I mainly started reading it to get an introduction of distributed systems, but overall it covers the topic very well. It clearly describes how popular software and databases systems nowadays, for example, are designed and how they handle a big amount of data.
In the other hand, Rust in Action by Tim McNamara is a good resource, and Jon Gjengset's upcoming book Rust for Rustaceans seems promising!
Richard Seroter has a nice overview of spring cloud infrastructure with a great example app. It's a bit dated now, but opened my eyes to what's possible with spring cloud:
Java Microservices with Spring Cloud: Developing Services
Java Microservices with Spring Cloud: Coordinating Service
Designing Data-Intensive Applications The Big Ideas Behind Reliable, Scalable, and Maintainable Systems By Martin Kleppmann (also available on O'Reilly)
There is also a couple of videos by Josh Long (browse his bootiful presentations) but I have no links at hand. Good luck!
IMO, one of them is this one: Designing-Data-Intensive-Applications
And as a reference or long term goal, these two:
- https://ce.guilan.ac.ir/images/other/soft/distribdystems.pdf
​
edit: added long term goals / references
I suggest you to make it a separate question (but describe your situation again), so you can see multiple opinions.
For a quick start, look into books talking about System Design aka design of high-load systems. Kleppman is the most famous one: https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321
Practical CS topics - databases, PL concepts (OOP, FP, type system...), CPU architecture, parallel and distributed programming.
There are also a few areas of Math useful for programer - statistics, linear algebra, a bit of discrete math and analysis.
Distributed systems is one of the hardest CS courses you can take because distributed systems is freakin hard. One of the projects is implementing some of PAXOS. LOL, freakin PAXOS??? LOL!
You really want to prep for that class? Read this book: https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321
Also, go read some DS papers like Dynamo, memcache, zookeeper, google file system, etc etc.
​
You're welcome. And prepare to lose your mental health.
Designing Data-Intensive Applications: https://www.amazon.com/dp/1449373321
The Data Warehouse Toolkit: https://www.amazon.com/gp/product/1118530802
Data Pipelines Pocket Reference: https://www.amazon.com/dp/1492087831
Data Engineering with AWS: https://www.amazon.com/dp/1800560419
Database Internals: https://www.amazon.com/dp/1492040347
Mie mi-a recomandat fostul meu CTO cartea asta:
https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321
Nu regret ca am citit-o. E una din cele mai utile carti pentru orice programator care vrea sa capete o intelegere cat mai avansata asupra elementelor cheie a aplicatiilor web moderne.
Kubernetes. It is becoming the operating system of the cloud.
But if you really want to grow as an engineer then you should focus on theory. Knowing this theory will give you a skillset that will allow you to work at more complex jobs.
Also I'm a big believer that theory is more beneficial in the long run (think about the "teaching a man to fish" proverb) since tools and languages are all based on some theory that has been around for hundreds of years while tools and languages change every year.
Here are topics I would consider to be beneficial for the future. These are all important and the order of listing here does not signify more or less importance.
Computer Networking - https://canvas.mit.edu/courses/11164
Computer Architecture - http://csg.csail.mit.edu/6.823S21/info.html
Distributed Systems - https://martinfowler.com/articles/patterns-of-distributed-systems/ & https://pdos.csail.mit.edu/6.824/ & [Designing Data-Intensive Applications](https://www.amazon.co.uk/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321)
Performance Engineering - [Performance Analysis and Tuning on Modern CPUs](https://www.amazon.co.uk/Performance-Analysis-Tuning-Modern-CPUs/dp/B08R6MTM7K) This book also has labs - https://github.com/dendibakh/perf-ninja
Computer Security - https://css.csail.mit.edu/6.858/2020/ & [The Hacker Playbook 3: Practical Guide To Penetration Testing](https://www.amazon.co.uk/Hacker-Playbook-Practical-Penetration-Testing-ebook/dp/B07CSPFYZ2)
Advanced Operating Systems - https://pdos.csail.mit.edu/6.828/2021/schedule.html
Algorithms - https://neetcode.io/ & https://elementsofprogramminginterviews.com/ (Make the job search easier for yourself)
Linux - [Linux Kernel Development](https://www.amazon.com/Linux-Kernel-Development-Robert-Love/dp/0672329468) & https://www.cs.dartmouth.edu/~sergey/netreads/path-of-packet/Network_stack.pdf & https://blog.packagecloud.io/monitoring-tuning-linux-networking-stack-sending-data/ & https://blog.packagecloud.io/monitoring-tuning-linux-networking-stack-receiving-data/ & [CCNA](https://www.youtube.com/playlist?list=PLxbwE86jKRgMpuZuLBivzlM8s2Dk5lXBQ)
Programming Languages - The courses above will expose you to C++/C, Go, and Python. At top of this, you should know functional programming concepts so a language like Haskell is good to get experience with. You don't have to master all these languages but you know at least one pretty well.
Learning all this is a multi year effort. So it's best to start now while you have a lot of free time, then in the future you can take it easy.
Good suggestions so far. Core for you should be Python, Spark, and an orchestration tool (Airflow). Focus on PySpark to get the most lift but don't forget your fundamentals: DS/Algos, Design Patterns, and System Design. Here are two resources for the latter:
https://github.com/donnemartin/system-design-primer
https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321
If your interested in reading about some of the answers here, this is a great read: https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321
I don't know much about Grokking the Coding interview. For DS/Algo questions Leetcode should be sufficient. I had Amazon interview few years back and a piece of feedback that I got from recruiter was that they didn't like my System Design solution, so I will suggest not to ignore it or take it lightly if you have some years of work experience.
Designing Data Intensive Applications book is considered one of best resource for preparing for that. You can go through various System Design topics from below link too but most important is to practice than merely cramming the concepts.
https://github.com/Developer-Y/Scalable-Software-Architecture
I remember this feeling! Here’s a great book if you want to feel even more seen:
Free version: https://github.com/ms2ag16/Books/blob/master/Designing%20Data-Intensive%20Applications%20-%20Martin%20Kleppmann.pdf
Not free version: https://www.amazon.co.uk/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321/ref=nodl_
Leetcode has database / SQL challenges for practice, I haven’t seen them used often in interviews though.
https://leetcode.com/problemset/database/
Strongly recommend you throw this on your reading list. It will give you all the language you need to talk about data systems in interviews.
https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems https://www.amazon.com/dp/1449373321/ref=cm_sw_r_awdo_EY5XY2CEGH26YSNA9X22
Seems to cover a lot for the price. Found a free resource if you want to self-pace: https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf
Also, for broad coverage/understanding this book is worth checking out: https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321
^^^ Covers more than just "Data" but it's a great resource for distributed systems, which you can benefit from.
Lastly, for even broader coverage, this repo is worth spending time in: https://github.com/donnemartin/system-design-primer
Feel free to DM me if you need anything else.
This is a great question.
It looks very simple on the surface, however, the answer depends a lot on your use case and what tech you are using.
In majority of the cases, this can be solved using transactions provided by most databases. Transactions guarentee that only one process is writing an entry in the database at a time. When an entry is being written, no one else can read that entry.
This might not work in the following cases:
Another method is to control this at code/application level. You can allow only a single object/thread/process to access the database if you are writing the data.
You can read about mutex and semaphores. Operating systems use them heavily for this kind of logic. There are a few more patterns which are variations of semaphores to control the access.
Example, you can create a single object that can write and read the data from the DB. This single object can be held only by a single process.
The above example is not a very good design because it will end up in a huge performance bottleneck. Usually, additional code has to be written to make it performant.
If you work in distributed system, this problem becomes even more complex.
If you really want to take a deep dive into this, I suggest reading Designing Data Intensive Applications (Part II).
My favorite blog on this topic:
And my favorite book:
Designing Data-Intensive Applications
As every design decision involves tradeoffs, it’s better to make deliberate, informed decisions.
The age difference can be weird at first, but you’ll get over it. You’ll find a lot of the younger people are clueless too haha. You probably have life skills that you’ve acquired at your age that they lack, take advantage of them.
There are an insane amount of resources available online to hone your craft, take advantage of them!!
Check out these books too
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems https://www.amazon.com/dp/1449373321/ref=cm_sw_r_awdo_navT_a_X8RF7DCNYKNGJX9XN84E
Web Scalability for Startup Engineers https://www.amazon.com/dp/0071843655/ref=cm_sw_r_cp_api_glt_i_T6X9ADF6HSSQ4N7GJ5Z4?_encoding=UTF8&psc=1
FWIW, when I applied to junior roles, I nearly never got asked system design questions. So my recommendation is to do a pretty broad study on system design questions, and spend more of your time focusing on what you think your weak points are during interviews
For system design, this is a great book but overkill for what you will need for junior roles: https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321
Other people have recommended: https://www.educative.io/courses/grokking-the-system-design-interview
I also have this bookmarked: https://github.com/donnemartin/system-design-primer
The job search for juniors is not easy. It wasn't for me 5 years ago, and it isn't easy for people now. It really comes down to determination and grit. You should be applying to a lot of positions every week. Maybe keep a trello board to track your progress for each application sent. Play the numbers game and network when you can (meetups, job fairs, hackathons, etc)
I am in the middle of reading through the commonly referenced Martin Kleppmann's [Designing Data-Intensive Applications](https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321). I had just gotten through the Graph-Like Data models portion which feels like it could be a great solution to your problem. Perhaps looking up the Cypher Query Language for the Neo4j graph database could be helpful and an opportunity for you to implement a new solution adding it to your toolbelt.
I can't speak to if this is even used anymore or worth while. The book was published in 2017 and I haven't ever used this type of database before myself.
If you are committed to the SQL route I am guessing you will need to have a comment table which can reference other comments in the table. From there you can do a self-join to build up your data set.
For system design interview preparation, SystemsExpert has been my favorite. Another common resource is Grokking the System Design interview on educative.io
There's also Designing Data-Intensive Applications, which has a lot more depth than the resources mentioned above, but is much less practical / time efficient studying for system design interviews; I'd only recommend it after the others
From a software engineering standpoint this is a very well respected text: https://smile.amazon.com/gp/product/1449373321
as u/a8m mentioned in the post, many use cases will not suffer substantially from having some level of inconsistency between systems, and using something like hooks (or other in-process dual writing) is so simple to write and manage that it makes this a useful solution. martin kleppman has a very interesting discussion of this towards the end of Data Intensive Applications
agreed, for use cases that require better consistency guarantees, this wont work, as mentioned explicitly in the post.
re ent + tx outbox, a discussion started on this issue https://github.com/ent/ent/issues/1473. i think its a great idea and with all of the "NewSQL" databases that support proper horizontal scaling, i think it will become a very widely used design pattern.
ping u/a8m or me on the ent lack or discord if you're interested in this and want to work on something together!
Item | Current | Lowest | Reviews |
---|---|---|---|
Designing Data-Intensive Applications: The Big Id… | - | - | 4.8/5.0 |
^Item Info | Bot Info | Trigger
Item | Current | Lowest | Reviews |
---|---|---|---|
Designing Data-Intensive Applications: The Big Id… | - | - | 4.8/5.0 |
^Item Info | Bot Info | Trigger
DDIA: Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems https://smile.amazon.com/dp/1449373321/ref=cm_sw_r_cp_api_glt_fabc_8PJTSGQM37RPXW5KH0T6
Designing Data-Intensive Applications is an excellent book for learning about general system design concepts: https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321
Just because your current systems aren't distributed, that doesn't mean they're useless experience. Think about how you would scale them out through mental exercises. Try to implement small pieces of the concepts if applicable (obviously don't over complicate your team's systems).
Read AWS, Azure, and Google Cloud reference designs! They have a billion examples of viable designs for things like gaming, low-latency financial trading, big data, and more.
AWS: https://aws.amazon.com/architecture/ Azure: https://docs.microsoft.com/en-us/azure/architecture/browse/ Google cloud: https://cloud.google.com/architecture
Supplement that knowledge by reading:
The latter is for distributed systems.
If you want to see this applied, learn about the designs of:
Remember: you don’t know what you don’t know. But it’s good to learn new things!
Good luck!
Read this book first before you continue https://www.amazon.com/dp/1449373321/ref=cm_sw_r_cp_awdb_imm_7GJ31XBN1T4FQ0P47PYS
Il y a beaucoup de connaissances requises. J'ai toujours été intéressé par l'informatique, mais j'ai commencé à étudier en sortant de lycée et à m'y suis mit sérieusement en école d'ingénieur.
En plus d'apprendre un langage de programmation, il faut au minimum connaître un peu de théorie en algorithmie. Ça aide beaucoup de comprendre comment fonctionne un ordinateur et un OS (qqch basé sur Linux comme Ubuntu par exemple). Quelqu'un qui s'y dédie complètement peut atteindre ce niveau en un an et quelques à mon avis. Toutes les ressources nécessaires sont disponibles en ligne mais c'est pas facile de savoir par où commencer. Un sub comme r/learnprogramming (wiki du sub) est un bon point de départ.
Aux niveaux plus seniors, savoir créer des systèmes distribués résilients est très important. Toutes les connaissances de base nécessaires sont compilées dans Designing Data-Intensive Applications (bouquin d'environ 600 pages).
It's pretty hard to cram for system design and nothing replaces real word experience. That said you can learn a lot really fast by standing on the shoulders of giants. If you're curious about the topic here are some resources I found useful.
everything from objc.io, but especially their book on app architecture if you want to understand some of the most common ways iOS apps are architected and the tradeoffs between each approach. https://www.objc.io/books/app-architecture/
pointfree.co -- a lot of ideas about functional programming and they talk through many approaches that lead to better testability and modularity
Designing Data Intensive Applications -- for understanding how the backend works https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321/ref=sr_1_3?dchild=1&keywords=designing+data+intensive+applications&qid=1620402120&sr=8-3
Building Mobile Apps at Scale is a good read if you've never worked at a larger company or on a larger team but still want to understand some of the challenges faced in these environments: https://www.mobileatscale.com/
Is the full name of the book this?
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
by Martin Kleppmann
Is it the one in this link?
https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321/
Below are some good resources:
If you’re a new grad or junior, interviews generally don’t focus on system design much. They gain increasing importance as you rise in seniority.
If system design is going to be a large part of your interviews, then you study it.
https://www.educative.io/courses/grokking-the-system-design-interview
https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321/
Well, there are plenty of these. For example, my top is:
- Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems (basics of databases, system architecture and distributed systems).
- Advanced Data Structures by Brass, Peter (there are many practical advanced data structures used in databases, IR and so on.
- The Art of Multiprocessor Programming (cover theory of concurrent and distributed systems)
- Types and programming languages (type theory and the foundations of functional programming)
- Purely functional data structures (the foundations of functional programming)
There's a lot of good books about generic design principles. My two favorites are Designing Data Intensive Applications and Martin Fowler's Patterns of Enterprise Architecture. I'm also a huge fan of Domain Driven Design.
Sure, Data Engineering's a little tough to get into since a lot of best practices are still actively being developed. The bad news is there isn't a golden path to follow. The good news, is at least there aren't bullshit boot camps to waste your time either.
This book was alright for some big picture stuff. Get very comfortable with SQL obviously. Know Bash well, and (whenever there are any automated processes you want to schedule, even if it's just trivial personal stuff for fun) learn to use Cron jobs. It's super easy, but getting used to the idea of scheduling and monitoring stuff is part of the bread and butter in data engineering.
Check out glassdoor and look for relevant jobs. Job postings often don't ask for what they should, but it's still useful information to consider when deciding what to study. Knowing a cloud framework like AWS would be good, and Airflow is used fairly extensively in my own company at least.
Honestly though, the biggest thing you should do is just get your hands dirty with some personal projects that have an obnoxious amount of data to wrangle. I played around with this paper for example. A couple billion stars turned into a single image. I forget how much data there actually was, I think it was something like 250 or 500 gigs. Setting up a system to asynchronously download the next chunk, do what you need to do with it, and then delete the local copy until all chunks have been processed was an interesting project. Look for personal projects that interest you, but have very high data requirements. Even that's chump change compared to the stupid amounts of data you could end up working with in industry. Data engineering isn't all that rough, but you need projects ideas that'll actually require some elbow grease to push you towards new methods.
Start reading about what major companies are doing on the data engineering side too (airbnb, Facebook, netflix, etc.). Most companies aren't anywhere near so sophisticated, but getting used to the kinds of questions being asked, problems being encountered, and solutions being explored will help give a flavor for what you'll need to know.
Good luck! Unfortunately Data Engineering as it exists now is too new for there to be anything super obvious that you can just focus on learning (other than SQL of course) but that's part of the job security too. Once you're in, experience is worth a lot, so you shouldn't struggle with finding a job in the future at least, once you get your first two years or whatever in industry. Pays good too, which is always nice. I was where you are three years ago, it sucks... but it gets a lot better once you're on the other side. You got this, just gotta keep your sanity intact enough to keep pressing forward until it happens. I made it, and I didn't even have a finished BS. I got my connection through networking, impressed the right people with my abilities and here I am. There's a huge need for actual productive, competent engineers, so the right company will be equally grateful to have found you as you'll be to be found.
not a 2020 book but I read it in 2020
https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321
​
opened my mind about going beyond data warehouses.
For distributed systems in general, you can’t go wrong with Designing Data-Intensive Applications
Designing Data Intensive Applications if you want something that covers a variety of data systems.
SQL Performnce Explained is a great book if you want something more specific to understanding B-Tree indexes in traditional relational databases.
> Data Intensive systems book
Are you referring to this book? Seems like a good book according to Amazon.
Hello folks, I decided to switch jobs earlier this year and interviewed with multiple companies in Bangalore(India) and Toronto(Canada). The following are some of the companies I tried:
I hadn’t interviewed in the last 4 years and realized during my interviews that the interview landscape had changed much in that time. Wanted to share some of the significant differences I noticed:
LeetCode style questions/interviews: Most of the companies I interviewed with asked LeetCode-style coding questions (Find median of a stream, Find LCA in Binary search tree, etc). I understand that companies like Google, Microsoft, Amazon always always have a strong emphasis on data-structures and algorithms - But I was surprised to find that most other companies are doing it these days. For example, I got a difficult question with graphs in my Intuit interview and I know from friends who work there for many years, that they never had to work with trees or graphs. There was less of an emphasis for CS fundamentals, programming language fundamentals, projects, etc during these interviews (than there was 4 years ago).
Phone/Video interviews suck less: A lot of the companies used Coderpad for the initial telephonic/video interviews. This was a much better experience than having to type code on a Google doc (or even worse, writing code on paper and reading it out to the interviewer on phone). The syntax highlighting, ability to execute your code and verify it against my own tests made a whole world of difference (at least for myself)
Companies in Toronto and Bangalore had/have very different interview styles/process - All the companies I interviewed with in Bangalore(with the exception of Atlassian) had a very strong emphasis on data structures and algorithms (Questions on Trees, Graphs, Dynamic Programming, LC medium/hard seemed to be common). However, the companies in Toronto seemed to focus more on general programming skills (problem solving, writing tests, familiarity with the language libraries, OO skills). Another difference I noticed was that the interviews are shorter and more organized with companies I interviewed in Toronto - None of my on-site interviews were more than ~3 hours in length and was wrapped within a day. In contrast, some companies I interviewed with in Bangalore had interviews on weekends, sometimes spending 6-7 hours on-site just to complete a couple of rounds, and had to be on-site for 2 different days for the same interview.
When I was interviewing, I could not find interview experiences for many of these companies on Glassdoor(or anywhere else) and I did not know what to expect during my interviews. There are lots of interview experiences shared online for FAANG companies, but I couldn’t find much useful information for startups and non-FAANG companies in general. So, I documented my interview experience in my blog (https://www.soberkoder.com/interview/) and shared it with some friends who said they found it really useful. Hence sharing it across the bigger community here - Hope you find this helpful for your preparation, if you are planning to interview in the near future. (I feel the content in the blog is too big to share in this post, hence sharing a link)
My 2 cents: You can start by learning Spark, personally its a great framework to learn how distributed data processing / streaming works.
Secondly I recommend this book, even if you dont have interest in field : https://www.amazon.com/Martin-Kleppmann/dp/1449373321/ref=sr_1_1?crid=1XYWI3UFVEW21&dchild=1&keywords=data+intensive+applications&qid=1603452406&sprefix=Data+inten%2Caps%2C212&sr=8-1
Thirdly don't set your goal to be "great", but rather to be "better".
https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321
save yourself the money, time, and grief.
Highly recommend picking this up and working through it Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems https://www.amazon.com/dp/1449373321/ref=cm_sw_r_cp_api_i_KiknFbMM0SAHS
Check out Martin Kleppmann's Designing Data Intensive Applications: https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321
This one has some good info:
https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321
And prob this:
Not OP, but I have heard good things about this book:
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems (2017) by Martin Kleppmann
Best big-picture book I can recommend would be:
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems (2017) by Martin Kleppmann
On a side note. I am currently reading https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321. Loving it so far. Author clearly explains the difference b/w relational & document model.
Highly recommended.
Depending on what you want to do with the events stream processing might be the best choice over event messaging. Here is a brief article on stream processing. https://wso2.com/library/articles/2018/05/what-is-stream-processing/
I got the idea of Stream processing from the book Designing Data Intensive Applications. https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321/ref=sr_1_1?ie=UTF8&qid=1532356233&sr=8-1