I agree, but I will add a crucial part: learn python to do specific things. Don’t learn the language for the sake of knowing python. Learn it in an applied way, which means doing end-to-end projects.
OP, you’re in a perfect position to be able to level up quickly because you have data to work with: the data at your job.
I recommend using python to do stuff you mentioned already doing: pull data, clean it up, make some visualizations, build models using scikit-learn/statsmodels, report model comparisons in a visual way.
Hand-in-hand with all this, I would get one of the many “machine learning with python” books and work through it using the data from your company. Not only will you learn the material faster because you are contextualizing the new concepts with data you understand, you’ll be able to impact your company with the assets you create as you learn. I found this book to be particularly nice, though you have many options.
Hope this helps!
Do you have a green card?
If you are going to use OPT, volunteering is not a good idea. You'll use your OPT on that (unless you have to graduate). I'd start by applying for internships. Many FAANG have interships positions for graduate students. Other FAANG have one year positions or short term positions that work like postdocs. But it's experience and I'm sure they keep people they like. I've seen Facebook has this type of stuff.
Also, you should contact people you know to pass on your resume and maybe create a portfolio on GitHub.
The "Concepts" part is too basic. I'd change this and put there any words that the ad says. Check out this book or their podcast https://www.amazon.com/Build-Career-Science-Jacqueline-Nolis/dp/1617296244
Great guide for building crisp, clear data visuals: https://www.amazon.com/Street-Journal-Guide-Information-Graphics/dp/0393347281
Story telling, try this structure: https://www.richardhare.com/2007/09/03/the-minto-pyramid-principle-scqa/
If you want to grow as a ~~scientist~~ practitioner, try Guerrilla Analytics: A Practical Approach to Working with Data
If you're looking for pleasure reading, check out The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century
The Art of R Programming is, hands down, one of the best books on R. It is simply an excellent, well-written description of how R works. That’s all you need to know.
You wouldn’t go read a book on Pandas to learn python, so don’t read a book on data science in R to learn R, as many people here have suggested you do.
http://www.scholarpedia.org/article/Policy_gradient_methods is a good overview of those equations. Introductory RL course will cover them so if the role involves any RL this feels very fair. RL is sometimes covered in an intro ML/AI course although will vary as there's a lot of other topics to cover.
​
Data Science is very broad word. There's a lot of different subfields. Many of them will never touch RL. I think broad knowledge of basics is useful, but there will likely always be subfields you know nothing about. RL is pretty common in robotics.
While Spark may seem shiny, it's overkill for small-medium data science projects. Using it in standalone mode on your local computer to practice thinking in map-reduce isn't a bad idea, but may be hard to build a compelling project out of it.
Spark really is about large scale data. So I'd use it to explore large datasets on AWS. Insight has a post on how to do this - http://blog.insightdatalabs.com/spark-cluster-step-by-step/ - and I'd check out the AWS large datasets collection too - https://aws.amazon.com/public-datasets/
But if you're data is less than 20-30 gigabytes, Spark really is overkill. If anything, figuring out how to write efficient Python (or R, etc) code to analyze ~20 GB of data will force you to be a better engineer & data scientist (over using Spark to easily / quickly process 20 GB of data).
I do a lot of NLP professionally. We've evaluated pretty much every library on any list you can find online.
NLTK is alright, though results tend to be average at best and installation has been a pain on occasion. Gensim is far more limited, though good at what it does. Ultimately, we gravitated towards spaCy, because it's so much better in almost every way imaginable, from installation, easy of use, model deployment, to GPU support, and considering the massive amount of analysis it does by default, it maintains spectacular runtime performance. It's probably one of our top 3 most used libraries alongside TF (No PyTorch here for production reasons.)
Our team used to have more people writing R, but earlier this year, we got to a point where it became more trouble than it was worth and we dropped it entirely. Performance was generally worse and there wasn't anything it was providing us that Python didn't have. There was plenty of stuff in python we weren't about to leave behind, because there was no R equivalent. I will say R Shiny was nice to have, but after we discovered Plotly's Dash framework it was incredibly easy to let go.
That's just our team though, everyone should do whatever is most effective for them.
So don't panic. You can pick up a working knowledge of HTML and CSS in an afternoon. Especially given that you don't need a particularly deep knowledge of them to build a web crawler. The main problem (as is often the case when teaching yourself) is psyching yourself out and going off on a bunch of tangents and down a few rabbit holes you didn't need to explore.
I'd start here and do the html bit and maybe the first 5 'lessons' of the css bit. The main thing to know about css for this application is just selectors, so this should be sufficient. These lessons are also very short so this won't take long.
Then you should have a reasonable idea of what elements to look for in an HTML document when parsing it. As for making the crawler itself it will depend on what specifically you need it to do, but you can find plenty of tutorials online for building things with selenium or beautifulsoup.
Ah. You'll need to monkey with path and settings variables if you want to use something like the python extension.
Microsoft has really good documentation for VS Code. I'd recommend using it heavily. Here: https://code.visualstudio.com/docs/python/python-tutorial
I actually just made a post about this book today. It was a good book when it was first released, but doesn't appear to have kept up with the pace that Pandas has developed. There's a second version which was released relatively recently, and even that doesn't mention some not too new features, and does reference some things that are highly outdated.
I've heard good things about Python Cookbook
Try install a package called: pandas profiling (Link to Github). Pretty amazing one. It can export as HTML or you can preview it in the JupyterLab.
Plus I felt having Kite(Link) with JupyterLab(extension) mildly alleviated the pain of auto complete. 👍
A Galton Board? (See it in action).
You might find it for cheaper though. That's a typical gift for someone quite into statistics.
I don't "know", it's just a guess. If you're half as good at Excel & VBA as you claim to be, that seems like a small number in NY. Your job title is non-standard so I'm not sure if glassdoor's info is accurate.
Intro Stat Learning w/ R (uses R)
Ng's Coursera Course (uses Octave, unfortunately)
Python or R are both great.
I glanced at "Hands-On Machine Learning with Scikit-Learn and TensorFlow" by Aurelien Geron and thought it is quite good. But I have not had a chance to read it deeply yet.
​
https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291
It's software that allows non technical people, people who don't code, to analyze data thanks to a graphical interface.
It generally contains, data tables (prepared by technical people) and functionalities to combine tables, create analytic graphs, and create dashboards (a mix of various tables, calculated results and graphs generally fitting in a single screen).
The Tableau video demo gives a good summary of what it looks like: https://www.tableau.com/#hero-video
Caveat: I'm no expert on pytorch but your question looked interesting so I read up on the documentation a bit.
Your code is not doing what you think it's doing. Pytorch multiprocessing is a wrapper round python's inbuilt multiprocessing, which spawns multiple identical processes and sends different data to each of them. The operating system then controls how those processes are assigned to your CPU cores. Nothing in your program is currently splitting data across multiple GPUs.
To use multiple GPUs, you have to explicitly tell pytorch to use different GPUs in each process. But the documentation recommends against doing it yourself with multiprocessing, and instead suggests the DistributedDataParallel function for multi-GPU operation.
Consider Kibana. Free and open source, allows you to build dashboards and visualizations without writing code.
I work with people all the time who brand and repackage.
He's wrong in basically saying that you have to be a full stack developer.
What he's right about is that being able to visualize your work is valuable and something worth investing time in.
Lots data science programs will require at least one course on visualization. Something like D3 isn't uncommon and doesn't require a full stack of web dev experience.
someone used this chart as a reference.
(slide 54)
Use privacy.com. It allows you to create separate credit cards (linked to your checking account) that only work with a single store. furthermore, you can set hard limits on the card (either per month or per transaction). If you only want to spend $30 per month on AWS, set your monthly limit to $30. Once it hits the limit, it will decline future charges.
There is an API over at CryptoCompare, with all kinds of variables across different currencies and with a decent historical record.
I don't ever really touch Python, but I may be able to help you with forming a question. Feel free to PM me.
I have been taking various classes on all three platforms. They are more or less the same and it just depends on what you are exactly looking for. If you just want to learn and gain expertise in Data Science, just take any course. If you want to actually get a certification that shows you have completed the classes, you could pay for it. Personally, I think they are bit over priced but you are getting an education from a top university for fraction of a cost and they need paid subscription to keep the sites running. Initially, I started using this Trello Board, https://trello.com/b/rbpEfMld/data-science, but also doing the specializations on each platform for free.
You forgot the Data Science specialization offered by Johns Hopkins. A 9 course linked series on Data Science from getting set up with Git to machine learning to producing a data product and releasing it to the world.
Airflow is great to hook those services. We are transitioning to integrate it with Kubernetes to have more flexibility in our architecture.
https://kubernetes.io/blog/2018/06/28/airflow-on-kubernetes-part-1-a-different-kind-of-operator/
They updated TOS last year to more aggressively define commercial use. Everyone on our team has a license now, even though we are small (<5).
“ We clarified our definition of commercial usage in our Terms of Service in an update on Sept. 30, 2020. The new language states that use by individual hobbyists, students, universities, non-profit organizations, or businesses with less than 200 employees is allowed, and all other usage is considered commercial and thus requires a business relationship with Anaconda.”
https://www.anaconda.com/blog/anaconda-commercial-edition-faq
Additional R and RStudio cheat sheets.
Looks like the "Advanced R Cheat Sheet" is the one produced by RStudio so that one is a duplicate.
I suggest you check out the book Helping: How to Offer, Give, and Receive Help by Edgar Schein.
Mentoring is ultimately a social process, and if you've never done it before, having some strategies at your disposal like the ones Schein outlines can be invaluable.
Hey there!
Nope, I did not interview Hastings! I was reading his book No Rules Rules about the culture at Netflix and found this part super interesting.
I posted this in this sub (as well as others) b/c generally tech folks like to discuss workplace culture and contrast + compare their own situation with everyone else. Let me know if you think there are some better subs to post in!
If you're interested in the underlying theory of probability and statistics I'd recommend not jumping straight to reference texts like Elements of Statistical Learning and All of Statistics that people usually suggest.
I'd go for a book that try to teach from first principles without sacrificing mathematical rigor. A personal favorite is
https://www.amazon.com/Primer-Econometric-Theory-MIT-Press/dp/0262034905
there are some free sample chapters online, although they don't do justice to the sheer beauty of the typesetting in the physical copy
I'd consider starting from the problem instead of the solution. What data problems have you encountered that traditional, simpler data science tools have failed you on? Then scale up as necessary (the story of scaling up you tell will also be helpful in an interview or when documenting a portfolio project that will help you land an interview).
Also, you're better off learning how to use hadoop / spark on cloud instances with many servers. The bonus is that most cloud providers host large datasets that you can use for free https://aws.amazon.com/public-datasets/
Take some classes!
Coursera has some great free classes.
This one is awesome: https://www.coursera.org/course/statistics
You get a good intro to stats as well as some R programming.
Here's a whole set of courses for an intro to data science: https://www.coursera.org/specialization/jhudatascience/1?utm_medium=listingPage
All for Free! Go dabble. Worst case, you learn some stuff. Best case, you start to develop some skills.
Erin Shellman (Research Scientist II at Amazon Web Services), writes on her blog: "I hiiiiighly recommend the Biostatistics bootcamps from Johns Hopkins. They are an excellent review of the first year of a graduate level statistics program. Don’t spend too much of your time watching the lectures. Instead test yourself with the quizzes and assignments and watch the videos in areas where you are weak. "
The link to the bootcamps is here => https://www.coursera.org/course/biostats
The link to her blogpost is here =>http://www.erinshellman.com/crushed-it-landing-a-data-science-job/
I have also put together a Beginner's Guide for people who have little or no experience in R programming (or any programming for that matter).
I hope this is of benefit to the research community. If anyone has questions or feedback, please feel free to PM me.
Depends on what kind of data you have and what you want to use it for. It's really about choosing a database model (e.g., relational vs. document store) and then choosing a particular implementation (e.g., postgres vs mysql or mongodb vs couchdb).
You mentioned CSVs, so I assume you are working with columnar data. You also mentioned the data is "small" (which in this case I'll take to mean "fits in memory" and "homogenous records"). So, unless you have an exotic use case, your best bet is probably a relational database.
Popular relational databases include sqlite, mysql, and postgres. Choosing between them depends largely on your operational requirements (what your application will be doing with your DB, what your expectations for data consistency and performance are). Local or on a server? One or multiple users? Fully ACID? Fully SQL compliant? High write/read ratio? Distributed? etc....
edit:
Here is a comparison of relational dbs: https://www.digitalocean.com/community/tutorials/sqlite-vs-mysql-vs-postgresql-a-comparison-of-relational-database-management-systems
As one commentor mentions, key/value store is possible in postgres but not the others. I've never used postgres for key/value, but theoretically thats pretty awesome.
The curriculum looks very geared towards items you can list on your resume rather than enabling you to get to the point where you can build anything on your own. If you want to get to machine learning and build projects, Hadoop and Tableau, among others, aren't useful at all, and they don't really have any machine learning projects for you to work on. They do appear to cover some ML models in the first section, but they don't appear to focus enough on data cleaning, etc, for you to be able to do much on your own.
Honestly, I'd save your money. If you're looking for good starting points, this quora thread has a lot of potential routes. I'm also working on a site, dataquest.io, that might be useful to you.
The author is 'she' ;-).
I'm pretty sure you could solve the problem in any programming language that's out there. As for the Star Wars one, F# has a few nice things that make it easy to write (powerful type inference, REPL, etc.) but some other languages have them. The article uses R type provider (for typed access to ggplot), which is something that only F# can do (with this degree of integration), so that bit would have to be done differently. As for the James Bond post, that one uses type providers (HTML type provider) which is really neat and does not have equivalent elsewhere.
So, you can of course do everything in any other language, but I think F# has some nice features that make it a lot easier and lead to more correct & efficient code. (Good resource here is the F# testimonials page.)
I’m fascinated by this related book Law as Data, but haven’t gotten to read it yet.
There really isn't one book.
When I interview potential interns, I'm looking for
– evidence of solid programming skills; in particular one of Python or Java for building systems, plus one of Python or R for modelling, plus reasonably strong SQL. Anything else you can do is a bonus
– a clear history of quantitative work with real data (here's where most computer science undergrads fall down; "traditional" scientists, social scientists and economists tend to fare well here)
– demonstrated experience with machine learning (you don't need to get fancy here; if you really know how linear regression and k-means work you're going to be useful)
– some product intuition
– the ability to translate all of the above into language a marketer or bizdev exec will understand
So ISL is a good recommendation; I'd pick up something solid on database fundamentals, something on experiment design, something on product design (Don't Make Me Think is a good book), and something on business fundamentals (Getting to Yes is a classic – and useful in your everyday life next time you have to negotiate something).
It's a lot to ask for, but that's why the roles are hard to fill. Data science is, like product management, a cross-cutting/glue role in a lot of companies and you have to be willing and able to wear many hats.
If you want to a structured course to follow, I would recommend Udacity. It is structured much more like a curriculum than DataCamp. The best part of Udacity for me was the projects. They essentially just give you a prompt with some research questions, a few guidelines and let you work it out on your own. You then submit it to be reviewed by someone at Udacity. And I was surprised that they actually give some solid feedback. In my opinion, DataCamp holds your hand too much during their projects.
Udacity says on average it takes students 4 months to complete the program, which is fairly accurate, I did it in like 3 and a half months. Price tag is a bit steep, but I did feel like I got more than my money's worth.
I am unaware of anything in Python like <code>lme4</code> to fit generalized linear mixed effect models (such as logistic regression with mixed effects). Options AFAIK are calling R from Python or use something like Stan or PyMC3 to fully specify a Bayesian model.
Pandas has the ability to read data in chunks. I would probably do that and then try HDF5. Here's some performance differences between the various options.
> Microsoft may make a point and click interface to load data into R or connect R to SQL Server or their Azure stacks. They may even try to create a point and click wrapper for R
It's already there and been in the works for over 2 years now. It's called Azure ML - http://azure.microsoft.com/en-us/services/machine-learning/
There are 3 components to what you are looking to do:
1) Scrape data and place it into a database.
Potential Book: "Mining The Social Web"
If you aren't looking to go that formal, the best API to start and play with in this regard is Twitter. That is mainly because there are approximately 5,000 guides as to how to do anything. For example, for information around how to potentially take data and place it into a database take a look [here](http://stats.seandolinar.com/collecting-twitter-data-introduction/.
2) Using R to visualize the data with the help of D3
Accessing the database itself is something that can definitely be done within R. It's strong dependent on the database you choose, but the mechanism is generally: 1) Get library that connects to particular database, 2) Connect to said database, 3) Run a query that returns some kind of data frame.
Then, you can use shiny as a mechanism to take your data and calculations from R and make dynamic and interactive charts and graphics. Lots of documentation exists for connecting it to D3.
As an alternative, especially if all you will be doing is very basic table or other manipulation. You can just have R output a csv and do all the work on the frontend.
3) How do I do this on mobile?
On that, I'm afraid I can't help with.
OpenRefine (previously Google's) is extremely good at clustering text among other things. It has the only available implementation available of Metaphone3 (basically, the creator is a little special regarding its intellectual property).
Cons, is a gui, you can run it as a server but it cannot be imported in outside libraries out of the blue.
Q: Tools
You probably have sufficient mathematics and data pipelines experience.
Refresh your linear algebra and convex optimization if you're rusty, and work through the microeconomics book from Hal Varian to get a feel for the mathematics behind utility curves (also helps beef up chops to explain the math in a simple way to execs). Also some time spent in Introduction to Algorithms is time well spent.
The Algorithm Design Manual by Skiena. It's what an algorithms book should be. It focuses on relatablity and helping people solve their problems smarter. The biggest trouble in big data is knowing if the algorithm you are proposing is feasible or if there is a better way. Skiena helps give you the tools to determine that.
Do you think the JHU/Yelp/Swiftkey data science program at coursera would give enough knowledge to get a job/internship in the field?
I have a bachelors in unrelated field but would like to make the switch (need money to fund full degree somehow and I'd like to get a job in the field first so I'm kind of hoping it is). Here is the link if you aren't familiar with the program. Thanks in advance!
I just graduated with a BSc in Math. I really want to get into data science. I have 2 years research experience in mathematical modelling (disease epidemics, not really data) with MATLAB and R. I was thinking of doing this specialization at coursera: Machine Learning (UofWashington). Is this a good idea to get my foot in the door?
Yes. For my own personal use, I use Trello. I can organize it however I want and I keep boards for personal things and work-related items.
Professionally, we use Jira/Confluence Kanban boards. The software devs are in agile teams and each have their own boards. I haven't needed to dive into their process (yet) but will eventually be pulled in when we start embedding models into our products and software.
Edit: changed "lean" to "agile" because I don't really know but only suspect that I know.
Tableau has some decent learning materials on their website. You can use those materials to become certified as well - since it is a platform specific certification there may be a bit more value than general certificates.
For visualization jobs / internships, this is the best place to look => https://groups.google.com/forum/#!forum/data-vis-jobs
For any full time job, you can always write the person and ask if you can be an intern in their group.
Here is the link to get more details: https://www.udacity.com/course/data-engineer-nanodegree--nd027
They are currently at $1195 for 5 months, they do offer "Pay as you go" option as well which is $269 per month.
I would suggest going for the per month option.
YouTube has an Alexa rank of 2. That's bigger than every social media site in existence. The only site that beats it is google.com itself.
I don't think you can reasonably say that nobody cares about YouTube.
So, if you are starting from scratch I would recommend going with the Elastic Stack. It will give you the ability to do grab data from your sources, transform it on the fly, store the data, and report on it.
The big advantages here are:
Here is a nice tutorial that takes you through things end-to-end.
BTW, I work for Elastic.
John Hopkin's has a certification course series on Data Science under the Coursera organization: https://www.coursera.org/specialization/jhudatascience/1
I'm taking it right now and, while it's not going to teach me EVERYthing I should know as a Data Scientist (i.e. SQL, Javascript, linear algebra, etc.), I think it gives me enough basics in R, dataset manipulation, statistical inference, regression modeling, and maching learning to begin to have a solid understanding of what is required of a Data Scientist.
You can take the courses for free if you don't need the certification and just want to see the material and lectures.
If you are going to focus on python:
Do this: http://www.codecademy.com/en/tracks/python
Then this: http://cs109.github.io/2014/pages/schedule.html
By doing that, you will realize if you like programming in python, and then you'll get through a very solid course on Data Science using python.
Both of those resources are free.
Realistically, if you don't think it is "fun" to program/do stats/etc., I can't imagine you enjoying the practice of data science.
Go and look at any job website where people want to hire data scientists.
.
Here's an example: "Indeed + Data Scientist + San Francisco": http://www.indeed.com/q-Data-Scientist-l-San-Francisco,-CA-jobs.html
.
Look at any of the job postings. There's one for Reddit -> https://jobs.lever.co/reddit/65a58292-767e-4a0b-aaee-26d4dd527f63
.
From that job posting there is a description of part of the job -> "Help build graphical representations of the relationships between reddit entities (communities, users and submissions)"
.
BAM! There's a potential thing you should do - start a website that takes in data and builds a graphical representation of relationships between reddit entities.
.
Once you have the website done, you can turn around and offer it as a service to Reddit. Since you know that they need this exact thing.
.
Lastly, if that doesn't appeal, look at any of the other 1000s of data science jobs to see if there is anything that appeals.
(also, looking at your posting history, you are also interested in crypto and security software. Perhaps look at an intersection of domains?)
Run your code through some static analysis tools:
https://stackoverflow.com/questions/35470/are-there-any-static-analysis-tools-for-python
Once you have to fix the types of things these flag enough, you're going to start automatically not doing them to begin with.
We're also looking at plot.ly in order to move away from django. The sheer cost of configuring and setting up different UI elements is so onerous with these frameworks. I just want an input form and a table output with some charts. Its mad that this doesn't exist more widely.
I'm not sure your use case, but have you considered openweathermap There are a few different api calls and you can have up to 60 calls a minute for free. I used it to make a twitter app that tweets the weather for my state in emoji form.
It works with java script or can export to an xml file if that helps.
I think in general it's easier to get a job, work there a while, then go remote. Hard to convince a hiring manager to agree to remote with someone they don't know, unless the job is advertised as such to start.
There do appear to be some jobs that explicitly say remote is ok though (for example)
I'd skip Packt and go for Safari:
https://www.safaribooksonline.com/pricing/
They get most of the Packt content anyways, plus many other publishers. They also have mobile apps for iOS/Android with offline viewing capabilities.
> is it possible/ worth it to buy a mac and install Ubuntu on it?
Yes, but you'll never be happy with the touchpad without macOS. Aside from the various desktop environments, most of the GNU and OSS stuff is available through Homebrew. I have a Late 2013 Retina 13" with 8GB and 256GB. If I was going to replace it with your funds, I'd look for a 2015 Retina with fully-loaded specs. I'm spoiled by the touchpad and the display. I find linux to be much more tolerable in a VM on Mac. At least then I can count on the touchpad doing what I expect.
This is awesome. Have been doing some of the tutorials and read through part of the how-tos.
Does anyone here know where I can get the Tensor~~Flow~~Board visualization tool? It is mentioned in one of the howtos, but I can't find it anywhere.
EDIT: Never mind, it was included in the default installation but I simply couldn't find the script's location. I had to do > python /usr/local/lib/python2.7/dist-packages/tensorflow/tensorboard/tensorboard.py --logdir=path/to/log-directory
Automate the boring stuff with pythonis a great book/site. If you're already experienced with programming, skip to the sections you find most interesting.
Additionally, here's a Pandas (Python's table library) tutorial.
I work for Google - and I love this post - thanks for sharing it!
Let me suggest to also take a look at this open source dashboard platform:
(talk to /u/arikfr to learn more about it)
I mostly use Redshift as the data warehouse, and currently Luigi for the ETL process (but previously I've used Azkaban).
With the orchestration stuff you can definitely just test it out yourself. The same with Redshift here: https://aws.amazon.com/redshift/free-trial/?nc1=h_ls - the SQL is very similar to PostgreSQL - but the important things to learn are the use of distribution and sort keys, column encodings and the need to vacuum and analyze the cluster.
Honestly, if you already know SQL I would just recommend testing the technologies directly and using their documentation. Set up your own small ETL project for reading some Twitter data or other APIs for example. Then you can add real-time analysis using something like Kafka, etc.
You can use Redash for visualization: https://redash.io/
It's free and open source too, I'd really like to contribute to that as I've contributed to Luigi.
It doesn't sound like you are on a deadline, so I'd recommend datacamp as the easiest tutorials to get your feet wet. They are basically "fill in the blanks" short exercises which would familiarize you with the progamming syntax and most common problem solving strategies. Data Science is a very broad field and it's easy to get lost from all the things you need to learn. I would recommend starting with a course on python and pandas and sticking to it at least until you can do everything you are able to do in excel, but with code.
It really doesn't matter which tutorial/bootcamp you pick, but it helps to enter a structured learning program rather than wander around picking up books on dozens of subjects.
great advice.
> How to tell a story with data. This includes data visualization and how to communicate with non-technical people. How to boil a giant dataset into a one-page white paper and present it to an audience. To me, this is the #1 must-have skill.
maybe to this add using a tool like Plotly's Dash to build a interactive presentation of the results of your number crunching.
> Basic programming skills.
So I've spent ages writing complex spaghetti code only later to find out I should have done it in 1-2 lines using some pandas + some other python library.
Even though python is intuitive, many of its libaries have so many functions and so many different ways to do one thing (especially pandas!) that its like Alice through the looking glass world. So basic programming skills can be a time waster. Especially when cleaning up messy data sets.
My team and I use Dash by Plotly. I don't know about a desktop application, but we use it as a web app (it's a Flask app). It's fairly new, I believe, but it's really cool and we like it for data viz/UI so far! Our customers are not data scientists, so I definitely think you can make a UI for non-data-scientist users.
The list of data analyst skills doesn't seem right to me. They don't need to know C/C++. VBA (which isn't on the list) would seem more useful to me. Here's an alternative list based on job descriptions.
? SSMS... expensive? It's free for the developer edition. You can always connect to LocalHost and start a DB service, without having to get a separate server spun up.
Microsoft has a similar developer-friendly business model as Oracle & Java. They have a very vested interest in making as many people familiar with their tools as possible, so employers can reliably get people with skills on their tech stack.
I came from academia and was looking to step immediately into a data scientist position, so it's a bit of a different situation. I assume you'll be looking for data analyst positions.
Even if you don't feel ready to start applying, take a look around at what jobs are available. Find some that sound exciting to you, and look at the required tools/skills. If you find they all want a lot of programming, try something like the cracking the coding interview questions: https://www.hackerrank.com/domains/tutorials/cracking-the-coding-interview
If you find they need something different, like Tableau or whatever, do a small learning project with them.
Don't worry about how confident you feel about interviews - some of your first few will likely go terribly and that's fine. Dealing with data sci interviews is a skill, and you'll develop it. If they go terribly, just make sure you learn something from it. If you go in expecting it to just be a learning experience I think it's a lot easier and less demoralizing when inevitably you get turned down a bunch before finding something.
This cheat sheet is amazing. There's a similar one for numpy, Matplotlib and seaborn as well.
Also, R users, there's a similar and equally amazing data wrangling cheat sheet on the official R Studio website. Here's the link for all their cheatsheets - https://www.rstudio.com/resources/cheatsheets/
Both of these cheat sheets are extremely useful while wrangling data.
Depends what you intend to do with it. I am also confused about the format of your data, but I'm assuming you'll want to merge the first and second rows into a single header row with descriptive variable names, and rearrange as needed. For example, if columns 2-3 are the same variables as columns 4-5, then cut them apart and bind columns [1,4,5] to the end of [1,2,3].
If you'll be working in R, have a look at the tidyr
package (http://tidyr.tidyverse.org/) and this cheat sheet: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
Heroku is a good solution for your needs. The dash webapp will run in what Heroku calls a dyno. Even the free hobby tier should be enough if you don't expect much traffic. The once-a-day scrapping you need can be run in a cron task, which Heroku also supports. You need to dig in some more to confirm that scrapy can run inside a Heroku cron worker dyno. Alternatively, you could run the scraping on your laptop and upload the data to the database.
The PostgreSQL database will be the most expensive item for your setup. As you seem to have more rows than the 10k the free tier, you'll need the Basic paid tier at $9 per month.
I gathered all this from the Heroku pricing page: https://www.heroku.com/pricing
Don't be daunted by web dev - I think it's not harder on average than data science :)
>as I only have read access to our non-production database.
I suppose I should be rather fortunate to be given this position, everything is a sandbox at the moment. Although, that does mean I will get sand in my shorts from time to time.
My previous SQL experience was from a couple college courses and at my last company used it to query user and location behavior data. pretty simple queries, insanely simple actually.
Although, recently my queries have become far more complex(to my standards and they probably are not the most streamlined) but if anything check out CTE's. Those have been super fun to play with. Especially since i can knock out at least one more step by adding a total to the end of the results and not having to total everything up separately. It essentially creates a temporary table outside of a query to query your query(from what I understand). heard you like queries bro.
This is a great tutorial. I would also recommend installing a Postgres GUI (my go-to is Postico) so you can explore the data in a more friendly way rather than just running the pre-written commands in the command line. Do various selects and joins, and play around with the data to feel comfortable with SQL.
I've found in general that interviewers don't expect personal projects for A/B testing as much as they want to see if you can explain how to conduct an A/B test for a particular hypothesis.
If you know barely anything about experimental design, I'd recommend reading through Howard Seltman's Experimental Design and Analysis book. It's aimed at social scientists but a great resource in general.
If you know basic experimental design but don't know a lot about online A/B testing I recommend Trustworthy Online Controlled Experiments by Kohavi et al. It covers a lot of concepts specific to online A/B testing like indications that the platform itself is buggy, designing experiments where the same user doesn't see both the control and test experiences, and indications that your results aren't trustworthy.
you've run into a interview problem I give out often, I give you a data viz in shambles and ask you to fix it.
What we are generally looking for is the ability for you to isolate the key points of the analysis and communicate that clearly visually
if you're interviewing for UI /ux you should also be familiar with this book
https://www.amazon.com/Visual-Display-Quantitative-Information/dp/1930824130
its required reading on my team.
Been working through "Data Science From Scratch" by Joel Grus and it's pretty good so far. It assumes you have an introductory knowledge of python so if you know things like if-else statements, list/dict comprehensions, and a little OOP, you should be good to go.
https://www.amazon.com/gp/product/B07QPC8RZX/ref=ppx_yo_dt_b_d_asin_title_o00?ie=UTF8&psc=1
Edit to include github for book:
I recommend cross posting to r/gis to ask questions about how that community would handle this problem. Big stores like Wal-Mart and Target definitely do this using GIS and a data science workflow. Your data science solution will be different from a GIS analysis, but you may want to learn what tools that group suggests using (some have already been suggested here). And as others have also said, you might also need to pull traffic and other data to do the predictions you want to do.
Like temporal data, spatial data can be really interesting (read: challenging) due to correlation structures in your observations and variables. A GitHub repo for this book will have more info about spatial statistics that you may want to take a look at if you're very uncomfortable with the subject. That could be a good place to start.
Personally, I build a neural net from scratch using only python builtins and deploy it via an API. You could build some utilities to convert it to ONNX if you’re feeling hardcore. This book will help you with the neural net parts. This basically hits all the major areas except front end.
I recommend deploying your model because it will get you comfortable thinking about production requirements and their assumptions. A lot of people can train models. Way fewer people can train them and expose them to the world. Ultimately our goal is to get people to use our work.
So the two most applicable books in my mind are Python Data Science Handbook and Python for Data Analysis. They are both good for the quirks of handling data in pandas and numpy or using machine learning (although not deep learning) through SKLearn. I’ll like the Amazon pages, but I’m sure you can find them elsewhere.
https://www.amazon.com/Python-Data-Science-Handbook-Essential-dp-1491912057/dp/1491912057
I don't know much about training in aws but if you feel like you're taking too much time to setup your environment (turning on your laptop/desktop) or things like that, you can use a screen on tmux instance and just ssh into your aws instance using termux from your phone.
I don't know if this might help you but when I train for days on my local machine I just monitor my progress from my phone. It saves time getting my desktop/laptop up and running.
Alternatively, there are also apps that can stream what's happening on your terminal. I've used hyperdash previously, but it seems too overboard for my purposes. Maybe it'll be useful for you.
I've been a one man team more often than not. My biggest challenges were overlooking obvious solutions and not having anyone to double check my work. In addition to the other advice I've heard, I'd add the following.
Learn best practices. If you are going to be writing code then read Code Complete and The Pragmatic Programmer. If you are going to work in SQL and build tables or whole databases, read Murachs SQL Server. Read some of Steven Few's books on presenting information visually. If you are going to be working closely with business leaders, learn how to communicate effectively with executives who tend to want short, direct answers.
Make systems to monitor and test your systems. Once someone rolled back a database without telling me and a subtle bug I had removed was reintroduced. As the only eyes on the problem it took months for me to notice the discrepancies in the data, and some poor decisions were made as a result. Now I build frameworks that will tell me if there is a new status code that wasn't there before, if a field I need isn't populating anymore, etc.
A spot request P3 instance will cost you about 25-cents (US) per hour. I personally use Keras to interface with TensorFlow on them, and I followed the instructions here: https://www.tensorflow.org/install/install_linux to install on my own vanilla Ubuntu 16.04 install. Worked like a charm, but it did take an hour or so to set up.
Amazon provides their own deep learning API which should save you the headaches of installation: https://aws.amazon.com/tensorflow/ but I can't speak to using that personally.
Certs can be great, but pretty much only for the skills you get out of them. If you don’t know SQL please please please prioritize it, it unlocks so many doors for you (as analyst, DS, DE, MLE, everything), here’s a free course, basic SQL is not that hard: https://www.udacity.com/course/sql-for-data-analysis--ud198
Do a Google search. This came up when I did:
As an aside, this subreddit is not for troubleshooting docker.
This Coursera course is on implementing machine learning tools in Python: https://www.coursera.org/learn/ml-foundations/home/info. I found it pretty helpful.
>my math is up to Pre-Calc
If you really want to get into the higher-end data analysis then you really should take more math. I would recommend machine learning - you can check out Andre Ng's machine learning course, but it assumes some linear algebra knowledge. You can check out the preview of the first week, which includes some linear algebra review, and see if you can follow it. If not, then you should probably just focus on calculus and linear algebra. Then eventually you could get into machine learning or more complicated statistical modelling.
Like /u/AnExercise4TheReader said, you're asking to get the abilities of a graduate student without getting a graduate degree, but there are no real shortcuts.
Although it may sound tedious, I think these exercises are really valuable! Thanks for your interest in Python ML! Btw don't forget to take a look at the accompanying github repo with notebooks and plenty of additional info material.
Another course that may be useful to you (as an alternative to Andrew Ng's course, or a complementary resource) would be Pedro Domingo's "Machine Learning" on Coursera. It's a really long course, but he explains everything very, very well -- with a big focus on the "why" instead of the "how". Currently, the course is not "active" but you can still watch the video lectures here.
If you are going to drop any money on courses I would recommend the Data Scientist Path from Udacity.
1) Programming for Data Science
2) Data Analyst
3) Data Scientist
https://www.udacity.com/bertelsmann-data-scholarships
You might quality for this. I don't know but I think you will have to complete the math courses up to linear algebra before moving into a masters. I haven't seen many programs that bypass that math requirement, and a lot of the people accepted to the American programs are from math and engineering backgrounds. Good luck!!
Dr. Portilla, I signed up for this course and really like it (the level of detail is good, it is focused on what I need, and doing everything in Jupyter is great), and was wondering if you had a recommendation for a similar R course? I have tried getting into the Johns Hopkins course but find it a bit dense at times and/or irrelevant. Sorry to be off topic.