I found that practical examples with student interaction usually work well in engaging them. Here are two examples that are pretty fun and illustrative (but they aren't "modern" statistics, so I don't know how useful you will find them):
The book by Andrew Gelman & Deborah Nolan contains these and more cool experiments.
You could easily move on to more modern and complicated statistics from there.
Yea I am trying to find new jobs and theres often fucking leetcode too. Statistical ML is hardly asked or often asked after that stage and I personally have not made it that far cause I can’t solve the “maximum # of events in a given time period” type of problem like this: https://leetcode.com/discuss/interview-question/374846/Twitter-or-OA-2019-or-University-Career-Fair
However, not all DS jobs have leetcode. Some do take homes or presentations, but the one that did a presentation was not in industry it was in academia.
I was asked about logistic regression in an interview once, and the interviewer was a CS person probing my resume. He asked me what the output of logistic regression was and I said its the class probability and it seemed he didn’t agree even when I emphasized classification happened after the probability. Huge red flag. I blame sklearn model.predict() for perpetuating these misconceptions and then you look stupid for the right answer...
The book R for Data Science is really excellent and available free in an easy-to-navigate online form. I highly, highly recommend it.
And although I don't know Python, I think that starting with R is a good idea, especially if this will be extra-curricular. The language is made for data science specifically, and has a really great associated user interface via R-Studio.
Statistical Rethinking: https://www.youtube.com/playlist?list=PLDcUM9US4XdM9_N6XUUFrhghGJ4K25bFc
Also has the book: https://www.amazon.com/Statistical-Rethinking-Bayesian-Examples-Chapman/dp/1482253445
Dumb question, but is this for hobby reasons, or do you have a specific scientific question you'd like to answer using statistics?
All the tests are "mathematically sound". Each one is based on different assumptions. I personally focus on understanding the assumptions and figuring out when real data is likely or not likely to match those assumptions reasonably well.
But if you want to geek out on the math, Casella and Berger is pretty much a classic. It's extremely thorough, and takes you through all the steps from basic probability distributions through the tests that are based on them.
Hi, I'm a PhD student in CMU Statistics. I can answer general questions about the program and curriculum, or at least point you at people who can.
Linear algebra is part of the required curriculum, as well as multivariate calculus. You'd only need linear algebra your second or third year when you take regression (36-401) and advanced data analysis (36-402), so learning it now may not be best, like /u/trijazzguy says.
Programming is definitely a good idea, though. Python is good if you take the time to learn the data packages (Numpy, Pandas, Matplotlib, etc.), but most of your courses will use R. But honestly it doesn't matter which language you learn, as long as you learn something you find interesting so you get practice thinking like a programmer. Find a little project you're interested in and write some code for it.
Also, take a look at our new majors. You can just do statistics, or you can combine it with economics, machine learning, or math. (I strongly recommend doing mathematical statistics if you're ever interested in going to graduate school or doing stats research -- the math preparation is essential.)
I'm not sure what else you can do to prepare. The CMU program is very good. Many undergraduates decide this means they need to take as many classes as possible every semester, so they spend all their waking hours doing homework and begging for extensions. Don't do that. Try to relax a bit and pick your courses strategically.
I highly recommend Lectures on Probability Theory and Mathematical Statistics by Marco Taboga. The proofs are rigorous yet concise, and the clarity of presentation is superb. IMO, this book is much better than Casella & Berger. The 2nd edition is available for free online, and the 3rd edition can be bought on Amazon.
In reading a bit about this disagreement I stumbled on this article, which to me seems like garbage.
> Some people might confuse logistic regression and a binomial GLM with a logistic link, but they aren’t the same.
Am I going crazy or are these exactly the same? Not to mention I've never heard people refer to a "logistic link" but "logit link". This guy is also assuming 538 uses pretty basic models like linear regression, but I was under the impression he is doing something with Hierarchical Bayesian models.
I do not know how much you are into cooking. But there is this concept of "basic sauces". Yes, there are a million sauces out there and an eager student might roam about and learn each and every sauce there is or he could learn that there a a few basic sauces on which all of these other sauces are variations of. sauce velouté is a white sauce based on flour and butter. From there on, it becomes any sauce you'd like.
I think a first step for you to make might be to read about the general linear model (not to be confused with the generalized linear model) and realize that all of these tests do the very same thing. A t-test is a special case for ANOVA, ANOVA is a special case for linear regression. Then all of these different tests become different flavors of the same kind.
For a non-technical introduction most people recommend andy fields "discovering statistics"
If you do not know what t-tests and the like even do you might want to start with An introduction to the practice of statistics
it has a lot of practice assignments
If you want to go the bayesian route you might enjoy "rethinking Statistics" by McElreath
I'd suggest taking a look at the free Data Science Specialization offered in Coursera. I've only tried the first two courses, but found them challenging enough with the added benefit of encouraging individual research to complete some of the assignments (they don't hold your hand like many MOOC do). If I'm not mistaken, the courses are running throughout the whole year, so you can just sign up for whatever level you're comfortable with.
This is Jeff here, from Simply Statistics. Roger's course was designed to teach the mechanics of R. I know he made a pretty strong effort to help folks who didn't have much background, but obviously there is variation in backgrounds. He would definitely love feedback on the course.
If you want to learn the statistical component, my course in Data Analysis: https://www.coursera.org/course/dataanalysis is the natural continuation of Roger's course. Hope to see you in that one!
R.
Better than a lot of commercial software by many criteria, though it does involve some investment to learn.
By default, it is command-line driven, but I think it's worth learning to use it that way.
There are many, many resources available.
Humans are not intuitively good at probability and statistics, because of numerous cognitive biases. -Thinking: Fast & Slow
I took a lot of calculus and I use every bit of it. I did a masters in economics and used it pretty much everyday. I'm now in a stats PhD program and I certainly use it everyday. Calculus and linear algebra are probably the two most important math classes you can take for a PhD program.
I never took a class titled Numerical Analysis but did a quick Google search: http://www.scholarpedia.org/article/Numerical_analysis. It lists 3 main areas:
All 3 of these areas are useful and things I've done to some degree while in my PhD program. After reading through this webpage, I'd say it's a no-brainer to take the numerical analysis course.
Ironically, now is the time to suggest every seemingly poor undergraduate text on introductory statistics.
I own a Shaum's outline of Statistics, which is essentially a Cole's notes of stats. Its cheap, it has topics you might find in an intro to stats class, and I think it is as good as anything. You can probably find a free pdf if you look hard enough. And if you can't, DM me and I'll send you mine.
Multivariate Statistical Analysis: A Conceptual Introduction, 2nd Edition, Kachigan https://www.amazon.com/dp/0942154916/ref=cm_sw_r_cp_apa_i_QPipDb8AVMXYR
It's short, cheap (especially if used), and easy to read. Would recommend.
It doesn't really cover GLMs, however, but it's more of the statistical fundamentals.
> And about the software itself, is it freeware?
Yes
> Where would be the best place to get the software?
But I'd also advise getting R studio https://www.rstudio.com/
I highly recommend Lectures on Probability Theory and Mathematical Statistics by Marco Taboga. The proofs are rigorous yet concise, and the clarity of presentation is superb. The interactive web format is available for free online, and the paperback format can be bought on Amazon. Another book that you can consider is the classic Statistical Inference from Casella & Berger. Personally I think Taboga is better than Casella and Berger.
The linear part of linear regression refers to the coefficients, not the variables. For example Y = aX + bX^2 is a linear model because it is a linear combination involving a and b. Y = a*b*X is not a linear model. You can fit a lot of models that are not linear using linear regression. The name is kind of misleading, I think.
Without knowing it, you're asking a gigantic question. You want to know how to fit regression models. That can take up two graduate level courses, if you're learning all the details. A good introduction is by Simon Sheather (Amazon Link). If you're a student, you can read that book for free from SpringerLink. There should be courses on regression modeling from Coursera and MIT Open Courseware, if you'd prefer that.
I'm sorry I can't answer your question directly. You really need to understand a little more about regression to build good models. For any given datasets, there's a handful of different ways, with varying degrees of validity, to model relationships among variables.
Applied Linear Statistical Models by Kutner is a far better reference for statistical modeling compared to ISLR/ESLR or any kind of "machine learning" text, but it sounds as though you did a stat masters since you're asking about stat modeling instead of the new buzzwords. The latter options are certainly more narrow.
https://www.amazon.com/Applied-Linear-Statistical-Models-Michael/dp/007310874X Considered a cornerstone, of sorts.
It's available in a Wayback Machine snapshot from 2013.
Edit: Gold! Aww, thanks. You wouldn't believe it. Almost one month ago, I received gold for the first time, which is soon expiring. I just got the notification about it. And now this. :-)
I would highly recommend this book to anyone who wants strong fundamentals in linear regression: https://www.amazon.com/Statistical-Models-Practice-David-Freedman/dp/0521743850
It presents these concepts very plainly, deliberately, and has exercises to demonstrate these fundamental differences and drive understanding home. It is a super no nonsense approach to the topic.
It was used as the main text for my grad level linear models course at UC Berkeley.
To answer the question, errors are assumed to be normally distributed as part of the data generating process and residuals should be roughly normally distributed as a useful diagnostic (e.g. heteroscedacity of residuals is an indicator that the underlying assumptions about the errors may not be true)
This sub tends to focus on statistical topics that are a bit more math intensive. But there's definitely stuff you can learn about descriptive statistics and visualization that doesn't require a strong math background. I just did a quick query on Amazon and found a couple of well reviewed books you may want to check out.
https://www.amazon.com/Excelling-Data-Descriptive-Statistics-Using/dp/1491029129
https://www.amazon.com/Storytelling-Data-Visualization-Business-Professionals/dp/1119002257
There is also good stuff on Khan Academy. Pausing when he introduces a problem and trying to work it out yourself is a good way to go.
What kind of work are you hoping to use some basic stats in?
It depends on what your ultimate career goals are. If you want to become a full blown statistician/data scientist at another firm, it's probably best to go back and get a masters in a relevant field (being as this is /r/statistics, i'd plug a stats masters).
If you're more concerned with further honing your skills/applying new knowledge to your current job, coursera is your best friend. My personal favorite course there is Machine Learning by Andrew Ng (fantastic course to learn about machine learning algorithms).
Another series to look into would be the Johns Hopkins data science track. P.S. you don't actually need to pay for this, you can take each class individually. I personally didn't derive a lot of value from the track, but i've heard positive things from others.
Good luck learning!
Here take this book "Linear Models in R"!. I am your superman -:)
Not hating on your post at all, just thought it'd be fun to post a favorite quote from Tufte's The Visual Display of Quantitative Information
>A table is nearly always better than a dumb pie chart; the
only worse design than a pie chart is several of them, for
then the viewer is asked to compare quantities located in
spatial disarray both within and between charts [...] Given
their low density and failure to order numbers along a
visual dimension, pie charts should never be used.
Khan Academy's Statistics Videos would be a good place for a refresher, to follow at your own pace.
Duke's "Data Analysis and Statistical Inference Course through Coursera is starting March 2nd, if you'd prefer something with a limited time-frame, and would like to learn how to use "R", a free, powerful statistical analysis platform.
Edit: Apparently the Coursera course was for last year - they have multiple statistics courses, so it may be a good idea to poke around and see if there are any upcoming ones that you might want to take part in.
This will give you a flavor of what the programming would be like in R (free language many statisticians use):
https://www.coursera.org/course/compdata
This for a very basic intro to applied data analysis using R:
https://www.coursera.org/course/statistics
For the "pre-req" math, just work through a Calc I-III sequence and a linear algebra course on Kahn academy.
If you are serious about switching you will need to actually take those math courses on the way to a bachelors degree if you haven't already. But if the Kahn academy stuff seems too overwhelming, I wouldn't spend my money on college courses.
Get the students to work with real data on a project they care about.
My collection of project ideas, and a couple of examples of past projects, are here:
Tufte's first book, The Visual Display of Quantitative Information, 2nd edition is without a doubt his best book. I have heard people say his work is dated, but this is just simply not the case. It is foundational work and I've not found anyone do a better job with the material than Tufte. While I enjoyed his other books, they are not must-reads like his first one. That said, with a good editor I believe his 2nd through 4th books could be cut into a single volume rivaling his first book in quality. So there is a lot of good information in there, but it's it's more of a slog.
Hi there :),
for some introduction (and a bit more) to statistics you might have a look at the Kahn Academy: Statistics. http://www.khanacademy.org/math/statistics
Here you have video tutorials step by step, just take your time, watch and understand them :)
For a simple introduction to regression analysis I usually recommend "Introduction to Econometrics" http://www.amazon.com/Introduction-Econometrics-Christopher-Dougherty/dp/0199567085/ref=sr_1_2?s=books&ie=UTF8&qid=1340222497&sr=1-2&keywords=introduction+to+econometrics I just love this book :).
I'm not sure however what good books there are on how to work with spss, sorry :(.
You have two major advantages here: 1) you know the hiring manager 2) You know what language will be used.
Preparation will be simple just make sure you know your sql. I would recommend reading this tutorial on SQL.
Next step after you have the fundamentals down: practice!!! Download mysql and work to better understand it.
Common interview SQL questions: "What are some common errors you have had to tackle when writing queries" -I always answer 'can't have aggregates in a group by'
"What is the difference between where and a having clause"
"What is a subquery, how do you use them?"
Study hard, and good luck!!!
Definitely R with shiny is perfect for what you need. If you know some java and or python, learning R isn't so bad; as usual though, any new language has a bit of a learning curve. Good luck! http://shiny.rstudio.com/
>is there a way I can highlight a section of it to modify/delete?
No.
But if you are on windows I believe there is a built in text editor of sorts. Regardless get Rstudio, just install and start it and you have a fullblown editor that communicates automatically with R, one caveat: the grid-like view of your data does not support editing. (if you're a little more courageous there is the more advanced rkward)
>For my main question: I'm working with a time-series dataset
I don't know much about time-series but as far as I know R has special data types for time-series, do a
apropos('ts')
and see if something familiar comes up.
Here you have your book: "Statistical Inference" by Casella , Berger
This is terrible advice. I took the Stanford class. It's a fantastic class but it is NOT an intro course by any stretch of the imagination.
The "Data Analysis and Statistical Inference" by Duke University on Coursera is a fantastic intro to stats course and it uses R:
https://www.coursera.org/course/statistics
Starts on Sept 14. The teacher is excellent, the course quality is excellent. It also comes with a free open source textbook which is also excellent. I was doing the Coursera Data Science Specialization track simultaneously and their coverage of stats was inadequate. Only the Duke course kept my head afloat.
I could not recommend it enough.
That's good to know. Thanks for sharing this information.
I've been casually looking at data science the last couple of weeks, and I was thinking about taking coursera's Data Scientists Toolbox over the summer.
Does this seem like it'd be worthwhile? Or would you say there are better uses of my time?
Khan Academy's statistics section is phenomenal for beginners because you get an insight into how the instructor thinks when he's solving the problems. Once you've checked out those videos you should invest in Whitlock & Schluter's "Analysis of Biological Data". The book is aimed at biologists, many of whom are in exactly your kind of predicament, and consequently it is very easy to understand.
Have you checked on Coursera? they may have more advanced classes in addition to the linked one - or it may have material you have not yet been exposed to yet.
You're a little late for Coursera's Computing for Data Analysis; the course finished a few days ago. As a side note the instructor, Roger Peng, is currently preparing certificates of completion for those who earned them.
I participated in the course but, owing to other demands on my time, did not complete it. The course itself moved at a brisk pace and, aside from the time required to watch the lectures, required time to complete the quizzes and exercises, as well as to read further on the subject. Personally I thought the lectures were excellent, and provided a well structured way to learn R, with some statistics thrown in.
Coursera's somewhat related Data Analysis course, which begins in January 2013, might be of interest to you though there's no mention of certificate.
for example your code would look something like:
ods pdf file = "C:\table.pdf"; proc print data = work.table; by year; run; ods pdf close; ods listing;
the last line turns the normal output back on. ods has lots of options if you want to get into the nuts and bolts of it, but that should print you a pdf of the output.
I would totally favor a single core table in these circumstances. That conceptual tidiness you mention really pays off in the long run and will be easier for others to understand (normalized data is expected; tables-per-year is definitely not). Subsetting by year, grouping by year, aggregation in general - sqlite
and dplyr
were designed to make that easy to code and quick to run. Further performance tweaks (like indexing) will probably depend on seeing all of the records at once.
Conversely, having split tables would be a pain if you ever needed to query, say, a single patient's records across all years.
The day may come when a single machine running sqlite can't handle all your data - but then you'd probably be better off looking into databases that support this kind of partitioning.
Fellow social scientist who had a similar background here. I would recommend going through Chang and Wainwright’s book “Fundamental methods of mathematical economics.” It covers basic multivariable calculus and linear algebra. It’s super readable, as well.
https://www.amazon.com/Fundamental-Mathematical-Economics-Wainwright-Professor/dp/0070109109
Try to get the international edition, it’s 20 bucks or so.
The authors of the original study politely reply:
"We agree with Ashley Croft and Joanne Palmer that the risk of mortality is an absolute that can be postponed but not eliminated. We emphasised the potential of exercise in reducing the mortality rate in a given year, not per se. Although the probability of death is 100% in the long run, we can reduce the speed of approaching death by walking briskly 15 min every day and thus extend our lives. It comes with a better quality of life, and that applies to us as well as to the prophets."¹
¹Chi Pang Wen, Min Kuang Tsai, et al. The Lancet, Volume 379, Issue 9818, 3–9 March 2012, Pages 800-801. (http://www.sciencedirect.com/science/article/pii/S0140673612603420)
OP, I would recommend you read through the OpenIntro statistics book. It's free, of very high quality, and there are labs that go along with it in R. The labs also help you learn R. There is a MOOC associated with the class that starts at the beginning of March on Coursera that you may consider taking as well.
It's been suggested that you learn linear algebra first. I disagree. If your goal is to refresh your memory of statistics and get a good introductory understanding of the subject, read the book or take the course I have suggested. If you know that statistics is what you want to pursue, take linear algebra. Linear algebra is essential for gaining a true understanding of statistics. At that point you'll also want to finish multivariable calculus and probability theory so that you can compute density functions and understand the probability behind statistical inference. It sounds like what you're looking for isn't going to involve these until later, and in the meantime I think it's most important that you get a solid basic understanding of statistics so you can determine for yourself whether or not you want to pursue further knowledge in the field.
It's 0.741469. Solution
There are 20^55 possible combinations.
There are
19^55 without the "1".
19^55 - 18^55 that contain the "1", but not the "2".
19^55 - 2* 18^55 + 17^55 that contain the "1" and the "2", but not the "3"
19^55 - 3* 18^55 + 3* 17^55 - 16^55 that contain the "1", "2" and "3", but not the "4"
and so on...
If you sum up all these combinations and divide them by the total nubmer of possible combinations you'll get the result above.
Look at this page: http://www.dynamicgeometry.com/General_Resources/Advanced_Sketch_Gallery/Other_Explorations/Statistics_Collection/Least_Squares.html
The red squares are a measure of how well the line fits the data. Choose the regression line which minimises the area of red squares. The regression formulae do this minimisation for you.
A really bad practice amongst economic researchers is writing really long, bloated stata or R files. Often they are not well documented and they often involve a lot of magic and trickery making you scratch your head figuring out why they took certain steps.
The solution I've found is to use makefiles and break up your stata/R files into many small portable pieces.
Make (http://www.gnu.org/software/make/) is basically a way to list out all the steps to get to your result and specify all the dependencies. Some people like to work with drake which is 'make for data': http://blog.factual.com/introducing-drake-a-kind-of-make-for-data
Usually the makefiles will specify how to import the data, clean the data, and process the data.
Try to make your stata files as modular as possible. It's better to have lots of small, clearly defined functions (hopefully self documenting or well documenting ones) then a 1000 line function that tries to do all the steps one after another. This has the added bonus that you can add lots of unit tests and travis to the functions and it will be a lot easier to debug your functions one at at time then trying to write unit tests for a 1000 line behemoth analytics process.
A huge bonus of this approach in addition to reusable and very readable code is that it will be very easy for others to modify and iterate off of your process. good luck.
It's worth reading some Edward Tufte for guidance too (http://www.edwardtufte.com). He talks a lot about aiming to maximise the ratio of information to ink, so basically reducing wastefulness and minimising the extent to which we add bells & whistles to our charts. In a nutshell: avoid 3D piecharts ;)
It really depends on your intended audience & the standards commonly used in your subject area. If your an R user, it could be worth a look here https://plot.ly/r/
May be overkill, but take a look at RStudio for the R statistical programming language. Fully functional, professional, open-source, statistics IDE. Cannot recommend enough.
then do something like:
x = rnorm(10, mean=2, sd=0.5)
This will generate:
1.75884164412553, 1.96923295305206, 2.02906575504054, 2.84513526976282, 2.2049150744444, 1.73318266409414, 1.62611322640113, 1.2014866750171, 2.0473842615968, 1.92262243708622
Good Thinking is an older book from IJ Good that is basically a series of meandering rants about old-school Bayesian statistics. Very niche, but very interesting.
I just finished my M.S. in statistics. Make sure you have these undergraduate-level topics nailed:
Linear Algebra (first semester, say at the level of Lay's text - look at it on Amazon to get an idea for its topics), down cold. Assume that you will get no time to review this material in class.
Calculus - have integration and differentiation techniques down cold from Calc. I and II, including Taylor/Maclaurin series. Double integration, partial derivatives, and Lagrange multipliers from Calc. III.
Real Analysis - make sure you can do ε-δ proofs as if they are second nature. Limits, continuity, uniform continuity, pointwise convergence, uniform convergence.
Probability and mathematical statistics, at the level of Wackerly's text.
Any programming experience you have would be helpful: doesn't matter if it's C, C++, Java, Python, or R. You have a CS degree, so this should be well covered.
Ok, how about a book to curl up to in-front of a fire when you're feeling alert and awake and at the same time comfortable and warm? Maybe there would be snow outside and a labrador by your slippers.
Anyway I digress - just the kind of book that is the book. E.g. people drop a few hundred pounds on The Art of Computer Programming not because they want to read it at 8.15 on a Monday morning before they start work - they read it because they value it as an important, rewarding and an aesthetically pleasing thing to do.
Introduction to Algorithms by Corman
Convex Optimization by Boyde
Pattern Recognition and Machine Learning by Bishop
Obviously these books at first glance aren't statistics booked, but there are tons of problems in statistics that these books cover and plus introduction to algorithms is a must for anyone looking to program and get a good mathematical basis for it.
I'm nearing completion of the Data Science specialization on Cousera. I've been pretty happy with it overall. I already have a decade+ of programming experience, so the early classes were rudimentary. But the last few courses - which to me are the meat of it - were pretty good.
The linked tutorial covers many of the same topics, but the specialization goes into WAY more detail. If you're thinking of doing this professionally, I would recommend doing the specialization (or something more indepth) over the tutorial. If you're just looking to explore the topic to see how interested you are, then the tutorial would be a better fit.
https://www.coursera.org/course/matrix https://www.edx.org/course/linear-algebra-foundations-frontiers-utaustinx-ut-5-03x
I feel like there is another one with the same concept of teaching LA through programming.
The problem i found with most stats classes that used programming is they used R and relied on R's built in methods, which, although the explained HOW they worked, still left you feeling that you were using a Black Box and so the outcomes always felt somewhat confusing. Meanwhile if you build the functions your self, then all mystery is removed, and you realized without-a-doubt that some concepts are identical and only change based on context.
Here's a good start if you are truly interested. They start with very introductory material you probably learned in middle school and build up to statistical tests you will learn about in a college course.
Right on! I'm a huge fan of the trial and error approach when it comes to learning new statistical software -- glad to see you're jumping into the deep end head first.
Anyways, I think you might be looking for the summary() function:
> model1 <- lm(stack.loss ~ ., data=stackloss) > summary(model1)
Call: lm(formula = stack.loss ~ ., data = stackloss)
Residuals: Min 1Q Median 3Q Max -7.2377 -1.7117 -0.4551 2.3614 5.6978
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -39.9197 11.8960 -3.356 0.00375 **
Air.Flow 0.7156 0.1349 5.307 5.8e-05 ***
Water.Temp 1.2953 0.3680 3.520 0.00263 **
Acid.Conc. -0.1521 0.1563 -0.973 0.34405
---
Signif. codes: 0 ‘**’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.243 on 17 degrees of freedom Multiple R-squared: 0.9136, Adjusted R-squared: 0.8983 F-statistic: 59.9 on 3 and 17 DF, p-value: 3.016e-09
In the output, the "Estimate" column lists the coefficient for each predictor variable in your model (as well as the intercept). Hope this helps.
err...well, what kind of data do you have? If you have something fairly digitized already, you could load your data into a network data structure and query the network for the node that has the highest out-degree. If for each article you have a list of citations handy, you could munge this into a format suitable for ingestion into something like gephi which would do the hard work for you and give you lots of pretty pictures and even allow you to do fancier analyses, like return the paper with the highest PageRank in the corpus, or with the highest betweeness-centrality.
If you would be satisfied by the paper with the most citations overall (without regard to your specific corpus) you could use google scholar to count the number of times each paper has been cited.
I recently took the Andrew Ng's MOOC in Machine Learning. As part of the course, we learnt to use Octave (an open-source Matlab clone) and implemented all the main algorithms ourselves - linear regression, linear classification, neural networks, SVM''s, etc.
If you want to go at a slower pace, then try the Coursera Data Science track, which is R-based. All the courses are free.
The statistical mechanics course contains a lot of applications of MCMC. I did the course and it is pretty good.
I just stumbled over this course while searching the link to the statistical mechanics. So I don't know how good it is...
Statistical theory is useful, but to apply it, you'll need to understand the tools used in the industry. I'd recommend the data-science track at Coursera. This way you'll learn some basic programming with R (a statistical programming language) and basic statistical inference. Teaching quality varies, but if you're motivated, you'll do fine.
This is for ecologists but might work for you...
Benjamin M. Bolker, Mollie E. Brooks, Connie J. Clark, Shane W. Geange, John R. Poulsen, M. Henry H. Stevens, Jada-Simone S. White, Generalized linear mixed models: a practical guide for ecology and evolution, Trends in Ecology & Evolution, Volume 24, Issue 3, March 2009, Pages 127-135, ISSN 0169-5347, DOI: 10.1016/j.tree.2008.10.008.
http://www.sciencedirect.com/science/article/pii/S0169534709000196
Would it be ethical to remove customer names/ids and release the data? I'm sure you could get some volunteers to help investigate something as interesting as what's being read at the library!
The rule of thumb is that if you can fit the data on a single machine, it's not "big" and hive, hadoop, spark, cloudera, etc should all be ignored. They're more cumbersome that it's worth. I'd suggest choosing from:
Learning more about crystal reports.
Learn enough sql to create the queries you're interested in, dump them to csv files, and use excel to create graphs.
If you want to get into statistical analysis and machine learning, learn R or Python. https://www.coursera.org/course/rprog which is part of https://www.coursera.org/specialization/jhudatascience/1 may help. I'm taking them to branch out from being a general purpose programmer.
Not sure how you were able to take a time series course without basic stats background (stuff you list is typically taught in Stats 101). I'd suggest Khan academy if your set on taking this time series course right now:
http://www.khanacademy.org/?video=statistics--the-average#statistics
As someone who teaches stats to non-stats grad students, I would highly recommend taking some introductory stats courses before pursuing the time series class. Either way, good luck with the semester.
Thank you. It’s all JavaScript. I created the content (and all the interactivity) in an Observable notebook , made plots using plotly .
I am preparing a talk on one of my favorite topics (there is only one test) and using this question as an example. I hope you don't mind.
My draft slides are here
https://docs.google.com/present/view?id=dcq7d5hs_234dwck2rf2
Comments and suggestions are welcome.
Yup. You might also be interested to hear that when Gosset (aka "Student") was doing his work on t-distributions, he implemented Monte Carlo methods without a digital computer:
> He then checked the adequacy of this distribution by drawing 750 samples of 4 from W. R. Mac-donell’s data on the height and middle-finger length of 3,000 criminals and by working out the standard deviations of both variates in each sample (see Macdonell 1902). This he did by shuffling 3,000 pieces of cardboard on which the results had been written, possibly the earliest work in statistical research that led to the development of the Monte Carlo method.
Look at JASP: https://jasp-stats.org/
It's new and open source, but it has an interface like SPSS and can probably take care of all the basics you need.
Nothing is going to do all your work for you though: you need to understand what you want to do, what tests you want to run and how they work in order to actually present something meaningful
From the FAQ:
> Q. What programming language is JASP written in? > >A. The JASP application is written in C++, using the Qt toolkit. The analyses themselves are written in either R or C++ (python support will be added soon!). The display layer (where the tables are rendered) is written in javascript, and is built on top of jQuery UI and webkit.
Working with Gephi is rather intuitive. You can request a bunch of measures, the ones you describe are certainly in there.
If you need more sophisticated measures, it will probably be less comprehensive then what is available in packages as igraph or sna, but with a point-and-click interface for both the measures and visualization of them.
As you talk about changes "over time", Gephi recently also got the ability to visualize the changes in the graph over time. Again, nice interface (time-slider), but I do not know if the necessary time-variant measures are also included.
Gephi should be able to handle the format you describe. Another package I frequently use for manipulating/creating networkdata, changing formats, etc. is NetworkX.
To echo what PhaethonPrime says, you'll be okay for most stats as long as you don't need to do any exotic (and also non-bayesian) models.
In terms of the graphics, definitely check out learning some of the "grammar" based plotting libs. This is one area where R still crushes it, but Python's Bokeh is getting interesting these days.
Try R studio if you want to go with R, it is much easier to use. Just find a couple of examples online and you will be good to go.
The best alternative probably is Stata, but I do not think that Stata produces nicer output (admittedly, you do have to program more in R to get the nice output). Also, Stata is not for free.
Bottom line; try R, using R studio, if you really do not like it, get something like Stata (or perhaps even SPSS). Don't bother with Matlab (similar coding requirements as R, not free and graphics are not that amazing out of the box), or mathematica.
If R is working for you and you want even some more freedom, go with Python / [Julia](www.julialang.org).
Stats 141: Statistical Computing taught by Duncan Temple Lang
of R fame: http://www.r-project.org/contributors.html
There is a lot going on in the davis stats program and the computational stats program definitely offers the skills to get a job right out of school. You should talk to counselors.
As a statistician, SQL is a good addition to your toolbox. I do some work in R, which by default loads all data into memory. This is a problem if you're working with data sets that are a few GB or more in size. If the data is in a relational DB (i.e., a DB that can be queried by SQL), then you may be able to write a query to select a subset of the data that fits in memory and proceed from there.
On that note, you may eventually want to learn a little about map-reduce, a technique for operating on data sets so large they don't fit on a single hard drive. I think the most popular open source implementation of map-reduce is hadoop.
Going back to SQL, I'm not familiar with MariaDB but a popular small relational database is sqlite. Unfortunately, you can't really do much (with sqlite or any database) until you've loaded in a some data to play around with. Does anybody know of any public data sets that are easily -- as in, for a novice -- loaded into a popular database?
This. If you do go down the data mining route, check out Weka (http://www.cs.waikato.ac.nz/ml/weka/). It doesn't take very long to learn and is great for exploring relationships between variables in a large, multi-variable dataset.
I've found LyX to be a nice way to crank out tables or long equations in a hurry; it's got an easy to use interface and the code to produce what you have written up is automatically generated (like a happy union between Word and a standard TeX editor). Often, I'll have it open in the background while I'm working with another editor so that I can hop over and create a table or an equation, then just copy the code back into my main document.
Here's a link: http://www.lyx.org/
I would highly recommend starting with the following:
While not strictly for A/B or Marketing, they give you the tools and encompass the principles used in marketing analytics. (Business Analyst, Currently working also with SEO and Marketing Analytics)
What textbook did you use in your class and how much of the book did you cover? Which topics did you cover?
I recommend Introduction to Probability by Blitzstein and Hwang [Link to PDF]. Then, for a book that goes deeper into the theory (and proves theorems in full rigor), I recommend Probability (Theory and Examples) by Durrett.
If you're interested in a bit of history, both Salsberg's The Lady Tasting Tea and Hacking's The Emergence of Probability are good reads. They dig more into the ideas and people that went into the original developments of probability and statistics. I found understanding how the field began gave greater context to the methods we use today and modern arguments about them.
This is a great book to learn spatial analysis modern spatial econometrics by Anselin.
I'm sure there's a free version online.
That's only one area of nonparametric. Nonparametric models are important for small data too where you don't or can't assume a distribution.
But I do agree with your sentiment that having a CI for your prediction is important.
To be fair to ML, there are area where they are very good at and that's data with low noises (images, NLP, etc..). I believe Frank Harrell's book Regression Modeling Strategies talk about this and his view on ML.
I also believe ASA have talk about how to augment ML with statistic for their goal for 2020.
If I can find ASA's post I'll update this post.
I don't know if this fall under nonparametric models but bootstrap get the distribution from data and not some assume distribution. It's certainly nonparametric technique at least. There are situation where bootstrap would be better than it's parametric counterpart.
A lot of clinical trial issues are pretty fundamental statistics. There are some specific weird things that tend to come up in trials, and not in other places (e.g. compliance with treatment). This book describes a lot of those issues in clinical trials, and it's fairly short.
https://www.amazon.com/Designing-Randomised-Trials-Education-Sciences/dp/0230537359
Disclaimer: I used to work with the authors, and have published papers with them. But I bought the book, with my own money.
This is an applied statistics book that will walk you to-through PCA and a bit beyond.
https://www.amazon.com/Analysis-Multivariate-Statistics-Behavioral-Sciences/dp/1584889608
R is a high level language that is fairly easy to pick up once you know basic CS coding language syntax and principles(at least I thought so). Additionally, if you are interested in writing statistical software for R, much of what you write will be in C++ and wrapped for R.
The C++ experience is good, but I would definitely recommend doing more on your own (mini projects like games, etc.) so you feel like you have a more versatile grasp on it.
I'm currently working through Cormen's Introduction to Algorithms, if you end up doing something similar this summer let me know and I can try to provide guidance.
For general Biostatistics I'd recommend "Intuitive Biostatistics" by Harvey Motulsky, although its thin on graphical representation.
For presentation of graphics Tufte's "The Visual Display of Quantitative Information" is great.
I don't claim to be very good at explaining things, so here is a pretty good intro to estimating a population proportion.
http://stattrek.com/lesson4/proportion.aspx
If you're not really into the theory, here is a handy calculator.
http://www.wolframalpha.com/input/?i=binomial+distribution+confidence+interval
Try iTunesU, Coursera, and Khan Academy.
https://www.coursera.org/course/stats1 http://www.khanacademy.org/math/probability
You may want to start with some classes on probability, it's the basis on which statistics is built.
You can also look at high school AP stats classes.
If you're just looking for an introduction to general statistics concepts I would suggest trying the Khan Academy videos for statistics.
A decent text to try is Elementary Statistics by Triola. I've taught intro to stats a few times with it, and don't have any major complaints.
For more serious graduate level, check out Casella and Berger mentioned below (looks promising to me; will pick up a copy sometime soon).
okay buddy, here you go First you can get this data from the california statewide database project. I grabbed the 2010 election results, summed up the gov election and prop 19 by county, and ran a bivariate regression. I got my parameter estimates and used that to create an expected value.
FIPS Code for San Diego is 73. As you can see in the data, it is 4 percentage points higher than expected by the regression.
I have attached the spreadsheet here:
It is a simple regression. The explanatory parameter and regression is significant.
Its the double edged sword of "data science". A vague name that defines a broad and loosely structured set of skills is going to beget a lot of jobs with broad, vague and loosely defined skillsets.
I generally get a good feel for what they are looking for in the job description itself though. I'm less data sciencey and more QC stats kind of person, but that has similar pitfalls. Anyways, here are two examples. Not the best examples, just two I dug up quickly.
http://www.indeed.com/viewjob?from=appsharedroid&jk=dbe49f98a547ba8b
http://www.indeed.com/viewjob?from=appsharedroid&jk=eaaaa3e0affc7c43
That first one I would never respond to. The second one I would be more likely to (at first glance at least, I'm quite sure I'm not actually qualified due to the non stats requirememts).
I can tell the person who wrote the second one actually has a background in stats, and it's also clear to me that they sincerely need the person they hire to have a decent level of strength in this.
There are keywords I see in both. In the first its "manipulating data", "statistical graphs", "Tableau". In the second its "factor analysis", "test design". You'll also notice that the second one requests you write technical reports, while the first talks about fulfilling requests from writers.
This is a pretty extreme example in terms of the differences between these, but I thought it was illustrative.
If we take your setup to be exact then the probability of being helped by any individual is given by 1/(xn). [When x=1 it is simply 1/n, etc.]
Importantly for us, this means that the probability that an individual does not help you is 1-1/(xn), or alternatively (xn - 1)/xn.
If we are working in your idealized scenario, then we can readily answer the question "what is the probability that no one helps us?". Assuming independence, then the probability that no one helps us is the product of the probabilities that each individual does not help us (think a coin flip: the probability of flipping heads is 0.5, the probability of flipping heads twice in a row is 0.5x0.5 = 0.25).
What this means is that the probability that we are not helped is given by [(xn-1)/xn]^n
From here we grab the complementary probability and say the chance that we are helped is given by 1-[(xn-1)/xn]^n.
You can play around here: https://www.desmos.com/calculator/be9zocwrtq where the x-axis represents the sample size and "z" can be set to be "x" in the above expression.
Here's an example of a MOOC that is open for people who don't have much of a math background. It's been offered fairly regularly in the past.
I'll give it a shot.
Alpha is an assumption associated with any frequentist statistical comparison. Alpha deals with how likely you are to make a type 1 error; whenever you take a sample of data from a population to compare it to some other value (another sample or population), there's a certain likelihood that by "some twist of fate", the sample you picked was uncharacteristic of the population from which it was drawn. Alpha is set by the experimenter based on previous data and experiments in the particular field. Psychology uses .05, other disciplines use .1 or .01. The homogeny of the participants/test subjects/microbes/groups/data in one's sample usually affects what one sets as one's alpha. This occurs largely due to design constraints (i.e. microbiologists can expect greater homogeny across their samples than a sports psychologist).
"most significant alpha level" is a turn of phrase i've never heard before.
Finally, data cannot prove a hypothesis correct; rather, it can disprove a null hypothesis. Again, this is due to experimental design constraints. Statistical comparisons are more rigorous than pure numerical comparisons; just because an average of "5" is greater than an average of "4", does not mean that they are statistically different. The variance associated with these two means, and the statistical and experimental design considerations in place are needed to determine if the samples used to identify those means are dissimilar.
These might help:
I haven't actually used them, but I've heard good things about khan academy for beginning stats. Here's a link to the stats videos http://www.khanacademy.org/math/statistics I have a friend who is an occupational therapist and when I was doing her stats homework the biggest thing was understanding a p-value and hypothesis testing for when you read about a study and they tell you they had a p-value of .01 you can understand what that means in the context of the study.
I do a lot of statistical computing, sometimes with large data sets, and this is the new build I just put together for myself. Total cost would be about $12K, but I reused the power supply, video cards, and case from my last build. Here is a link to the parts