From 3.5 billion Reddit comments

The following content includes affiliate links for which I may make a small commission at no extra cost to you should you make a purchase.

Products

Services

Android Apps

#1

#4

#5

#7

#9

#10

#11

#12

#13

#14

#1

60 /r/statistics comments

#2

42 /r/statistics comments

#3

26 /r/statistics comments

#4

25 /r/statistics comments

#5

19 /r/statistics comments

#6

13 /r/statistics comments

#7

12 /r/statistics comments

#8

9 /r/statistics comments

#9

8 /r/statistics comments

#10

7 /r/statistics comments

#11

6 /r/statistics comments

#12

6 /r/statistics comments

#13

6 /r/statistics comments

#14

6 /r/statistics comments

#15

5 /r/statistics comments

43 points

·
14th Apr 2018

I found that practical examples with student interaction usually work well in engaging them. Here are two examples that are pretty fun and illustrative (but they aren't "modern" statistics, so I don't know how useful you will find them):

- Bring an inflatable globe. The goal is to estimate the proportion of land/water on earth by taking samples from the globe. Students throw the globe around, each taking a "sample" by randomly placing their finger on the globe and recording whether it hit land or water. After a sufficient number of samples, say, 50, they can estimate the proportion of land/water. This is a fun illustration of how we can learn something by taking samples. It can also be used to illustrate that the estimation is not perfect and subject to randomness etc.
- The German tank problem can serve as an example where statistics were used in a real historical problem where the stakes were very high. The problem is good for story-stelling to set up the situation the allies found themselves in. For an illustration: Bring a pot with little slips of paper containing the numbers 1 to
*N*(where*N*could be - say - 400 or so). The students pass the pot and draw a sample without replacement of, say, 5 slips of papers and noting the numbers on them for later use. At the end, have each student calculate his estimate for the total number of slips (*N*) using the formula (the formula is so easy that the calculation can be done effortlessy on a calculator or a phone).

The book by Andrew Gelman & Deborah Nolan contains these and more cool experiments.

You could easily move on to more modern and complicated statistics from there.

38 points

·
2nd Jun 2021

Yea I am trying to find new jobs and theres often fucking leetcode too. Statistical ML is hardly asked or often asked after that stage and I personally have not made it that far cause I can’t solve the “maximum # of events in a given time period” type of problem like this: https://leetcode.com/discuss/interview-question/374846/Twitter-or-OA-2019-or-University-Career-Fair

However, not all DS jobs have leetcode. Some do take homes or presentations, but the one that did a presentation was not in industry it was in academia.

I was asked about logistic regression in an interview once, and the interviewer was a CS person probing my resume. He asked me what the output of logistic regression was and I said its the class probability and it seemed he didn’t agree even when I emphasized classification happened *after* the probability. Huge red flag. I blame sklearn model.predict() for perpetuating these misconceptions and then you look stupid for the right answer...

36 points

·
29th Nov 2018

The book R for Data Science is really excellent and available free in an easy-to-navigate online form. I highly, highly recommend it.

And although I don't know Python, I think that starting with R is a good idea, especially if this will be extra-curricular. The language is made for data science specifically, and has a really great associated user interface via R-Studio.

32 points

·
19th Mar 2018

Statistical Rethinking: https://www.youtube.com/playlist?list=PLDcUM9US4XdM9_N6XUUFrhghGJ4K25bFc

Also has the book: https://www.amazon.com/Statistical-Rethinking-Bayesian-Examples-Chapman/dp/1482253445

28 points

·
20th Jul 2021

Dumb question, but is this for hobby reasons, or do you have a specific scientific question you'd like to answer using statistics?

All the tests are "mathematically sound". Each one is based on different assumptions. I personally focus on understanding the assumptions and figuring out when real data is likely or not likely to match those assumptions reasonably well.

But if you want to geek out on the math, Casella and Berger is pretty much a classic. It's extremely thorough, and takes you through all the steps from basic probability distributions through the tests that are based on them.

21 points

·
23rd Apr 2015

20 points

·
4th Apr 2015

Hi, I'm a PhD student in CMU Statistics. I can answer general questions about the program and curriculum, or at least point you at people who can.

Linear algebra is part of the required curriculum, as well as multivariate calculus. You'd only need linear algebra your second or third year when you take regression (36-401) and advanced data analysis (36-402), so learning it now may not be best, like /u/trijazzguy says.

Programming is definitely a good idea, though. Python is good if you take the time to learn the data packages (Numpy, Pandas, Matplotlib, etc.), but most of your courses will use R. But honestly it doesn't matter which language you learn, as long as you learn something you find interesting so you get practice thinking like a programmer. Find a little project you're interested in and write some code for it.

Also, take a look at our new majors. You can just do statistics, or you can combine it with economics, machine learning, or math. (I strongly recommend doing mathematical statistics if you're ever interested in going to graduate school or doing stats research -- the math preparation is essential.)

I'm not sure what else you can do to prepare. The CMU program is very good. Many undergraduates decide this means they need to take as many classes as possible every semester, so they spend all their waking hours doing homework and begging for extensions. Don't do that. Try to relax a bit and pick your courses strategically.

18 points

·
23rd Jan 2019

I highly recommend Lectures on Probability Theory and Mathematical Statistics by Marco Taboga. The proofs are rigorous yet concise, and the clarity of presentation is superb. IMO, this book is much better than Casella & Berger. The 2nd edition is available for free online, and the 3rd edition can be bought on Amazon.

17 points

·
6th Apr 2019

In reading a bit about this disagreement I stumbled on this article, which to me seems like garbage.

> Some people might confuse logistic regression and a binomial GLM with a logistic link, but they aren’t the same.

Am I going crazy or are these exactly the same? Not to mention I've never heard people refer to a "logistic link" but "logit link". This guy is also assuming 538 uses pretty basic models like linear regression, but I was under the impression he is doing something with Hierarchical Bayesian models.

17 points

·
14th Jul 2018

I do not know how much you are into cooking. But there is this concept of "basic sauces". Yes, there are a million sauces out there and an eager student might roam about and learn each and every sauce there is or he could learn that there a a few basic sauces on which all of these other sauces are variations of. *sauce velouté* is a white sauce based on flour and butter. From there on, it becomes any sauce you'd like.

I think a first step for you to make might be to read about the general linear model (not to be confused with the generalized linear model) and realize that all of these tests do the very same thing. A t-test is a special case for ANOVA, ANOVA is a special case for linear regression. Then all of these different tests become different flavors of the same kind.

For a non-technical introduction most people recommend andy fields "discovering statistics"

If you do not know what t-tests and the like even do you might want to start with An introduction to the practice of statistics

it has a lot of practice assignments

If you want to go the bayesian route you might enjoy "rethinking Statistics" by McElreath

16 points

·
9th May 2015

I'd suggest taking a look at the free Data Science Specialization offered in Coursera. I've only tried the first two courses, but found them challenging enough with the added benefit of encouraging individual research to complete some of the assignments (they don't hold your hand like many MOOC do). If I'm not mistaken, the courses are running throughout the whole year, so you can just sign up for whatever level you're comfortable with.

16 points

·
16th Jan 2018

15 points

·
29th Oct 2012

This is Jeff here, from Simply Statistics. Roger's course was designed to teach the mechanics of R. I know he made a pretty strong effort to help folks who didn't have much background, but obviously there is variation in backgrounds. He would definitely love feedback on the course.

If you want to learn the statistical component, my course in Data Analysis: https://www.coursera.org/course/dataanalysis is the natural continuation of Roger's course. Hope to see you in that one!

14 points

·
2nd Jan 2011

R.

*Better* than a lot of commercial software by many criteria, though it does involve some investment to learn.

By default, it is command-line driven, but I think it's worth learning to use it that way.

There are many, many resources available.

14 points

·
18th Oct 2018

Humans are not intuitively good at probability and statistics, because of numerous cognitive biases. -Thinking: Fast & Slow

13 points

·
18th Aug 2021

I took a lot of calculus and I use every bit of it. I did a masters in economics and used it pretty much everyday. I'm now in a stats PhD program and I certainly use it everyday. Calculus and linear algebra are probably the two most important math classes you can take for a PhD program.

I never took a class titled Numerical Analysis but did a quick Google search: http://www.scholarpedia.org/article/Numerical_analysis. It lists 3 main areas:

- Systems of Linear and Nonlinear Equations
- Approximation Theory
- Numerical Solution of Differential and Integral Equations

All 3 of these areas are useful and things I've done to some degree while in my PhD program. After reading through this webpage, I'd say it's a no-brainer to take the numerical analysis course.

13 points

·
15th Sep 2020

Ironically, now is the time to suggest every seemingly poor undergraduate text on introductory statistics.

I own a Shaum's outline of Statistics, which is essentially a Cole's notes of stats. Its cheap, it has topics you might find in an intro to stats class, and I think it is as good as anything. You can probably find a free pdf if you look hard enough. And if you can't, DM me and I'll send you mine.

13 points

·
27th Jul 2019

Multivariate Statistical Analysis: A Conceptual Introduction, 2nd Edition, Kachigan https://www.amazon.com/dp/0942154916/ref=cm_sw_r_cp_apa_i_QPipDb8AVMXYR

It's short, cheap (especially if used), and easy to read. Would recommend.

It doesn't really cover GLMs, however, but it's more of the statistical fundamentals.

12 points

·
7th Jun 2018

> And about the software itself, is it freeware?

Yes

> Where would be the best place to get the software?

But I'd also advise getting R studio https://www.rstudio.com/

12 points

·
20th Feb 2019

I highly recommend *Lectures on Probability Theory and Mathematical Statistics* by Marco Taboga. The proofs are rigorous yet concise, and the clarity of presentation is superb. The interactive web format is available for free online, and the paperback format can be bought on Amazon. Another book that you can consider is the classic *Statistical Inference* from Casella & Berger. Personally I think Taboga is better than Casella and Berger.

11 points

·
8th Nov 2017

The *linear* part of linear regression refers to the coefficients, not the variables. For example Y = aX + bX^2 is a linear model because it is a linear combination involving a and b. Y = a*b*X is not a linear model. You can fit a lot of models that are not linear using linear regression. The name is kind of misleading, I think.

Without knowing it, you're asking a gigantic question. You want to know how to fit regression models. That can take up two graduate level courses, if you're learning all the details. A good introduction is by Simon Sheather (Amazon Link). If you're a student, you can read that book for free from SpringerLink. There should be courses on regression modeling from Coursera and MIT Open Courseware, if you'd prefer that.

I'm sorry I can't answer your question directly. You really need to understand a little more about regression to build good models. For any given datasets, there's a handful of different ways, with varying degrees of validity, to model relationships among variables.

11 points

·
22nd Apr 2019

Applied Linear Statistical Models by Kutner is a far better reference for statistical modeling compared to ISLR/ESLR or any kind of "machine learning" text, but it sounds as though you did a stat masters since you're asking about stat modeling instead of the new buzzwords. The latter options are certainly more narrow.

https://www.amazon.com/Applied-Linear-Statistical-Models-Michael/dp/007310874X Considered a cornerstone, of sorts.

10 points

·
27th Jan 2015

It's available in a Wayback Machine snapshot from 2013.

Edit: Gold! Aww, thanks. You wouldn't believe it. Almost one month ago, I received gold for the first time, which is soon expiring. I just got the notification about it. And now this. :-)

10 points

·
25th May 2021

I would highly recommend this book to anyone who wants strong fundamentals in linear regression: https://www.amazon.com/Statistical-Models-Practice-David-Freedman/dp/0521743850

It presents these concepts very plainly, deliberately, and has exercises to demonstrate these fundamental differences and drive understanding home. It is a super no nonsense approach to the topic.

It was used as the main text for my grad level linear models course at UC Berkeley.

To answer the question, errors are assumed to be normally distributed as part of the data generating process and residuals should be roughly normally distributed as a useful diagnostic (e.g. heteroscedacity of residuals is an indicator that the underlying assumptions about the errors may not be true)

10 points

·
14th Apr 2018

This sub tends to focus on statistical topics that are a bit more math intensive. But there's definitely stuff you can learn about descriptive statistics and visualization that doesn't require a strong math background. I just did a quick query on Amazon and found a couple of well reviewed books you may want to check out.

https://www.amazon.com/Excelling-Data-Descriptive-Statistics-Using/dp/1491029129

https://www.amazon.com/Storytelling-Data-Visualization-Business-Professionals/dp/1119002257

There is also good stuff on Khan Academy. Pausing when he introduces a problem and trying to work it out yourself is a good way to go.

What kind of work are you hoping to use some basic stats in?

9 points

·
23rd Jun 2015

It depends on what your ultimate career goals are. If you want to become a full blown statistician/data scientist at another firm, it's probably best to go back and get a masters in a relevant field (being as this is /r/statistics, i'd plug a stats masters).

If you're more concerned with further honing your skills/applying new knowledge to your current job, coursera is your best friend. My personal favorite course there is Machine Learning by Andrew Ng (fantastic course to learn about machine learning algorithms).

Another series to look into would be the Johns Hopkins data science track. P.S. you don't actually need to pay for this, you can take each class individually. I personally didn't derive a lot of value from the track, but i've heard positive things from others.

Good luck learning!

9 points

·
14th Jan 2011

Here take this book "Linear Models in R"!. I am your superman -:)

8 points

·
5th Oct 2016

Not hating on your post at all, just thought it'd be fun to post a favorite quote from Tufte's *The Visual Display of Quantitative Information*

>A table is nearly always better than a dumb pie chart; the
only worse design than a pie chart is several of them, for
then the viewer is asked to compare quantities located in
spatial disarray both within and between charts [...] Given
their low density and failure to order numbers along a

visual dimension, pie charts should never be used.

7 points

·
26th Feb 2016

Khan Academy's Statistics Videos would be a good place for a refresher, to follow at your own pace.

Duke's "Data Analysis and Statistical Inference Course through Coursera is starting March 2nd, if you'd prefer something with a limited time-frame, and would like to learn how to use "R", a free, powerful statistical analysis platform.

Edit: Apparently the Coursera course was for last year - they have multiple statistics courses, so it may be a good idea to poke around and see if there are any upcoming ones that you might want to take part in.

7 points

·
14th May 2015

This will give you a flavor of what the programming would be like in R (free language many statisticians use):

https://www.coursera.org/course/compdata

This for a very basic intro to applied data analysis using R:

https://www.coursera.org/course/statistics

For the "pre-req" math, just work through a Calc I-III sequence and a linear algebra course on Kahn academy.

If you are serious about switching you will need to actually take those math courses on the way to a bachelors degree if you haven't already. But if the Kahn academy stuff seems too overwhelming, I wouldn't spend my money on college courses.

7 points

·
1st Dec 2011

Get the students to work with real data on a project they care about.

My collection of project ideas, and a couple of examples of past projects, are here:

7 points

·
27th Oct 2016

Tufte's first book, *The Visual Display of Quantitative Information, 2nd edition* is without a doubt his best book. I have heard people say his work is dated, but this is just simply not the case. It is foundational work and I've not found anyone do a better job with the material than Tufte. While I enjoyed his other books, they are not must-reads like his first one. That said, with a good editor I believe his 2nd through 4th books could be cut into a single volume rivaling his first book in quality. So there is a lot of good information in there, but it's it's more of a slog.

6 points

·
20th Jun 2012

Hi there :),

for some introduction (and a bit more) to statistics you might have a look at the Kahn Academy: Statistics. http://www.khanacademy.org/math/statistics

Here you have video tutorials step by step, just take your time, watch and understand them :)

For a simple introduction to regression analysis I usually recommend "Introduction to Econometrics" http://www.amazon.com/Introduction-Econometrics-Christopher-Dougherty/dp/0199567085/ref=sr_1_2?s=books&ie=UTF8&qid=1340222497&sr=1-2&keywords=introduction+to+econometrics I just love this book :).

I'm not sure however what good books there are on how to work with spss, sorry :(.

6 points

·
31st Mar 2016

You have two major advantages here: 1) you know the hiring manager 2) You know what language will be used.

Preparation will be simple just make sure you know your sql. I would recommend reading this tutorial on SQL.

Next step after you have the fundamentals down: practice!!! Download mysql and work to better understand it.

Common interview SQL questions: "What are some common errors you have had to tackle when writing queries" -I always answer 'can't have aggregates in a group by'

"What is the difference between where and a having clause"

"What is a subquery, how do you use them?"

Study hard, and good luck!!!

6 points

·
30th Jul 2015

Definitely R with shiny is perfect for what you need. If you know some java and or python, learning R isn't so bad; as usual though, any new language has a bit of a learning curve. Good luck! http://shiny.rstudio.com/

6 points

·
7th Nov 2012

>is there a way I can highlight a section of it to modify/delete?

No.

But if you are on windows I believe there is a built in text editor of sorts. Regardless get Rstudio, just install and start it and you have a fullblown editor that communicates automatically with R, one caveat: the grid-like view of your data does *not* support editing. (if you're a little more courageous there is the more advanced rkward)

>For my main question: I'm working with a time-series dataset

I don't know much about time-series but as far as I know R has special data types for time-series, do a

apropos('ts')

and see if something familiar comes up.

6 points

·
14th Feb 2011

Here you have your book: "Statistical Inference" by Casella , Berger

5 points

·
3rd Aug 2015

This is terrible advice. I took the Stanford class. It's a fantastic class but it is NOT an intro course by any stretch of the imagination.

The "Data Analysis and Statistical Inference" by Duke University on Coursera is a fantastic intro to stats course and it uses R:

https://www.coursera.org/course/statistics

Starts on Sept 14. The teacher is excellent, the course quality is excellent. It also comes with a free open source textbook which is also excellent. I was doing the Coursera Data Science Specialization track simultaneously and their coverage of stats was inadequate. Only the Duke course kept my head afloat.

I could not recommend it enough.

5 points

·
14th Jun 2015

That's good to know. Thanks for sharing this information.

I've been casually looking at data science the last couple of weeks, and I was thinking about taking coursera's Data Scientists Toolbox over the summer.

Does this seem like it'd be worthwhile? Or would you say there are better uses of my time?

5 points

·
1st May 2012

Khan Academy's statistics section is phenomenal for beginners because you get an insight into how the instructor thinks when he's solving the problems. Once you've checked out those videos you should invest in Whitlock & Schluter's "Analysis of Biological Data". The book is aimed at biologists, many of whom are in exactly your kind of predicament, and consequently it is very easy to understand.

5 points

·
12th Jan 2014

Have you checked on Coursera? they may have more advanced classes in addition to the linked one - or it may have material you have not yet been exposed to yet.

5 points

·
1st Nov 2012

You're a little late for Coursera's Computing for Data Analysis; the course finished a few days ago. As a side note the instructor, Roger Peng, is currently preparing certificates of completion for those who earned them.

I participated in the course but, owing to other demands on my time, did not complete it. The course itself moved at a brisk pace and, aside from the time required to watch the lectures, required time to complete the quizzes and exercises, as well as to read further on the subject. Personally I thought the lectures were excellent, and provided a well structured way to learn R, with some statistics thrown in.

Coursera's somewhat related Data Analysis course, which begins in January 2013, might be of interest to you though there's no mention of certificate.

5 points

·
26th Apr 2012

for example your code would look something like:

ods pdf file = "C:\table.pdf"; proc print data = work.table; by year; run; ods pdf close; ods listing;

the last line turns the normal output back on. ods has lots of options if you want to get into the nuts and bolts of it, but that should print you a pdf of the output.

5 points

·
3rd Feb 2016

I would totally favor a single core table in these circumstances. That conceptual tidiness you mention really pays off in the long run and will be easier for others to understand (normalized data is expected; tables-per-year is definitely not). Subsetting by year, grouping by year, aggregation in general - `sqlite`

and `dplyr`

were designed to make that easy to code and quick to run. Further performance tweaks (like indexing) will probably depend on seeing all of the records at once.

Conversely, having split tables would be a pain if you ever needed to query, say, a single patient's records across all years.

The day may come when a single machine running sqlite can't handle all your data - but then you'd probably be better off looking into databases that support this kind of partitioning.

5 points

·
29th Mar 2019

Fellow social scientist who had a similar background here. I would recommend going through Chang and Wainwright’s book “Fundamental methods of mathematical economics.” It covers basic multivariable calculus and linear algebra. It’s super readable, as well.

https://www.amazon.com/Fundamental-Mathematical-Economics-Wainwright-Professor/dp/0070109109

Try to get the international edition, it’s 20 bucks or so.

4 points

·
13th Mar 2012

The authors of the original study politely reply:

"We agree with Ashley Croft and Joanne Palmer that the risk of mortality is an absolute that can be postponed but not eliminated. We emphasised the potential of exercise in reducing the mortality rate in a given year, not per se. Although the probability of death is 100% in the long run, we can reduce the speed of approaching death by walking briskly 15 min every day and thus extend our lives. It comes with a better quality of life, and that applies to us as well as to the prophets."¹

¹Chi Pang Wen, Min Kuang Tsai, et al. The Lancet, Volume 379, Issue 9818, 3–9 March 2012, Pages 800-801. (http://www.sciencedirect.com/science/article/pii/S0140673612603420)

4 points

·
4th Feb 2015

OP, I would recommend you read through the OpenIntro statistics book. It's free, of very high quality, and there are labs that go along with it in R. The labs also help you learn R. There is a MOOC associated with the class that starts at the beginning of March on Coursera that you may consider taking as well.

It's been suggested that you learn linear algebra first. I disagree. If your goal is to refresh your memory of statistics and get a good introductory understanding of the subject, read the book or take the course I have suggested. If you **know** that statistics is what you want to pursue, take linear algebra. Linear algebra is essential for gaining a true understanding of statistics. At that point you'll also want to finish multivariable calculus and probability theory so that you can compute density functions and understand the probability behind statistical inference. It sounds like what you're looking for isn't going to involve these until later, and in the meantime I think it's most important that you get a solid basic understanding of statistics so you can determine for yourself whether or not you want to pursue further knowledge in the field.

4 points

·
15th May 2011

It's 0.741469. Solution

There are 20^55 possible combinations.

There are

19^55 without the "1".

19^55 - 18^55 that contain the "1", but not the "2".

19^55 - 2* 18^55 + 17^55 that contain the "1" and the "2", but not the "3"

19^55 - 3* 18^55 + 3* 17^55 - 16^55 that contain the "1", "2" and "3", but not the "4"

and so on...

If you sum up all these combinations and divide them by the total nubmer of possible combinations you'll get the result above.

4 points

·
17th Sep 2014

Look at this page: http://www.dynamicgeometry.com/General_Resources/Advanced_Sketch_Gallery/Other_Explorations/Statistics_Collection/Least_Squares.html

The red squares are a measure of how well the line fits the data. Choose the regression line which minimises the area of red squares. The regression formulae do this minimisation for you.

4 points

·
31st Oct 2012

4 points

·
4th Jun 2014

A really bad practice amongst economic researchers is writing really long, bloated stata or R files. Often they are not well documented and they often involve a lot of magic and trickery making you scratch your head figuring out why they took certain steps.

The solution I've found is to use makefiles and break up your stata/R files into many small portable pieces.

Make (http://www.gnu.org/software/make/) is basically a way to list out all the steps to get to your result and specify all the dependencies. Some people like to work with drake which is 'make for data': http://blog.factual.com/introducing-drake-a-kind-of-make-for-data

Usually the makefiles will specify how to import the data, clean the data, and process the data.

Try to make your stata files as modular as possible. It's better to have lots of small, clearly defined functions (hopefully self documenting or well documenting ones) then a 1000 line function that tries to do all the steps one after another. This has the added bonus that you can add lots of unit tests and travis to the functions and it will be a lot easier to debug your functions one at at time then trying to write unit tests for a 1000 line behemoth analytics process.

A huge bonus of this approach in addition to reusable and very readable code is that it will be very easy for others to modify and iterate off of your process. good luck.

4 points

·
15th May 2015

It's worth reading some Edward Tufte for guidance too (http://www.edwardtufte.com). He talks a lot about aiming to maximise the ratio of information to ink, so basically reducing wastefulness and minimising the extent to which we add bells & whistles to our charts. In a nutshell: avoid 3D piecharts ;)

It really depends on your intended audience & the standards commonly used in your subject area. If your an R user, it could be worth a look here https://plot.ly/r/

4 points

·
9th Apr 2014

May be overkill, but take a look at RStudio for the R statistical programming language. Fully functional, professional, open-source, statistics IDE. Cannot recommend enough.

then do something like:

x = rnorm(10, mean=2, sd=0.5)

This will generate:

1.75884164412553, 1.96923295305206, 2.02906575504054, 2.84513526976282, 2.2049150744444, 1.73318266409414, 1.62611322640113, 1.2014866750171, 2.0473842615968, 1.92262243708622

4 points

·
2nd May 2021

Good Thinking is an older book from IJ Good that is basically a series of meandering rants about old-school Bayesian statistics. Very niche, but very interesting.

4 points

·
30th Jun 2020

I just finished my M.S. in statistics. Make sure you have these undergraduate-level topics nailed:

Linear Algebra (first semester, say at the level of Lay's text - look at it on Amazon to get an idea for its topics), down cold. Assume that you will get no time to review this material in class.

Calculus - have integration and differentiation techniques down cold from Calc. I and II, including Taylor/Maclaurin series. Double integration, partial derivatives, and Lagrange multipliers from Calc. III.

Real Analysis - make sure you can do ε-δ proofs as if they are second nature. Limits, continuity, uniform continuity, pointwise convergence, uniform convergence.

Probability and mathematical statistics, at the level of Wackerly's text.

Any programming experience you have would be helpful: doesn't matter if it's C, C++, Java, Python, or R. You have a CS degree, so this should be well covered.

4 points

·
4th Nov 2016

Ok, how about a book to curl up to in-front of a fire when you're feeling alert and awake and at the same time comfortable and warm? Maybe there would be snow outside and a labrador by your slippers.

Anyway I digress - just the kind of book that is *the* book. E.g. people drop a few hundred pounds on The Art of Computer Programming not because they want to read it at 8.15 on a Monday morning before they start work - they read it because they value it as an important, rewarding and an aesthetically pleasing thing to do.

4 points

·
11th Apr 2012

Introduction to Algorithms by Corman

Convex Optimization by Boyde

Pattern Recognition and Machine Learning by Bishop

Obviously these books at first glance aren't statistics booked, but there are tons of problems in statistics that these books cover and plus introduction to algorithms is a must for anyone looking to program and get a good mathematical basis for it.

3 points

·
29th Feb 2016

I'm nearing completion of the Data Science specialization on Cousera. I've been pretty happy with it overall. I already have a decade+ of programming experience, so the early classes were rudimentary. But the last few courses - which to me are the meat of it - were pretty good.

The linked tutorial covers many of the same topics, but the specialization goes into WAY more detail. If you're thinking of doing this professionally, I would recommend doing the specialization (or something more indepth) over the tutorial. If you're just looking to explore the topic to see how interested you are, then the tutorial would be a better fit.

3 points

·
17th Sep 2015

https://www.coursera.org/course/matrix https://www.edx.org/course/linear-algebra-foundations-frontiers-utaustinx-ut-5-03x

I feel like there is another one with the same concept of teaching LA through programming.

The problem i found with most stats classes that used programming is they used R and relied on R's built in methods, which, although the explained HOW they worked, still left you feeling that you were using a Black Box and so the outcomes always felt somewhat confusing. Meanwhile if you build the functions your self, then all mystery is removed, and you realized without-a-doubt that some concepts are identical and only change based on context.

3 points

·
23rd Jul 2012

Here's a good start if you are truly interested. They start with very introductory material you probably learned in middle school and build up to statistical tests you will learn about in a college course.

3 points

·
8th Jan 2011

Right on! I'm a huge fan of the trial and error approach when it comes to learning new statistical software -- glad to see you're jumping into the deep end head first.

Anyways, I think you might be looking for the summary() function:

> model1 <- lm(stack.loss ~ ., data=stackloss) > summary(model1)

Call: lm(formula = stack.loss ~ ., data = stackloss)

Residuals: Min 1Q Median 3Q Max -7.2377 -1.7117 -0.4551 2.3614 5.6978

Coefficients:
Estimate Std. Error t value Pr(>|t|)

(Intercept) -39.9197 11.8960 -3.356 0.00375 **
Air.Flow 0.7156 0.1349 5.307 5.8e-05 ***
Water.Temp 1.2953 0.3680 3.520 0.00263 **
Acid.Conc. -0.1521 0.1563 -0.973 0.34405

---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.243 on 17 degrees of freedom Multiple R-squared: 0.9136, Adjusted R-squared: 0.8983 F-statistic: 59.9 on 3 and 17 DF, p-value: 3.016e-09

In the output, the "Estimate" column lists the coefficient for each predictor variable in your model (as well as the intercept). Hope this helps.

3 points

·
20th Dec 2012

err...well, what kind of data do you have? If you have something fairly digitized already, you could load your data into a network data structure and query the network for the node that has the highest out-degree. If for each article you have a list of citations handy, you could munge this into a format suitable for ingestion into something like gephi which would do the hard work for you and give you lots of pretty pictures and even allow you to do fancier analyses, like return the paper with the highest PageRank in the corpus, or with the highest betweeness-centrality.

If you would be satisfied by the paper with the most citations *overall* (without regard to your specific corpus) you could use google scholar to count the number of times each paper has been cited.

3 points

·
23rd Apr 2015

I recently took the Andrew Ng's MOOC in Machine Learning. As part of the course, we learnt to use Octave (an open-source Matlab clone) and implemented all the main algorithms ourselves - linear regression, linear classification, neural networks, SVM''s, etc.

If you want to go at a slower pace, then try the Coursera Data Science track, which is R-based. All the courses are free.

3 points

·
17th Feb 2015

The statistical mechanics course contains a lot of applications of MCMC. I did the course and it is pretty good.

I just stumbled over this course while searching the link to the statistical mechanics. So I don't know how good it is...

3 points

·
4th Feb 2015

Statistical theory is useful, but to apply it, you'll need to understand the tools used in the industry. I'd recommend the data-science track at Coursera. This way you'll learn some basic programming with R (a statistical programming language) and basic statistical inference. Teaching quality varies, but if you're motivated, you'll do fine.

3 points

·
16th Aug 2011

This is for ecologists but might work for you...

Benjamin M. Bolker, Mollie E. Brooks, Connie J. Clark, Shane W. Geange, John R. Poulsen, M. Henry H. Stevens, Jada-Simone S. White, Generalized linear mixed models: a practical guide for ecology and evolution, Trends in Ecology & Evolution, Volume 24, Issue 3, March 2009, Pages 127-135, ISSN 0169-5347, DOI: 10.1016/j.tree.2008.10.008.

http://www.sciencedirect.com/science/article/pii/S0169534709000196

3 points

·
19th Sep 2014

Would it be ethical to remove customer names/ids and release the data? I'm sure you could get some volunteers to help investigate something as interesting as what's being read at the library!

The rule of thumb is that **if you can fit the data on a single machine, it's not "big"** and hive, hadoop, spark, cloudera, etc should all be ignored. They're more cumbersome that it's worth.
I'd suggest choosing from:

Learning more about crystal reports.

Learn enough sql to create the queries you're interested in, dump them to csv files, and use excel to create graphs.

If you want to get into statistical analysis and machine learning, learn R or Python. https://www.coursera.org/course/rprog which is part of https://www.coursera.org/specialization/jhudatascience/1 may help. I'm taking them to branch out from being a general purpose programmer.

3 points

·
23rd Sep 2011

Not sure how you were able to take a time series course without basic stats background (stuff you list is typically taught in Stats 101). I'd suggest Khan academy if your set on taking this time series course right now:

http://www.khanacademy.org/?video=statistics--the-average#statistics

As someone who teaches stats to non-stats grad students, I would highly recommend taking some introductory stats courses before pursuing the time series class. Either way, good luck with the semester.

3 points

·
15th Jun 2019

Thank you. It’s all JavaScript. I created the content (and all the interactivity) in an Observable notebook , made plots using plotly .

3 points

·
16th Aug 2011

I am preparing a talk on one of my favorite topics (there is only one test) and using this question as an example. I hope you don't mind.

My draft slides are here

https://docs.google.com/present/view?id=dcq7d5hs_234dwck2rf2

Comments and suggestions are welcome.

3 points

·
16th Apr 2015

Yup. You might also be interested to hear that when Gosset (aka "Student") was doing his work on t-distributions, he implemented Monte Carlo methods without a digital computer:

> He then checked the adequacy of this distribution by drawing 750 samples of 4 from W. R. Mac-donell’s data on the height and middle-finger length of 3,000 criminals and by working out the standard deviations of both variates in each sample (see Macdonell 1902). This he did by shuffling 3,000 pieces of cardboard on which the results had been written, possibly the earliest work in statistical research that led to the development of the Monte Carlo method.

3 points

·
24th Sep 2015

Look at JASP: https://jasp-stats.org/

It's new and open source, but it has an interface like SPSS and can probably take care of all the basics you need.

Nothing is going to do all your work for you though: you need to understand what you want to do, what tests you want to run and how they work in order to actually present something meaningful

3 points

·
4th Sep 2015

From the FAQ:

> Q. What programming language is JASP written in? > >A. The JASP application is written in C++, using the Qt toolkit. The analyses themselves are written in either R or C++ (python support will be added soon!). The display layer (where the tables are rendered) is written in javascript, and is built on top of jQuery UI and webkit.

3 points

·
16th Mar 2011

Working with Gephi is rather intuitive. You can request a bunch of measures, the ones you describe are certainly in there.

If you need more sophisticated measures, it will probably be less comprehensive then what is available in packages as igraph or sna, but with a point-and-click interface for both the measures and visualization of them.

As you talk about changes "over time", Gephi recently also got the ability to visualize the changes in the graph over time. Again, nice interface (time-slider), but I do not know if the necessary time-variant measures are also included.

Gephi should be able to handle the format you describe. Another package I frequently use for manipulating/creating networkdata, changing formats, etc. is NetworkX.

3 points

·
18th Feb 2015

To echo what PhaethonPrime says, you'll be okay for most stats as long as you don't need to do any exotic (and also non-bayesian) models.

In terms of the graphics, definitely check out learning some of the "grammar" based plotting libs. This is one area where R still crushes it, but Python's Bokeh is getting interesting these days.

3 points

·
7th Aug 2015

Try R studio if you want to go with R, it is much easier to use. Just find a couple of examples online and you will be good to go.

The best alternative probably is Stata, but I do not think that Stata produces nicer output (admittedly, you do have to program more in R to get the nice output). Also, Stata is not for free.

Bottom line; try R, using R studio, if you really do not like it, get something like Stata (or perhaps even SPSS). Don't bother with Matlab (similar coding requirements as R, not free and graphics are not that amazing out of the box), or mathematica.

If R is working for you and you want even some more freedom, go with Python / [Julia](www.julialang.org).

3 points

·
4th May 2014

Stats 141: Statistical Computing taught by Duncan Temple Lang

of R fame: http://www.r-project.org/contributors.html

There is a lot going on in the davis stats program and the computational stats program definitely offers the skills to get a job right out of school. You should talk to counselors.

3 points

·
26th Jul 2011

As a statistician, SQL is a good addition to your toolbox. I do some work in R, which by default loads all data into memory. This is a problem if you're working with data sets that are a few GB or more in size. If the data is in a relational DB (i.e., a DB that can be queried by SQL), then you may be able to write a query to select a subset of the data that fits in memory and proceed from there.

On that note, you may eventually want to learn a little about map-reduce, a technique for operating on data sets so large they don't fit on a single hard drive. I think the most popular open source implementation of map-reduce is hadoop.

Going back to SQL, I'm not familiar with MariaDB but a popular small relational database is sqlite. Unfortunately, you can't really do much (with sqlite or any database) until you've loaded in a some data to play around with. Does anybody know of any public data sets that are easily -- as in, for a novice -- loaded into a popular database?

3 points

·
16th May 2012

This. If you do go down the data mining route, check out Weka (http://www.cs.waikato.ac.nz/ml/weka/). It doesn't take very long to learn and is great for exploring relationships between variables in a large, multi-variable dataset.

3 points

·
27th Aug 2012

I've found LyX to be a nice way to crank out tables or long equations in a hurry; it's got an easy to use interface and the code to produce what you have written up is automatically generated (like a happy union between Word and a standard TeX editor). Often, I'll have it open in the background while I'm working with another editor so that I can hop over and create a table or an equation, then just copy the code back into my main document.

Here's a link: http://www.lyx.org/

3 points

·
9th Jun 2021

I would highly recommend starting with the following:

- Practical Data Science with R
- Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python

While not strictly for A/B or Marketing, they give you the tools and encompass the principles used in marketing analytics. (Business Analyst, Currently working also with SEO and Marketing Analytics)

3 points

·
3rd Jun 2021

What textbook did you use in your class and how much of the book did you cover? Which topics did you cover?

I recommend Introduction to Probability by Blitzstein and Hwang [Link to PDF]. Then, for a book that goes deeper into the theory (and proves theorems in full rigor), I recommend Probability (Theory and Examples) by Durrett.

3 points

·
2nd May 2021

If you're interested in a bit of history, both Salsberg's The Lady Tasting Tea and Hacking's The Emergence of Probability are good reads. They dig more into the ideas and people that went into the original developments of probability and statistics. I found understanding how the field began gave greater context to the methods we use today and modern arguments about them.

3 points

·
26th Oct 2020

This is a great book to learn spatial analysis modern spatial econometrics by Anselin.

I'm sure there's a free version online.

3 points

·
3rd Sep 2020

That's only one area of nonparametric. Nonparametric models are important for small data too where you don't or can't assume a distribution.

But I do agree with your sentiment that having a CI for your prediction is important.

To be fair to ML, there are area where they are very good at and that's data with low noises (images, NLP, etc..). I believe Frank Harrell's book Regression Modeling Strategies talk about this and his view on ML.

I also believe ASA have talk about how to augment ML with statistic for their goal for 2020.

If I can find ASA's post I'll update this post.

I don't know if this fall under nonparametric models but bootstrap get the distribution from data and not some assume distribution. It's certainly nonparametric technique at least. There are situation where bootstrap would be better than it's parametric counterpart.

3 points

·
4th Aug 2020

A lot of clinical trial issues are pretty fundamental statistics. There are some specific weird things that tend to come up in trials, and not in other places (e.g. compliance with treatment). This book describes a lot of those issues in clinical trials, and it's fairly short.

https://www.amazon.com/Designing-Randomised-Trials-Education-Sciences/dp/0230537359

Disclaimer: I used to work with the authors, and have published papers with them. But I bought the book, with my own money.

3 points

·
23rd Jul 2020

This is an applied statistics book that will walk you to-through PCA and a bit beyond.

https://www.amazon.com/Analysis-Multivariate-Statistics-Behavioral-Sciences/dp/1584889608

3 points

·
5th Apr 2015

R is a high level language that is fairly easy to pick up once you know basic CS coding language syntax and principles(at least I thought so). Additionally, if you are interested in writing statistical software for R, much of what you write will be in C++ and wrapped for R.

The C++ experience is good, but I would definitely recommend doing more on your own (mini projects like games, etc.) so you feel like you have a more versatile grasp on it.

I'm currently working through Cormen's Introduction to Algorithms, if you end up doing something similar this summer let me know and I can try to provide guidance.

3 points

·
19th Sep 2012

For general Biostatistics I'd recommend "Intuitive Biostatistics" by Harvey Motulsky, although its thin on graphical representation.

For presentation of graphics Tufte's "The Visual Display of Quantitative Information" is great.

2 points

·
14th Aug 2011

I don't claim to be very good at explaining things, so here is a pretty good intro to estimating a population proportion.

http://stattrek.com/lesson4/proportion.aspx

If you're not really into the theory, here is a handy calculator.

http://www.wolframalpha.com/input/?i=binomial+distribution+confidence+interval

2 points

·
26th Mar 2013

Try iTunesU, Coursera, and Khan Academy.

https://www.coursera.org/course/stats1 http://www.khanacademy.org/math/probability

You may want to start with some classes on probability, it's the basis on which statistics is built.

You can also look at high school AP stats classes.

2 points

·
9th Dec 2012

If you're just looking for an introduction to general statistics concepts I would suggest trying the Khan Academy videos for statistics.

A decent text to try is Elementary Statistics by Triola. I've taught intro to stats a few times with it, and don't have any major complaints.

For more serious graduate level, check out Casella and Berger mentioned below (looks promising to me; will pick up a copy sometime soon).

2 points

·
31st Dec 2011

okay buddy, here you go First you can get this data from the california statewide database project. I grabbed the 2010 election results, summed up the gov election and prop 19 by county, and ran a bivariate regression. I got my parameter estimates and used that to create an expected value.

FIPS Code for San Diego is 73. As you can see in the data, it is 4 percentage points higher than expected by the regression.

I have attached the spreadsheet here:

It is a simple regression. The explanatory parameter and regression is significant.

2 points

·
20th Jun 2019

Its the double edged sword of "data science". A vague name that defines a broad and loosely structured set of skills is going to beget a lot of jobs with broad, vague and loosely defined skillsets.

I generally get a good feel for what they are looking for in the job description itself though. I'm less data sciencey and more QC stats kind of person, but that has similar pitfalls. Anyways, here are two examples. Not the best examples, just two I dug up quickly.

http://www.indeed.com/viewjob?from=appsharedroid&jk=dbe49f98a547ba8b

http://www.indeed.com/viewjob?from=appsharedroid&jk=eaaaa3e0affc7c43

That first one I would never respond to. The second one I would be more likely to (at first glance at least, I'm quite sure I'm not actually qualified due to the non stats requirememts).

I can tell the person who wrote the second one actually has a background in stats, and it's also clear to me that they sincerely need the person they hire to have a decent level of strength in this.

There are keywords I see in both. In the first its "manipulating data", "statistical graphs", "Tableau". In the second its "factor analysis", "test design". You'll also notice that the second one requests you *write* technical reports, while the first talks about fulfilling requests from writers.

This is a pretty extreme example in terms of the differences between these, but I thought it was illustrative.

2 points

·
29th Jun 2019

If we take your setup to be exact then the probability of being helped by any individual is given by 1/(xn). [When x=1 it is simply 1/n, etc.]

Importantly for us, this means that the probability that an individual does not help you is 1-1/(xn), or alternatively (xn - 1)/xn.

If we are working in your idealized scenario, then we can readily answer the question "what is the probability that no one helps us?". Assuming independence, then the probability that no one helps us is the product of the probabilities that each individual does not help us (think a coin flip: the probability of flipping heads is 0.5, the probability of flipping heads twice in a row is 0.5x0.5 = 0.25).

What this means is that the probability that we are not helped is given by [(xn-1)/xn]^n

From here we grab the complementary probability and say the chance that we are helped is given by 1-[(xn-1)/xn]^n.

You can play around here: https://www.desmos.com/calculator/be9zocwrtq where the x-axis represents the sample size and "z" can be set to be "x" in the above expression.

2 points

·
17th Jun 2012

I'll give it a shot.

Alpha is an assumption associated with any frequentist statistical comparison. Alpha deals with how likely you are to make a type 1 error; whenever you take a sample of data from a population to compare it to some other value (another sample or population), there's a certain likelihood that by "some twist of fate", the sample you picked was uncharacteristic of the population from which it was drawn. Alpha is set by the experimenter based on previous data and experiments in the particular field. Psychology uses .05, other disciplines use .1 or .01. The homogeny of the participants/test subjects/microbes/groups/data in one's sample usually affects what one sets as one's alpha. This occurs largely due to design constraints (i.e. microbiologists can expect greater homogeny across their samples than a sports psychologist).

"most significant alpha level" is a turn of phrase i've never heard before.

Finally, data cannot prove a hypothesis correct; rather, it can disprove a null hypothesis. Again, this is due to experimental design constraints. Statistical comparisons are more rigorous than pure numerical comparisons; just because an average of "5" is greater than an average of "4", does not mean that they are statistically different. The variance associated with these two means, and the statistical and experimental design considerations in place are needed to determine if the samples used to identify those means are dissimilar.

These might help:

2 points

·
5th Jun 2012

I haven't actually used them, but I've heard good things about khan academy for beginning stats. Here's a link to the stats videos http://www.khanacademy.org/math/statistics I have a friend who is an occupational therapist and when I was doing her stats homework the biggest thing was understanding a p-value and hypothesis testing for when you read about a study and they tell you they had a p-value of .01 you can understand what that means in the context of the study.

2 points

·
16th Aug 2011

I do a lot of statistical computing, sometimes with large data sets, and this is the new build I just put together for myself. Total cost would be about $12K, but I reused the power supply, video cards, and case from my last build. Here is a link to the parts

2 points

·
10th Apr 2015

> I want to be able to from initial usage (i.e. first day or week) to be able to place a user into a segment for the purposes of sending them customized in-app offers and balancing.

The task you are asking about boils down to building a recommender system. This is an information retrieval task closely related to the operation of search engines. Here's a relevant Coursera course.

> from initial usage

This is a particularly challenging area of recommender system space called a "cold start" problem: if you had a full history on a user's activity, providing a recommendation would be straight forward. You don't so it's not. Google this phrase for discussions on various approaches. You will probably use some variation of an approach called "nearest neighbors."