Replication is a good idea and is not p-hacking. In fact you can combine the p values from the two studies. Here is a description of one method. . Also, avoid the all-or-nothing rejection of the null hypothesis. The beginning of this article has a good discussion of this issue.
Use R. It is a free and open source software for statistical modelling and analysis.
http://nlp.stanford.edu/manning/courses/ling289/logistic.pdf will tell you how to do a logistic regression appropriately in R.
Let me be the 94th person to recommend R with optional RStudio/tidyverse.
You might also be interested in JASP and JAMOVI, they are free / open source and really good!
The two projects forked in different directions a while back, IIRC the main difference (apart from light cosmetics) is JASP offers a hook into Bayesian/Network analyses, while JAMOVI has stronger links into underlying R code.
The Theory of the Design of Experiments by Cox and Reid is a useful book. You should definitely click that link and buy it from Amazon rather than clicking on this link of a pdf of the whole book which came up as the second result in my google search just now, but which we all certainly agree should not be clicked on or shared.
It's been ages since I read it, but I recall really liking Richard Hamming's <em>Methods of Mathematics Applied to Calculus, Probability, and Statistics</em>. It's very long, but it's really well presented, and the coverage is about what you're looking for.
This book is a amazing: Discovering Statistics Using R by Andy Field
If you are doing self-study, it is easy to lose momentum. This book is hilarious, personal, and transcends the textbook genre.
Disclaimer: I'm not a PhD student, just a late-undergraduate student (0.5 of a semester left) in computer science.
Depending on what sort of computation you are pursuing, and your background wrt computer science, you might want to include some machine learning books. Some recommendations...
General ML:
Reinforcement learning/Artificial Intelligence:
I think it's doubtful that you will use all of these tools/techniques (esp. reinforcement learning/AI resources unless that's your thing) but it might be worthwhile to know that they're out there in case you do...
Use a calculator: Assuming that you have a random sample from the 1M frogs, the solution it is pretty simple. What you are doing is trying to estimate the proportion of your population that possesses a particular property. Specifically you want the sample size needed to have a small confidence interval. Here there is a calculator that does exactly this: Sample size calculator. You will have to input the margin of error that you want, for example if you are fine with an interval of 55% to 65% then the margin of error is 5 (plus 5 and minus 5), and a confidence level (the most typical choice is 95%). For the proportion, if you want to be conservative use 50%.
How does the calculator works? Their calculation is based on the variance of the sample of the proportion: p*(1-p)/n. We know that standard deviation is the square root of the variance, so the margin of error with 95% confidence interval is 1.96*sqrt(p*(1-p)/n). So what you want to do is to say I want 1.96*sqrt(p*(1-p)/n) to be less than the desired margin of error. Example for MoE of 5: r/https://tinyurl.com/y7j6yxrf
Beware: Now I was a little bit worried about how you framed the exercise. What does it mean to have "a random sample of 50,000 frogs at a nearby zoo", are you sure this a is an accurate random sample of the 1M frogs? Can you elaborate where do these 50,000 frogs come from?
Have you looked into Amazon Web Services? It lets you run R on their supercomputers.
Here's a tutorial for how to do this. It's less complicated than it looks: https://www.youtube.com/watch?v=NQu3ugUkYTk
Pricing is pretty reasonable: https://aws.amazon.com/ec2/pricing/
I'd say it is pretty easy to pick up and designed for what you want to do with plenty of specialised stats functions ready to go. Impossible to predict how you personally will take to programming, but it's free and it's fun to play with.
SAS is expensive and harder, especially for graphics. Stata is good, a tad easier than R, but not free.
This discussion looks useful for you
As long as you're prepared to put some time in, I'd start with R because it's free. If it does your head in, try Stata. The learning experience won't be wasted regardless, a lot of what you will need to learn to get started is very similar across platforms.
There are a few different things going on here.
For your first question you don't really need to do any statistics unless I'm confused about what you're asking. The assignment with the biggest point total is the one that effects the final grade the most. I'm not really sure what else you would need than the proportion of points each assignment accounts for.
You could use an exploratory factor analysis to see if there are different latent factors (i.e., are the assignments all actually assessing skills in the same domain or multiple domains) but to do EFA well you will need to do a good bit of reading/learning first.
Further I would honestly suggest a k-means cluster analysis for the second ask here. A cluster analysis basically is an EFA for identifying groups of people rather than latent factors that are the common cause of observed variables. So you tell it which variables to use for classification and then it uses an algorithm to identify groups of people that are alike ("clusters") on the basis of the set of classifying variables used. You then investigate the "levels" of each of the classification variables in each cluster to see what the real differences are.
You could do both an EFA and a cluster analysis in jamovi (free! - https://jamovi.org) if you don't have the software. You will need to get the "SnowCluster" module (also free) to do the cluster analysis.
That paper is about 30 years old and causal inference has come a long way since then. A big innovation in the field have been the development of inference techniques that make minimal assumptions about the how the world works. If you're curious or skeptical about the math Guido and Imbens have a pretty good textbook on it.
As for success stories, all pharmaceutical trials are causal inference and I would consider things like vaccines to have been a net benefit for society. If you want to stay within the social sciences, causal inference is what allows you to successfully evaluate the effects of policy. Causal inference is what tells us that we should invest in early childhood education and high quality teachers (PDF). Finally, businesses use things like A/B testing to design effective marketing tactics. Businesses continue to do this so I would say that there's good reason to think that they are valid techniques and help businesses make more money.
Swiss mathematician Jakob Bernoulli, in a proof published posthumously in 1713, determined that the probability of k such outcomes in n repetitions is equal to the kth term (where k starts with 0) in the expansion of the binomial expression (p + q)^n , where q = 1 − p. (Hence the name binomial distribution.)
Simulation would work better for that case. Here's an example of how to do that using numpy's random geometric distribution.
If you want to get really fancy, we can use markov chains to calculate the number of steps as well.
Good question! I cannot find a difference between the two terms. To my limited knowledge, it's contextual - association is used in cases of categorical data, relationship is used with numerical data. I did find this in the OECD Glossary of Statistical Terms:
Definition of relationship: "A connection among model elements. " and " Association is a semantic relationship between two classes." [ISO/IEC 19501-1:2001, 2.5.2.3]" from http://stats.oecd.org/glossary/detail.asp?ID=7079 . I couldn't find a specific definition for an association
There is, however, a difference between an association and a CORRELATION. A correlation refers specifically linear relationship between two numerical variables.
There are also such things as correlation coefficients and association coefficients, which refer to similar but not exactly the same things. The difference is again that correlation is more restrictive on the types of data that can be used to calculate it (assuming you're using the default of Pearson's correlation coefficient). As for association coefficients, I found this: http://www.encyclopedia.com/doc/1O88-associationcoefficients.html
I would also take a read over this recent article from Bates and co. re: how coefficients vary with different random effect structures and good procedure for specifying your random effects
https://www.researchgate.net/publication/278734089_Parsimonious_Mixed_Models
EDIT: direct arXiv link -> http://de.arxiv.org/pdf/1506.04967
I'm pretty sure you have no idea what you're getting into, but sure, learning is fun.
R is a statistical programming software that has many packages available. R studio is a more user-friendly interface for it. If you have any coding experience you shouldn't have too much trouble with it. It's open source and free.
Once you have that running you can install the "forecast" package, and try running the following code
library(forecast) x = c(0,2,2,0,1,1,1,0,3) # Comments # Put the rest of your data in there, I stopped early myts <- ts(x) fit <- arima(myts, order = c(1,1,1)) forecast(fit, 3) # end
That gives you the model's prediction for the next 3 observations, (under "Point Forecast"). It also gives confidence intervals for each of the observations.
You can change the values of c(1,1,1) to other numbers. There is a lot of theory that you're not going to be looking at, but you can try it out. (But please keep those numbers small, between 0 and 3 would be best). You can change the number in "forecast(fit, 3)" to other numbers if you want to look further ahead.
As I said before. This is probably not going to give you reliable results. These methods are not meant to be blindly used.
There are many other options that you can try out, but this can get you exploring without too much headache (I hope).
It looks like you are using Python. Do you have a git repo? If not, no worries. I can still help.
> I didn't check the assumptions because this is the first project on regression I've ever made, so I do not have experience at all.
OLS regression relies on five primary assumptions. You can find specific detail in this article. You should test each of these assumptions at least visually using plots similar to the ones you have above. Taking the log of Price
to account for the skew and then looking at a histogram of the resulting vector might yield some interesting insights with regards to its distribution.
Right from the start it does not look like your predictors have a linear relationship with Price
. This is assumption #1 in the article.
Likely the normality and heteroskedasticity assumptions (#3 in the article) will not hold either because your response variable skews upwards. You can see this by creating a histogram of Price
. This will result in your model fitting the data poorly. You likely need to consider a GLM to fit this data. I think a log link function will probably be worth looking into to start.
>I've tried linear regression with a k-fold cv (k=5) on both the original dataset (standardized) and the dataset after the PCA, on both of them I've achieved very poor results..
This is because k-fold CV seeks to correct for over-fitting. Your problem is under-fitting.
R is more beginner friendly and is actually open source. I know SAS too and I hated it after learning R.
If the coding is tripping you up there are GUIs out there. I would suggest not using SPSS just because it is quite expensive depending on how you get it. Instead I would suggest other open source GUIs like jamovi and JASP. These two programs specifically are GUIs for R and they can be set to produce the R syntax resulting from your point and click actions. Well worth a try!!
I have a strong-ish preference for jamovi - https://jamovi.org
All right so it seems like a one-way repeated measures ANOVA might be exactly what you're looking for.
I highly recommend this book.
There's an entire chapter dedicated to the ANOVA test using a repeated measures design and it's detailed implementation in SPSS.
This book is geared towards people with no prior background in statistics and provides great intuition about the usage of the different tests. There's also a section that walks you through the interpretation of your results.
Hope that helps and good luck with your project!
Uhh... I think your reference is either incorrect or referencing a republication of the same edition. According to Amazon the 8th edition was released in 1989 long after both authors had passed away. Also according to Amazon, the 9th edition is on pre-order for an Oct. 2016 release date: https://www.amazon.com/Snedecor-Cochrans-Statistical-Methods-Kenneth/dp/0813808642
So that people don't waste time re-doing what I've done:
Good luck.
The multiplication there is only what the chances are of gladiators B, C, D, and E (or any four other specific gladiators) attacking.
What you would actually need is to take what are the chances that 4 gladiators do attack, and that 5 gladiators don't.
So the first step is .11^4 times .89^(n-5)
But this is just the probability that the first four gladiators attack, and the rest don't. In actuality, any combination of four gladiators would satisfy. So what you'd want to look at is how many ways can you select 4 gladiators from 9 possible gladiators, or more mathematically speaking, what is 9 choose 4?
https://www.mathway.com/popular-problems/Finite%20Math/600050
There's 126 ways, so it'll be the probability above multiplied by that.
If the mgf exists in a neighborhood of zero there's a unique distribution with that mgf.
Given some set of cumulants, you should in principle be able to explicitly write the cgf and hence the mgf (since the log of the mgf is the cgf) and then invert (the mgf is the Laplace transform of -X, so you can invert it the same way you invert a Laplace transform) as long as that convergence in a neighborhood of zero holds. If you look at the relationship between cumulants and moments you can see that zero higher cumulants implies particular behavior for the moments; it's not immediately clear whether or not that sequence will converge in a neighborhood of 0.
However, Marcinkiewicz (1939)^(1) showed that the normal distribution is the only distribution having a finite number of non-zero cumulants. (e.g. see here).
So it turns out the answer for that bit is ... no.
[In the more general case of converting a set of moments - or indeed cumulants - to a distribution you can have the same moment sequence belong to more than one distribution when the mgf doesn't converge in the abovementioned sense, so in a case where a distribution exists for some moment-sequence, if it doesn't correspond to an mgf, non-uniqueness may result. There's some specific examples of multiple distributions with the same set of moments.]
1. J. Marcinkiewicz (1939)
"Sur une peropri'et'e de la loi de Gauss."
Mathematische Zeitschrift, 44: 612--618.
Not quite. First of all, a p-value tells you that IF there were no effect (just random noise) the probability of getting a sample difference that is larger or as large as the one you observed (0.02) , is 5%. It doesn't directly tell you anything about the probability that there is or is not an effect. This is because when we calculate the p-value we assume that there is NO effect. That's sorta like saying "assuming the grass is green, what is the probability the grass is green?"
Secondly, the 2-tailed p-value itself tells you only that they're different because a difference of 0.02 and -0.02 would give you the same p-value in this case. But you can use your estimates (.10 and .12) to infer the direction.
If you're interested in this more, there's a Bayesian technique called VS-MPR which can give you estimates of what you're looking for: evidence for whether a p-value value of 0.05 means the null is more likely to be true. Here'sa good source on that.
ETA: Also we don't usually Say "95% significance" it would be significance at the 5% significance level. It's an easy slip to make because with confidence intervals we talk about our intervals with the phrase "95% confidence".
Note I am not a statistican but a biologist that uses statistics.
You have what is called a repeated measures design since you are taking measurements from the same people across multiple time periods. I'm not sure if you know how to use R statistical software but this would be relatively easy to do in R which is free. If you have never used R, Rstudio is a must as well
This would be a linear mixed effects model assuming your response variable CCT is a continuous value. You have a fixed effect of year and a random effect of individual id For simplicity, a linear mixed effects model is very much similar to a paired t-test depending on how you specify it.
To do this model in R you would run the following code
install.packages("lme4") install.packages("lmerTest") library(lme4) library(lmerTest)
data <- read.csv("/Path/to/your/file.csv")
data$year <- as.factor(data$year)
model <- lmer(CCT ~ year + (1|id), data = data)
summary(model)
Replace id
with whatever you column name you use to keep track of the identity of your individuals and replace year
with whatever column name you use to specify the year. Replace CCT
with whatever column name you used to specify central corneal thickness.
This should output a summary of the model output where you will see the estimate of your effect of the variable year and whether it is significant (the p value) under the Pr(>|t|)
.
If you end up using R and run this let me know if it doesn't work and I can try and help some more.
Try checking out this page: https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
It's hard for us to help you when you're being very specific, there are many different types of graphs, scatterplot, boxplot, bar, line? Do you want it all on one plot and colour coded or do you actually want seven different plots? If it is 7 different plots is it split up by year? What's your x-axis and y-axis?
library(ggplot2)
I'm surprised nobody told you yet to use R as statistical software. Its been three hours since you wrote that.
Having said that BH^2 does not assume any calculations beyond a simple calculator
I'd be interested in your ability to predict alcohol consumption levels (e.g. abuse) based on other social factors. The weka package provides a number of classifiers that you can use and a graphical interface if you're not interested in programming.
I agree with this. Another approach you might try in the same vein is to use machine learning.
In particular, I'd try making a table of the data with a column for things like day of the week, month (or week number through the year). These will be the "inputs" to the machine learning. It will probably also be helpful to have the data for the previous few days be included as columns for the later data points (since they can be inputs to the machine learning model).
If you produce something like a csv file with a row for each data point (with the columns like above), you should be able to load it into a tool like WEKA (http://www.cs.waikato.ac.nz/ml/weka/) and try out a number of machine learning algorithms pretty easily. Unfortunately, I suspect that using the tool is not so intuitive and there's a lot of jargon to learn. Personally, I'd start with a "nearest neighbors" based approach - basically, it will look for when a certain pattern has happened in the past and guess based on what happened last time.
When you are building models, you can use something called "cross-validation" to see how good it is. Essentially, you leave out some of the data when building your prediction model, and then test how accurate it is on that data. You can do this on different subsets of the data to get a bigger test sample.
Isn't a horrible place to start. It doesn't go very deep at all on the statistics involved, but does a great job of providing a solid framework for establishing a culture of experimentation.
That's not how causality works.
Causal modelling is a whole rabbit hole. Check out this book for an intro: Judea Pearl - The Book of Why
Fields Andy, "Discovering Statistics" might be what you're looking for.
It provides great intuition as to what test you can use given your data and thoroughly explains the working of each without diving unnecessarily deep into the math. (As far as I remember, the usage of Greek symbols is kept to a minimum)
There are several variants of this book, each directed at a particular statistical software (such as R and SPSS). I'm currently using the version for R and can highly recommend it!
This book used to be free online but it is now available at a very low price (2.99). It has an excellent short and understandable section on logistic regression.
Statistical Rethinking is generally well-regarded. I've read through a bit of it, but not all.
There are a 1,000 routes you could go here, and no one path is right for everyone. From your statement, my guess is that you want to get a good grounding in the basics, at least from the start, without a lot of proofs and high level mathematics. I recommend going through something like Moore, McCabe and Craig's IPS to get a first look. Though I have an older version, the presentation is clear and straightforward, and covers just about all of the basics people in the medical profession use regularly. It is a pretty easy book- and I would rather people start at a level they find too easy, than one that is too hard. If every physician/medical researcher fully understood what is in this basic book, it would be a great leap forward.
After you are exposed to this book with its techniques and examples, then you will be on a better footing to decide where to go next.
The Visual Display of Quantitative Information by Edward Tufte is a great read, definitely a foundation text for understanding visualizations.
The Grammer of Graphics by Leland Wilkinson and ggplot2: Elegant Graphics for Data Analysis (Use R!) by Hadley Wickham (which I only just started last month, but it's great) are related to each other and they get into the idea of layering information and weaving statistical concepts into the visualization.
Only correct when the sample size is "large enough", like all asymptotic results.
The many samples from the bootstrap distribution are an accurate representation of the sampling distribution of whatever you are trying to get the sampling distribution of than the normal distribution produced by the usual asymptotics.
The standard textbooks are Davison and Hinkley and Efron and Tibshirani.
Response surface methodoloigy (RSM) is an approach to optimization. Design of experiments is the approach to deciding what data to acquire. Frequently the response surface method requires experimental data to find the optimum, but not always because there are models of systems that are already validated and can be used. When a model needs to be developed, it can be a wise choice to use design of experiments to configure the matrix of treatments necessary to provide data for optimization.
RSM typically follows a vector toward the optimum of a multidimensional surface. At initial phases of learning, the vector can be derived from a linear model of the surface. Multiple experiments can be conducted along the vector. At some point in the design space, it's typical that the linear model is not capable of detecting interesting effects, so curvature terms are added to the design of an experiment.
Each of the experiments along the vector could be a factorial design, with or without interactions. Eventually higher-order terms are added so the model can allow for peaks and valleys that are not at the corners of an inference space.
TL;DR: RSM uses DOE, though it doesn't have to if there is already a model.
A quick simulation in R:
deck=rep(1:13,4) mean(replicate(1000000,all(sample(deck)!=deck))) [1] 0.0162382
standard error is about 0.000126
Algebraic / combinatorial solution here:
Also see this paper, if you can manage to get access to it, or have access to a library that keeps it (many university libraries will carry it):
And it gets worse. A random vector has a mean vector and a variance matrix (also called variance-covariance, dispersion, or covariance matrix, the last being bad terminology because it uses up what should be the term for the covariance matrix of two random vectors). And if you wanted to go to higher-order moments you get into tensors.
Maronna et al. is newer than other standard references (Huber, Huber and Ronchetti, Hampel et al.) but has its own particular slant on the subject. But it is good.
There is discrete statistics also called categorical data analysis but you are barking up the wrong tree. Trying to discretize the fundamental concepts of probability theory (probability and expectation) is not the way to do it. Those concepts apply to all random variables, discrete, continuous, or neither.
For example, when you learn about the Poisson distribution, you learn that its usual parameter is the mean and the mean can be any positive real number. If you said the mean could take only integer values, then you wouldn't have the whole family of Poisson distributions.
To say Geiger counter clicks at your lab bench follow a Poisson distribution with 2.7 clicks per second does not mean there there are 2.7 clicks in any particular second (what you are complaining about) but rather the number of clicks per second gets closer and closer to 2.7 as longer and longer time intervals are observed.
Kutner is definitely encyclopedic and is great as a reference. But as /u/efrique said, it might not be good for learning.
I used Weisberg, Applied Linear Regression, which includes R examples, as a text for my linear models course and thought it was quite good. I don't think I'd consider it a purely beginner-level text but it's not quite at the intermediate level either.
My cheat would have been so say "hmm, good question, I haven't used that test lately, please let me look in my reference book '100 Statistical Tests' by Kanji. Oh, here it is, Test 36:
> Test 36 To investigate the significance of the difference between two population distributions, based on two sample distributions. The Kolmogorov–Smirnov test for comparing two populations"
The concept you're explaining is called synthetic data, and bootstrapping is often used for that same purpose.
In fact yo can use the method /u/4yolo8you showed you, because the bootstrap sample is going to already be a vector of sythetic data based on the inputs.
I don't think a gamma dist will work for this, because as far as I'm aware (and I'd have to double check), it only has support for x -> R^n: for x >= 0 (aka from 0+).
If you want to take the time to learn something a bit more complex, but much more flexable, I would recommend using Generalized Lambda/Beta Distributions.
I'm still a complete novice with it myself so I can only guide you so much but it should be relativly easy to use the 'gld' package. You can use fit.fkml to fit the data then pass the vector of lambdas back to the GLD to get a distribution.
Then you can use these: https://rdrr.io/cran/gld/man/GeneralisedLambdaDistribution.html to find the probability of certain results like you would with pnorm/rnorm/etc.
I plan on doing a deep dive into the subject using this book: Handbook Fitting Statistical Distributions R
Currently I'm doing a 30 day R coding challenge thats taking up all of my study time, but I'll be picking up the book as my main study text next month.
What you say is right in that
bootstrapping does not give more accuracy (shorter confidence intervals or more powerful hypothesis tests) although it may give more higher-level accuracy (coverage of confidence intervals closer to nominal, significance level of hypothesis tests closer to nominal) and
But I do not like your description of what the bootstrap does. Yes, the bootstrap is relevant simulation (relevant to the experiment actually done) rather than irrelevant simulation (relevant to toy problems having zero relevance to real-world problems like those cluttering so many statistics papers). But the bootstrap has only asymptotic justification. It does not sample from the true unknow distribution of the data. This is obvious when that distribution is continuous, (The empirical distribution is discrete, concentrated at the observed data values.)
Also the bootstrap can give very bad answers when done naively. It is not magic. One has to have some real understanding to use it correctly. You don't have to understand the difficult mathematics of the proofs, but you do need to understand most of what is said in Efron and Tibshirani or Davison and Hinckley. Naive notions like what you said can lead users to mistakes from overconfidence.
Coursera has some free statistics courses as well. They’re fast paced and skip over some stuff but it can help guide them with side research. I have the Humongous Book of Statistics: https://www.amazon.com/Humongous-Book-Statistics-Problems-Books/dp/1592578659/ref=mp_s_a_1_16?crid=2CW6P82RP6I7O&keywords=statistics+practice+workbook&qid=1652071796&sprefix=statistics+practice%2Caps%2C140&sr=8-16 Having a paper copy is helpful.
I also got a coursehero subscription which can be cheating and some info can be wrong but I found it helpful just looking over different students’ methods and conclusions.
Also recommend Khan Academy. And any free textbook on Bookdown.
Whenever I try to learn something I find that it helps to first have a comprehensive exploration of the whole field so that it gives me a context to place all the stuff I learn into it. In that sense, starting on a sequence of courses from square one would slow me down.
Here's a great start, I wish I had this book when I started:
https://www.amazon.com/Art-Statistics-Learning-Pelican-Books/dp/0241398630
And for machine learning / data-science, this is a great introduction:
https://www.amazon.com/Data-Science-Press-Essential-Knowledge/dp/0262535432
They're tiny books and they're glossary, but they provide a lot of grip for your journey.
Try Andy Field’s book _ Discovering Statistics Using IBM SPSS Statistics_. You don’t have to necessarily be an SPSS user to follow along, because he does a really good job explaining the concepts. Also, _ Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences_ was basically the stats bible in my PhD program (I/O Psychology).
Some Ms have an entry exam and even preliminary trajectories to get you in so you don't end up in the wrong place, wasting time and money.
You should, at the bare minimum be familiar with what's being presented in this book. If not, it's a great start after which there are plenty of other resources to get you up to speed.
The Art of Statistics: Learning from Data https://www.amazon.com/dp/0241398630
My professor wrote a great book on statistical theory with lots of Montana-oriented examples. It is a bit pricey though
Mathematical Statistics: An Introduction to Likelihood Based Inference https://smile.amazon.com/dp/1118771044/ref=cm_sw_r_apan_i_FMS0YCZ94EVRH5772Z09
This is by far the best book to start:
https://www.amazon.com/Art-Statistics-How-Learn-Data/dp/1541618513
What also helped me was understanding that there's both Frequentism and Bayesian. Frequentism is by far the most used, but what's frustrating about it is that universities often present it as if its the only form of statistics. So knowing what Bayesian statistics is gives more context to Frequentism and makes Frequentism easier to learn:
https://www.amazon.com/Book-Why-Science-Cause-Effect/dp/046509760X
If you’re having trouble with technical books, this was the easiest to read text I have found.
Time Series Analysis for the Social Sciences (Analytical Methods for Social Research) https://www.amazon.com/dp/B00SLGS1ME/ref=cm_sw_r_awdo_2J8W1T0MVK285C5EA87J
One of the clearest, easiest to understand books that covers almost all of these topics is Moore and McCabe's "Introduction to the Practice of Statistics"
I link to one version of the book-- they have newer editions, but the older ones are cheaper, and wonderful. They do not have mathematical proofs, but just clear "what", "why", and "how" with clear examples.
Probability and Statistics for Engineering and the Sciences https://www.amazon.com/dp/1305251806/ref=cm_sw_r_cp_api_glt_i_A6J5WC1TG3SACZ2HC32R This is the one we are using for stats 2 I am a senior in college
If you are maybe looking for clear verbal treatments of the topic rather than some of the mode rigorous mathematical ones, I can recommend something like Pickup [2014](https://www.amazon.ca/Introduction-Time-Series-Analysis-Pickup/dp/1452282013). The writing is pretty clear and could help to build up intuition while at the same time looking at the math more rigorously (with proofs and so on, if needed) in another textbook or during class.
Judea Pearl wrote a book on this:
https://www.amazon.com/Book-Why-Science-Cause-Effect/dp/1541698967
Grossly oversimplified: In order to determine causality, you need a counterfactual. A counterfactual is a comparable 'what if' scenario in which the intervention is absent.
Most statisticians don't actually bother with causality. Instead they focus on how often a study would be true if repeated, this is called Frequentism. It's a robust standard to compare studies. But it's not very satisfying. One could argue the entire point of establishing whether or not something correlates with something else is because there is some hunch that the two might be causally related.
Not even a longitudinal study . . . I am currently working with a cool dataset where I can easily "prove" that a future outcome causes a past result if I set things up right :) What you actually need is an experimental manipulation that is randomly assigned to half the participants.
Plotting depends on the software you're using. What have you got? In your case I think I might start with a scatterplot of depression vs. discrimination, with two colors of dots and two fitted lines (one for married, one for not married). Here are a couple of random examples from google:
Using Swirl now and it's great. There's a Coursera course in conjunction with it here: https://www.coursera.org/course/rprog which is also handy. It takes you from the basics so you won't miss anything.
Thre is a marvelous book -and articles- from kassambara (or something like that) of PCA. Here it is. https://www.amazon.es/Practical-Guide-Principal-Component-Methods/dp/1975721136 I bet you can find the page of kassambara
Well, being honest you’ll have a ways to go if you plan to stack your academic resume with things that will put you in the running for data science.
You might want to start with something like symbolic logic for programming fundamentals, particularly when it comes to statistical programming and wrangling data sets. A cursory search yielded http://www.openculture.com/symbolic-logic-a-free-online-course
Openculture is pretty good for courses and texts. For statistics, there are a few online textbook/course options on Carnegie Melons website. Search “cmu oli” and navigate your way there. But if you really want to work in the field, probably best if you at least minor in stats or something. This means you’ll need to work your way through calculus (uni- and multivariate) as well as linear algebra.
Good luck!
Logarithmic functions where as x increases, y stops at a certain point (Well not really, y still increases but in a slow rate where it cannot reach a particular value).
For reference, I got this, but they're looking for something more closed-form/simpler.
I used 3 pokemon instead of 4 to make it easier to do here but it expands to any number.
Battle 1 uses binomial probabilities as expected, Battle 2 has to calculate the probabilities of each outcome given each prior of Battle 1 having 0 items, 1 item, 2 items, etc. then weight all those outcomes by the probability of the prior, and add them together to get the new weights for the next battle, and so on until the last battle's probabilities are the answer.
No. It is best to use something that actually agrees with statistical theory. See the Little and Rubin or Raghunathan.
What you are talking about is model averaging (either Bayesian or frequentist), the book of Burnham and Anderson discusses both. Hoeting et al. has much more on Bayesian model averaging. There are many terms for various kinds of frequentist model averaging and many papers about each type.
Anything with a one-side specification limit is generally non-normal, and frequently well modeled with log normal. Flatness, roundness, runout, leakage, standby current.
Cycle times are not normally distributed, generally they are exponential.
Defect rates are not typically normally distributed. Go all the way back to your first statistics class to remember that number of defective light bulbs has a binomial distribution.
Defects per unit are distributed like Poisson, this represents number of defects per area of opportunity. This is appropriate for defects on the fins of radiators, or bumps, bulges and nicks along the length of tubing or wire (though these length-related or time-related defects are not unrelated to exponential distribution).
See Andrew Sleeper's book "Six Sigma Distribution Modeling" for many more examples.
You are welcome! If you have access to a university library, I wrote a chapter of a book illustrating applications of lots of different distance measures. It is a Springer book, so you can print it out free from most good libraries. Here is a link to a Google books preview, see especially p. 448-449 in the preview, if it will let you.
Foundations of Location Analysis By H. A. Eiselt, Vladimir Marianov, Chapter 19 on "Voronoi Diagrams" by Burkey, Bhadury, and Eiselt.
Depends on what you've already learned on statistics. A good book (and free in PDF format) are The Elements of Statistical Learning. I would also suggest looking into some machine learning (though I do not know enough books to give a good advice here), but there are lots of good online resources.
You also need to pick up some programming skills if you didn't already. I suggest looking into R and Python and possibly some SQL to start with before you move on into some more complex technologies.
A good starting point would be the Data Science specialisation by John Hopkins university on Coursera (https://www.coursera.org/specialization/jhudatascience/1?utm_medium=listingPage) you can take it for free or pay to get a certificate. It teaches you R along with some statistics, so if you already know the techniques you at least learn to do them in R.
I am sure there are many people who will be able to give more information, you can also try posting into /r/datascience if you didn't already.
Hi. I am in need of some expert inputs.
I have run a regression on a dataset for a multivariable polynomial equation.
How can we do it in excel. My limited knowledge is that in excel you can only get uni variable polynomial equation or multi variable linear equation at maximum.
Can you help please?
Attaching my analysis. Thanks in Advance.
Thanks, that helps. I think you're there: Assuming that the range of ages is between 20 and 90, the maximal standard deviation (the worst-case) is 35. All the other points (except the population size) are basically at your discretion. Can you take it from here?
I'm building a predictive model which will determine which items are likely to be stolen based on trends of the aftermarket price. (e.g. the price of an item on eBay)
I already performed a logistic regression using StatTools and the p-values say that my findings are significant.
Maybe this will help? https://onedrive.live.com/edit.aspx?cid=6ADFAE0D13423A55&resid=6ADFAE0D13423A55!1084&qt=mru&app=Excel
If you don't have StatTools (add on) installed you can't view a lot of my statistical data. :(
I had exactly this question and ultimately decided on Anderson's book. https://www.amazon.com/Introduction-Multivariate-Statistical-Analysis/dp/0471360910
I picked it because it seemed to me to be the most thorough and rigorous. This is not a virtue for everyone. If you like a more applied, and intuitive exposition, then you may prefer some of the books that you've listed.
I haven't yet read it because I've been busy with other projects, so this is probably my project for a year from now.
What do you need it for?
===
There's an answer on crossvalidated with four sets <strong>here</strong> (toward the end of the answer), but it uses R code to generate the data, chosen to approximately replicate diagrams in a published paper (referenced in that answer).
It doesn't give the data itself, just R code that creates it, but that's probably because they're each sample size 100 ... it's a lot briefer than giving the data.
If you don't have R installed, you could paste the code into www.r-fiddle.org and get the data that way
(after running the code, you could use write(x1,"")
to get a space delimited list of the first variable or dput(x1)
to get them separated by commas (if you also remove the first two characters and the last one) then do the same for x2, x3, x4.
Note that if you take stuff from a post there (including code, or text, or pictures etc) and use it elsewhere you would need to follow the license for StackExchange (which mostly boils down to giving credit in the required form):
> user contributions licensed under cc by-sa 3.0 with attribution required
which license is here: https://creativecommons.org/licenses/by-sa/3.0/
If I remember correctly, the license requirements are easy enough to follow. I haven't read them in a while though so I forget the exact details.
If you're just using the data generated from that code in a blog or something but not any of the images, code or text, you could probably just get away with a link to the answer.
I'm saying before you begin the experiment, you aren't sure what the increase in views is going to be from the new exposure point, so you don't know how conversion rate will play in to the math.
Basically with this tool again, here the min detectable effect is in terms of change in conversion rate, but I'd like to have a min detectable effect in terms of xthousand conversions.
Effectively to decide if the new exposure point is worth it in terms of raw conversions, conversion rate aside (you probably expect two UI exposure points to a feature won't double conversions, but it may add meaningful marginal conversions at a lower combined conversion rate)
You need to look at the sample size of the test too, not just the p value.
Here's a calc https://www.optimizely.com/sample-size-calculator/?conversion=5&effect=14&significance=95.
It says you're missing about 10-20k visitors per bucket.
Not meeting minimum sample size inflates the error rate.
If possible you should continue testing.
Don't fall in the trap of looking at just the p value. Your test didn't hit the minimum sample size for opens or clicks.
You can use something like https://www.optimizely.com/sample-size-calculator/?conversion=5&effect=14&significance=95. It calcs the minimum sample size (per buxket) for two tailed tests only. You're still missing 10 -20k visitors per bucket depending on if you're looking at clicks or opens.
Not hitting the sample size overinflates the error rate.
It happened in game Legends of Runeterra. This is what it looks like. Here I got 3 same cards in the top 4 draws.
You have a 40 card deck in which you can have 1-3 copies of each card. Usually, around 30 of the cards are run as 3-ofs for consistency.
No matter how we cut it, it shouldn't happen to me once every 40 games, unless it's somehow rigged for some reason.
I learn best by example and by doing, so I would recommend downloading an app called Prism and learn by example. Their website is also very good at explaining concepts to medical professionals in your position. https://www.graphpad.com/scientific-software/prism/
I would have included anything that would be useful for the reader in order to gain insight. So if you do have high values for skewness / kurtosis, you could mention it. And, while this is a task (I'm guessing Andy Field), normally you would also consider what you are doing with the data next, e.g., will you perform tests that requires certain assumptions.
SPSS can be bothersome, I would recommend checking out JASP (easier to navigate, see what you have done, change up thing etc.), even if you have to use SPSS,JASP can be a great learning tool, can also recommend their short manual/stats guide.
I think I was not able to put my ideas into right words. I just wanted to understand the reason behind using gaussian because using gaussian only , we get mean squared error.
By saying that its used as a metric for every dataset, I just wanted to say that its quite famous and find its application in regression problems,etc. https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html
It seems like you’re just looking for the percentile ranges for the POC and then checking each value from the new device to see what percentile it falls into. This website goes over how to calculate percentiles in R you’d be looking for 0.12, 0.15, 0.85, and 0.88. Then you’d use something like dplyr’s filter function to find each of the values over or under those percentiles and count them with something like the length function.
Here’s my favorite cheat sheet for dplyr
As someone else mentioned, you’ll want to set all this up as a function if you have multiple data sets to analyze or you want to use it frequently. Hopefully that will give you the necessary search terms and staring point. Good luck!
So what I have done is I Crete a model based on the auto data, then I use the function predict: predict(model, mtcars) . There are only ~33 outputs, so I look at it and compare with the actual data in the mtcars, like manually.
I have attached the dataset as dput(): https://codeshare.io/am1ekW
I tried bst = xgboost(data=as.matrix(auto), mpg~. ,label=row.names(auto), nround = 391)
but it returns a list of train-rmse:NaN
​
Here's my data: https://codeshare.io/am1ekW
I'm a linguistics student teaching himself statistics, and I'm reading Statistics in Plain English by Timothy Urdan and finding it very helpful. It's more on the theoretical side, but with lots of concrete examples. My plan is to follow this one up with Learning Statistics with R by Danielle Navarro in order to learn the practical side of things. It uses the R statistical programming language, which might not be something you'd feel comfortable using, in which case I think Learning Statistics with Jamovi, coauthored by the same author based on the R version of the book could be useful to you. Jamovi is a graphical interface for R, seems fairly easy to use and comprehensive.
You can definitely learn Applied Statistics without calculus, and in my opinion, this is the correct order to learn it (Learn the basics of data, distributions, probability, and inference, then learn the mathematical underpinnings/proofs). Of course, opinions differ on this.
A wonderful, gentle, and interesting introduction is Moore/McCabe/Craig Introduction to the Practice of Statistics.
There are more recent editions that are more expensive, but there is no reason at all to pay more for a newer edition. If you go through this book, working through the problems and taking notes, you will have a very good overview of applied stats. If you then want to learn more, then you can worry about learning the mathematical derivations of it all.
I hate your schedule. The semester is dragging out the easiest parts and crams the most difficult parts in the last two weeks.
This is the best place to get started:
https://www.amazon.com/Art-Statistics-Learning-Pelican-Books/dp/0241398630
Make sure you read up on hypothesis testing right now. It's a very counter-intuitive concept and it will take time to fully understand it.
Data Science seems to have the more robust career path. I recommend supplementing your coursework with as much of both as you can. Strongly recommend the following: Statistical applications, applied math, programming in R and/or Python, PowerBI, and this book.
I went the Biostatistics route for grad school and am now working as a data scientist/biostatistician. Very happy with where I am professionally.
How about this book? Although I prefer video courses intro to statistics in R
You can learn a lot of applied statistics using only algebra. Calculus is useful in learning the theory behind stats, but in high school you aren't really trying to learn what we call "Math Stats".
I recommend starting with a book like Moore/McCabeCraig. It gives you a good sense of what statistics is all about, only using math you are familiar with. After you understand the applications and get motivated, then the mathematical/theoretical foundations can come later.
The following book on introduction to statistics might help
Introduction to Statistical Methods
It is uses R software and if you want learn Statistics using software then it might help you
What you or they consider "simple" or "core" is too subjective to guess. I would use Probit/Logit regression. See e.g. Maddala page 22.
Not really a technical book but I'd also recommend The Lady Tasting Tea. It contains a lot of stories about various statisticians/mathematicians whose works are heavily used in statistics. For me, it added some personality to the procedures and tests, making them just a bit more adorable.
Thank you very much for the book recommendation (I've already bought it :) ) and the advices. I have the following questions, if you don't mind:
Thank you very much for your answer again, I was blind and now I have a path to follow.
This one, right? What barriers will I encounter as a person with a sparse stats background? I'm an elementary special education teacher so that's my background.
What's your math background?
My general recommendation would be to start with an intro-level stats book. I think Devore's Introduction to Statistics and Probability for Scientists and Engineers (Amazon link) is a good one. Older editions would be fine for what you'd need.
There's also an online book called ModernDive which goes through statistical methods and R programming together. I'm not sure if that goes as deep on the math though, Devore's book has some sections with a bit of the calculus.
After that, a math-stat book would be a decent choice. The book you mentioned is described as a graduate level book. Usually as a first intro to the topic, people recommend Wackerly
you might be interested in Regression and other Stories.
>Most textbooks on regression focus on theory and the simplest of examples. Real statistical problems, however, are complex and subtle. This is not a book about the theory of regression. It is about using regression to solve real problems of comparison, estimation, prediction, and causal inference. Unlike other books, it focuses on practical issues such as sample size and missing data and a wide range of goals and techniques. It jumps right in to methods and computer code you can use immediately. Real examples, real stories from the authors' experience
Had the same book in grad school and also found it challenging on its own (also my mathematical stats prof was garbage). Found this book paired really well with it: Hogg, Mckean, and Craig - Introduction to Mathematical Statistics
HMC covers most of the same material as C&B, but I consider HMC much more accessible. I still use C&B as a reference more often than HMC, but for pedagogical purposes I think HMC was the better book to learn the material from.
What's the purpose you need a Statistics course for?
Casella and Berger is the most commonly recommended book in it's class - graduate level probability and statistical theory text. If you need a statistics theory textbook, Wackerly, Mendenhall, and Schaffer is a solid one, it was used in my undergraduate program for the senior level math-stat course. Though depending on what's giving you trouble, I can't promise this will be any better for you than Casella and Berger.
I know this is a difficult question, but if you can explain a bit more why you're finding (some of) the exercises difficult, that may help people here suggest other books. Is it the calculus? Is it connecting the math to the statistical concepts?
It’s interesting you’re interested in DOEs but seem to not be collaborating directly with scientists / rather work with publicly available datasets. DOEs are not necessarily meant to collect “homogeneous” datasets, they’re meant to make sure you collect the appropriate dataset to answer a specific hypothesis/achieve a specific goal you have a priori given certain resource constraints. It requires that you make educated guesses about the underlying model of the phenomenon you’re studying before you even collect any data. It’s somewhat different from good data cleaning practices performed a posteriori, and let the data alone tell you what model fits best. Nevertheless, I think there are definite advantages to bridge those 2 worlds.
The books I’m about to suggest are probably more tailored towards the scientists actually conducting the experiments/collecting the data, but I think they’re great intros to DOEs.
This is a great overview of classical design of experiments.
Optimal Design of Experiments - A Case Study Approach, by Goos and Jones
This is a very practical approach to get an intuitive sense of optimal designs. Not much math background included, but it makes for an easy read and will definitely make you understand the challenges faced by experimentalists/those collecting these currently imperfect datasets you mentioned!