Start by reading the great books on the topic: The Visual Display of Quantitative Information, Envisioning Information and Visual Explanations by Edward Tufte and something that covers Bertin (his 1968 book Semiology of Graphics is dated but the ideas are sound. I like Information Visualisation by Colin Ware, which takes Bertin's ideas further). Then Now You See It by Steven Few and Visualise This by Nathan Yau. You don't have to agree with those people's opinions, but those books will give you tools to begin developing your own.
For folks new to shiny on my team, I recommend the first section of mastering shiny (4 or 5 Chapters) and then watching both parts of Joe Cheng’s shiny dev conf talk on reactivity.
https://www.rstudio.com/resources/shiny-dev-con/reactivity-pt-1-joe-cheng/
Save as PNGs. But not just regular PNGs, use the cairo device png (ggsave(..., type = "cairo-png")
, you'll need the cairoDevice
package installed). It's amazing the difference this simple step can make in image quality and (especially) font readability.
Use the font size argument to any ggplot theme
function. The defaults are good for print. For a presentation you'll need to bump it up to, e.g., theme_classic(base_size = 20)
.
Use relative font sizes (rel()
) for theme customization so that your base_size
argument propagates through.
Use the R graphics device window (or the RStudio pane or whatever) only for rough drafts. By the time you get close to a production-quality graphic, save it as a PNG and do all final adjustments based on viewing the file. To help with that, get a lightweight image viewer that will auto-update when the file changes---I've recently started using imageglass and it works fine, there are probably several other very good options too. Make the viewer window about the size of the graph in your PPT window and make sure things look sized right before putting in PPT.
Don't use default ggplot colors. For presentations you'll want colors with relatively high saturation, some of the RColorBrewer
palettes work, or the colors from the wesanderson
package.
As others have said, use ggthemes
to find a theme that works for you, and customize it to work for your presentations. For presentation figures generally, keep things clean---don't overdo gridlines or axis-break labels, move annotations from the legend to the graph where possible (use annotate()
!).
There are good 'cheat sheets' available out there on some R packages.
These, I've found, are a good way to eliminate some of the noise of Stack Overflow, and help get you where you need to be.
Though, part of the beauty of R is that there are so many ways to go about coming to the same ending point. Use what you are most comfortable using and understand.
It also wouldn't hurt to over-annotate your code with comments, especially when starting out. One month from now you will really appreciate the comments you put in your code now when you need to figure out how the hell you got the code to work before!
Munging and clean up of csv really needs to be done in a script. If not your whole process is vulnerable to human error and how would you audit your work? Also once you figured it out you can throw data into data.frames much faster since it is automated.
1) R is great for cleaning up data. (Packages: reshape2, dplyr, gdata, stringr, and lubridate. (package tm if working with text))
2) If you struggling on learning them I would suggest open refine. It will help get the job done be script able but will help you to think in a programmable way to clean up datasets.
Don't use Excel. I find it fun to clean up data actually. It is a little challenge.
GGplots2 isn't everything. I use it BUT you should learn the other tools also. Learn base plotting first. Find a good book and read it and you'll be happy you did. I suggest R for Everyone
there is a cheatsheet:
https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
If you're after a GUI tool go use graph pad prism. The whole point of ggplot is to program. Also you can create a list object with aesthetic arguments you can import and apply for ever graph if you want them to look consistent.
No need for a loop here. dplyr makes things like this easy once you know how to use it.
library(dplyr) df %>% filter(set != 'NA') %>% # get rid of NA sets group_by(set) %>% # dplyr will group the sets for you summarise(meanER = mean(PowerER, na.rm = T), meanEL = mean(PowerEL, na.rm = T))
I didn't test this, but it's at least pretty close.
I have R and Rstudio set up on a private server on digital ocean. I'm sure you can do it with Amazon Web Services too, but if you are not wed to that here is a tutorial. It costs as low as $10 a month
Sounds like you want a Shiny app. Check out shiny server pro for handling many concurrent users. https://www.rstudio.com/products/shiny-server-pro/. I don’t think you’d need a paid version or R itself tho.
First, cool, enjoy R. Then you should try with dplyr, is very straight forward for what you want.
library(dplyr) merge.data <- inner_join(ds1,ds2,by="codeid")
Nice cheatsheet https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
Avoid using absolute paths. User-specified input files should be always passed as arguments to functions, never hardcoded; files provided by your package should be located in inst/extdata/
subdirectory and accessed via system.file("extdata", "filename.ext", package = "myPackage")
(note that the contents of inst/
go one level up when the package is installed).
Specify all dependencies in the DESCRIPTION file. This will allow users to install them with devtools::install_deps()
.
Consider creating a git repository for your project. You'll be able to keep track of all changes.
Documenting your functions with Roxygen will take you a few minutes, but may save your users and future developers hours.
This is d3.js. Quite possibly done by the creator of d3, as he at least used to work for NYT. It is also possible that the data manipulation was done in R. There are a few ways of presenting R work in d3.js. Most commonly through Plotly. Although that will never compare to working with d3 natively.
/u/zdk and /u/BeerSharkBot covered how to solve your particular problem, but for general data cleaning and manipulation work involving character strings I've found the stringr package extremely useful. The "Working With Strings" cheat sheet from RStudio is a handy two-page pdf to use as a quick resource. On the first page it lists a quick summary of the useful functions in the package, and on the other it gives a nice breakdown of regular expressions with examples. The package page also has two vignettes (linked on the page): "Introduction to stringr" and "Regular expressions". Those will go into more detail than the cheat sheet does if you want to learn more.
You can insert latex right in the rmarkdown document - knitr will ignore it. Pandoc will pass Latex commands straight through as it converts the markdown document to a tex document. So you can do something like this:
--- the yaml header --- # Header 1 \emph{This text is emphasized} And this text isn't
And that will produce emphasized (usually italic) text in the pdf document output.
To get two columns, you can just att twocolumns
to the document class. However, if you want that stuff at the top and bottom that spans the columns it gets trickier. I think that your best bet there is multicolumn, here is an explanation: https://www.sharelatex.com/learn/Multiple_columns
EDIT: In case it isn't clear, the \emph{}
command is a LaTeX macro (command).
A few comments:
1) Your function is trying to do too much. I count four things:
I'd suggest expanding that into 3-4 clearer functions. You can imagine something like this:
treemaker <- function(..., saved = TRUE) { tree <- construct_tree(...) prune_location <- find_prune_location(...) pruned_tree <- prune(tree, prune_location) if (savd) { save_output(pruned_tree) } return(pruned_tree) }
The intent and purpose of the function should be obvious at a glance. It also helps with debugging.
2) You do some unnecessary transforms (e.g. as.numeric, as.data.frame). This isn't the end of the world (unless you want to run it millions of times, then it'll add considerable overhead) but it's a code smell.
3) A number of things in the function, e.g. the output directory name. You can make these optional with a default that a user can over-ride.
4) You don't need to wrap cat
in print
.
5) Note that sprintf
can be convenient for constructing filenames.
You might be interested in reading The Pragmatic Programmer, Clean Code, and Code Complete 2 (in that order).
I use both so I don't think I'm (too) biased. They are both good but for different things. R markdown is useful when there's a lot of code involved, but for something which is purely symbolic with no R code involved, using RStudio is just clunky.
> And besides that, I like R. I aim to use practising R as a reason to crash course calc, otherwise I’ll likely have no motivation to do it.
The thing is, R is simply not useful for your calc, as others have pointed out, beyond stuff like graphing (which you can do much more easily by, say, using Desmos. If you want to do numerical differentiation and integration, in which case it will make more sense to involve R, you should learn numerical analysis after doing calculus and linear algebra.
don't worry, you're not at fault, the basic plotting function of R is very bad. The problem is a common one, and is because the canvas for plotting is too small for that many points. You'd think they'd fixed it but didn't because most people who would plot with that many points would use this R package call ggplot2.
that's who you'd do it.
test <- read.csv(file="Jan_22_2015_17.40.03_0",head=FALSE,sep=",")
install.packages("ggplot2")
ggplot(test,aes(x=x,y=y)) + geom_line()
quick primer: ggplot works in layers, like photoshop. the first term above, ggplot(test,aes(x=x,y=y)) defines what your dataframe, x and y are. then you add the line graph plot geom_line() and if you wanted you could add a scatter/point plot too by doing this: ggplot(df,aes(x=x,y=y))+geom_line() + geom_point() reference: http://docs.ggplot2.org/current/geom_line.html good luck. btw use stackoverflow if you have other questions, it's very good, e.g. https://stackoverflow.com/questions/7714677/r-scatterplot-with-too-many-points
good luck
?
means 0 or 1. Said differently: once or none at all. But *?
means something slightly different in regex.
Use a regex builder like https://regexr.com/ (also checkout the cheat sheet) to see the difference visualised.
Here is a stackoverflow answer on *?
:
https://stackoverflow.com/questions/3075130/what-is-the-difference-between-and-regular-expressions
*?
is what they call a lazy quantifier. It will match as few characters as possible. eg:
Searching in 101000000000100
1.*1
will match 1010000000001
while 1.*?1
will match 101
.
Also the use and meaning of ?
varies. eg:
(?:abc) non-capturing group (?=abc) positive lookahead (?!abc) negative lookahead
If you want to do this on unbuntu. you "just" need to open the terminal app.
open the terminal
then paste the command :
sudo apt update sudo apt install r-base
you will be asked to provide your password.
Those command request the Ubuntu repositories ( Ubuntu app store) to :
You could also just open the ubuntu app store via the graphical interface and search for r-base
But this is the first step. I assume that you would like to have a graphical interface to run your scripts. The most common one is Rstudio , which you can also install on windows.
In your situation you can do the following , still in the terminal app you paste :
sudo apt install wget wget https://download1.rstudio.org/desktop/bionic/amd64/rstudio-1.4.1106-amd64.deb sudo apt install ./rstudio-1.4.1106-amd64.deb rm rstudio-1.4.1106-amd64.deb
the reason you have to do this is because Rstudio is not available in the Ubuntu store. What happens is
With the RStudio 1.2 preview release, you can finally add new color schemes without having to edit the horribly-named CSS files.
You can use an online theme editor like this one, modify until you're happy, then add the theme file through the RStudio global options. If you're using RStudio server, you'll need to upload it to the host, first. I also had to add an extra blank line at the bottom of the .tmtheme file to make it happy.
I could be mistaken, but this might be what you're trying to get at
library("dplyr") df.join <- left_join(df1, df2, by = "Code")
A loop might be overthinking it.
Also, to take dplyr a step further, you might try using pipes like another commentator suggested.
library("dplyr") df.join <- df.1 %>% left_join(df2, by = "Code")
With pipes, you can chain a bunch of commands together in a series of easy to follow steps that will lead to a result. It helps to eliminate intermediary steps that clog up computer resources and are hard to mentally keep track of.
PS: Far be it from me to comment on your code, but using a mix of caps and lower case and also spaces could cause you to tear your hair out. Speaking from personal experience.
PPS: I heart this: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
> While shiny is great, we don't want to rely on its hosting service for a number of reasons and want to keep everything coded in a webpage.
You don't have to rely on shiny hosting, you can download the Open Source Shiny Server and host it yourself. Would probably help to have the assistance of someone familiar with GNU/Linux systems and setting such things up if you're not familiar with them.
It might not be as common with R, but often people write boilerplate code which is just a kickoff to structuring a codebase. Often boilerplates will include some of the setup for interconnectivity between frameworks as well, such as these Node, Express, and beyond.
The readme could better convey what the code is, but what it is appears to be boilerplate. The best boilerplate I can think of for R is Golem's. Package of Shiny App Using Golem generates a bunch of files and directories. I think Drake also generates boilerplate.
a) literally google "tutorial on regular expressions". (http://en.wikipedia.org/wiki/Regular_expression (they are actually called 'regular expressions'... Just like R, but much easier to search for))
b) in R, you can get help on it, but it is more refresher than tutorial. ?gsub, ?regex
c) practice on a site like: http://regexpal.com/ which lets you see what you are matching live.
Regular expressions will make working with text much much easier.
in this case, it would be newstring=gsub(" ", "", oldstring)
to replace all spaces in oldstring
with nothing. Replacing with %20 (the URL way of encoding a space) might be more helpful to read.csv, but as cruyf8 notes, RCurl is the package for reading things off the internet.
well, i can't really show you mine, because it's confidential work stuff, but i am more than happy to walk you through the process.
the tricky part is knowing that a packaged* tableau workbook is just a zip file with .zip changed to .twbx. if you rename it to .zip and open it up, you'll see that it's a workbook and a folder containing a datasource, in my case, a csv i have r make.
if you know how to use shiny, then it's not that difficult to make a form that takes in parameters (i'm not that good with shiny, but everything you need to know is from the shiny tutorial page, and you use those parameters to run an sql query and put it into a dataframe.
from there it's just file manipulation. if you have that packaged workbook (run the query once with whatever parameters you want, so you can design a presentation; then tableau knows the columns/names of the data, but the data will change depending on the parameters you passed it), then you have the workbook, so just put it in a location that r has access to, and then have r recreate that zip file (i.e. replace the existing data source in the zip with your newly created data source and zip it back up). you can make another page for a downloader that will export that workbook and make it available as a download, or you can have r email it with a user supplied address (another shiny option). or, if you have a presentation on a tableau server, and you can set the location of the data source that tableau uses, just have r update that file with the new data file.
does that help, or should i go into more depth?
edit: correction
You're right to store data in the long format for plotting in ggplot2, and you're on the right track with your ggplot code. In regards to A, you can use the color
and shape
aesthetics in the geom_line
call to change the color and shape of the lines, respectively.
In regards to B, try data$week <- factor(data$week, levels = paste0("W", 1:12), ordered = TRUE)
to change your week variable to an ordered factor so that it will plot correctly. Alternatively, if you don't want the "W" in your x-axis labels and would prefer just to have the numbers, data$week <- as.numeric(gsub("W", "", data$week))
will remove the W and convert week to numeric, which will plot in order as well.
There are lots of places online to find help on ggplot2, but you're on the right track - this cheat sheet might be good for quick reference, and you can usually find questions similar to your own on stack overflow.
Check out the “combine datasets” section of this data wrangling cheat sheet. This should provide you with a way to join the data depending on how you’d like the end state to look.
https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
Ps. I always keep printed versions of the cheat sheets on my desk and it’s saved me a LOT of time
I'm a huge fan of Hadley Wickam's tidyverse family of packages. It provides a really nice and consistent way to work with dataframes that is (I think) much more straightforward and easily scalable than the base r methodologies.
He's got a great book that is available online for free which covers everything really well. It can be found at http://r4ds.had.co.nz
In this case the function would be mutate, which will create a new column from other ones.
The cheat sheets at rstudio are also an amazing resource, like this one for this kind of thing. https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
Apt-get installs an old version of R, can't get packages like dolyr to install as a result.
Going to try this tutorial to get latest version of R installed: https://www.digitalocean.com/community/tutorials/how-to-install-r-on-ubuntu-16-04-2
Do you think this class would be much better/worse than the Johns Hopkins Data Science Specialization you can do on Coursera? That one is free as well.
Side note, you might want to link to the class.
Hi, there! Just as a note, here and on StackOverflow, it's much easier to get/give help if the person doing the asking adds a reproducible example that gets at their Q (aka, a "minimal, complete, and verifiable example").
See: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
Is there a key that is connecting all the observations somehow? To do something like this, I generally use joins from dplyr and subset with brackets if nec. Do you use dplyr at all? There's a great cheatsheet online: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
edit: Actually, I might be misreading you. Are you trying to add more observations from the other data sets to the master only? So the variables of interest exist in all data sets?
This is a very thorough write-up; but why not just run on something like GNURoot Debian? Installing R (and any other deb software) is already well-documented, and that app doesn't actually require root access, as the name implies.
Edit: I'd also add that compiling the tidyverse needs at least 1GB of spare RAM, so less-equipped phones may or may not be able to accomplish this by any means.
FYI for a reproducible example please include all code that will make your code run. diamonds
is a not a dataset in base R. So start your code with library(ggplot2)
. Second, what do you want your output to be? It will help you get the help that you need. Read this for guidance: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example.
Ever hear of Reproducible Research?. You should run your script one line at a time and see where it is actually breaking. Also, did you have your original copy , perhaps you are working with an old version of your source? Did you use github to save your changes?
You can change font and make a dyslexia friendly rstudio theme. I didn't see anything on Google yet so you'd have to make one yourself unfortunately. Maybe try messaging this guy
VS code is an awesome alternative and somebody has already made a dyslexia theme
I have taken many Coursera courses, and loved them all. This R programming course is part of a course series in Data Science from Johns Hopkins, if you are interested: https://www.coursera.org/specialization/jhudatascience/1/overview.
Answering this sort of question is best served by first searching on StackOverflow and then posting a question there if you can't find one.
StackOverflow appears to see far more traffic for R than this subreddit* (which recommends StackOverflow for posting questions, but doesn't prohibit asking them here).
As an aside heres an interesting paper comparing R-help and StackOverflow for R help.
I'm also learning R and thought it would be fun to try and solve your problem. I came up with the following function.
I'm sure seasoned R users could manage it more elegantly though.
Use geom_smooth() instead.
​
This should help:
https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
It seems like your question about how apply() is working was already answered in the comments, but I thought this code might be 'more straightforward' for what you're trying to accomplish.
R1Data<-myDF[myDF[3]!=0,]
Also, if you're wanting to learn dplyr/plyr/tidyr syntax, R studio has a list of common 'cheat sheets'. There's also this one which doesn't appear to be on their list anymore, but covers common functions for managing/transforming data!
tidyr
is actually pretty similar to reshape2
, but it's been designed with use of pipes in mind (i.e. like the plyr
to dplyr
conversion).
If you know reshape2
, you'll be able to switch to using tidyr
just fine; the biggest difference is going to be specifying "key" and "value" instead of "variable.name" and "value.name" inside the function. It took me about 15 minutes to work out the differences once I'd figured out how to use dplyr
and pipes. Now I don't think I'd ever go back.
Check out this blog entry and this cheat sheet for a pretty easy introduction. At least for me, it was harder to grasp the concept of "wide form" and "long form" (when learning reshape2
) and much easier to make the transition from reshape2
to tidyr
, as the concept stays the same and the only real change is syntax.
It might not be 100% corresponding to your situation, but this document on R and FDA compliance is where you want to look, and most likely reference in your business proposal. It's the official position of the R Core team on how to handle the open-source aspect of R in a regulated environment.
the right place to post your question is on the shiny google group. the rstudio guys stay on top of it.
https://groups.google.com/forum/?utm_source=digest&utm_medium=email/#!forum/shiny-discuss
How's this? Note that I didn't do the subtotals piece, as all you have to do there is repeat the code in another dataframe and then bind together. I find subtotals confusing in an observational data.frame.
iris_summary <- iris iris_summary <- group_by(iris_summary, Species, Sepal.Length, Sepal.Width) %>% summarise(Length = sum(Petal.Length), Obs = length(Petal.Width))
EDIT: Looks like the subtotals issue is somewhat common (I wondered if there was a particularly elegant way to do it, since I've never tried). This stackoverflow string would probably help: https://stackoverflow.com/questions/31164350/dplyr-summarize-with-subtotals
> you're mentioning probability density distribution and probability density function. Are those different things?
I was talking about the same thing... Just googled "density distribution" and got this answer:
"The probability density function is nonnegative everywhere, and its integral over the entire space is equal to one. The terms "probability distribution function" and "probability function" have also sometimes been used to denote the probability density function."
Realized I mixed two different terms for the same thing..
> The problem I'm having at the moment is this: certain areas of my heatmap imply a "high likelihood" of a new value being in said areas. However, I currently cannot tell how high.
Yeah that's right. There is no probability for point estimates - they are all 0. So probability to get x=1 and y=2.3 is 0. In order to tell the probability I think you would need to do a double integral on x and y. Something like "integrate from x1 to x2 and y1 to y2". Not sure how to do it in R, but found this answer on SO:
https://stackoverflow.com/questions/8913603/calculating-double-integrals-in-r-quickly
Not really sure what your goal is, but maybe this will be helpful.
Rainbow parentheses might be something you want to activate: https://www.rstudio.com/blog/rstudio-1-4-preview-rainbow-parentheses/
Seems like a silly thing but helps a lot to ensure everything is properly enclosed. For me it works better over a dark theme. Selecting a good theme I´m sure will help you a lot too.
Rstudio also has an autocomplete functions, which will surely help avoiding typos.
I have a few general suggestions that could help streamline the code.
Firstly, the objects S1-S5 are essentially the same, creating a lot of needless code duplication and complexity. Combine these objects into a single object with an additional factor column identifying the weather station (i.e., the weather station ID). Sorting the rows by time as suggested by /u/mergs would also help.
Second, many of your mixed data structures are using data frames instead of data tables. Data tables from the data.table
package behave just like data frames but offer better performance, particularly in subsetting operations.
Third, for-loops in R are fine for small datasets where performance and efficiency is unimportant, but the limitations of loops become evident as the datasets become larger. If you want to apply a function over every element of some object (hint: your consolidated weather station data), use the apply()
family of functions. There are many variants of apply()
and beginners can often have a hard time distinguishing which function to use, but these resources are helpful (I often refer to them myself):
http://www.r-bloggers.com/using-apply-sapply-lapply-in-r/
https://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega
If you just need to plot points in 3D space, might I suggest <code>plotly</code> as an alternative? The interactive part makes it nice to navigate 3D data.
sample code: (it is typical to have the response z as a function of x and y)
library(plotly) plot_ly(data=mtcars, x=~hp, y=~drat, z=~am, type='scatter3d', mode='markers')
Now, to plot the result of a 2-variable regression, that's more of a plane/surface. The really simple way is just to plot a grid of points according to your equation. Example:
x = 0:10 y = 20:30 plane = expand.grid(x, y) # generate all combinations of x and y plane$z = 3.2 + 0.8*plane$Var1 + 0.6*plane$Var2
plot_ly(data=plane, x=~Var1, y=~Var2, z=~z, type='scatter3d', mode='markers')
There are plenty of tweaks and options available if you had a more specific result in mind, but I can't give a much better answer unless you explain more about what you need.
You might find what you're looking for with "Open Street Maps". It has an API that might work.
https://www.openstreetmap.org/#map=5/51.500/-0.100
Edit: sorry... I see you already have the map data.
Hi, I created editR. I'm glad it is useful for other people.
To answer your question, yes, pandoc is a requirement. I guess I should have mentioned it in the installation instructions (will fix this as soon as I can). I use the 'render' function from the rmarkdown package to render the .Rmd documents and it relies on pandoc. The preview is rendered with the 'knit2html' function from the knitr package. It's faster than calling pandoc each time but it has more basic functionalities and less versatility than pandoc when it comes to render the formatted document.
Pandoc is fairly easy to install (http://johnmacfarlane.net/pandoc/installing.html). If you have a Mac I'd recommend installing it via Homebrew or MacPorts. Also installing pandoc-citeproc comes in handy if you want to include bibliographical references.
Try using dplyr to sort quickly:
library(readr) library(dplyr) data <- read_table("~/tmpdata.txt") data$removal_N <- as.numeric(gsub("\n\d$", "", data$removal_N)) data %>% arrange(r_NH4, r_NO3)
This still won't sort by both dimensions simultaneously, since you don't have nice round values.
Your best bet for plotting a surface is to interpolate over a grid of the x and y values (which enables you to sort) and then plot the interpolated surface.
According to this, you can use the akima
package to interpolate (though other interpolation options are available in the fields
package, among others)
I used the code from the StackOverflow post to interpolate (for simplicity) since I only have your sample data...
library(akima) s <- with(data, interp(r_NH4, r_NO3, removal_N))
Once that's in there, it's fairly easy to plot with persp3D():
library(plot3D) persp3D(z=s$z, x=s$x, y=s$y)
I have no idea if this is what it's supposed to look like, not knowing your data... but here's the output :)
According to this:
http://spark.apache.org/docs/latest/hardware-provisioning.html
Spark can run in 8GB. But yes, some of the lighter weight alternatives mentioned by others are probably a better bet.
Thanks! They render each time. How long does one of your graphs take to render when you run it manually? Would it be difficult to make a list that has a large part of the processing already done? I've helped a friend do choropleth maps of US county data in shiny and it only took about 10 second to render a new map. You can see some examples on showmeshiny that are doing some pretty powerful stuff.
When displaying that many graphs I would worry more about usability. How many people are going to find 300 graphs useful? If you explore around there may be better ways to display the data.
If you want to go that route rstudio has an article on rendering images in shiny. It's certainly doable.
Yes. I asked in the RStudio support forum and they suggested to look into the Authentication feature that ShinyApps have. Very cool. I think I will propose this to my clients to see how it goes.
I have not done it with R, but your problem with messy data remind me of google refine, now called open refine. Its made to deal with messy data.
EDIT: similar problem in https://www.r-bloggers.com/merging-data-sets-based-on-partially-matched-data-elements/
Build an interactive visualization of some kind with shiny and plotly. Fairly low level of effort, very impressive, and your professor may not even be aware of either technology.
I'll let u/cDidsM answer about their code, but here's a good resource for this stuff: Data Wrangling Cheatsheet.
Funny, my first code chunk was incorrect per your original question, but then turned out to be what you actually wanted!
I learned ggplot2 with a coursera course. They went in depth with many of the renderer techniques. I don't remember almost all of what I learned. I just refer to the ggplot2 cheat sheet and/or I google a description of the plot I need and it hasn't ever failed me.
Well the rstudio server site is pretty easy to follow if you have debian or ubuntu as your os. If not you could run a vm in windows or mac and set up linux there.
R CheatSheets https://www.rstudio.com/resources/cheatsheets/ R for Data Science http://r4ds.had.co.nz/
Not sure what applications you will be using but the above URLs should be able to provide you with a basic understanding of R and with the end of chapter questions some practical as well.
Once you're in the job ask your team members what packages they use regularly or better yet call ahead and ask and try to familiarize yourself with those in addition to the material above.
Good Luck!
Dude, you got this. If you're dedicated, it should take about a week or so.
The best way to learn is by doing. I'd (1). find datasets (many are pre-loaded into R already), (2). manipulate/clean the data by merging, sub-setting, and collapsing, maybe with dplyr or tidyr, (3). summary stats and tables, (4). hypothesis tests like ttests and chi-square, (5). modeling (OLS, logistic, simple stuff), (6). use the broom package to do more with the model output, and (7). visualize that data, try out ggplot2,
Check out "cheatsheets" https://www.rstudio.com/resources/cheatsheets/ to get to the point, and look into UCLA's stats website (it's kind of old..) for code snippets, output, and interpretation: https://stats.idre.ucla.edu/other/dae/
eta: I'm assuming you have a stats background already. If not, then I'd focus on #1, #2, and #7 above.
I haven't been to any, but the videos I watched were really impressive. If you can afford the trip, its worth it. See the videos here: <https://www.rstudio.com/resources/webinars/> There is an upcoming conference next year: <https://www.rstudio.com/conference/>
Look at tidyr::spread and gather. You can spread by value1 and set a default value.
I'm a fan of the cheat sheet:
https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
is R the right solution for this? If you don't need to use R for anything R-specific, maybe just stick with existing tech (vis-a-vis google charts etc).
Also, you dont need a hosting service for shiny if you run a shiny-server? https://www.rstudio.com/products/shiny/shiny-server/ Unless you meant service in the IT sense, not in the "hey im offering a service" sense of course.
Other than this, I guess you could do a bunch of hackish things - eg java applets that hook into rJava ala http://stackoverflow.com/questions/2349820/using-r-in-processing-through-rjava-jri Not sure about this but you might be able to hook it into JavaFX etc (a bunch of pretty charts there)
Meh I guess ...
> Is it possible [without requiring] a full install of R?
No. This is fundamentally impossible: each R script needs R to be executed. You could bundle an executable that comes with its own R but I’m unaware of a ready-made tool that would bundle such an executable for you. At any rate I’d generally avoid against it since it would lead to unnecessary bloat in the executable size.
That said, an example application that does exactly that is SeqPlots. Unfortunately the public code base doesn’t seem to contain information about how the bundle (containing the code and the R executable) was created. But, in a nutshell, it uses Shiny and Electron.
I second the use of version control. It is extremely helpful with all larger projects. If you prefer a user interface for things, I would suggest GitKraken. It is super helpful and allows for lots of expansion when you become more comfortable with it.
​
And with a GitHub page, you can continue to update a README document so that onboarding is more straightforward and streamlined for new folks that you bring on. Therefore everything is contained on a single site!
I keep all of my source files on my own server using GitLab, but a lot of people use GitHub. This takes care of syncing across devices and keeping old versions of code around in case I need it. Knowing how to use git is an essential skill if you ever want a job coding in industry. In my experience it's not common for R users to use this kind of source control but I think it's a good idea.
When I publish a paper I usually just release the R source files on GitHub. There's one file that runs the entire analysis and then extra files for various functions that the main file will use. I only write it up into a package if the paper proposes a new methodology.
I always use Roxygen to document my functions regardless of whether I'm making a package or not.
I don't think the issue is necessarily just the proprietary nature of Excel (after all, LibreOffice's Calc is just one of a number of free workalikes that should read and write Excel spreadsheets without difficulty).
I think it's a combination of a number of factors. For example, one issue relates to using Excel for data entry when it might interpret an entry as a date - something that's easy to miss, and even if you notice, sometimes hard to correct. There can sometimes be some issues with data going from Excel to csv format and vice-versa - some Excel versions have issues with loss of accuracy for example; (now try reporting an Office bug to Microsoft...)
I'm new to R so I don't know strong data manipulation there. STATA has the wide and long commands to help you there. I managed to do this in excel though, this is the file If you need more tables tell me is a very simple method
See this question on so: https://stackoverflow.com/questions/37713351/formatting-ggplot2-axis-labels-with-commas-and-k-mm-if-i-already-have-a-y-sc
You have to dig into the comments to get to the part about having it scale (calling unit_format
, I think).
Maybe this one? That also includes recommendations to set comment.char=""
and define colClasses
which I can't find the original blog where I read that but has been useful in my case. Here's a few more.
OK I fiddled with this to no success. Here is my code script with what I attempted (it is using the vtsummer2 data set above. The first line of the code will not read it correctly and the filepath needs to be changed to fit the variable designation. http://www.filedropper.com/multiplecorrelations)
Some of my fails seemed closer than others, and I am guessing it is a syntax thing. Am I at least working toward the right idea?
Personally, I use Github once I've decided that a program is more than a fleeting exercise in programming. I also use Dropbox and Google Drive, but only with code that is less important to me. In particular, because of how the sync app works with Dropbox, they've had lost user files. While it's a rare occurrence, my code is the one place where I don't dare risk that sort of loss. Additionally, I do inline backups with traditional backup software (stored locally) for additional security.
With Github, the version control that is provided not only protects from such losses, but it also allows me to undo mistakes in my programs if I've accidentally saved good code after bad. Adding in the code sharing features and it's design specifically for managing code, and it's hard to go wrong with it for your programming adventures.
In certain cases, Markov Probabilistic Model can be solved exactly fast. In some other cases, solve Markov Probabilistic Model exactly requires exponential number of steps and requires Belief Propagation to solve in approximately.
Look at the slides on Discrete Markov Random Field.
https://www.slideshare.net/SingKuangTan/brief-np-vspexplain-249524831
What are you looking for, theory or applications? A good applied book is this one:
An oft-cited theory book is this one:
As far as tidyverse versus base R goes, I strongly recommend learning the tidyverse family of libraries. Ggplot2 is far better than base R for aesthetic, publication-quality data visualization. Dplyr and readr and tidyr together form one of the best libraries available for manipulating data (agnostic of language). I built an entire R package around a few core functions from those libraries. To get the most from those libraries, you'll want to learn how to use the pipe operator to write code inspired by a functional programming paradigm.
that's perplexing to me, but you should crosspost this to GGplot2 google group if you get no answer here. Hadley Wickham hangs around there and occasionally answers questions.
Well, judging from this overview: https://haveibeenpwned.com/PwnedWebsites there's not a Liverpool specific leak. Ali's, by sheer numbers I'd go out on a limb and say that a regional distortion by location is unlikely. However an overview of "targeted countries" within the leaks would be awesome, yet my quick googling didn't bring one up.
Thanks for the recommendation, it was the formating. Couldn't figure out exactly how to work it with `httr` (due to my lack of experience with it most likely), but ended up using [Postman](https://www.postman.com/) to generate the code and running it inside the `system()` function.
I'm surprised you can't open it using a text editor. What happens when you try? If R can ingest it, then a text editor and SAS should too. Plain old Windows Notepad may choke if it's a big file. I expect Notepad++ could handle it. Or maybe MS Word could open it.
You can tell SAS support that RRF for UMLS data is not a special format. It's literally pipe-delimited text files. You could rename MRCONSO.RRF
to concept_names.txt
and R or SAS should ingest it just the same. If you tried something like the following & it failed, then I don't know—it's been a while since I've used SAS.
proc import datafile="C:\directory\BLAHBLAH.RRF" out=rrfdata dbms=dlm replace; delimiter='|'; getnames=yes; run;
I think first things come first. The first thing you need to know is the different kinds of objects, how to assign values to them, how they interact, and how to complete simple computational tasks with them. It took me a bit to even get a grasp on vectors.
You could start off with simple code consisting of how to use R as a calculator. Then you could go into assigning single variables with values. Then you can go into vectors, and have the students create two vectors and subsequently join them into a dataframe. Then learn how to use the $ character to work with the columns inside the data frame, make a scatterplot of the two vectors, etc. Then move into subsetting, etc.
When we master someting, we tend to forget what are the basic elemental things that our instructees are going to not understand or have trouble with. In R, the first few steps are just understanding the environment and how to create objects and use them. I wouldn't necessarily rush this, and I think the subject of packages can naturally arise at the moment where you say "Ok, we've learned how to do all of these tasks, but what happens when we want more options on how to make graphs or manipulate data frames? Here's how you increase R's functionality with additional packages..."
I took the Coursera Courses on R and found them to go at a pretty good pace you might want to browse them, for example R Programing, especially since it comes at it from a programming perspective.
I agree that RStudio is great for R-related stuffs (R, Rmd, Rnw, Rcpp, etc.), and it should be great to keep users. If your workflow needs other stuffs that VS Code does better, it's worth to trying. VSCode has great out-of-the-box functionality and extensibility. You may be able to find or write VSCode extensions to imitate RStduio functionalities, but the chance that you will miss some RStudio's functionality is very high. If you are interested in other editor/IDE, check out other alternatives such as Vim and Emacs. If you're concerned about the editor popularity, a talk in EmacsConf would be helpful.
Check out the Tidyverse cheat sheets for dplyr and tidyr, they have been a life saver to just have on hand when I am working through data like this :)
https://www.rstudio.com/resources/cheatsheets/
Updating my post because I got very lucky I think and found something that worked for getting RODBC installed. I'll update so people can search.
I went to the terminal and tried:
sudo apt-get install libiodbc-dev
which did not work! So I tried
sudo apt=get install unixodbc-dev
which worked. Then I went to R and tried again
install.packages('RODBC')
as you'd expect and no problems installing.
Now the trick is to get it working.
Source: http://superuser.com/questions/283272/problem-with-rodbc-installation-in-ubuntu
As general advice, avoid using loops in R where performance is important. The apply() family of functions offers far superior performance by avoiding in-memory object duplication and the use of vectorised operations.
Secondly, consider using rbindlist() instead of rbind().
I'm not all that familiar with plotly (I mainly use it for visualizing maps or with the ggplotly function), but from what I can tell the syntax is pretty similar to ggplot2 in that you use add_trace to add "layers" to a plot.
If you want to define the error bars directly, you'll need your data is this sort of format:
df <- data.frame(x = 1:10,
y = 1:10,
ymin = (1:10) - runif(10),
ymax = (1:10) + runif(10),
xmin = (1:10) - runif(10),
xmax = (1:10) + runif(10))
From: https://plot.ly/ggplot2/geom_errorbar/
Then rather than x and y you'd have CGA, timepoint, and mean.value. Then do something like their example:
ggplot(data = df, aes(x = CGA,y = mean.value, color=timepoint)) + geom_point() + geom_errorbar(aes(ymin = ymin,ymax = ymax))
I mean I know read.csv and write.csv or saveRDS type stuff but this is like taking input from the hacker-rank.
It won’t even necessarily be a nice CSV table that can be parsed. Im concerned I won’t be able to answer the question because I can’t figure out how to get it into a usual dataframe.
Example:
https://www.hackerrank.com/challenges/s10-multiple-linear-regression/problem
You can see how that input is not formatted in the usual way with differing number of lines.
The docs seem to assume knowledge of what stdin is and I don’t know. I know stats and ML in R pretty well but not this stuff
The code also has to generalize to whatever arbitrary test cases with the specified file format. It seems like this is testing CS rather than stats and ML almost.
Codecademy is a quick and easy (and free) way to get started with Python.
I agree with other posts here that Python is more general-purpose than R but that it's probably easier to get the standard statistical procedures up and running in R if you don't have any programming experience.
Why not do something like:
library(tidyverse) library(rvest)
url <- "https://www.trustpilot.com/review/www.etsy.com?page="
get_reviews <- function(page, base_url){ scrapingurl <- paste0(base_url, page) scrapingurl %>% read_html() %>% html_nodes(".review-content__text") %>% html_text() }
map(1:10, get_reviews, base_url = url) %>% unlist()
If your code can't be parallelized (no matter the reason), then you want a machine with good single-threaded efficiency and just enough RAM that solves your problem. Look at https://aws.amazon.com/ec2/instance-types/z1d/
Right, so you are looking at two problems.
Web-scraping, which is concerned with downloading and parsing html file (or xml I believe) and web-crawling.
web-crawling is involved with writing rules to follow certain links and download web pages. So looking at your problem, you can either find out a way to programmatically create all of the links you require (all players) which would have to be something like downloading all of their names, perhaps from another source, and then using their last name first with the first two letters of their first name you can construct MOST of your pages you require. This method isn't pretty because after poking around I am unsure how exact the page names are structured. It seems there are rules depending on how long the names are, hyphens, etc.
Your other option is to not only scrape the pages, but crawl the site looking for what you need. If you are crawling I recommend scrapy. http://scrapy.org/ its awesome. With that being said I don't know what your objectives are, strengths, weaknesses, timelines, objectives. etc.
if you are just after data I would attempt to build all of the links using names. You might be able to cover a lot of ground quickly with that. If you are interested in learning and expanding as well as data. I would recommend python and scrapy, because any and all web data will be relatively within your grasp after learning them, which is a liberating feeling :)
good luck
Yes, mzalewski is right.
An alternative if you're on a Mac: Set the method
parameter of download.file()
to 'curl'
download.file(url, 'yourFileName.pdf', method = 'curl')
That parsed the url's spaces automatically and downloaded the file no problem for me. If you're on a Windows machine, you can download the curl binaries here: http://curl.haxx.se/, and I understand you can then use the curl method once you get that set up.
Because apparently there's not enough call for it for somebody to maintain an installation like that. Somebody is maintaining a portable version that uses the Portable Apps framework, but you would still need to install it.
After a year and a half learning and working in R, this course from the MIT OpenCourseWare helped me pick up a rudimentary knowledge of Python in a very short amount of time.
It's introductory, but tackling the included assignments gave me a good starting point.
The nice thing about learning on your own is that you're free to modify the assignments to match the skills you want to learn. For example, problem set 1 asks you to create a program for calculating house hunting values. Since I was interested in learning GUI programming, I completed this program with the tkinter module which allowed for a functioning interface.
It can be difficult to adjust to syntax too and the resource that helped me most with that was using the python 3 tutorial from Sololearn. The nice part about that the Sololearn iOS/Android apps let you speed through the simpler stuff in your spare time.
Now, full disclosure: I have absolutely not stuck with Python since. I used it to solve the one problem that I needed to, and fiddled around with for-fun projects. I hope these resources can help you, though!
> I'd like an RStudio dedicated client app to talk to RStudio Server instances (not just Safari access)
Why? What's wrong with connecting through a browser? R Studio is a browser-based app anyway, both the editor and console panes are Ace editors.
You're correct on the other points though. iOS is simply too locked down to do much other than play Candy Crush.
If you can write a function, you can write a shiny app. If you use R-studio, this is the standard axample under "new file". Running it should open a browser window with the app:
# # This is a Shiny web application. You can run the application by clicking # the 'Run App' button above. # # Find out more about building applications with Shiny here: # # http://shiny.rstudio.com/ #
library(shiny)
# Define UI for application that draws a histogram ui <- fluidPage(
# Application title titlePanel("Old Faithful Geyser Data"),
# Sidebar with a slider input for number of bins sidebarLayout( sidebarPanel( sliderInput("bins", "Number of bins:", min = 1, max = 50, value = 30) ),
# Show a plot of the generated distribution mainPanel( plotOutput("distPlot") ) ) )
# Define server logic required to draw a histogram server <- function(input, output) {
output$distPlot <- renderPlot({ # generate bins based on input$bins from ui.R x <- faithful[, 2] bins <- seq(min(x), max(x), length.out = input$bins + 1)
# draw the histogram with the specified number of bins hist(x, breaks = bins, col = 'darkgray', border = 'white') }) }
# Run the application shinyApp(ui = ui, server = server)
I you are on linux or if you feel like installing KDE in windows you can use rkward which is the most advanced gui for R, menues and all. It also supports plugins and custom menus.