I found the The WEKA toolkit to be a nice centralised resource when it came to learning about the multitude of techniques and parameters used out there. There's a book too which is a very informative read if a little dry in places.
This was used in my Language Identification project for speech signals and it worked quite nicely.
For those who don't want to sit through Ng's course material, these notes are (seemingly) pretty good: http://www.holehouse.org/mlclass/
I'd spend time going over some finite math and algebra MOOCs or tutorials or videos instead TBH.
Also, this book is a great intro that doesn't assume really much of anything, and introduces the required math and stats as necessary. After cramming through and understanding what's going on in a blackbox manner, a deeper revisit of concepts after-the-fact could be worked: http://www.cs.waikato.ac.nz/ml/weka/book.html
You could load your dataset into Weka: http://www.cs.waikato.ac.nz/ml/weka/
it's a suite of machine learning algorithms written in Java. You can import your data fairly easy, set the types of your variables and whether they're known or unknown, and then classify your unknowns using various algorithmic methods (e.g. decision trees / neural nets / naive bayes). It won't give you a properly sophisticated model of your problem, but it's fun to play around with common techniques.
Came here to mention Weka, It's not clear the author of the article actually wants neural networks at all to solve his problem, and a toolkit like Weka allows experimenting with a bunch of different machine learning / data mining algorithms. Weka at least has a decent book to help with the understanding too.
Typically deep neural nets outperform naive Bayesian classifiers if they have been well trained on representative data. However, the trade off is complexity both in terms of difficulty to understand and build and also in terms of sheer processing power required.
Perhaps the best course of action would be to go and use a java machine learning framework like Weka from which you can very easily build a naive bayes or decision tree classifier with a fairly uninvolved process, this will help you to understand the basic principles of feature extraction and supervised machine learning - supplying training pairs etc. There is also a companion weka book you could look at. Then I would suggest you use cross fold/blind test methodology to see if the naive bayes algorithm you've built is good enough and if, after you've been in and messed about with the features a bit, it still isn't good enough consider neural nets.
If you are going to go down that path, I'd recommend following a tutorial like this one which is for python and tensor flow. It uses a convolutional topography (as opposed to LSTM as I rexommended) but the programming principals are the same and if you complete it I expect you'd have learnt enough to be comfortable replacing the convolutional layer with an lstm later.
If you run into problems then post on this subreddit or the cross validated stackexchange and I expect you'll get some help!
That description helps a lot. Especially given the sheer number of predictors and that d > N (i.e., that your predictors outnumber your instances), Naive Bayes performed quite admirably. I assume that you're stripping out a lot of uninformative terms (function words, for instance).
It might also be possible (though with extra work) to identify features that really zero in on your categories. So, for example, consider making a document-term matrix for all categories' item descriptions. This gives a picture of the overall distribution of terms. But it is quite probably the case that individual categories' term distributions differ on only a handful of keywords. The union of all categories' "maximally identifying" keywords would then create a much-reduced feature set with more predictive utility. Now, it's easy for me to say that, but actually doing it would be a bit of a computational chore.
I do think you can get some mileage out of random forests, and you almost certainly should be able to use the features as you've already coded them. If they'll work for NB, they should work for RF. As for Java ... you shouldn't have a problem. But I'd opt to use an existing package. Weka is Java based, and you can do a lot more at the command line than you can using Weka's GUI, so you may be right at home there.
That would definitely be solved using ready-made software. Latent Dirichlet Allocation (LDA) is a popular and effective topic model and is available off-the-shelf in many packages. Mallet (http://mallet.cs.umass.edu/) is particularly easy to use.
Regression and classification software is very common. If you only want to put into 1-5 age groups, you could probably use Mallet for this too. In fact, naive bayes may work very well for predicting demographics from topics.
If you choose to use regression (and say, precisely predict someone's age), then WEKA (http://www.cs.waikato.ac.nz/ml/weka/) is a nice, easy package to start with.
I don't want understate the amount of work involved with this, but my proposed solution basically involves just gluing some packages together with some Java code. Someone with experience with these packages could probably hack all this together in a day or so (depending on the amount of data). It is definitely not anywhere close to a dissertation.
Game AI and real AI are quite different things.
Game AI is basically a set of rules written that your bot will follow. Like how a blackjack dealer always hits if they are below 17. If you like to write rules, try https://codecombat.com/
Real AI is more about Machine Learning. You give the program a goal, and it comes up with the rules of its behavior. It's rarely used in games because it makes for poor storytelling. If you like to learn about how you learn, try http://www.cs.waikato.ac.nz/ml/weka/
If you want to start with a hands-on approach, download Weka and check out some tutorials or Youtube videos.
Weka is a tool that makes a huge amount of complicated math and computer science available to anyone. There is a lot of sample material to e.g. predict the weather or if somebody needs contact lenses, based on previously known data points (the essence of machine learning). It has a GUI and you don't need any programming experience to get started.
A ROC curve is a percentage of correct positive classification. If your classifier applies the '90%' classification to all of the imbalanced data, then it would have a 90% true positive rate and a 10% false positive rate. With any ML algorithm, you should be able to improve upon that rate (and change the location on the ROC curve).
You may want to look at a different metric to evaluate the performance of your classifier, including the true-negatives and false-negatives. Lookup 'F-Measure' and 'three point average recall' to get you started.
The WEKA book has a whole section devoted to explaining this topic: http://www.cs.waikato.ac.nz/ml/weka/book.html (plus you can support the authors). I'd recommend it, unless you're looking for a blog post or something (wherein you probably won't find the specialized information that you're looking for).
This. If you do go down the data mining route, check out Weka (http://www.cs.waikato.ac.nz/ml/weka/). It doesn't take very long to learn and is great for exploring relationships between variables in a large, multi-variable dataset.
As other have said, this is a statistical problem. You might also find it useful to look at sources on machine learning, as there are a number of ML algorithms designed for this type of problem (classification, supervised learning).
The question of "what's the best type of model" doesn't really have an easy answer. At the very least, you need to first decide what the model will be used for; i.e. if you're only concerned with classification accuracy, you might prefer ML algorithms with good accuracy on a validation set, if you're interested in using the model to understand a physical process, it may be necessary to use a statistical model with more easily interpreted parameters, and if you're using the model to make decisions, you might prefer to use a Bayesian model and maximize expected utility. Even then, there are most likely many reasonable models that you could use, and the topic of how to make such a choice (or whether to make a single choice at all rather than using something like model averaging) is a topic which people spend years studying.
You might also take a look at Weka, which you could use to try out different algorithms as a starting place. http://www.cs.waikato.ac.nz/ml/weka/
From what I remember of AI, it doesn't matter too much which language you learn, provided you know all the underlying theory and 'get' AI. When it comes to implementing solutions, the language you do it in isn't massively important. I've seen machine learning software that uses Java, Python, C++, C# and MATLAB, so you don't need to worry too much about languages, provided you get the concepts right.
If you're really interested in machine learning, download WEKA, which uses Java, and play about with some large datasets you can download from Kaggle. If you look at the datasets and the discussion on them, it should give you some insight into what other people are doing with them.
On the note of C++ being more beneficial to getting a career in the industry, you're going to have to be a lot more specific in what industry you want to join. It also depends on where you are in your country. For example: in Nottingham in the UK, it's a proverbial goldmine for C# developers. You can't move for the number of C# jobs laying around.
I really like Weka for a basic introduction to machine learning and AI concepts. You can compare and experiment with a huge suite of built-in techniques to understand which tools work better on different datasets.
Essentially, going through raw data and finding trends within.
I took an intro class to data mining in college, and for our main project we were given a set of raw data concerning computer programs in the 1980s. Using a tool called Weka, we could focus on one dependent variable (number of bugs, in this case) and several dependent variables (length of the code, number of comments, action statements, etc), creating either linear regression equations or decision trees, either of which could be used to predict future cases.
Interesting.
Obviously correlation doesn't imply causality.
I think you'll want to look at Judea Pearl's work on Causal Inference, this is a good starting point (although it gets pretty detailed, you don't need to understand the whole algorithm to use it): http://www.michaelnielsen.org/ddi/if-correlation-doesnt-imply-causation-then-what-does/
I believe http://www.cs.waikato.ac.nz/ml/weka/ has an implementation of causal inference, I'm sure there are others.
I'd be interested in your ability to predict alcohol consumption levels (e.g. abuse) based on other social factors. The weka package provides a number of classifiers that you can use and a graphical interface if you're not interested in programming.
You should probably check out Weka. This allows you to put just all the observed attributes and the desired outcome into a (ARFF) file, and then you can try out various algorithms on your data and find out what works best. It has an insane collection of method, so you don't need to decide now if neuronal networks or decision trees or whatever works best.
I used it during my undergrad AI/ML course and it seemed convoluted and confusing. I also felt that there was a lot of overhead to quickly try new things.
I think the best demonstration is to compare WEKA's documentation landing page and scikit's documentation learning page.
I agree with this. Another approach you might try in the same vein is to use machine learning.
In particular, I'd try making a table of the data with a column for things like day of the week, month (or week number through the year). These will be the "inputs" to the machine learning. It will probably also be helpful to have the data for the previous few days be included as columns for the later data points (since they can be inputs to the machine learning model).
If you produce something like a csv file with a row for each data point (with the columns like above), you should be able to load it into a tool like WEKA (http://www.cs.waikato.ac.nz/ml/weka/) and try out a number of machine learning algorithms pretty easily. Unfortunately, I suspect that using the tool is not so intuitive and there's a lot of jargon to learn. Personally, I'd start with a "nearest neighbors" based approach - basically, it will look for when a certain pattern has happened in the past and guess based on what happened last time.
When you are building models, you can use something called "cross-validation" to see how good it is. Essentially, you leave out some of the data when building your prediction model, and then test how accurate it is on that data. You can do this on different subsets of the data to get a bigger test sample.
I'm not sure whether you're looking for a general self-directed reading course type of thing, or a thesis-style project.
If the former, definitely AI. It's incredibly useful, and if you've done anything computational at all before you've probably run into a few of the concepts already. It'll also be very relevant to you; in the AI class I'm taking right now which is geared toward CS majors, maybe a third of the motivating example applications are linguistics.
If the latter, there are so many awesome things you can do!
Whatever you end up doing, good luck!
A course I'm taking is largely based upon this book and I think it may help. It'd bundled with a piece of free software called Weka which lets you experiment with a whole range of data mining techniques.
Admittedly this isn't using R, but you could do your initial data analysis in Weka and use whatever you've learnt there in R.
You're given an arff file, I assume your instructor is telling you to use WEKA? I've used the tool before and it uses the arff format. Though I can't help you more than that since it's been a long while since I used it.
You might want to search for tutorials for using the WEKA tool. It's in the documentation page, for starter.
Edit: might add that you need to use Java if you use weka.
If you want to predict some outcome, the first step is to ask yourself what collection of variables would contain enough information to do this. It sounds like you maybe don't really know, so you just want to record a bunch of stuff in the hope that it will fit the bill. This is a risky investment if you have no research/experts telling you which things to measure. Another thing to consider is how noisy your labels will be: if someone fails/passes a urine test, does that mean they definitely did (not) relapse?
In any case, if you go through with this you will end up with some data. It's impossible to say right now which methods will work best for your particular problem. It sounds like a relatively straightforward prediction problem, so most supervised learning / datamining algorithms should work in principle. One fairly common strategy seems to be to just try a lot of different things and pick the best (be sure to do cross-validation!). Tools like Weka make this easy. You can also check out Kaggle to see how people often tackle this. Gradient boosting and deep learning seem the most popular nowadays, but it depends on your exact problem and requirements (deep learning can be difficult for beginners though, especially if you don't have a lot of data).
It sounds like you can just store your data as comma separated values (CSV) (or some serialized format), and you don't really need a database. This has the advantage of not having to mess with a database, and will make it easy to use different tools like MATLAB and R, because everybody can read in CSV files.
I have very little experience with R, but you might be right that it's gaining ground. Also Python (with NumPy/SciPy). I think MATLAB is also still good. Another option is using Weka, which contains many algorithms that you can easily try and visualize. So go with whatever you like best. These are all fine choices.
http://www.cs.waikato.ac.nz/ml/weka/book.html
http://guidetodatamining.com/ <- lo de programacion viene en Python segun recuerdo, asi que deberia parecerte sencillon.
Te recomiendo que busques cursos en coursera y edx, etc. No creo que puedas ir a algo presencial chido en Monterrey sin tener que llevar algun otro tipo de clases. Quizas http://www.itesm.mx/wps/wcm/connect/itesm/tecnologico+de+monterrey/maestrias+y+doctorados/escuelas/escuela+de+ingenieria+y+ciencias/mit
As mentioned below, Andrew Ng's coursera course is quite good.
If you'd like to jump in and just play around with some basic techniques to get a feel for them, I recommend the Weka GUI. You can find a list of nice, simple datasets to work with at KEEL.
If you want to get a bit lower level with ML techniques I would recommend SciKit-Learn, which is a python library. Its support for neural nets is a bit lacking, though.
Incredible resource. Thank you. I would add Weka, an open source collection of data mining and machine learning tools written in Java. I'm unsure how popular it is but I've used it for a few grad courses.
This question would be tackled in the first few classes of machine learning (ML). While it's true there are an infinite number of equations that fit a finite data set, you can make a good guess at what it might be by making some assumptions.
You may be familiar with "linear regression", a technique where you take a bunch of data points and say "I'm specifically looking for a linear equation that best fits through these data points." There exist regression methods of higher order for if you want to fit a quadratic equation, a polynomial equation, a logarithmic equation, etc.
But if you don't know what form the equation will take, there's still hope. Enter "supervised learning". SL is a branch of ML dealing specifically with techniques for learning a function given a bunch of example input/output data. There are tons of different algorithms for this, many of which are listed on the SL wiki page. And on top of this there are many techniques for how to train and validate the algorithms.
If this is something you are interested in learning about from a low-level mathematical perspective, I recommend the Coursera ML course. You could probably find videos from past years on youtube.
If you are just looking for the algorithms themselves, one free tool that has an extensive collection of ML methods is Weka. But ML has gained a lot of popularity in the last decade, so you should take a look around.
Specific algorithms you might look at:
Another book I'd recommend is the Weka book http://www.cs.waikato.ac.nz/ml/weka/book.html - I wouldn't call it "must have" but I certainly found it a useful reference, even if you don't use Weka.
It's been a while since I read it but I don't think its coverage of neural networks goes beyond the multilayer perceptron, but even so it's a good introduction to a lot of standard machine learning approaches.
You'd have to choose an algorithm and framework for doing the classification (or build your own). Weka is very popular and I'd recommend you start there, so is scikit-learn, libsvm, torch, etc. R has some excellent stuff too. Apache Spark is good if you have existing infastructure like hbase/hadoop/cassandra.
If you want to roll your own, a free resource is clever algorithms, although Artificial Intelligence, A Modern Approach is a classic for walking you through AI as a whole and has good sections on learning/classification.
Cheers!
It was just some Python to scrape data from mtggoldfish, reformat it into the ARFF format and feed it into Weka for classification, where all the hard work was already done by generations of CS grad students.
It worked pretty well but it would be a lot of work to keep up with all the formats and new sets as you need to have training data (decks properly labeled with an Archetype) before you can classify an unknown deck.
I should clean the whole thing up and put it on github, maybe I'll get around to do it one day.
For experimentation, I would recommend starting with a system that works right out of the box. Namely Weka, which you can find here: http://www.cs.waikato.ac.nz/ml/weka/.
Weka is nice because...
It has a user-friendly GUI.
It offers useful data visualization features.
It is very well documented.
It can handle a variety of algorithms/tasks.
It comes with a bunch of toy datasets (survey data, weather data, etc.).
You can easily integrate Weka into your own Java programs.
It's one of my favorites to use because there's very little labor involved in setting up your experiments, and you can plug in your models straight into your Java code.
If that interests you, you can PM me on reddit if you have any questions or issues with setting it up (not that I think you will, but just in case).
Weka is a data mining tool I used for my final year Data Mining module. It's quite powerful and there's quite a few guides/tutorials. It's also free an open source.
The machine will. I'm planning on using a couple AI techniques to analyze the data. There is a great tool called weka that has several classification algorithms already implemented.
How much do you want to learn math & theory vs learning practical application? The vast majority of books on neural nets are heavy on math, proofs, and trivial examples that aren't useful.
I'd recommend you download a software library for Neural Nets. Some of them have great documentation and examples on how to apply them to your own problems.
Weka might be a good place to start: http://www.cs.waikato.ac.nz/ml/weka/
If MATLAB isn't a must, an easy to use implementation of the C4.5 algorithm is wekas J48 java algorithm (you could of course bridge this into the MATLAB workspace).
When I was taking a machine learning course at uni we did a lot of experiments with Weka. Free/Open source software that lets you run lots of different techniques on data. Could be useful as you probably won't want to implement them all yourself.
If you're dead-set on using a trained classifier for this - and it doesn't sound, from your description, that that's the best approach - something simple like C4.5 should do. In fact, if you can describe your data in ARFF format (very simple), you can use the Weka toolkit to experiment with many, many learning functions. http://www.cs.waikato.ac.nz/ml/weka/. If you're on Ubuntu, there's already a package for it in Syntaptic.