It's possible, and is being done, but is very difficult. Human languages are very complex grammatically, and words can have very different meanings depending on context. I did some work on computer knowledge representation for a class, and was pretty overwhelmed with how complex trying to get a computer to understand simple sentences is. For instance, take a look at wordnet and think about how much it took to create such a database.
There are Scrabble foundations that release these for free use. They want you to mention where the word list comes from tho, and if your game can be use to cheat at Scrabble they want you to mention that they are against cheating in Scrabble (these where the terms from the Norwegian Scrabble association at least, assuming others are similar)
I don't even remember where I got the English wordlist I'm using at the moment but http://wordnet.princeton.edu/ has an extensive one (I don't think I used that one since it was a hassle to parse into my own system)
I will check what list I'm using or Make a new one from a known source before I publish.
PS: if you find a Scrabble accepted word list which includes word classes it would be nice with a heads up.(I only have that in the Norwegian version of my game now.)
I concur. It's been a few years, but Wordnet is awesome and I would recommend it.
I'm no expert but here are a few resources I've used. I'm sure others can add to the list. I would first check out WordNet. LingPipe is a Java tool that can do Named Entity Recognition and Part of Speech tagging. And, I really like to play with the Google n-Gram Viewer. If you can I would read the associated article Quantitative Analysis of Culture Using Millions of Digitized Books. If you're near an academic library check out and read Speech and language processing : an introduction to natural language processing, computational linguistics, and speech recognition (2009) by Jurafsky and Martin. It gives a really good intro to many NLP topics.
The first thing that comes to mind is Ogden's basic English words. You can get this easily for Python in the Nodebox en module (the very bottom of that page shows usage).
If you're looking for something more comprehensive, check out WordNet.
I've been using dict-mode [1] for a number of years, and I believe that package uses Wordnet [2] and a number of other open sources.
[1] http://www.emacswiki.org/emacs/DictMode [2] http://wordnet.princeton.edu/
Check out Wordnet. This has all that and more! The data are in text files with a fairly straightforward indexed structure (from memory when I looked at it a few years ago).
>It's the only argument I've seen where God Did It is the MOST stable conclusion. And it's still not great, in my opinion, but compared to the secular arguments, it's vastly superior.
Really, that's your ignorance on the field, not the lack of better answers.
Philosophy and metaphysics when applied to ontology, semantics, and epistemology are actually modeled quite well with multi-class neural networks.
The state of the art on semantic classification is about 94% accuracy across 117,000 synsets (actually the subset that are discrete nouns, work proceeds on other synsets).
So what that means technically is our best image classifiers can distinguish between a haystack and a surfboard and a telescope and a bluebird purely from an image file about 95% of the time. Most of the advances have come in the last 3 years, when we started using something called deep convolutional neural networks. We now regularly use between 15 and 21 semantic layers in neural network classifiers. The first input layer is merely pixel brightness across RGB space, the second layer is usually doing edge/border detection, the third line/curve identification, all the way up to what you call particulars like table/cave/ice cream cone.
Most metaphysical arguments are inferior to semantic classifiers, because they cannot handle probability, randomness, and measuring certainty by the logic of language propositions. Quite a few don't handle temporality or computation either.
I'm not sure why you think the strongest SO is indicative of the best, or even an adequate, summary sentence of the article. To me it seems like a frustrating metric because news articles in particular actively attempt to use a neutral lexicon. However, I think the approach of finding a suitable sentence within the article to use as a summary is a good one, just with a richer metric
Have you ever heard of wordnet? I could imagine finding the "semantic network" of each sentence in an article, and comparing it to the semantic network of the rest of the article, and see which sentence has the furthest 'reach' within the article's network. I'm not sure what the best measure of that would be, but the concept at least seems to approach the idea of 'summary' better than orientation.... food for thought
Well, no, not without a lookup table.
The problem you're describing is actually a lot harder than you may realize. Take, for example, the word "ground." It can be the participle form of the verb "grind", or the base form of the verb "ground", or the singular form of the noun "ground."
You can make some progress in deciding which word "ground" is a form of by using part of speech tagging. (For which there are fairly accurate taggers out there.) If it's a noun, you know it's probably a form of the word "ground/N". But, if it's a verb, you still don't know whether it's a form of "ground/V" or "grind/V".
So, the problem becomes that of sense disambiguation. Which is very difficult, because, unlike POS tagging, it's a pain to get tagged data to train with. So, you'd be lucky to get one instance of "grind/V" in your training data, which means your tagger really doesn't know anything about the contexts that "grind/V" is likely to show up in.
So it probably wouldn't get the chance to learn that you grind spices and ground arguments, which means that's a distinction you would have to code in manually, multiplied by every word in the language, which is terrible.
Resources:
Wordnet. Using wordnet you can get a list of the senses of a word. These would be your candidate "clusters" for that form.
UKB. Sense disambiguation approach that uses wordnet. There are some papers on there and some software you can try to get running. I wasn't able to get it running in OSX, but it was extremely straightforward in an Ubuntu VM, so try that out.
I use Onelook Reverse Dictionary every day. Since you can constrain the output with wildcarded templates, it's even good for building backronyms.
http://wordnet.princeton.edu and related do this, and put homonyms into an ontological hierarchy.
while this is a fantastic project, I have found it to be too incomplete for anything non-trivial. i.e. I tried to recover a common topic from terms like (gun, soldier, bullet, tank) hoping to come out at "war", but clearly that would be too easy ;)
there are some other projects that combine with wordnet or build on top of it, but i'm having a hard time finding them on my phone right now ;)
Wordnet would be an obvious first place to look, but WordNet really only knows about taxonomic relations: a 'dog' is an 'animal', etc. It will tell you that dogs and cats are similar concepts (by virtue of both being mammals or whatever) but it probably doesn't know that dogs and bones are related (other than that they're both 'things'). OpenCyc or ConceptNet are probably better for that.
>WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.
Well, for my program, I was going to try a few things.
Since not everything works, I was planning on having the individual features pluggable per project so you can decide what works and doesn't work.
Doing the above would give some hints of untagged dialog may be difficult to identify.
Also, having the dialog identified by a character, would let me get a statistical analysis of a character's speech. Which is useful for identifying said catch phrases, common patterns, etc.
As I mentioned before, I wanted to be able to have a grammar checker work inside and outside of dialogs but with different settings. So, I could have one character speaking like Yoda and have the appropriate rules. And then have character dialog ignore the interrupts:
> "What do you think," asked Gary," about the cheese?"
... would look at:
> "What do you think about the cheese?"
... but for Gary's speech.
Why the c/c++ needs? As for APIs, I think WordNet has a some tools you could use. Without knowing the larger system, maybe the first couple of hits for WordNet APIs could be good enough. I've worked with the raw data files and they are easy enough to translate other resources into and out of.
I've come across Princeton University's WordNet many times. Don't know if you can get the raw data, but see what you can find:-
They may not be a race (people who are believed to belong to the same genetic stock), but surely they are a race ((biology) a taxonomic group that is a division of a species; usually arises as a consequence of geographical isolation within a species), and they are definitely racist (discriminatory especially on the basis of race or religion).
If you had said "incorrect data analysis" or "hyperbole that helps neither men nor women" I would have found myself in agreement with you or more willing to listen and talk about the data.
Using the word "feminist," which essentially means "a supporter of feminism, which is a doctrine that advocates equal rights for women" (http://wordnet.princeton.edu/ definition) to describe a problem with data gathering or analysis is something I see as confusing or unhelpful.
Of course we should want feminist data analysis - we should want a way of looking at information that respects both women and men as people and as equals, not placing the rights of one above the other.
You also said that this incorrect data or interpretation thereof was the cause of the OP's problem:
> you wife has an enormous fear which has been fueled mostly by feminist political propaganda
You don't know this without knowing a person, and making such a statement is inflammatory and does not help a reasonable debate. That is what I meant when I said it was unhelpful.