Yes. My background is traditional manual translation, but I got a little bit bored and I got intrigued by machine translation.
I learned Python and kinda obsessed about statistical machine translation. Since I got no computer science background I have been self-learning everything I can absorb.
Now I'm at the stage where I think I'm able to implement a Moses system. Moses is the leading (I think) open source statistical machine translation framework, created by some of the researchers that were involved in Google's switch from rule-based translation (Systran) to statistical translation. (Philipp Koehn et al.)
I am installing a Linux instance on AWS, in order to build a Moses system there. I'm very excited. I am studying all the steps needed to implement it. A lot of the performance of such a system, AFAIK, depends on the data. I have been searching for sources of bilingual and multilingual data, and I found OPUS, an open source parallel corpus. I want to see if I can train a machine translation system for subtitles! This would be a way to leverage the huge corpora of crowd-submitted subtitles translations that are openly available.
This is my first project beyond normal translation. I want to dare to try to build it, even though I am not a formally trained engineer. I am (was?) a translator, but now I want to grow and try to understand machine translation technology and use it creatively.
Moses is a program that can build up systems for statistical machine translation if you put in a lot of translated text. If you don't have enough data for statistical machine translation (which is likelyー it takes quantities on the order of millions of words at least) you could try Apertium, which is a rule-based machine translation system that's designed not to require specialized linguistic or computer knowledge to make language pairs for it.
Strong AI is not a necessary prerequisite for human-level machine translation. Current AI can handle image recognition, Go-playing, etc at better than human levels (for the image recognition claim there are some caveats, but the point is that it's pretty good, despite us not being close to strong AI). At the rate that machine learning has been progressing, will most likely have human-level machine translation in the next 5-10 years at the very latest.
Back during the Cold War, the US government invested loads of money into rule-based automatic machine translators, but they never worked, the funding dried up, and caused what is called "AI winter." Google translate has never, ever used anything like that. Up until 2 years ago, the state-of-the-art in machine translation was called "statistical phrase-based machine translation," such as systems like this: http://www.statmt.org/moses/
The state-of-the-art is now largely dominated by "neural machine translation," which is based on deep neural networks. Humans do not hand-code anything; it learns what things "mean," (it creates a real-valued vector encoding a sentence in high-dimensional space) and then translates based off of that. This approach has real potential to create human-level automatic translations.
> Moses is a statistical machine translation system that allows you to automatically train translation models for any language pair. All you need is a collection of translated texts (parallel corpus). Once you have a trained model, an efficient search algorithm quickly finds the highest probability translation among the exponential number of choices.
...but don't expect good results.
Also, you might look into this online course on machine translation: http://www.mt-mooc.upc.edu/
http://www.statmt.org/moses/?n=Moses.Releases
Moses releases some pretrained models. It looks like it's trained on Europarl, which is a data set of all the European Parliament proceedings translated into multiple languages. It's a pretty standard data set in academic MT.
Unfortunately it is somewhat small and might not contain coverage of the words you need. Generally though, you can expect Moses will perform near state-of-the-art amount non-proprietary algorithms. (Google, Bing, etc greatly benefit from having many orders of magnitude more training data in many more languages.)
But it will give you out of the box full MT system.
It seems like this is basically just machine translation (which I think you note), which is a massive research field.
You could just try applying an off-the-shelf machine translation toolkit to the problem, if you think your problem is similar enough to language translation.
Moses is the most popular open source MT toolkit. Cdec is also popular.
Unfortunately MT is not just one algorithm, but is a fairly complex pipeline of algorithms. This is my field so I can help more if this is the path you want to go down.
People do apply existing MT software to other tasks (restoring case to lower-cased text, transliterating names) with good results.
Usually you don't test. Sounds weird, I know. But think of the alignment as the actual result of Giza's computations, not the models.
When people need to align new data the usual way is to append said data to the training corpus and to just run Giza again. If you want to do that often this might interest you:
Moses documenation on incremental training
Nevertheless if you still want to evaluate your model as-is this is quite easy for the simpler ones such as HMM. Here is some code that evaluates an HMM model in pure Python:
Last, if you look closely at the Giza/incgiza code there seems to be some XMLRPC functionality but I was not able to get that to work. If anyone did - please shout.
Machine translation is all about large quantities of data. The proper dataset to train a good system for translating a philosophy text from German to English is a parallel corpus (text in both languages aligned on a sentence level) for literally every book and article ever written that's somewhat related to philosophy and exists in both German and English; and it's plausible that this data is too small for good results, so it would be augmtened with general-domain data as well. Anything you can do with a limited amount of training data (e.g. dictionaries but no large parallel corpus) would likely be worse than a state of art general purpose model (e.g. Google Translate), so it may be an interesting learning exercise but no practical benefit - IMHO if you just want to translate a particular book, you should just run it through the best currently existing model you can find and spend the effort on correcting the translation instead of trying to make a better model.
If you do want to play with MT, the existing open source MT engines may be helpful. Some years ago Moses (http://www.statmt.org/moses/) was the go-to solution, but nowadays neural approaches are state of art, so perhaps https://opennmt.net/ or something like that would be usable.
There are some approaches that integrate terminology dictionaries to improve accuracy, but word level translations generally are not the main problem in machine translation currently, it's the larger phrase and sentence structures, disambiguation and fluency.
This task is generally called 'languge modeling', with the most common technical definition being estimating the probability distribution of the next word given the previous context.
The simplest methods are n-gram counts. You might want to read the Jurafsky-Martin book chapter 3 on them - https://web.stanford.edu/~jurafsky/slp3/ ; some tools to do that (build and query language models) are mentioned here http://www.statmt.org/moses/?n=FactoredTraining.BuildingLanguageModel
It's worth noting that 1500 sentences is a tiny toy corpus that by itself is not sufficient to get a meaningful next word prediction. You can get something interesting by taking all the works of an author or all the speeches of a prolific politician, but for 1500 sentences you're not going to get anything nice unless you somehow incorporate a large corpus of general language.
ok, anything to/from English isn't really apertium's domain; SMT/NMT typically have much better effort/result ratios there. Unfortunately, there isn't much pre-packaged, even though the parts all exist (e.g. you can download Moses word alignments and phrase tables for a 70million token nb-en corpus from http://opus.nlpl.eu/download.php?f=OpenSubtitles2018%2Fen-no.txt.zip and plug it into Moses but that's a bit of effort. Maybe Someone™ should make a script that grabs data off Opus and creates a Moses translator …)
With regard to machine translation, check out the moses smt system. It is open source and produces the current state of the art. There are some sample projects on their website of things they want done.
http://www.statmt.org/moses/?n=Moses.GetInvolved
Some of that material, especially the front end stuff should be doable with an undergrad background.
Another place to get some inspiration would be the U Edinburgh computational linguistics department webpage. I dont have a link right now, but alot of the professors post phd thesis topics they are interested in.
Do you want to incorporate machine translation in an app or do you actually want to develop a machine translation system? If you're looking to develop a system, do you expect it to be competitive, or do you just want to understand how it works?
If you only need to include MT in your app, then use Google Translate API or something similar.
If you actually want to develop a system, then things get more complicated. Expecting it to be competitive with state-of-the-art commercial systems is not a realistic scenario. Most companies have considerably more data and resources than most research groups.
However, if you're looking at it as an exercise, you could have a look at Moses. This is a statistical MT toolkit. Also have a look at the book for more insight at what these models do. Look for freely-available parallel data for your language pair and train a few models. It'll never be competitive with commercial systems, but you'll get a feel for how these systems are trained.