This slide covers CROSPELL Engine; an engine made with multiple approaches for Natural Language Processing. It covers a wide variety of topics in text and image processing. From spell checking to topics prediction. It’s a project made in late 2012 and delivered in early 2013 at the F.I.T.E of Damascus, Syria as the final project in NLP course (with Ola Al Naameh and Mhd Hasan Sarhan.)
System Specification (Implementation details can be found in the doc.)
The auto-correction algorithm make sure that the misspelled word is matched with a proper correct word. Many approaches can be implemented for this. The option I opted to is the distance between keys on the keyboard map.
But ones should make sure he got the right algorithm. Keys on the keyboard map are not scattered linearly.
The distance between keys are also not linear. The best thing for this is Gaussian curve to measure the right distance.
The CyperSpell Algorithm maps the (possible) misspelled words with their correct-spelled counterparts (using a dictionary).
The user can, in realtime, write and the system will auto-correct (or suggest) the correct words when the user misspell. The system also knows what words the user has misspelled before and rank their chosen correct words higher in the list of suggestions.
2. Language Identification
The user can input any language and the system can figure out what than language is (as long as the corresponding corpses are provided).
if there are more than one language in the text, the system will list them (rank them) according to their occurrences (frequencies in the text).
3. Word Prediction
Using bi-grams and tri-grams the system can successfully suggest auto-completion while writing words.
3. Topic Prediction
Using bi-grams and tri-grams the system can successfully suggest the best topic that match the paragraph. The system, actually, lists all the possible topics prediction and rank them according to the best match.
The system also provide and Arabic-English dictionary.
5. Image Processing using NLP Approaches
Using Minimum Edit Distance (MED), we can match images with others having similar properties (colors in our case). Though, this approach is shallow since it fail completely when images are re-sized or rotated. Anyway, it’s just for fun!
The system can best compare images having similar sized and not-transformed.
6. ISRI and Porter Stemming Algorithms
Both, ISRI and Porter stemming algorithms are implemented in the engine.
7. Genome Matching using Minimum Edit Distance
The engine interestingly implement Genome matching using MED. The initial interface is:
The user can input two genomes and the system will find the match between the two.
8. Sentiment Analysis
The system implement a light sentiment analyzer. Just write a sentence or a paragraph and the system will provide the corresponding emotion for it.
You can download the full project documentation [in Arabic – بالعربية] here. I would be happy to upload the engine source code along with its interface for anyone to use! but the languages corpus are quite big (the project in 400 MB!) so if anyone is interested don’t hesitate to contact me by mail and I’ll figure something out!