CROSPELL ENGINE – Natural Language Processing Engine

This slide covers CROSPELL Engine; an engine made with multiple approaches for Natural Language Processing. It covers a wide variety of topics in text and image processing. From spell checking to topics prediction. It’s a project made in late 2012 and delivered in early 2013 at the F.I.T.E of Damascus, Syria as the final project in NLP course (with Ola Al Naameh and Mhd Hasan Sarhan.)

 System Specification (Implementation details can be found in the doc.)

1. Auto-correction

The auto-correction algorithm make sure that the misspelled word is matched with a proper correct word. Many approaches can be implemented for this. The option I opted to is the distance between keys on the keyboard map.

cr2

But ones should make sure he got the right algorithm. Keys on the keyboard map are not scattered linearly.

cr3

The distance between keys are also not linear. The best thing for this is Gaussian curve to measure the right distance.

cr4

The CyperSpell Algorithm maps the (possible) misspelled words with their correct-spelled counterparts (using a dictionary).

cr5

The user can, in realtime, write and the system will auto-correct (or suggest) the correct words when the user misspell. The system also knows what words the user has misspelled before and rank their chosen correct words higher in the list of suggestions.

cr19

2. Language Identification

The user can input any language and the system can figure out what than language is (as long as the corresponding corpses are provided).

cr6

if there are more than one language in the text, the system will list them (rank them) according to their occurrences (frequencies in the text).

cr7

3. Word Prediction

Using bi-grams and tri-grams the system can successfully suggest auto-completion while writing words.

cr8

3. Topic Prediction

Using bi-grams and tri-grams the system can successfully suggest the best topic that match the paragraph. The system, actually, lists all the possible topics prediction and rank them according to the best match.

cr9

4. Dictionary

The system also provide and Arabic-English dictionary.

cr10

5. Image Processing using NLP Approaches

Using Minimum Edit Distance (MED), we can match images with others having similar properties (colors in our case). Though, this approach is shallow since it fail completely when images are re-sized or rotated. Anyway, it’s just for fun!

cr11

The system can best compare images having similar sized and not-transformed.

cr12

cr20

6. ISRI and Porter Stemming Algorithms

Both, ISRI and Porter stemming algorithms are implemented in the engine.

cr13

cr14

7. Genome Matching using Minimum Edit Distance

The engine interestingly implement Genome matching using MED. The initial interface is:

cr15

The user can input two genomes and the system will find the match between the two.

cr16

cr17

8. Sentiment Analysis

The system implement a light sentiment analyzer. Just write a sentence or a paragraph and the system will provide the corresponding emotion for it.

cr18

You can download the full project documentation [in Arabic – بالعربية] here. I would be happy to upload the engine source code along with its interface for anyone to use! but the languages corpus are quite big (the project in 400 MB!) so if anyone is interested don’t hesitate to contact me by mail and I’ll figure something out!

Advertisements

One thought on “CROSPELL ENGINE – Natural Language Processing Engine

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s