Introduction to Arabic NLP

In natural language processing, Arabic is different form English (and other languages); so the approach may be different.
 
For other languages, disambiguation may be a step of the analysis process, in Arabic it’s the main process. 
 
Disambiguation is parallel process of most analysis process. Let’s see a simple example:
 
He wrote
كتب
 
No Diacritics
 
In English processing, it seems easy to analyze this sentence so “he” is immediately annotated as personal pronoun, and ‘wrote’ as verb in simple past form. For equivalent Arabic sentence (the third person pronoun is implicit if we suppose ‘wrote’ meaning), due to diacritics absence, it’s not sure that كتب is a verb. More, there are several meanings and forms for this word depending on supposed vowels and context.
 
Unfortunately most Arabic text is without diacritics, so we have to accept that:
 
After tokenization processing, each token accept different vowels possibilities, so different forms and meaning. Each token is then ambiguous.
 
A token is composed
 
The second characteristic of Arabic text is the agglutination; a token is usually composed by prefixes, stem and suffixes. For example:
 
فسيكفيكهم
 
 
 
 
 
 
هم
ك
يكفي
س
ف
Thy/Them
you
(He) Save/Spare/ suffice
Will
So/then

Cialis

Cialis

Buy Ambien

Xanax

Ativan

Buy Viagra

Valium

Phentermine

Ativan

Xanax

Cheap Ambien

Viagra

Cialis

Tramadol

KSUbjA

MtmwQn KSUbjA

oZvqZPC

qYTvGNzu oZvqZPC

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

More information about formatting options