Natural Language Processing (NLP) is a field of research and application that analyses how with the help of machine we can comprehend and manipulate natural language for further exploration. NLP contains many computational techniques for the automated analysis and representation of human language. The fundamental terms of language play important role in NLP. Examples of some fundamental terms are bad, somewhat, old, fantastic, extremely and so on. A collection of these fundamental terms is called a composite. Examples of some composite terms are very good movie, young man, not extremely surprised, and so on. In simple terms, atomic terms are words and composite terms are phrases. Words are the constitutional building block of language. Human languages either in spoken or written form, is composed of words. The NLP approaches related to word level are one of the initial steps towards comprehending the language.
The performance of NLP systems including machine translation, automatic question answering, information retrieval depends on the correct meaning of the text. Biggest challenge is ambiguity i.e., unclear or open meaning depending on the context of usage.
- Lexical Ambiguity
- Syntactic Ambiguity
The lexical ambiguity of a word or phrase means having more than one meaning of the word in the language. "Meaning" here refers to the definition captured by a good dictionary. For example, in Hindi language “Aam” means common as well as mango. Another example, in English language, treating the word silver as a noun, an adjective, or a verb. She bagged two silver medals (Noun); She made a silver speech (Adjective); His worries had silvered his hair (Verb). Syntactic ambiguity is a situation where a sentence may be explained in more than one way due to ambiguous sentence structure. For example: John saw the man on the hill with a telescope. Many questions arise: Who is on the mountain? John, the man, or both? Who has the telescope? John, the man, or the mountain?
Ambiguity in Natural Language Processing can be removed using:
- Word Sense Disambiguation
- Part of Speech Tagger
- HMM (Hidden Markov Model) Tagger
- Hybrid combination of taggers with machine learning techniques.
Word sense disambiguation (WSD) aims to identify the intended meanings of words (word senses) in a given context. For a given word and its possible meanings, WSD categorizes an occurrence of the word in context into one or more of its sense classes. The features of the context (such as neighbouring words) provide the evidence for classification. Statistical methods for ambiguity resolution include concepts of Part of Speech (POS) tagging and the usage of probabilistic grammars in parsing. Statistical methods for diminishing ambiguity in sentences have been formulated by deploying large corpora (e.g., Brown Corpus, WordNet, SentiWordNet) about word usage. The words and sentences in these different corpora are pre-tagged with the POS, grammatical structure, and frequencies of usage for words and sentences taken from a huge sample of written language. POS tagging is the mechanism of selecting the most likely part of speech from among the alternatives for each word in a sentence. Probabilistic grammar work on the concept of grammar rules, rules associated to part of speech tags. A probabilistic grammar has a probability associated with each rule, based on its frequency of use in the corpus. This approach assists in choosing the best alternative when the text is syntactically ambiguous (that is, there are more than one different parse trees for the text).
Author:
Srishti Vashishtha
Assistant Professor
Computer Science Department, NCU