Hierarchical Tag-set for Rule-based Processing of Tamil Language
DOI:
https://doi.org/10.31357/ijms.v1i2.2230Abstract
Corpora are fundamental tools for Natural Language Processing. Part of Speech tagging provides moremeaning to the corpora by annotating words. A tag-set used to annotate a corpus should be selected in such away that it represents grammatical structure of the respective language. These tag-sets can be flat orhierarchical in structure. There are several efforts have been made in Tamil language to identify a tag-set.However, existing tag-sets have many shortcomings including inability of tagging all the words, inability tocapture required syntactic information such as divisibility, too many numbers of tags in a set, flat in tagstructure, and lack of extendibility. The scholar works Tolkāppiyam and Naṉṉūl clearly shows the grammaticalclassification of words. This paper proposes a new hierarchical tag-set with 10 labels for Tamil language inview of developing a morphological analyser by considering the existing limitations and using Tamil grammar.The morphological analyser can be used to extend the proposed tag-set easily with more grammaticalinformation.
KEYWORDS: POS tagging, Tag-set, Morphological analyser, Tamil grammar