Hierarchical Tag-set for Rule-based Processing of Tamil Language

Authors

  • Kengatharaiyer Sarveswaran Department of Computer Science University of Jaffna, Sri Lanka.
  • Sinnathamby Mahesan Department of Computer Science University of Jaffna, Sri Lanka.

DOI:

https://doi.org/10.31357/ijms.v1i2.2230

Abstract

Corpora are fundamental tools for Natural Language Processing. Part of Speech tagging provides moremeaning to the corpora by annotating words. A tag-set used to annotate a corpus should be selected in such away that it represents grammatical structure of the respective language. These tag-sets can be flat orhierarchical in structure. There are several efforts have been made in Tamil language to identify a tag-set.However, existing tag-sets have many shortcomings including inability of tagging all the words, inability tocapture required syntactic information such as divisibility, too many numbers of tags in a set, flat in tagstructure, and lack of extendibility. The scholar works Tolkāppiyam and Naṉṉūl clearly shows the grammaticalclassification of words. This paper proposes a new hierarchical tag-set with 10 labels for Tamil language inview of developing a morphological analyser by considering the existing limitations and using Tamil grammar.The morphological analyser can be used to extend the proposed tag-set easily with more grammaticalinformation.

KEYWORDS: POS tagging, Tag-set, Morphological analyser, Tamil grammar

Author Biographies

Kengatharaiyer Sarveswaran, Department of Computer Science University of Jaffna, Sri Lanka.

Department of Computer Science

University of Jaffna, Sri Lanka.

Sinnathamby Mahesan, Department of Computer Science University of Jaffna, Sri Lanka.

Department of Computer Science

University of Jaffna, Sri Lanka.

Downloads

Published

2015-06-23