TY - GEN
T1 - A Classifier to Distinguish between Cypriot Greek and Standard Modern Greek
AU - Sababa, Hanna
AU - Stassopoulou, Athena
N1 - Publisher Copyright:
© 2018 IEEE.
Copyright:
Copyright 2019 Elsevier B.V., All rights reserved.
PY - 2018/11/30
Y1 - 2018/11/30
N2 - The problem of discriminating between similar languages and dialects is one of the current challenges of natural language processing. In this paper, we describe the collection of a bidialectal corpus of Greek and the construction of a classifier to distinguish between Cypriot Greek (CG) and Standard Modern Greek (SMG). The corpus of CG and SMG was compiled from social media websites such as Facebook, Twitter and online forums. N-gram features were extracted and three classification algorithms were applied and tested on labeled sentences: multinomial naive Bayes (NB), linear support vector classifier (SVC) and logistic regression. All algorithms classified the test data with an accuracy of over 90%, with the multinomial NB classifier performing best, yielding a mean accuracy of 95%. This study adds to the existing body of work on the problem of discriminating between similar languages and is the first to examine CG and SMG. The results demonstrate the feasibility of an accurate Greek dialect classifier for academic or applied purposes.
AB - The problem of discriminating between similar languages and dialects is one of the current challenges of natural language processing. In this paper, we describe the collection of a bidialectal corpus of Greek and the construction of a classifier to distinguish between Cypriot Greek (CG) and Standard Modern Greek (SMG). The corpus of CG and SMG was compiled from social media websites such as Facebook, Twitter and online forums. N-gram features were extracted and three classification algorithms were applied and tested on labeled sentences: multinomial naive Bayes (NB), linear support vector classifier (SVC) and logistic regression. All algorithms classified the test data with an accuracy of over 90%, with the multinomial NB classifier performing best, yielding a mean accuracy of 95%. This study adds to the existing body of work on the problem of discriminating between similar languages and is the first to examine CG and SMG. The results demonstrate the feasibility of an accurate Greek dialect classifier for academic or applied purposes.
KW - feature extraction
KW - machine learning
KW - natural language processing
KW - natural languages
KW - statistical learning
UR - http://www.scopus.com/inward/record.url?scp=85060039196&partnerID=8YFLogxK
U2 - 10.1109/SNAMS.2018.8554709
DO - 10.1109/SNAMS.2018.8554709
M3 - Conference contribution
AN - SCOPUS:85060039196
T3 - 2018 5th International Conference on Social Networks Analysis, Management and Security, SNAMS 2018
SP - 251
EP - 255
BT - 2018 5th International Conference on Social Networks Analysis, Management and Security, SNAMS 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 5th International Conference on Social Networks Analysis, Management and Security, SNAMS 2018
Y2 - 15 October 2018 through 18 October 2018
ER -