Authors:
Gustavo M. C. Coelho
1
;
Alimed Celecia
1
;
Jefferson de Sousa
1
;
Melissa Cavaliere
1
;
Maria Julia Lima
1
;
Ana Mangeth
2
;
Isabella Frajhof
2
;
Cesar Cury
3
and
Marco Casanova
1
Affiliations:
1
Tecgraf, PUC-Rio, Rio de Janeiro, Brazil
;
2
LES, PUC-Rio, Rio de Janeiro, Brazil
;
3
Escola da Magistratura do Estado do Rio de Janeiro, Rio de Janeiro, Brazil
Keyword(s):
Document Embedding, Text Classification, Natural Language Processing.
Abstract:
Text classification is a popular Natural Language Processing task that aims at predicting the categorical values associated with textual instances. One of the relevant application fields for this task is the legal domain, which involves a high volume of unstructured textual documents. This paper proposes a new model for the task of classifying legal opinions related to consumer complaints according to the moral damage value. The proposed model, named MuDEC (Multi-step Document Embedding-Based Classifier), combines Doc2vec and SVM for feature extraction and classification, respectively. To optimize the classification performance, the model uses a combination of methods, such as oversampling for imbalanced datasets, clustering for the identification of textual patterns, and dimensionality reduction for complexity control. For performance evaluation, a 6-class dataset of 193 legal opinions related to consumer complaints was created in which each instance was manually labeled according t
o its moral damage value. A 10-fold stratified cross-validation resampling procedure was used to evaluate different models. The results demonstrated that, under this experimental setup, MuDEC outperforms baseline models by a significant margin, achieving 78.7% of accuracy, compared to 61.1% for a SIF classifier and 65.2% for a C-LSTM classifier.
(More)