Ca 4 NLP Report - 1
Ca 4 NLP Report - 1
Ca 4 NLP Report - 1
Submitted By
BACHELOR OF TECHNOLOGY
In
COMPUTER SCIENCE AND ENGINEERING
(Artificial Intelligence and Data Analytics)
Sri Ramachandra Faculty of Engineering and Technology
Sri Ramachandra Institute of Higher Education and Research, Porur,
Chennai-600116
JULY, 2024
1
ACKNOWLEDGEMENTS
We express our sincere gratitude to our dean Mr. Ragunathan for their support
and for providing the required facilities for carrying out this study.
Without her continuous guidance and persistent help, this project would not have
Faculty of Engineering and Technology, my dear parents, and friends who have
crucial.
2
TABLE OF CONTENTS
7 RESULT 14-16
8 MACHINE CONFIGURATIONS 17
9 CONCLUSION 18
10 FUTURE ENCHANMENTS 19
11 REFERENCES 20
12 CODE 21
3
INTRODUCTION
In the digital age, the proliferation of online content has transformed how
individuals express their opinions and experiences. Sentiment analysis, a
branch of natural language processing (NLP), focuses on the computational
treatment of opinions, sentiments, and emotions expressed in text. It serves as
a vital tool for businesses, researchers, and policymakers to gauge public
sentiment, understand consumer behavior, and derive actionable insights from
vast amounts of unstructured data.
The primary objective of this report is to present a comprehensive overview of
a sentiment analysis model developed to classify sentiments expressed in user
reviews across multiple languages and categories, specifically books, DVDs,
and music. By categorizing sentiments into three distinct classes—positive,
negative, and neutral—the model aims to provide a nuanced understanding of
user feedback, which can significantly inform decision-making processes in
various sectors.
This report will detail the methodology employed in developing the sentiment
analysis model, including the data preprocessing steps, model architecture, and
training process. Additionally, it will provide an overview of the dataset used,
along with the results obtained from evaluating the model's performance across
different languages and categories. Finally, the report will conclude with
recommendations for future improvements and potential applications of the
model in real-world scenarios.
By addressing these components, this report aims to contribute to the ongoing
discourse in the field of sentiment analysis and highlight the importance of
developing robust models capable of understanding and interpreting human
emotions in text.
4
SIGNIFICANCE OF THE PROBLEM
Sentiment analysis serves as a bridge between raw textual data and actionable
insights. By categorizing sentiments into positive, negative, or neutral classes,
organizations can gauge customer satisfaction, identify areas for improvement,
and tailor their products and services to better meet consumer needs. For instance,
businesses can analyze customer reviews to understand what features are
appreciated or criticized, allowing for data-driven decision-making.
5
LITERATURE REVIEW
Sentiment analysis has garnered significant attention in the field of natural language
processing (NLP) due to its ability to extract meaningful insights from textual data.
This literature review examines key studies and methodologies relevant to sentiment
analysis, particularly those employing deep learning techniques and addressing
challenges in multilingual sentiment classification.
One of the foundational works in this area is the study by Kim (2014), which
introduced the use of Convolutional Neural Networks (CNNs) for sentence
classification. Kim demonstrated that CNNs could effectively capture local features
in text, leading to improved accuracy in sentiment classification tasks. This approach
marked a shift from traditional machine learning methods, which often relied on
handcrafted features, to data-driven models capable of learning complex patterns
directly from raw text.
Building on this, the work of Zhang et al. (2018) explored the application of recurrent
neural networks (RNNs), particularly Long Short-Term Memory (LSTM) networks,
for sentiment analysis. The authors highlighted the advantages of LSTMs in handling
sequential data and capturing long-range dependencies, making them well-suited for
sentiment classification tasks where context is crucial. Their experiments
demonstrated that LSTMs outperformed traditional classifiers, particularly in
datasets with longer textual inputs.
7
OUR PROPOSED METHOD
1. Embedding Layer
The embedding layer converts words into dense vector representations, allowing
the model to learn semantic relationships between words. This layer takes the
input sequences of words and maps them to a higher-dimensional space, where
similar words are represented by similar vectors.
2. LSTM Layers
The LSTM layers are responsible for processing the embedded sequences and
learning the temporal dependencies within the text. LSTM cells are designed to
selectively remember and forget information, enabling the model to capture
context and understand the sentiment expressed in the reviews.
In the proposed architecture, two LSTM layers are stacked, with the first layer
returning the full sequence of hidden states and the second layer only returning
the final hidden state. This allows the model to learn both local and global features
from the input text.
3. Dropout Layer
8
of the input units to 0 during training, forcing the model to learn more robust
features and reducing the risk of co-adaptation among neurons.
The final layer of the model is a dense layer with a softmax activation function.
This layer takes the output from the LSTM layers and produces a probability
distribution over the sentiment classes (positive, negative, and neutral). The class
with the highest probability is selected as the predicted sentiment.
The model is trained using the Adam optimizer and sparse categorical cross-
entropy loss function. The training process involves feeding the padded input
sequences and their corresponding sentiment labels to the model, allowing it to
learn the mapping between the text and sentiment classes.
9
DESCRIPTION OF DATASET
1. Summary
A brief overview of the review, providing a concise summary of the user's
opinion.
2. Text
The full text of the review, containing the user's detailed feedback and
sentiments.
3. Category
The category of the item being reviewed, such as books, DVDs, or music.
4. Rating
10
employed based on the user-provided rating. Reviews with a rating greater
than 3 are considered positive, those with a rating less than 2 are considered
negative, and those with a rating of 2 or 3 are considered neutral.
The dataset is divided into training and test sets for each language and
category. The training set is used to fit the sentiment analysis model, while the
test set is used for evaluating the model's performance and generalization
capabilities.
It is important to note that the dataset may contain noisy or inconsistent data,
such as reviews with missing or irrelevant information, which can potentially
impact the model's performance. Careful preprocessing and data cleaning
steps may be necessary to ensure the quality and reliability of the dataset.
11
METHODOLOGY & APPROACH
The dataset consists of user reviews collected from various sources, formatted
in XML files. Each review entry includes fields such as summary, text,
category, and rating. The preprocessing steps include:
• Parsing XML Files: The XML files are parsed to extract relevant fields,
and the data is organized into a structured format (DataFrame) for
further processing.
• Handling Missing Data: Reviews with missing or empty text fields are
filtered out to ensure the quality of the input data. This step is crucial to
prevent errors during tokenization and model training.
• Sentiment Labeling: The sentiment of each review is determined based
on the user-provided rating. Reviews with ratings greater than 3 are
labeled as positive, those with ratings less than 2 as negative, and ratings
of 2 or 3 as neutral.
• Tokenization: The text data is tokenized using Keras' Tokenizer, which
converts words into sequences of integers. This process allows the
model to work with numerical representations of the text.
• Padding Sequences: To ensure uniform input size for the model, the
sequences are padded to a fixed length (e.g., 100 tokens) using Keras'
pad_sequences function. This step is essential for handling variable-
length input data.
2. Model Architecture
• Embedding Layer: This layer converts the integer sequences into dense
vector representations, allowing the model to learn semantic
relationships between words.
• LSTM Layers: Two stacked LSTM layers process the embedded
sequences, capturing temporal dependencies and contextual information
12
within the text. The first LSTM layer returns the full sequence of hidden
states, while the second layer returns only the final hidden state.
• Dropout Layer: A dropout layer is included to mitigate overfitting by
randomly setting a fraction of the input units to zero during training.
• Dense Output Layer: The final layer is a dense layer with a softmax
activation function, producing a probability distribution over the
sentiment classes (positive, negative, neutral).
3. Model Training
The model is trained using the Adam optimizer and sparse categorical cross-
entropy loss function. The training process consists of the following steps:
• Training and Validation Split: The dataset is divided into training and
validation sets to monitor the model's performance during training. A
typical split might involve using 80% of the data for training and 20%
for validation.
• Epochs and Batch Size: The model is trained for a specified number of
epochs (e.g., 5) with a defined batch size (e.g., 64). During each epoch,
the model learns from the training data and updates its weights based on
the computed loss.
• Monitoring Performance: The validation loss and accuracy are
monitored at the end of each epoch to assess the model's ability to
generalize to unseen data. Early stopping can be employed to prevent
overfitting if the validation loss starts to increase.
4. Model Evaluation
After training, the model is evaluated on a separate test set for each language
and category. The evaluation process includes:
The sentiment analysis model was trained across four languages—German, French,
Japanese, and English—over five epochs, yielding varying results. For German
reviews, the model achieved a final training accuracy of 92.02% but only reached a
validation accuracy of 63.17%, with a test accuracy of 28% across categories
(books, DVDs, music), indicating significant misclassification, particularly with
positive and neutral sentiments. In French, the model's performance was notably
poor, with a maximum training accuracy of 91.21% and a test accuracy of only 22%
for DVDs and 23% for books, highlighting the model's inability to generalize
effectively. The Japanese dataset presented additional challenges, with the model
encountering errors during evaluation, resulting in an accuracy of 50% for DVDs
but undefined metrics for books due to data issues. For English reviews, the model
achieved a training accuracy of 89.38% and a validation accuracy of 62.75%, but
similarly struggled with low test accuracy across categories, reflecting a pattern of
high recall for negative sentiments but poor precision for positive and neutral
classes. Overall, while the model demonstrated strong training performance, it
struggled with generalization and class imbalance, leading to low accuracy on the
test sets and undefined metrics for certain classes, necessitating further
improvements in data handling and model architecture.
14
Comparison with Previous Work
The results obtained from this sentiment analysis model can be compared to
previous studies in the field. Kim (2014) demonstrated that Convolutional Neural
Networks (CNNs) can achieve state-of-the-art performance on sentence
classification tasks, with accuracy scores ranging from 0.81 to 0.89 on various
datasets.
However, it is important to note that the datasets and evaluation metrics used in
different studies may vary, making direct comparisons challenging. Additionally,
the complexity of the task and the diversity of languages and domains can
significantly impact the model's performance.
The sentiment analysis model was evaluated on separate test sets for each language
and category, revealing significant limitations in its ability to accurately classify
sentiments. For German reviews, the model achieved a test accuracy of only 28%
across all categories (books, DVDs, music), with undefined precision and F1-scores
for positive and neutral sentiments, indicating a strong bias towards predicting
negative sentiments. The French dataset showed even lower performance, with
accuracies ranging from 22% to 25%, and the Japanese dataset encountered errors
during evaluation, particularly for the books category, suggesting data quality
issues.
15
16
MACHINE CONFIGURATIONS
CONCLUSION
17
The sentiment analysis model developed in this project aimed to classify
sentiments expressed in user reviews across multiple languages and categories.
While the model achieved high training accuracy, the evaluation results revealed
limitations in accurately classifying positive and neutral sentiments, particularly in
multilingual contexts.
To improve the model's performance, future work should focus on addressing data
imbalance, experimenting with different model architectures, and incorporating
advanced techniques such as transfer learning and data augmentation.
Additionally, further research is needed to understand the challenges posed by
language complexity and cultural nuances in sentiment analysis.
Key points :
FUTURE ENCHANMENTS
18
Implementing the BERT model for sequence classification: The model architecture
was updated to use a pre-trained BERT model fine-tuned for sentiment analysis.
This allowed the model to leverage learned representations from vast amounts of
text data, potentially improving its ability to capture contextual information and
handle language complexities.
Optimizing hyperparameters: The learning rate and batch size were adjusted during
training to find the optimal values that balance convergence speed and accuracy.
This fine-tuning aimed to help the model learn more effectively and generalize
better to unseen data.
REFERENCES
19
1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
https://arxiv.org/abs/1810.04805
CODE
20
Aadithya Prabha R – E0321008
https://github.com/aadithyaprabha/crosslingualsentimentanalysis
https://github.com/deevnared/Cross-Lingual-Sentiment-Analysis
21