Online Note Categorization
Online Note Categorization
Online Note Categorization
Tribhuvan University
i
“Online Note Categorization System”
A PROJECT REPORT
Submitted To:
In partial fulfillment of the requirements for the Bachelor’s Degree in Computer Science
and Information Technology
Submitted By:
Bijay Basnet
Ritesh Shrestha
ii
Ritesh Shrestha. This project fulfills the necessary requirements for the degree of B.Sc. in
Computer Science and Information Technology and should now proceed for evaluation.
…………………………………………
Tekendra Nath Yogi
Lecturer
College of Applied Business
Tribhuvan University
We, the undersigned, solemnly declare that we are the sole authors of this work and that no
sources other than those listed here have been utilized in its creation.
iii
We further confirm that this work has not been submitted for any other academic evaluation
or degree program.
…………………………………………
Bijay Basnet (105)
…………………………………………
Kripesh Prasad Bhattarai (118)
…………………………………………
Ritesh Shrestha (126)
This is to certify that this project prepared by BIJAY BASNET, KRIPESH PRASAD
BHATTARAI AND RITESH SHRESTHA entitled “ONLINE NOTE
iv
CATEGORIZATION SYSTEM” in partial fulfillment of the requirements for the degree of
B.Sc. in Computer Science and Information Technology , has been thoroughly reviewed. In
our assessment, it demonstrates a satisfactory level of scope and quality, fulfilling the criteria
for the required degree.
……………………………… ……………………………….
Head Coordinator Supervisor
College of Applied Business Tekendra Nath Yogi
College of Applied Business
…………………………………. …………………………………..
Internal Examiner External Examiner
ACKNOWLEDGEMENT
With Respect,
ABSTRACT
As the volume of textual information continues to grow across various domains, the task of
text categorization has gained significant importance. Our project focuses on addressing this
challenge by automatically categorizing personal notes and generating relevant tags based on
their content.
vi
To achieve our objective, we employed the Multinomial Naive Bayes algorithm for
classifying personal notes into predefined categories, including Technology, Politics,
Religion, Science, Sports, Space, Medicine, and Miscellaneous. Additionally, we
implemented 20 specific categories from the dataset, such as alt.atheism, comp.graphics,
comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware, comp.sys.mac.hardware,
comp.windows.x, misc.forsale, rec.autos, rec.motorcycles, rec.sport.baseball,
rec.sport.hockey, sci.crypt, sci.electronics, sci.med, sci.space, soc.religion.christian,
talk.politics.guns, talk.politics.mideast, talk.politics.misc, and talk.religion.misc. This
approach allowed us to generate automatic tags for each personal note.
Our system effectively processed and classified the textual personal notes using the
Multinomial Naive Bayes algorithm. The categorized notes were then displayed, enabling
users to filter them based on specific categories. This report highlights the successful
application of the Multinomial Naive Bayes algorithm for text classification, offering a
streamlined note-taking process and simplified note management for users.
Table of Contents
SUPERVISOR’S RECOMMENDATION................................................................................ii
STUDENT’S DECLARATION...............................................................................................iii
LETTER OF APPROVAL........................................................................................................iv
vii
ACKNOWLEDGEMENT.........................................................................................................v
ABSTRACT..............................................................................................................................vi
LIST OF FIGURES..................................................................................................................ix
LIST OF TABLES.....................................................................................................................x
LIST OF ABBREVIATIONS...................................................................................................xi
CHAPTER 1...............................................................................................................................1
INTRODUCTION......................................................................................................................1
1.1 Introduction.................................................................................................................1
1.3 Objectives....................................................................................................................2
CHAPTER 2...............................................................................................................................6
CHAPTER 3...............................................................................................................................9
SYSTEM ANALYSIS...............................................................................................................9
i. Functional Requirements...........................................................................................9
i. Technical Feasibility................................................................................................11
viii
iv. Schedule Feasibility................................................................................................12
3.1.3 Analysis..............................................................................................................13
CHAPTER 4.............................................................................................................................16
SYSTEM DESIGN..................................................................................................................16
4.1 Design........................................................................................................................16
CHAPTER 5.............................................................................................................................24
5.1 Implementation..............................................................................................................24
5.2 Testing.......................................................................................................................28
CHAPTER 6.............................................................................................................................37
6.1 Conclusion......................................................................................................................37
REFERENCES.........................................................................................................................39
APPENDIX..............................................................................................................................40
LIST OF FIGURES
LIST OF TABLES
x
Table 5.5 Test Case for Managing category
Table 5.6 Test Case for Managing Notes
Table 5.7 Test Case for Text Categorization
LIST OF ABBREVIATIONS
ML Machine Learning
NLP Natural Language Processing
xi
MNB Multinomial Naïve Bayes
ER Entity-Relationship
DFD Data Flow Diagram
PK Primary Key
FK Foreign Key
TF Term Frequency
IDF Inverse Document Frequency
TFIDF Term Frequency-Inverse Document Frequency
NLTK Natural Language Toolkit
HTTP Hypertext Transfer Protocol
HTML Hypertext Markup Language
xii
CHAPTER 1
INTRODUCTION
1.1 Introduction
In today's digital age, the volume of textual data being generated on a daily basis is staggering. It
has become a challenge to manage and extract insights from such an enormous amount of
unstructured data. However, with the advent of natural language processing (NLP) and machine
learning techniques, text classification has emerged as a powerful tool to organize, structure, and
categorize this data, leading to better decision-making. Text classification has numerous
applications, such as sentiment analysis, spam detection, intent detection, and topic labeling. It is
estimated that 80% of all information is unstructured, with text being the most common type of
unstructured data. This makes it a challenging task to analyze, understand, and sort through
textual data, which is why companies often fail to use it to its full potential.
Text classification involves categorizing text documents into predefined categories or classes.
This technique can be used to automatically sort through large volumes of textual data, enabling
the extraction of useful insights and patterns that might otherwise be difficult to identify. It is an
essential task in many fields, including information retrieval, web searching, and text mining.
Text classification is typically approached as a supervised learning problem, where a machine
learning algorithm is trained on a labeled dataset of text documents, with the goal of developing
a model that can accurately classify new, unseen documents.
In this document, we focus on "Online Note Categorization and Automatic Tag Generation"
using the Multinomial Naive Bayes algorithm. The aim of this project is to classify personal
notes into predefined categories and generate automatic tags for them, making it easier to search
and filter the notes based on their content. Our approach involves pre-processing the textual
personal notes and then training the Multinomial Naive Bayes algorithm to classify them
categories. We then use these categories to automatically generate tags for each note, enabling
users to quickly filter and find the information they need.
Overall, the project aims to showcase the effectiveness of the Multinomial Naive Bayes
algorithm in text classification by accurately classifying textual notes into predefined categories
and generating automatic tags for them.
1
1.2 Problem Statement
With the ever-increasing amount of textual data generated on social media, news sites, and blogs,
there is a need for efficient organization and categorization of such data. This becomes
particularly important in countries like Nepal where natural language processing techniques are
not commonly used. Our project addresses the problem of categorizing personal notes into
predefined categories using Multinomial Naive Bayes algorithm and generating automatic tags
for those notes. The system can be useful for companies and governmental sectors to gather
insights from customer feedback, comments, and opinions.
“Jesus Christ was a famous Christian figure in the 1st century.” . → Religion ( #soc
#religion #christian )
1.3 Objectives
The main objectives of this study are:
1. Develop a Multinomial Naive Bayes Classification model to identify, extract, and study
various personal notes, enabling effective text analysis.
2. Create a web application that automates the categorization of unstructured user notes into
different categories and generates relevant tags to describe the notes accurately.
2
1.4 Scope and Limitation
Some of the scopes are:
1. Analyze opinions and notes about different topics of different people.
2. Classify notes, comments, and subjective expressions into different categories and
generate automatic tags through a classification algorithm.
3. Efficient organization of notes to help users quickly organize and retrieve them easily.
4. Time-saving by eliminating the need for manual categorization and tagging of notes.
5. Consistency in categorization and tagging of notes, eliminating errors or inconsistencies
that may arise from manual categorization.
6. User-friendly and easy-to-use system, allowing users to focus on the content of their
notes instead of spending time organizing them.
3
1.5 Development Methodology
The software development methodology used to develop the Online Note Categorization system
is the Waterfall Methodology due to the simplicity of the project requirements and the absence of
stakeholders or users who may require frequent changes to the system. The Waterfall model
includes various phases such as requirements analysis, design, implementation, testing
(validation), integration, and maintenance. Each phase must be completed before the next one
can begin, and a review takes place at the end of each phase to determine the project's progress.
The testing starts only after the development is complete, and the outcome of one phase acts as
the input for the next phase sequentially, with no overlap between phases.
4
1.6 Report Organization
The report is organized as follows:
Preliminary Section: This section includes the title page, certificate page, acknowledgement,
abstract, table of contents, list of figures, list of tables, and list of abbreviations.
Chapter 1: Introduction - This chapter presents an overview of the project, including its
background, objectives, scope, limitations, development methodology, and report organization.
Chapter 2: Background Study and Literature Review - This chapter provides a detailed
description, summary, and critical evaluation of relevant literature and research related to the
project.
Chapter 3: System Analysis - This chapter focuses on the requirement analysis and feasibility
analysis of the project, including economic, technical, operational, and schedule feasibility.
Chapter 4: System Design - This chapter describes the system design process, including the
interpretation of findings and the definition of system components and data to satisfy
requirements.
Chapter 5: Implementation and Testing - This chapter discusses the results of implementing
the project, including checking the outputs and testing using various test cases.
Chapter 6: Conclusion and Recommendation - This chapter presents the conclusions and
recommendations based on the results of the project, including ideas for future development and
improvements to the system.
5
CHAPTER 2
BACKGROUND STUDY AND LITERATURE REVIEW
2.1 Background Study
Online note-taking applications have become increasingly popular in recent years due to the
convenience they offer in storing, organizing, and sharing notes. However, with the increasing
volume of notes and comments generated by users, it becomes difficult to efficiently categorize
and tag them for easy retrieval. This is where automatic note categorization systems come into
play. These systems use machine learning algorithms to categorize and tag notes based on their
content. Multinomial Naive Bayes Classification is one such algorithm that has been widely used
for text classification tasks, achieving high accuracy in various applications such as spam
filtering, sentiment analysis, and document categorization.
The Online Note Categorization system is designed to classify and tag notes, comments, and
subjective expressions into different categories using a classification algorithm. In order to
understand the fundamental theories and general concepts related to the project, it is important to
explore various areas such as Natural Language Processing (NLP), Machine Learning (ML),
Text Classification, Data Mining, and User Interface Design. NLP involves the analysis of
natural language text in order to categorize it into different classes. ML uses machine learning
algorithms to classify notes into different categories based on patterns and relationships within
the data. Text classification is the process of categorizing text into predefined categories based
on its content. Data mining is the process of discovering patterns in large datasets, which the
Online Note Categorization system uses to identify patterns and relationships within the data to
improve classification accuracy. Finally, user interface design will be utilized to create an
interface that is easy for users to understand and use.
Therefore, the aim of this project is to develop an automatic note categorization system using
Multinomial Naive Bayes Classification, which will allow users to efficiently organize their
notes and retrieve them easily. Understanding these fundamental theories, general concepts, and
terminologies related to the project is essential for the successful development of the Online Note
Categorization system.
6
2.2 Literature Review
Online note-taking applications have become increasingly popular in recent years due to the
convenience they offer in storing, organizing, and sharing notes. However, with the increasing
volume of notes and comments generated by users, it becomes difficult to efficiently categorize
and tag them for easy retrieval. This is where automatic note categorization systems come into
play. These systems use machine learning algorithms to categorize and tag notes based on their
content.
Multinomial Naive Bayes Classification is one such algorithm that has been widely used for text
classification tasks. It works by calculating the probability of a document belonging to a
particular category, based on the frequency of words in the document. This algorithm has been
shown to achieve high accuracy in text classification tasks and has been used in various
applications such as spam filtering, sentiment analysis, and document categorization.
Several projects have been developed in the area of automatic note categorization.
7
Research [3] conducted by Li et al. (2017) on automatic text classification using Naive Bayes
showed that the algorithm achieved an average accuracy of 85.4% in classifying text documents
into different categories. Similarly, research [4] conducted by Zhang and Yang (2014) on
automatic categorization of emails using Naive Bayes showed that the algorithm achieved an
accuracy of 92.8%.
Another relevant study [5] was conducted by Chong et al. (2018) on automatic categorization of
scientific articles using Naive Bayes. The study showed that the algorithm achieved an accuracy
of 90.8% in classifying articles into different categories. They also compared Naive Bayes with
other classification algorithms, such as Support Vector Machines and Random Forest, and found
that Naive Bayes outperformed them in terms of accuracy and speed.
Automatic note categorization systems have become increasingly important due to the large
volume of notes and comments generated by users. Multinomial Naive Bayes Classification is a
widely used algorithm for text classification tasks and has been shown to achieve high accuracy
in various applications. Several projects have been developed in the area of automatic note
categorization using different techniques, such as keyword extraction and supervised machine
learning algorithms. Studies conducted by researchers have shown that Naive Bayes achieves
high accuracy in classifying text documents into different categories. Recent studies have also
proposed novel approaches using deep learning techniques, such as convolutional neural
networks, to improve the accuracy of note categorization systems.
8
CHAPTER 3
SYSTEM ANALYSIS
3.1 System Analysis
3.1.1 Requirement Analysis
Effective requirement analysis is critical for project success. Thorough research, clear
requirements definition, and careful tool selection are essential. It is crucial to choose
requirements wisely to align with project objectives and ensure effective output. The functional
and non-functional requirements are described below:
i. Functional Requirements
A functional requirement is something a system must do. In this project functional requirements
are:
1. User authentication: The system should allow users to create and manage their accounts
and provide login functionality.
2. Manage categories: The system should allow users to create, add, save, edit and delete
categories.
3. Manage notes: The system should allow users to create, add, save, edit, view and delete
notes.
4. Automatic tag generation: The system should generate relevant tags for notes based on
their content using natural language processing (NLP) techniques.
5. Note categorization: The system should provide functionality to categorize notes.
6. Important notes: The system should provide the option to mark important notes, so they
are visible in the important note list in the dashboard.
7. Tag-based search: The system should allow users to search notes based on tags and
categories.
8. Security: The system should ensure the privacy and security of user data and protect
against unauthorized access.
9
Figure 3.1 Use Case Diagram for Online Note Categorization System
10
ii. Non-Functional Requirements
The non-functional requirements of this web application can be summarized as follows:
1. Performance: The website must load fast and respond quickly to user requests. The
website must be fault-tolerant.
2. Usability: The website must have a minimalistic and non-distracting UI, with easy-to-use
functionality that is easy to understand. The website must have a login system for
authentication, and the user must be helped appropriately to fill in the mandatory fields in
case of invalid input.
3. Security: The website must have secure access to confidential data. Only authenticated
users must be allowed to view their respective databases. User credentials must be stored
securely in a database as a hash string.
4. Maintainability: The website must be easy to extend and reliable, providing correct
results. The analysis procedure must be carried out to increase the precision of the
system.
Overall, the web application must be responsive, secure, user-friendly, maintainable, and
scalable, providing positive user experience and optimal website performance.
11
ii. Operational Feasibility
This system will aid decision makers and business owners to make proper decisions by providing
efficient reports and classification. Statistics and reports generated from the system are easier to
read and understand, thus making the system operationally feasible.
iii. Economic Feasibility
The main economic cost would be the investment in hardware, software, and manpower required
to develop the system. Most of the tools and software required for development are freely
available. Therefore, the initial investment in hardware and software is expected to be low.
Furthermore, the system can be developed by a team of students, which would eliminate the need
for additional manpower cost. The only cost associated with manpower would be for the
development of skills required for the project, which can be achieved through self-learning or
with the assistance of a supervisor.
As the project does not involve any commercial use or product, there is no direct revenue
expected to be generated from the system. However, the system has potential benefits for
academic and research purposes, which can contribute to the advancement of knowledge and
technology.
iv. Schedule Feasibility
Requirements Analysis
Design X
Coding X
Testing X
In conclusion, the Online Note Categorization system project is technically feasible with some
assistance, operationally feasible, economically feasible, and schedule feasible.
12
3.1.3 Analysis
Structured Approach
3.1.3.1 Data modelling using ER Diagram
ER diagrams consist of entities, attributes, and relationships, which are represented using
symbols such as rectangles, ovals, and diamonds, respectively. An entity represents a real-world
object or concept, such as a user, a note, or a category. An attribute is a characteristic of an
entity, such as a user's name, a note's content, or a category's description. A relationship
represents a connection between two or more entities, such as a user creating a note, a note
belonging to a category, or a note having multiple tags.
By using ER diagram to model the data for the Online Note Categorization system, we can
ensure that the relationships between the data are clearly defined and can be easily understood by
the people involved in this project. This can help to improve the accuracy and efficiency of the
development process and ensure that the application meets the requirements.
13
3.1.3.2 Process Modelling using DFD
i. Level 0 DFD
Level 0 DFD is a basic overview of the whole system or process being modeled. It provides a
bird's eye view of the system and is an excellent starting point for understanding the system's
overall architecture and functionality.
14
ii. Level 1 DFD
Level 1 Data Flow Diagram (DFD) is a detailed graphical representation of a system's processes,
inputs, outputs, and interactions with external entities at a more granular level than the Level 0
DFD. It provides a more detailed view of the system than the Level 0 DFD and is useful for
identifying the sub-processes and data flows that make up the system. This understanding is
necessary for the development of the system's detailed design and implementation.
15
CHAPTER 4
SYSTEM DESIGN
System Design is a vital aspect of software development that involves defining the architecture,
modules, components, and interfaces of a system, as well as the data that goes through it. It
enables businesses and organizations to satisfy specific needs and requirements by engineering a
well-structured and efficient system. By providing a means to manage and control the software
development process, System Design helps to ensure that the resulting system is scalable,
maintainable, and upgradable over time. Ultimately, the goal of System Design is to create a
coherent and well-running system that meets the needs of the users.
4.1 Design
Database Design: Transformation of ER to relations and normalizations
The ER diagram for the system includes entities such as Signup, Category, Notes, and
Noteshistory, each with their own set of attributes.
To transform the ER diagram into a set of relations, each entity in the diagram is mapped to a
corresponding table in the database. For example, the Signup entity becomes a table named
Signup, with columns such as user_id, ContactNo, About, Role, and RegDate. Similarly, the
Category entity becomes a table named Category, with columns such as signup_id and
categoryName.
Next, relationships between the entities are mapped to foreign keys in the corresponding tables.
For example, the relationship between Signup and Category is represented by a foreign key
named signup in the Category table. This foreign key links the Category table to the Signup table
and ensures that each category is associated with a specific signup.
Once the tables are created, normalization rules are applied to ensure that the data is consistent
and free of redundancy. The first normal form (1NF) ensures that each table has a primary key
and that each column in the table is atomic. The second normal form (2NF) ensures that each
non-key column in the table depends on the entire primary key, and the third normal form (3NF)
ensures that each non-key column depends only on the primary key and not on other non-key
columns.
16
In the case of the Online Note Categorization System, normalizing the tables ensures that the
data is consistent and can be easily queried and manipulated. For example, by ensuring that each
table has a primary key, queries can be performed more efficiently, and data can be easily
updated or deleted without affecting other data in the table.
Overall, transforming the ER diagram into a set of relations and applying normalization rules are
critical steps in database design for the Online Note Categorization System.
17
Figure 4.1: Relational Data Model of Online Note Categorization System
Forms Design
18
Forms design ensure an intuitive and efficient user experience when interacting with the system's
forms. Forms are the primary means for users to input and manipulate data in the system. When
documenting forms design, the following aspects were addressed:
1.User-Friendly Interface: Forms with a user-centric approach, ensuring a clean and intuitive
interface. Appropriate labels, field types, and validation techniques to guide users and prevent
errors.
2. Error Handling and Validation: Validation techniques to ensure data integrity and minimize
errors. Clear error messages when users submit invalid data or miss required fields.
Interface Design
Interface design elements ensure an intuitive and engaging user experience, allowing users to
interact with the system effectively. Here's an overview of interface design considerations for the
project:
1. Visual Consistency: Visual consistency throughout the interface to provide a cohesive and
unified experience. Consistent color schemes, typography, and iconography across different
screens and components.
2. Clear Information Hierarchy: Organized the interface elements in a logical and hierarchical
manner. Important information and features are easily discoverable and accessible to users.
Appropriate visual cues, such as headings, labels, and grouping, to convey the information
hierarchy.
3. Intuitive Navigation: Intuitive navigation menus and controls to help users navigate through
different sections and features of the system. Familiar navigation patterns.
19
were each kept inside with their respective category folder name.
Data Preprocessing
1. Text Cleaning and Tokenization
Punctuations and case are removed in the required field. Extra spaces are removed if found any.
Then the input sentence is broken down into array of words known as tokens.
2.Stop Word Removal
After cleaning and tokenizing the input text, stop words are removed. Since stop words doesn’t
have categorical meaning. Some of the stop word are a, an, the, and, are, as etc.
3.Porter Stemming
It is the process for removing common morphological endings from words in English. It is
mainly used to remove the redundant words that give the same meaning. For example, catch,
catches, catching etc. are different forms of word catch. So, instead of keeping all of those
words, we represented all the words with a single word, i.e., catch, by performing Porter
Stemming.
Feature Extraction
In feature extraction, we use the Term Frequency and Inverse Document Frequency (TFIDF).
TFIDF
gives the relative weight of the individual words in the given input.
Calculation of TF,
Calculation of IDF,
Calculation of TFIDF,
20
Or,
Where,
= Term frequency,
= Inverse document frequency.
= Term k in document .
= Frequency of term in document .
=Inverse document frequency of in document collection C.
N = Total number of document in the collection C.
𝑛𝑘= The number of document in C that contain𝑇𝑘.
TfidfVectorizer
It is a built-in class provided by the scikit-learn library in Python. It combines both the TF and
IDF calculations into a single step. By using TfidfVectorizer, you can calculate TF-IDF scores,
and transform the data into a TF-IDF matrix in a single line of code. It is highly optimized and
efficient, making it suitable for processing large amounts of text data.
21
decision tree and selected neural network classifiers. Bayesian classifiers have also exhibited
high accuracy and speed when applied to large databases.
Naive Bayesian classifiers assume that the effect of an attribute value on a given class is
independent of the values of the other attributes. This assumption is called class conditional
independence. It is made to simplify the computations involved and, in this sense, is considered
“naive”.
Bayes’ Theorem
Bayes’ theorem is named after Thomas Bayes, a nonconformist English clergyman who did early
work in probability and decision theory during the 18th century. Let X be a data tuple. In
Bayesian terms, X is considered “evidence.” As usual, it is described by measurements made on
a set of n attributes. Let H be some hypothesis such as that the data tuple X belongs to a specified
class C. For classification problems, we want to determine P(H|X), the probability that the
hypothesis H holds given the “evidence” or observed data tuple X. In other words, we are
looking for the probability that tuple X belongs to class C, given that we know the attribute
description of X. P(H|X) is the posterior probability, or a posteriori probability, of H conditioned
on X.
In contrast, P(H) is the prior probability, or a priori probability, of H.
Similarly, P(X|H) is the posterior probability of X conditioned on H
P(X) is the prior probability of X.
Bayes’ theorem is
P (X ∨H ) P( H )
P(H|X) =
P(X )
Naive Bayesian Classification
The naive Bayesian classifier, or simple Bayesian classifier, works as follows:
1. Let D be a training set of tuples and their associated class labels. As usual, each tuple is
represented by an n-dimensional attribute vector, X = (x1, x2,..., xn), depicting n measurements
made on the tuple from n attributes, respectively, A1, A2,..., An.
2. Suppose that there are m classes, C1, C2,..., Cm. Given a tuple, X, the classifier will predict
that X belongs to the class having the highest posterior probability, conditioned on X. That is, the
naive Bayesian classifier predicts that tuple X belongs to the class Ci if and only if
P (Ci |X) > P (Cj |X) for 1 ≤ j ≤ m, j ≠ i.
22
Thus, we maximize P (Ci |X). The class Ci for which P (Ci |X) is maximized is called the
maximum posteriori hypothesis. By Bayes’ theorem,
P (X ∨Ci)P(Ci)
P (Ci |X) =
P (X ).
3. As P(X) is constant for all classes, only P(X|Ci) P(Ci) needs to be maximized. If the class prior
probabilities are not known, then it is commonly assumed that the classes are equally likely, that
is, P(C1) = P(C2) = ··· = P(Cm), and we would therefore maximize P(X|Ci). Otherwise, we
maximize P(X|Ci) P(Ci). Note that the class prior probabilities may be estimated by P(Ci) = |Ci,
D|/|D|, where |Ci, D| is the number of training tuples of class Ci in D.
4. Given data sets with many attributes, it would be extremely computationally expensive to
compute P(X|Ci). To reduce computation in evaluating P(X|Ci), the naive assumption of class-
conditional independence is made. This presumes that the attributes’ values are conditionally
independent of one another, given the class label of the tuple (i.e., that there are no dependence
relationships among the attributes). Thus,
n
P(X|Ci) = ∏ ¿ 1 P (xk |Ci)
k
23
We need to compute µCi and σCi, which are the mean (i.e., average) and standard deviation,
respectively, of the values of attribute Ak for training tuples of class Ci. We then plug these two
quantities together with xk, to estimate P (xk |Ci).
5. To predict the class label of X, P(X|Ci) P(Ci) is evaluated for each class Ci. The classifier
predicts that the class label of tuple X is the class Ci if and only if
P(X|Ci) P(Ci) > P(X|Cj) P(Cj) for 1 ≤ j ≤ m, j ≠ i.
(In other words, the predicted class label is the class Ci for which P(X|Ci) P(Ci) is the maximum.
“How effective are Bayesian classifiers?” Various empirical studies of this classifier in
comparison to decision tree and neural network classifiers have found it to be comparable in
some domains. In theory, Bayesian classifiers have the minimum error rate in comparison to all
other classifiers. However, in practice this is not always the case, owing to inaccuracies in the
assumptions made for its use, such as class-conditional independence, and the lack of available
probability data. Bayesian classifiers are also useful in that they provide a theoretical justification
for other classifiers that do not explicitly use Bayes’ theorem. For example, under certain
assumptions, it can be shown that many neural network and curve-fitting algorithms output the
maximum posteriori hypothesis, as does the naive Bayesian classifier.
24
Example: ['start', 'use', 'adob', 'photoshop', 'amaz']
f) Join the words back into the string
Example: [start use adob photoshop amaz]
3) Transform using a TF-IDF vectorizer.
(0, 110891) 0.17194467768154792
(0, 103045) 0.269221459723549
(0, 89179) 0.6323672913057574
(0, 21102) 0.4597762305091781
(0, 19675) 0.5354178370083506
4) Classify the text.
[1]
5) Map the predicted category index to the corresponding category label:
comp.graphics
In this way, every note description is categorized into predefined categories and the tags are
generated by splitting the words.
CHAPTER 5
IMPLEMENTATION AND TESTING
5.1 Implementation
The implementation of the Online Note Categorization System with Automatic Tag Generation
was carried out in three distinct phases. This chapter provides an overview of the implementation
process, including the utilization of algorithms, the development of backend APIs and databases,
and the creation of a frontend website for note management.
Phase 1: Algorithm Implementation
In the first phase, the focus was on implementing the Multinomial Naive Bayes algorithm along
with the necessary preprocessors for handling textual data. This algorithm was chosen for its
effectiveness in categorizing notes based on their content. Preprocessors such as tokenization,
stemming, and stop word removal were applied to enhance the accuracy of the categorization
process.
25
Phase 2: Backend API and Database Implementation
The second phase involved the development of the backend API, which serves as the backbone
of the system. The API was responsible for handling user requests, managing user profiles,
categories, and notes, and integrating the categorization and tag generation algorithms. The
database, implemented using technology SQLite was used to store user information, notes,
categories, and other relevant data.
Phase 3: Frontend Website Development
The final phase focused on creating a user-friendly frontend website for note management.
Technologies such as HTML, CSS, Bootstrap, and Django were employed to design and
implement the user interface. The website provided different dashboards for customers and
admin users with superuser privileges. While the landing page allowed visitors to browse the
site, features such as profile management, category management, and note management required
users to either log in or register in the system.
26
5.1.1 Tools Used
Tools Used in the Implementation of Online Note Categorization and Automatic Tag Generation
System:
Web Technologies:
1. HTML (Hypertext Markup Language): The standard markup language used to create the
structure and content of web pages.
2. CSS (Cascading Style Sheets): A style sheet language used for defining the presentation
and layout of HTML documents.
3. JavaScript: A programming language that enables interactive and dynamic behavior on web
pages.
Database Platform:
1. SQLite: A lightweight and serverless relational database management system used for local
development and testing.
CSS Framework:
1. Bootstrap: A popular CSS framework that provides pre-built styles and components for
creating responsive and visually appealing web pages.
Additional Tools:
1. Jupyter Notebook: An interactive computational environment used for running Python
code and creating Jupyter notebook documents.
27
2. Draw.io: A browser-based diagramming tool used for creating system design diagrams,
flowcharts, and block diagrams.
These tools were essential for the implementation of the Online Note Categorization and
Automatic Tag Generation System. They provided the necessary frameworks, languages, and
utilities to build the frontend interface, develop the backend logic, manage databases, and create
visual system design diagrams. The combination of HTML, CSS, and JavaScript enabled the
creation of a user-friendly and interactive web interface. Django, along with Python, facilitated
the development of the backend API and database integration. SQLite served as the database
platform for local development and testing. Bootstrap provided a convenient set of styles and
components for responsive web design. Jupyter Notebook allowed for the execution of Python
code and the integration of natural language processing functionalities. Draw.io was utilized for
creating system design diagrams and visual representations of the project.
28
6. The `TfidfVectorizer` from scikit-learn is used to convert the preprocessed text data into
feature vectors.
7. A pipeline is created using `make_pipeline`, which combines the vectorizer and a
`MultinomialNB` classifier.
8. The model is trained using the training features and target labels.
9. The model and categories are saved using the `pickle` module.
Views Implementation:
The views implementation consists of several functions that handle different HTTP requests and
render the corresponding HTML templates.
1. The `index` and `about` functions simply render the corresponding HTML templates.
2. The `register` function handles the registration process. It retrieves the user's input from
the request, creates a new user object, and saves it to the database.
3. The `user_login` function handles the user login process. It authenticates the user using
the provided credentials and redirects to the appropriate page based on the user's role
(admin or regular user).
4. The `dashboard` function retrieves user-related information and renders the dashboard
template.
5. The `profile` function handles updating the user's profile information.
6. The `manageCategory` function handles managing user categories. It retrieves existing
categories and allows the user to add new categories.
7. The `editCategory` function handles editing a category's details.
8. The `deleteCategory` function handles deleting a category.
9. The `manageNotes` function handles managing user notes. It retrieves the user's
categories and allows the user to add new notes.
10. The `editNotes` function handles editing a note's details.
11. The `viewNotes` function handles viewing a specific note and allows the user to add
comments on the note.
12. The `deleteNotesHistory` function handles deleting a note's comment history.
13. The `deleteNotes` function handles deleting a note.
29
14. The `generalCategory` and `specificCategory` functions render the corresponding HTML
templates for category selection.
15. The `resultSpecific` and `resultGeneral` functions handle the prediction process based on
the notes description.
16. The `searchNotes` function handles searching for notes based on the user's query.
17. The `changePassword` function handles the password change process for the logged-in
user.
18. The `Logout` function handles logging out the user.
5.2 Testing
The testing phase determines the possible flaws and the potential inefficiency of the system.
After building the project, following testing measures are applied and got the results as shown
below:
30
User First Name = Ram An error message An error message
Registration Last Name = Pudasaini related to incomplete “Please enter a
Fail Email = ram@ email address should part following
Password = ram be displayed. @. ram@ is
Contact no.= 9812345678 incomplete.” is
About = I am a dancer. displayed.
31
Successful Email = kripesh@gmail.com The login page is The login page is
User Password = kripesh redirected to personal redirected to personal
Login dashboard with the dashboard with the
message “Logged In message “Logged In
successfully.” successfully.”
32
S.N. Action Inputs Expected Actual Test
Output Output Result
Successfully Successfully
33
Test Unit Test Expected Result Test Outcome Evidence
no.
Test C: Tokenization
Input: just started using adobe photoshop and its amazing what you can do with it
Output: ['just', 'started', 'using', 'adobe', 'photoshop', 'and', 'its', 'amazing', 'what', 'you', 'can', 'do',
'with', 'it']
Test D: Stop-Words removal
Precondition: A list of stop words is available
34
Assumption: Given words are preprocessed
Input: ['just', 'started', 'using', 'adobe', 'photoshop', 'and', 'its', 'amazing', 'what', 'you', 'can', 'do',
'with', 'it']
Output: ['started', 'using', 'adobe', 'photoshop', 'amazing']
35
S.N. Test case Input Expected Result Test Result
2 Add Category Fill Form to add category A successful message “New Category
should be displayed has been
Added.”
message is
displayed
3 Display Category Go to the manage category Recently Added category Recently Added
page should be displayed category is
displayed
2 Add Note. Fill Form to add note. A successful message “New Note has
should be displayed been Added.”
message is
displayed
3 Display Notes Go to the manage note Recently added note Recently added
page should be displayed. note is
displayed .
Table 5.7 :Test Case for Text Categorization
36
no.
1 General Text Classify input text Input text classified Successful Test F
Classification into predefined into predefined
general category general category
2 Specific Text Classify input text Input text classified Successful Test G
Classification into predefined into predefined
specific category specific category
37
The system demonstrated exceptional performance throughout the testing phase, meeting and
even exceeding expectations. The admin successfully carried out all the required CRUD
operations without any issues, indicating the system's seamless operation and adherence to the
specified requirements.
Upon analyzing the results of all the testing conditions, it can be confidently stated that the
system has successfully passed all the intended tests. It exhibited the desired functionality,
reliability, and accuracy, thus validating its effectiveness in note categorization and automatic tag
generation.
The comprehensive result analysis not only confirmed the system's ability to perform as expected
but also provided valuable insights for further improvements and enhancements.
CHAPTER 6
CONCLUSION AND FUTURE RECOMMENDATIONS
6.1 Conclusion
In conclusion, the development and implementation of the Online Note Categorization System
have been successfully accomplished. The system offers an efficient and user-friendly platform
38
for organizing and managing notes by automatically generating relevant tags based on their
content.
Throughout the project, careful attention was given to ensuring the system's accuracy, reliability,
and performance. The implementation phase involved thorough testing and result analysis, which
validated the system's ability to handle various scenarios and deliver the expected outcomes.
The system's key features, such as user registration, note creation, categorization, and tag
generation, were implemented with precision and have proven to be effective in enhancing
productivity and organizing notes effectively. The intuitive user interface and seamless user
experience contribute to the system's usability and overall satisfaction.
By automating the tag generation process, the system eliminates the need for manual tagging,
saving time and effort for users. The categorization of notes based on their content further
enhances accessibility and facilitates efficient retrieval of information.
39
2. Integration with External Platforms: To expand the system's reach and usability, integrating
with external platforms and services could be beneficial. Integration with popular productivity
tools, cloud storage services, or note-taking applications would allow seamless synchronization
and access to notes across multiple platforms.
3. Mobile Application: Developing a mobile application for the system would provide users
with the flexibility to access and manage their notes on the go. A mobile app would enhance
convenience and cater to users who prefer using their smartphones or tablets for note-taking and
organization.
4. User Feedback and Analytics: Implementing a feedback mechanism and analytics system
would enable users to provide suggestions and report issues. Analyzing user feedback and
system usage data can provide valuable insights for further system improvements and feature
enhancements.
By considering these recommendations, the Online Note Categorization and Automatic Tag
Generation System can evolve into a more comprehensive and powerful tool for efficient note
management and organization. Continuous development, user feedback, and staying updated
with technological advancements would contribute to the system's long-term success and user
satisfaction.
REFERENCES
[1] X. Dong, J. Zhang and X. Li, "Automatic text classification system based on Naive Bayes
and SVM.," Journal of Information Science and Engineering, vol. 31, no. 1, pp. 301-312,
2015.
40
[2] T. Ananthakrishnan and R. Krishnamurthy, "NoteCatcher: A Note Categorization and
Classification System for Personal Notebooks.," in IEEE International Conference on Data
Science and Data Intensive Systems, 2015.
[3] X. Li, X. Dong and J. Zhang, "Research on Automatic Text Classification Based on Naive
Bayes.," Journal of Physics: Conference Series, 2017.
[4] J. Zhang and W. Yang, "Automatic categorization of emails based on naïve Bayes
algorithm.," Journal of Computational Information Systems, vol. 10, no. 2, pp. 781-788,
2014.
[5] F. K. Chong, C. S. G. Khoo and B. H. Kang, "Automatic categorization of scientific articles:
An empirical study.," Journal of Information Science, vol. 44, no. 3, pp. 373-390, 2018.
41
APPENDIX
42
Figure: Login Page
Figure: Dashboard
43
Figure: Manage Profile Page
44
Figure: Manage Notes Page
45
Figure: Categorization Page
46
Figure: Django Admin Panel
47