Report 83
Report 83
Report 83
Transformers
BACHELOR OF ENGINEERING
IN
INFORMATION TECHNOLOGY
Submitted by
November,2024
CERTIFICATE
This is to certify that the project titled Daily Mail Summarization using
L.V.Vishva Sree(160121737083)
iii
Abstract
iv
Table of Contents
3.1 Flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.1 The template folder contains HTML pages. The app.py contains
the flask code where the user gives input text/mail or can even
upload files for summarization . . . . . . . . . . . . . . . . . . . . 27
5.2 Import libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3 Read the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.4 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.5 Text Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.6 Splitting and Tokenization . . . . . . . . . . . . . . . . . . . . . . 31
5.7 Home page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.8 Upload Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.9 Result Now Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
vii
Abbrevations
Abbrevations Description
IT Information Technology
AI Airtificial Intelligence
UI User Interface
viii
CHAPTER 1
INTRODUCTION
1
The growing demand for time-efficient communication solutions has fueled
the need for email summarization tools. These tools can help reduce the reading
load by automatically extracting the most relevant content from emails and
presenting it in a concise, readable format.
• Job Seekers: For those applying for jobs, having a resume that high-
lights the most relevant skills and experiences is crucial. However,
creating the perfect summary or tailoring resumes for different roles can
be an exhausting and time-consuming task. An AI-powered summarizer
can assist by automatically identifying key skills, qualifications, and ex-
periences, ensuring that the most important information is presented in
a clear, concise format.
These challenges highlight the need for a solution that combines the power
of AI with ease of use, accuracy, and flexibility.
EXISTING SYSTEM
– a.TextRank Algorithm
∗ Advantages:
8
· 1.Simple to implement and computationally efficient.
∗ Limitations:
– b.Sumy
∗ Advantages:
∗ Limitations:
– a.OpenAI’s GPT-3
∗ Advantages:
∗ Limitations:
∗ Advantages:
∗ Limitations:
∗ Advantages:
∗ Limitations:
· Advantages:
· Limitations:
PROPOSED SYSTEM
14
The T5 Transformer model is fine-tuned on the Daily Mail
Summarization Dataset to generate high-quality summaries.
The backend processes the input text, invokes the trained
model, and generates a summarized output.
· 3. Web Server:
· a.Dataset:
· b.Text Preprocessing:
· 2. Model Development:
· 3.Model Deployment:
· b.Front-End Design:
The web interface will be designed with minimal complexity,
offering a simple input box for users to paste text or a
file upload option. The output will be displayed clearly,
showing the summarized content.
· a.Testing:
After deploying the system, it will be tested for various edge
cases, such as extremely long emails, emails with mixed
content (e.g., tables, images, or non-text elements), and
emails with complex language. The system’s performance
will be monitored for consistency and reliability.
· b.Optimization:
Based on test results, the system will undergo optimiza-
tion, including fine-tuning the model further, improving the
UI/UX, and ensuring that the backend can handle large
datasets without significant delays.
· 1.User Input:
The user visits the web application, either pasting email
content into a text box or uploading an email file.
· 2.Preprocessing:
The input is passed through preprocessing steps (such as
tokenization and cleaning) to prepare it for the model.
· 3.Summarization:
The preprocessed input is fed into the T5 model, which
generates a summary based on the context and content of
the email.
· 5.Further Interaction:
The user may choose to submit another email or perform
additional actions.
· 1.Automated Summarization:
The system will automatically generate summaries of emails,
reducing the time and effort required to read lengthy emails.
· 2.NLP-based Techniques:
The system uses state-of-the-art transformer models, ensur-
ing high-quality and contextually relevant summaries.
· 4.Real-time Results:
· 5.User-Friendly Interface:
The interface will be intuitive and accessible, requiring min-
imal user interaction to achieve optimal results.
· 6.Customization Options:
Users will be able to specify the length of the summary
(e.g., concise or detailed) or the level of abstraction desired.
· 3.Python:
The primary programming language for implementing the
system, used with libraries like Flask, PyTorch, and Trans-
formers.
· 4.Flask:
A lightweight web framework for Python that will host the
system and handle user requests.
· 5.HTML/CSS:
For creating the front-end user interface.
· 6.JavaScript:
For enhancing the interactivity of the web pages.
SYSTEM REQUIREMENTS
· Python 3.x:
20
are essential:
· Torch (PyTorch):
· 1.PyTorch is a deep learning framework required to work
with the T5 model, as it allows for GPU acceleration and
provides tools to train and fine-tune neural networks.
· 2.Command for installation: pip install torch
· Flask:
· 1.Flask is a micro web framework for Python that will
be used to create the web application. It handles routing,
rendering HTML templates, and managing user requests.
· 2.Command for installation: pip install flask
· Pandas:
· 1.Pandas is essential for reading, manipulating, and prepro-
cessing data. It is used for handling the Daily Mail dataset,
cleaning, and splitting it into training and testing sets.
· 2.Command for installation: pip install pandas
· NumPy:
· 1.NumPy is a package for scientific computing in Python
and will be used for array manipulation and mathemati-
· Scikit-learn:
· 1.Scikit-learn is a machine learning library that will be useful
for model evaluation (e.g., calculating ROUGE scores) and
splitting datasets for training and testing purposes.
· 2.Command for installation: pip install scikit-learn
· HTML/CSS/JavaScript:
HTML, CSS, and JavaScript will be used to create and
style the user interface for the web application. JavaScript
can also be used for enhancing the interactivity of the web
pages.
· Flask:
· 1.As mentioned, Flask will be used to handle user input,
display results, and serve the web application.
· 2.Flask Extensions: For deployment, extensions like Flask-
SocketIO (for real-time interaction) and Flask-WTF (for
form handling) may be used.
· Web Server:
· 1.Apache or Nginx can be used for deploying the application
in a production environment.
· Processor:
· RAM:
· Storage:
· Processor:
· RAM:
· GPU:
· Storage:
· Processor:
A cloud server with at least 2-4 CPU cores (for medium-sized
deployments). For high availability, multi-core instances
should be considered.
· RAM:
8GB or more of RAM for the server hosting the web
application.
· Storage:
SSD storage for fast access to the model and data, around
10-20GB depending on the deployment scale.
· Network:
A stable internet connection with sufficient bandwidth for
real-time interactions between users and the server.
Figure 5.1: The template folder contains HTML pages. The app.py contains
the flask code where the user gives input text/mail or can even upload files
for summarization
There are many popular open sources for collecting the data.
Eg: kaggle.com, UCI repository, etc.
In this project, we have used .csv data. This data is downloaded
from kaggle.com. Please refer to the link given below to down-
load the dataset. Link: https://www.kaggle.com/datasets/evilspirit05/daily-
mail-summarization-dataset As the dataset is downloaded. Let
us read and understand the data properly with the help of some
visualisation techniques and some analysing techniques. Note:
There are several techniques for understanding the data. But
27
here we have used some of it. In an additional way, you can
use multiple techniques.
CONCLUSION
37
CHAPTER 7
FUTURE SCOPE
38
emails.
· 2.Handling Multilingual Text: Although the T5 model
is powerful, it is computationally intensive. Future versions
of the system can explore ways to reduce the model’s
size, such as through distillation or pruning, to ensure the
model is more lightweight while retaining its summarization
capabilities.