Internship Report
Internship Report
Internship Report
Subject
Web scrapping & data collection
Carried Out By
Hamza Ghanmi
Organism
Academic year
2019/2020
Supervisor signature
i
ACKNOWLEDGEMENTS
• First of all, I would like to thank ALLAH the Almighty and Merciful, who has given us
the strength and patience to do this Work
ii
Contents
General Introduction 1
iii
3.1.2 advantages of Three-Tier Architecture . . . . . . . . . . . . . . . . . . . . . 10
3.2 Logical Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 Implementation 12
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.1 work environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.1.1 Hardware configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.1.2 Software configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Achieved work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2.1 Functions implemented: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2.2 Project structure: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.3 Data collection: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
General Conclusion 20
Netography 21
iv
List of Figures
v
General Introduction
While the data grows rapidly, a lot of companies show their attention in these data be-
cause of their importance.
The businesses that prosper through this transformation of data will be those that can recog-
nize and take advantage of the essential subset of data that will have a huge positive effect on
customer experience, solve complex challenges and build new economies of scale.
In this perspective, our project consists of scraping data from different websites like Face-
book, Twitter and Gamespot by building Crawlers.
So our work will be presented in this report as the following plan, The first chapter will
be dedicated to a general context of the project and brief preliminary study.The second chapter
will be about explaining the requirement analysis and specification of the project and followed
by the third chapter which explains the architecture of the project. In the last chapter we will
shows tools and technologies used in this project besides we will detail the achievement work.
1
Chapter 1
Introduction
In this chapter, we will present the general context of the project, following by a brief
study about web scraping and we will finish by mentioning our implemented solution .
This section includes, an overview of the context of our internship, followed by a brief
description of the project as well as the problem settings.
1.1.1 Overview
This work is proposed as a summer internship project with Geeks data consulting during
2 months starting from July 1st to August 30Th.
GEEKS DATA is a Tunisian company that provides services in data science and text min-
ing. They support several organisations of different sizes in their Data projects. They devote
part of their activity in research innovation in the predictive field or automatic text generation
[N1].
2
Figure 1.1: Geeks Data logo
Web scraping, also referred to as web extraction or harvesting, is a form of extracting data
from The World Wide Web and save it for later retrieval or saving to a file system or archive.
Analyzing. Internet data is usually scrapped using the Hypertext Transfer Protocol ( HTTP) or
Through web browser. This is achieved either by a user manually or by an automatic Bot or
Crawler on the Web.
Moreover web scraping is commonly recognized as an accurate and efficient process and Effec-
tive Big Data collection technique that’s why it’s mentioned as one sources of big data colection.
The figure below shows the difference between the most powerful techniques for data extrac-
tion and captures the activities of web crawling and web scraping
3
Figure 1.2: Web Scraping vs. Web Crawling
[N2]
• Enterprise technologies:
For larger ventures, incompatible business technologies are popular. There is also a need
for a cohesive presentation of data from many systems. In certain particular cases, the
Web Scraping approach is based on
• Opinion Poll:
Film Makers gather knowledge about their latest blockbusters. Such data provides user
reviews if it was posted in a summary on the film portals.
4
• Human Resources Agencies:
In large firms, human resources ( HR) departments process several work opportunities
for their firms and try to align the position with prospective employees. Using only in-
coming vacancy applications from applicants is not enough. HR departments also collab-
orate with third party firms, who can provide them with specialized directories of their
own. For such organizations, communication mining is an significant practice.
• Government Services:
Crime activity tracking on social websites and specific channels is an valuable source of
information for government officials and law enforcement agencies. Clearly there are no
references for this use of conduct.
• Corporate spying:
In the corporate sense, web scraping helps a business to analyze both its own and com-
petitors’ presence in news server headlines. A business may also gather information
about rivals and even about its own workers.
• Manual Scraping:
This is a traditional way for gathering data but it has a lot of drawbacks specially when
the amount of data is big moreover it takes a lot of time.
5
• HTML Parsing:
Web pages do not always supply their data In easy file formats such as .csv or .json.
Examination of the HTML structure will reveal repeated elements in the specified web
page.
Each page with a similar pattern can be used as a source for data using a programming
language script or Web scraping tool.
In our project we implemented three crawlers each one will be for a specific website
(Facebook,Twitter and Gamespot).
Also we should mention that we used Selenium which allow us to scrape data from dynamic
websites like Facebook and Twitter.
Conclusion
This chapter included a presentation of the hosting company and the context of the
project and followed by study of the art with the solution for our project.
In the next chapter,will be devoted for specifying the Requirements of our project
6
Chapter 2
Introduction
In this chapter, we will describe the functional and non-functional requirements of our
project. Then, we will show the use case diagram by in order to understand the interaction
between the user and the entire system.
In this part we will explain the functional and non-functional requirements of our project
• The system should enable the user to log in before use because of confidentiality
• The system should allow the user to enter a URL of a specific user
7
• The system should be fast: the scrapping of data should be made in a short time to pro-
vide a faster system
• Performance: The system should be very optimized and efficient at the same time
The use case diagram shows the interaction between users and the application .
In our application , We should mention that we have just one actor which is the user.
The following figure shows the use case diagram of our system.
Conclusion
8
Chapter 3
Introduction
In this chapter, we will show and explain the architecture of our project in order to un-
derstand the design of each component of the system .
In this section we will describe the adopted architecture for the application and its differ-
ent components
The Three-Tier Architecture is a logic model of the application architecture and it’s com-
posed of 3 components. Client application layer,Application layer and the Data layer.
The following figure represent the Three-Tier Architecture [N3].
9
Figure 3.1: Three-Tier Architecture
The reason we have used this architecture is that it is well structured and offers a good
option for good data management and manipulation in our case.
In addition, the customer side refers to the personal computer and can be called a strong cus-
tomer because it manages data processing. Furthermore, the web site and the web server host-
ing the necessary data relate to the server side. An HTTP protocol assumes the connection
between the client and the web-server
10
Figure 3.2: Logical Architecture
[N4]
• Request : also known as HTTP Request ,is a packet of information that is sent from one
computer to another in order to communicate. At its heart, HTTP Request is a binary
data packet that the client sends to the server
• Response : also known as HTTP Response ,is a packet of information that indicates
whether a particular HTTP request was successfully completed
• Source : It is the primary script which will be run on the client server and will extract the
required data content from the returned packet if the response is successful.
• Save to data : After scraping the data, it can be stored in the database in Json format
Conclusion
In this chapter, we introduced the general design of our project by explaining the logical
and physical architecture of the project.
In the following chapter we will show the used tools for the implementation and the work
achievement.
11
Chapter 4
Implementation
Introduction
We present shortly in this chapter the technical tools we have chosen to create our appli-
cation.
• RAM : 4,00 GO .
The software environment used in this project is illustrated in the following tools:
12
and perform tasks as a human being would, such as: Clicking buttons,Entering informa-
tion in forms and Searching for particular information on the web pages
• Chrome Driver and Firefox Driver:which provides a platform to perform tasks in a spec-
ified browser.
For this project we implemented some functions to extract data for each website and we
detailed them as follow:
13
Figure 4.2: example of get publications
• get_comment_by_key(key) : Function that gets all comments from the key (name).
• get_comment_by_keys([key1,key 2]) : Function that gets all comments from all the keys.
• get_images(user_id,n) : Function that gets all the first n images posted by the user.
14
Figure 4.4: example of get images
• get_friends(user_id) : Function that gets all the followers/friends list of the user.
• get_following(user_id) : Function that gets all the followings list of the user (specific for
Twitter ).
• get_react(publication_id) : Function that gets all reactions in the post and comment.
• get_likes(user_id) : Function that gets all page details that the user follows.
15
• get_events(user_id) : Function that gets all events details that the user participate.
• save_data() : Function that save data in json file and downloads all pictures from their
urls
Each crawler of our project has a structure of a tree of depth 2 and the first node is the url
of a user X that we want to scrap and the crawler will scrap data of the user X and all people
that commented his first n publications and then for each of these users we will scrap all people
that commented their first n publications.
To install each Crawler (Facebook-Crawler,Twitter-Crawler and GameSpot-Crawler) you should
run the following command line in the path of the setup file :
pip install .
The following picture shows how to run the project in order to launch the automatic scraping
from a command prompt (note that you should run this command in the path where the main
file exist) where:
16
4.2.3 Data collection:
The data collection campaign lasts more than 2 weeks and we extracted from Facebook,
data from 1000 users and from Twitter 250 users.
For each user we scraped the first 30 publications with images and the first 30 images shared.
1-Facebook data : here an example of part of data collected from a Facebook user
17
Figure 4.10: all data collected
2-Twitter data : here an example of part of data collected from a Twitter user
18
Figure 4.12: all data collected
For each user we stored his data (images + data.json which contains data in json format) in a
folder named as the full name of the user.
Conclusion
In this chapter, we presented our work environment and we detailed tools used in this
project. After that, we explained our achievement work by explaining all functions developed
to extract data besides our methodology for collecting it and we showed some examples from
these data .
19
General Conclusion
There has always been a great deal of interest in collecting data because of it’s impor-
tance and growth, in this perspective different Web Scraping techniques can be used collect
this amount of data.
This project is about implementing three crawlers that extracts data from different web-
sites like Facebook,Twitter and Gamespot and it was part of an experience in Geeks Data Con-
sulting.
This project that we have done wasn’t a simple task at all, in fact we faced some problems
like making scroll in dynamic websites like Facebook and Twitter, besides each time if there is
any updates or changes in any website we have to make some changes in the project. In addi-
tion to that, we can analyse the data scraped and build machine learning algorithms to classify
users based on their publications and images.
20
Netography
.
[N1] https://www.geeksdata.fr/
consulted on 8/2020
[N2] http://prowebscraping.com/web-scraping-vs-web-crawling/
consulted on 8/2020
[N3] https://www.softwaretestingclass.com/what-is-difference-between-two-tier-and-three-tier-
architecture/
consulted on 09/2020
[N4] https://www.pinterest.com/pin/401031541793486742/
consulted on 08/2020
21