Nothing Special   »   [go: up one dir, main page]

Internship Report

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

Ref : 2020/Summer internship

M INISTRY OF H IGHER E DUCATION AND S CIENTIFIC R ESEARCH


U NIVERSITY OF L A M ANOUBA
N ATIONAL S CHOOL OF C OMPUTER S CIENCES

S UMMER INTERNSHIP REPORT

Subject
Web scrapping & data collection

Carried Out By
Hamza Ghanmi

Organism

Organism: Geeks DATA Consulting


CEO: M. Aymen Khelifi
Supervised by: M.Sofiane Amokrane
Address: Pôle Technologique El Gazela Ariana
Phone : +216 70 834 069

Academic year
2019/2020
Supervisor signature

i
ACKNOWLEDGEMENTS

• First of all, I would like to thank ALLAH the Almighty and Merciful, who has given us
the strength and patience to do this Work

• My sincere thanks also goes to M. Sofiane Amokrane who, as a supervisor, is always


open to listening and very helpful throughout the entire internship, as well as for the inspira-
tion, support and time she has generously given to me

ii
Contents

General Introduction 1

1 General Context of the Internship 2


Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1 Context of the project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Presentation of the hosting company . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Preliminary Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 General facts about Web Scraping . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Goal of Web Scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Web Scraping methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.4 Our solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Requirement Analysis and Specification 7


Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Requirements analysis and specification . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Functional requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Non-functional requirements . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.3 use case diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Design of the Biometric Behavioral System 9


Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1 General Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 The Three-Tier Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

iii
3.1.2 advantages of Three-Tier Architecture . . . . . . . . . . . . . . . . . . . . . 10
3.2 Logical Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Implementation 12
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.1 work environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.1.1 Hardware configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.1.2 Software configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Achieved work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2.1 Functions implemented: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2.2 Project structure: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.3 Data collection: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

General Conclusion 20

Netography 21

iv
List of Figures

1.1 Geeks Data logo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3


1.2 Web Scraping vs. Web Crawling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Structure of Html pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1 Use case Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1 Three-Tier Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10


3.2 Logical Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.1 example of user info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13


4.2 example of get publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3 example with the key(love) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.4 example of get images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.5 example of get followers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.6 example of get followers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.7 example of get events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.8 Command prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.9 example from json file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.10 all data collected . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.11 example from json file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.12 all data collected . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

v
General Introduction

He world wide web(WWW) consists of an interlinked knowledge network which is pro-


T vided to users via websites. The way we share, gather, and publish data has fundamen-
tally changed the world wide web.The amount of data presented is continuously increasing..

While the data grows rapidly, a lot of companies show their attention in these data be-
cause of their importance.
The businesses that prosper through this transformation of data will be those that can recog-
nize and take advantage of the essential subset of data that will have a huge positive effect on
customer experience, solve complex challenges and build new economies of scale.

In this perspective, our project consists of scraping data from different websites like Face-
book, Twitter and Gamespot by building Crawlers.

So our work will be presented in this report as the following plan, The first chapter will
be dedicated to a general context of the project and brief preliminary study.The second chapter
will be about explaining the requirement analysis and specification of the project and followed
by the third chapter which explains the architecture of the project. In the last chapter we will
shows tools and technologies used in this project besides we will detail the achievement work.

1
Chapter 1

General Context of the Internship

Introduction

In this chapter, we will present the general context of the project, following by a brief
study about web scraping and we will finish by mentioning our implemented solution .

1.1 Context of the project

This section includes, an overview of the context of our internship, followed by a brief
description of the project as well as the problem settings.

1.1.1 Overview

This work is proposed as a summer internship project with Geeks data consulting during
2 months starting from July 1st to August 30Th.

1.1.2 Presentation of the hosting company

GEEKS DATA is a Tunisian company that provides services in data science and text min-
ing. They support several organisations of different sizes in their Data projects. They devote
part of their activity in research innovation in the predictive field or automatic text generation
[N1].

2
Figure 1.1: Geeks Data logo

1.2 Preliminary Study

1.2.1 General facts about Web Scraping

Web scraping, also referred to as web extraction or harvesting, is a form of extracting data
from The World Wide Web and save it for later retrieval or saving to a file system or archive.
Analyzing. Internet data is usually scrapped using the Hypertext Transfer Protocol ( HTTP) or
Through web browser. This is achieved either by a user manually or by an automatic Bot or
Crawler on the Web.
Moreover web scraping is commonly recognized as an accurate and efficient process and Effec-
tive Big Data collection technique that’s why it’s mentioned as one sources of big data colection.
The figure below shows the difference between the most powerful techniques for data extrac-
tion and captures the activities of web crawling and web scraping

3
Figure 1.2: Web Scraping vs. Web Crawling
[N2]

1.2.2 Goal of Web Scraping

We will mention below some purpose of web scraping:

• Market analysis and research:


Consumers are involved and express their insight, anger or inspiration in the online envi-
ronment. Online sources of knowledge can be introduced by businesses wanting to learn
more from customers. One of the methods of gathering such data is web scraping.

• Enterprise technologies:
For larger ventures, incompatible business technologies are popular. There is also a need
for a cohesive presentation of data from many systems. In certain particular cases, the
Web Scraping approach is based on

• Opinion Poll:
Film Makers gather knowledge about their latest blockbusters. Such data provides user
reviews if it was posted in a summary on the film portals.

4
• Human Resources Agencies:
In large firms, human resources ( HR) departments process several work opportunities
for their firms and try to align the position with prospective employees. Using only in-
coming vacancy applications from applicants is not enough. HR departments also collab-
orate with third party firms, who can provide them with specialized directories of their
own. For such organizations, communication mining is an significant practice.

• Social Network mining:


Over the past decade social media (such as blogs, online social networks, microblogs)
has become one of the main sources of data for quantitative communication analysis.
Researchers can retrieve specific messages from social media sites for different research
purposes by using simple programming tools.

• Government Services:
Crime activity tracking on social websites and specific channels is an valuable source of
information for government officials and law enforcement agencies. Clearly there are no
references for this use of conduct.

• Corporate spying:
In the corporate sense, web scraping helps a business to analyze both its own and com-
petitors’ presence in news server headlines. A business may also gather information
about rivals and even about its own workers.

• Social Mining and Sentiment Analysis:


Social networking is a modern data source that varies greatly from traditional ones. So-
cial media data sets are mostly created by users and are broad, interlinked and intercon-
nected. Somehow heterogeneous. Social networking data can be accessed from publicly
accessible sources by various means, such as scraping, using site-provided applications
and crawling

1.2.3 Web Scraping methods

In web scraping we have two main techniques:

• Manual Scraping:
This is a traditional way for gathering data but it has a lot of drawbacks specially when
the amount of data is big moreover it takes a lot of time.

5
• HTML Parsing:
Web pages do not always supply their data In easy file formats such as .csv or .json.
Examination of the HTML structure will reveal repeated elements in the specified web
page.
Each page with a similar pattern can be used as a source for data using a programming
language script or Web scraping tool.

Figure 1.3: Structure of Html pages

1.2.4 Our solution

In our project we implemented three crawlers each one will be for a specific website
(Facebook,Twitter and Gamespot).
Also we should mention that we used Selenium which allow us to scrape data from dynamic
websites like Facebook and Twitter.

Conclusion

This chapter included a presentation of the hosting company and the context of the
project and followed by study of the art with the solution for our project.
In the next chapter,will be devoted for specifying the Requirements of our project

6
Chapter 2

Requirement Analysis and Specification

Introduction

In this chapter, we will describe the functional and non-functional requirements of our
project. Then, we will show the use case diagram by in order to understand the interaction
between the user and the entire system.

2.1 Requirements analysis and specification

In this part we will explain the functional and non-functional requirements of our project

2.1.1 Functional requirements

The functional requirements of the project are:

• The system should enable the user to log in before use because of confidentiality

• The system should allow the user to enter a URL of a specific user

• The system should scrap data from the website

• The system should download data in the database

2.1.2 Non-functional requirements

The functional requirements of the project are:

• The system should be easy to use

7
• The system should be fast: the scrapping of data should be made in a short time to pro-
vide a faster system

• Performance: The system should be very optimized and efficient at the same time

2.1.3 use case diagram

The use case diagram shows the interaction between users and the application .
In our application , We should mention that we have just one actor which is the user.
The following figure shows the use case diagram of our system.

Figure 2.1: Use case Diagram

Conclusion

In this chapter, we detailed the specification of our project


In the following chapter, we will go deeper into the architecture of our project in order to
understand it better.

8
Chapter 3

Design of the Biometric Behavioral


System

Introduction

In this chapter, we will show and explain the architecture of our project in order to un-
derstand the design of each component of the system .

3.1 General Design

In this section we will describe the adopted architecture for the application and its differ-
ent components

3.1.1 The Three-Tier Architecture

The Three-Tier Architecture is a logic model of the application architecture and it’s com-
posed of 3 components. Client application layer,Application layer and the Data layer.
The following figure represent the Three-Tier Architecture [N3].

9
Figure 3.1: Three-Tier Architecture

3.1.2 advantages of Three-Tier Architecture

The reason we have used this architecture is that it is well structured and offers a good
option for good data management and manipulation in our case.
In addition, the customer side refers to the personal computer and can be called a strong cus-
tomer because it manages data processing. Furthermore, the web site and the web server host-
ing the necessary data relate to the server side. An HTTP protocol assumes the connection
between the client and the web-server

3.2 Logical Architecture

The following figure represent the logical architecture of our project.

10
Figure 3.2: Logical Architecture
[N4]

• Request : also known as HTTP Request ,is a packet of information that is sent from one
computer to another in order to communicate. At its heart, HTTP Request is a binary
data packet that the client sends to the server

• Response : also known as HTTP Response ,is a packet of information that indicates
whether a particular HTTP request was successfully completed

• Source : It is the primary script which will be run on the client server and will extract the
required data content from the returned packet if the response is successful.

• Save to data : After scraping the data, it can be stored in the database in Json format

Conclusion

In this chapter, we introduced the general design of our project by explaining the logical
and physical architecture of the project.
In the following chapter we will show the used tools for the implementation and the work
achievement.

11
Chapter 4

Implementation

Introduction

4.1 work environment

We present shortly in this chapter the technical tools we have chosen to create our appli-
cation.

4.1.1 Hardware configuration

I developed this work on a PC whose configuration is as follows:

• Operating system : Windows 10.

• Processor : Intel(R) Pentium(TM) CPU @ 2.16GHz .

• RAM : 4,00 GO .

4.1.2 Software configuration

The software environment used in this project is illustrated in the following tools:

• Python: We use python as the programming language to implement this project

• Selenium: Selenium is a Web Browser Automation Tool


For testing purposes, it is mainly for automating web applications, but it is definitely not
limited to only that. This makes it possible for you to open a browser of your choosing

12
and perform tasks as a human being would, such as: Clicking buttons,Entering informa-
tion in forms and Searching for particular information on the web pages

• Chrome Driver and Firefox Driver:which provides a platform to perform tasks in a spec-
ified browser.

• Click: which is a python package for creating command line interfaces

• Urllib:which is a package for working with urls

4.2 Achieved work

4.2.1 Functions implemented:

For this project we implemented some functions to extract data for each website and we
detailed them as follow:

• get_browser(user_id,password) : Function that open the browser and redirect it to the


website (Facebook, Twitter or GameSpot) .

• get_user_info(user_id) : Function that gets user profile info (name,age,gender,number


of followers/friends,number of posts, number of images/videos..etc).

Figure 4.1: example of user info

• get_publications(user_id,n, comments(T/F)) : Function that gets the first n publications


posted by the user and if comments=True, it gets all publications along with all com-
ments.

13
Figure 4.2: example of get publications

• get_comment_by_publication(publication_id,n) : Function that gets all the first n com-


ments from a publication using publication_id.

• get_comment_by_key(key) : Function that gets all comments from the key (name).

Figure 4.3: example with the key(love)

• get_comment_by_keys([key1,key 2]) : Function that gets all comments from all the keys.

• get_images(user_id,n) : Function that gets all the first n images posted by the user.

14
Figure 4.4: example of get images

• get_friends(user_id) : Function that gets all the followers/friends list of the user.

Figure 4.5: example of get followers

• get_following(user_id) : Function that gets all the followings list of the user (specific for
Twitter ).

• get_react(publication_id) : Function that gets all reactions in the post and comment.

• get_likes(user_id) : Function that gets all page details that the user follows.

Figure 4.6: example of get followers

15
• get_events(user_id) : Function that gets all events details that the user participate.

Figure 4.7: example of get events

• save_data() : Function that save data in json file and downloads all pictures from their
urls

4.2.2 Project structure:

Each crawler of our project has a structure of a tree of depth 2 and the first node is the url
of a user X that we want to scrap and the crawler will scrap data of the user X and all people
that commented his first n publications and then for each of these users we will scrap all people
that commented their first n publications.
To install each Crawler (Facebook-Crawler,Twitter-Crawler and GameSpot-Crawler) you should
run the following command line in the path of the setup file :
pip install .
The following picture shows how to run the project in order to launch the automatic scraping
from a command prompt (note that you should run this command in the path where the main
file exist) where:

Figure 4.8: Command prompt

• Url : Url of a (Facebook, Twitter or GameSpot) user.

• n : Number of publications to extract .

• s : refers to downloading data or not .

• usr and pwd : Login and password of the user .

16
4.2.3 Data collection:

The data collection campaign lasts more than 2 weeks and we extracted from Facebook,
data from 1000 users and from Twitter 250 users.
For each user we scraped the first 30 publications with images and the first 30 images shared.
1-Facebook data : here an example of part of data collected from a Facebook user

Figure 4.9: example from json file

17
Figure 4.10: all data collected

2-Twitter data : here an example of part of data collected from a Twitter user

Figure 4.11: example from json file

18
Figure 4.12: all data collected
For each user we stored his data (images + data.json which contains data in json format) in a
folder named as the full name of the user.

Conclusion

In this chapter, we presented our work environment and we detailed tools used in this
project. After that, we explained our achievement work by explaining all functions developed
to extract data besides our methodology for collecting it and we showed some examples from
these data .

19
General Conclusion

There has always been a great deal of interest in collecting data because of it’s impor-
tance and growth, in this perspective different Web Scraping techniques can be used collect
this amount of data.

This project is about implementing three crawlers that extracts data from different web-
sites like Facebook,Twitter and Gamespot and it was part of an experience in Geeks Data Con-
sulting.

This project that we have done wasn’t a simple task at all, in fact we faced some problems
like making scroll in dynamic websites like Facebook and Twitter, besides each time if there is
any updates or changes in any website we have to make some changes in the project. In addi-
tion to that, we can analyse the data scraped and build machine learning algorithms to classify
users based on their publications and images.

20
Netography

.
[N1] https://www.geeksdata.fr/
consulted on 8/2020
[N2] http://prowebscraping.com/web-scraping-vs-web-crawling/
consulted on 8/2020
[N3] https://www.softwaretestingclass.com/what-is-difference-between-two-tier-and-three-tier-
architecture/
consulted on 09/2020
[N4] https://www.pinterest.com/pin/401031541793486742/
consulted on 08/2020

21

You might also like