Nothing Special   »   [go: up one dir, main page]

Skip to content
forked from trendsci/linkrun

LinkRun - Data Engineering project done in 3 weeks during the Insight fellowship

License

Notifications You must be signed in to change notification settings

sitedata/linkrun

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LinkRun

A pipeline to analyze popularity of domains across the web.

LinkRun web application
Demo: youtube video of LinkRun web application
Presentation: google slides

Idea

LinkRun is a data engineering project that ranks the popularity of millions of websites. LinkRun was created within 3 weeks by me during the Insight Data Engineering fellowship.

LinkRun processes data from over 2.6 billion web pages (>17 terabytes compressed data) and analyzes all the links present on those pages. LinkRun then counts how many times each domain was linked, filters out links based on specific criteria, and stores these results to a database (>47 million rows). The resulting database can be queried to obtain insight about the populatiry of millions of website across the internet. A custom web application allows users to view and query the data by entering their favorite websites into a search box.

The LinkRun database contains data for over 47 million unique subomain.domain entries from over 25 million unique websites.

Example output (click to see a video demo):

Domain Number of linking pages
facebook.com 1,123,535,234
youtube.com 478,735,963
... ...
yoursite.com 208,666

The Pipeline

LinkRun Pipeline

The Data

LinkRun uses data from the Common Crawl database. Common Crawl data is updated each month with new web crawl data. LinkRun has analyzed the July 2019 Common Crawl data set which contains >2.5 billion web pages and >17 terabytes of compressed data.

How to run LinkRun on your own

LinkRun can run on any resource that supports the applications used in the pipline. For best results, LinkRun can be run on the following AWS provisioned resources:

  • AWS Elastic MapReduce (EMR) release 5.26.0, running Spark 2.4.3
    • Bootstrap the cluster using the files linkrun_emr_bootstrap.sh and sample_secrets/sample_secret_bootstrap.sh (update this file with your configurations).
  • AWS RDS running Postgres 10.6
  • (Optional) EC2 instance with the Dash web UI
    • To configure the connection between the web UI and the database run sample_secrets/sample_webapp_secrets.sh then source ~/.bashrc

To run the LinkRun pipeline use run.sh. To modify which data is processed by LinkRun modify the src/automation/config.json file.

Thank you for visiting LinkRun!

About

LinkRun - Data Engineering project done in 3 weeks during the Insight fellowship

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 90.3%
  • CSS 5.1%
  • Shell 4.6%