Global News 60K

Abstract

Text classification systems have become increasingly important in recent years due to the explosion of online documents and the need to sort them for specific services. One of the most critical issues in text classification is the limited availability and diversity of datasets, which can lead to overfitting and poor generalization. In this context, we present a new dataset named Global News 60K (GN60K), which consists of 60,000 news articles from different sources from different parts of the world, covering 10 topics. The dataset provides a rich vocabulary, avoids overfitting problems, and creates better-generalized models.

The topics included in the dataset are Politics, Sports, Entertainment, Science and Technology, Business, Health, Environment, Education, Arts and Culture, and Crime. We selected these topics because they cover a wide range of interests and are commonly used in text classification applications. To further increase the dataset's diversity, we considered articles from different parts of the world, including North America, Europe, Asia, Africa, and South America.

The articles were selected based on their publication dates, which range from 2022 and 2023.

We believe that our dataset will be valuable for researchers and practitioners working on text/topic classification tasks. The GN60K dataset provides a diverse and well-labelled set of documents that can be used for training and testing various machine learning models. Additionally, the dataset can be used to develop new algorithms for topic classification, and related tasks. We hope that our dataset will contribute to the advancement of the text classification field and foster new research ideas.

Instructions:

Data Format

The dataset is provided in CSV format, with one row per news article. Each row contains the following fields:

· TITLE: Title of the news article.

· TEXT: Content of the news article.

· TOPIC: Topic of the news article.

List of Topics

This dataset contains a collection of news articles labelled with one of 10 topics. The topics, listed in alphabetical order, are: Arts & Culture, Business & Economy, Crime & Security, Entertainment & Celebrity, Health & Education, Politics, Science, Sports, Tech, and Weird News.

Sources

The dataset was constructed using various sources. These sources are listed below, along with their names, countries, and the topics acquired from them.

Breitbart.com

USA

Politics, Sports, Business & Economy, Tech, Entertainment & Celebrity

Bristolpost.co.uk

Entertainment & Celebrity, Health & Education, Crime & Security

Cnet.com

USA, UK, AUS

Politics, Tech

Csmonitor.com

USA

Science, Arts & Culture

Dailycoller.com

USA

Business & Economy, Entertainment & Celebrity, Health & Education, Sports, Politics

Mirror.co.uk

Crime & Security, Weird

Funding Agency:

This work has been partially funded by the Ministero dell’Istruzione, dell’Universita e della Ricerca (MIUR) with the PON “Ricerca e Innovazione” 2014-2020 (PON RI) “Azione IV.5 Dottorati su tematiche green”, assigned with D.M. 1062 on 10.08.2021.

Dataset Files

GN60K.csv (180.07 MB)

Datasets

Standard Dataset