Best Paper Award at the 1st Workshop on Preparing Good Data for Generative AI: Challenges and Approaches (Good-Data @ AAAI 2025)
SciEvo is a large-scale dataset that spans over 30 years of academic literature from arXiv, designed to support scientometric research and the study of scientific knowledge evolution. By providing a comprehensive collection of over two million publications, including detailed metadata and citation graphs, SciEvo enables researchers to analyze long-term trends in academic disciplines, citation practices, and interdisciplinary knowledge exchange.
GitHubο½
HuggingFace |
Kaggle | π Paper
- Installation
- Quick Start
- Dataset Overview
- Usage Examples
- API Reference
- Research Applications
- Example Findings
- Visualizations
- Contributing
- Citation
- License
- Python 3.8+
- pip
# Clone the repository
git clone https://github.com/Ahren09/SciEvo.git
cd SciEvo
# Install dependencies
pip install -r requirements.txt
pip install scievo
from datasets import load_dataset
# Load the SciEvo dataset
dataset = load_dataset("Ahren09/SciEvo")
# Access different configurations
arxiv_data = dataset["arxiv"]
semantic_scholar_data = dataset["semantic_scholar"]
references_data = dataset["references"]
# Explore the data
print(f"Number of arXiv papers: {len(arxiv_data)}")
print(f"Number of Semantic Scholar papers: {len(semantic_scholar_data)}")
print(f"Number of reference entries: {len(references_data)}")
# Citation analysis
python analysis/analyze_citation.py
# Keyword extraction
python analysis/keyword_extraction_llm.py --feature_name title
# Visualization
python visualization/plot_aoc_by_subjects.py
- Longitudinal Coverage: Includes academic publications from arXiv since 1991
- Rich Metadata: Titles, abstracts, full texts, keywords, subject categories, and citation relationships
- Comprehensive Citation Graphs: Captures citation networks to analyze influence and knowledge diffusion
- Interdisciplinary Focus: Supports cross-disciplinary studies on research evolution and knowledge exchange
- Analytical and Visualization Tools: Provides tools for analyzing terminology shifts, citation dynamics, and paradigm shifts
- Ease of usage: SciEvo is ready-to-use. You can directly download the dataset from HuggingFace, instead of downloading from arXiv API or S2ORC, which can be costly and requires API keys
Field | Description |
---|---|
paperId |
The Semantic Scholar ID for the paper |
externalIds |
Dictionary containing other external identifiers (DOI, PubMed ID, etc.) |
title |
The title of the paper |
abstract |
A summary of the paper's content |
year |
The year the paper was published |
citationCount |
Number of times this paper has been cited |
influentialCitationCount |
Number of "influential" citations by Semantic Scholar's algorithms |
fieldsOfStudy |
List of general research fields |
s2FieldsOfStudy |
More granular classification of research fields |
arXivId |
Identifier for the paper in arXiv repository |
Field | Description |
---|---|
id |
The paper's arXiv ID |
title |
The title of the paper |
summary |
The abstract or summary of the paper |
published |
Date when the paper was first published on arXiv |
authors |
List of authors who contributed to the paper |
tags |
Set of subject categories (e.g., cs.AI ) |
title_keywords |
Keywords extracted from the title |
title_and_abstract_keywords |
Keywords extracted from both title and abstract |
Field | Description |
---|---|
arXivId |
The paper's arXiv ID |
references |
List of references cited by the paper |
arXivPublicationDate |
Date when the paper was first published on arXiv |
# Analyze citation patterns and age of citations
python analysis/analyze_citation.py --feature_name title
# Extract keywords using LLM
python analysis/keyword_extraction_llm.py --feature_name title
# Extract keywords for specific years
python analysis/keyword_extraction_llm.py --year 2020
# Extract keywords from title and abstract
python analysis/keyword_extraction_llm.py --feature_name title_and_abstract
# Plot Age of Citation (AoC) by subjects
python visualization/plot_aoc_by_subjects.py
# Plot citation diversity
python visualization/plot_citation_diversity.py --feature_name title
# Plot keyword trajectories
python visualization/plot_keyword_traj.py
analysis/analyze_citation.py
- Analyze citation patterns and diversityanalysis/rank_keywords_by_number_of_occurrences.py
- Rank keywords by frequency
analysis/keyword_extraction.py
- Traditional keyword extractionanalysis/keyword_extraction_llm.py
- LLM-based keyword extractionanalysis/keyword_extraction_ngram.py
- N-gram based extraction
dataset/construct_citation_graph.py
- Build citation networksdataset/construct_keyword_hypergraph.py
- Create keyword hypergraphsdataset/download_arxiv_paper.py
- Download arXiv papers
model/gcn.py
- Graph Convolutional Network modelsmodel/gconvgru.py
- GConvGRU modelsmodel/word2vec.py
- Word embedding modelsmodel/procrustes.py
- Alignment analysis
visualization/plot_aoc_by_subjects.py
- Age of Citation plotsvisualization/plot_citation_diversity.py
- Citation diversity analysisvisualization/plot_keyword_traj.py
- Keyword trajectory visualization
Most scripts support the following arguments:
--feature_name title
- Focus analysis on paper titles only--feature_name title_and_abstract
- Include abstract text in analysis--year
- Specify year for temporal analysis--gcn
- Use Graph Convolutional Network models--model
- Specify model type (e.g., "7b")--networkx
- Use NetworkX for graph operations
SciEvo enables researchers to explore scientific knowledge evolution and citation patterns with a broad range of applications:
- Terminology Evolution: Tracking the rise and decline of key terms over time
- Citation Dynamics: Understanding citation lifespan and field-specific differences
- Interdisciplinary Research Patterns: Analyzing how different disciplines interact
- Scientific Paradigm Shifts: Identifying major shifts in research focus
- Comparative Field Analysis: Exploring differences in knowledge production and citation behavior across disciplines
Using SciEvo, we uncover key insights into the evolution of scientific research:
- Paradigm Shifts: Scientific progress occurs in leaps rather than through gradual accumulation. Applied fields, such as LLM research, show rapid shifts
- Keyword Trends: Machine learning terms surged post-2015, reflecting the growing dominance of AI-related research
- Citation Lifespan: Applied fields exhibit shorter citation cycles (e.g., LLM research: 2.48 years, Oral History: 9.71 years), indicating recency bias in some disciplines
- Disciplinary Homophily: Over 91% of citations occur within the same discipline, showing strong field-specific citation preferences
- Epistemic Cultures: Applied research relies on recent works, whereas theoretical fields prioritize foundational literature
Citation graphs of papers related to LLMs and COVID.
Keyword trajectories of the term Artificial Intelligence and COVID-19 show how these keywords co-occur with other keywords in papers.
Evolution in the ranks of math and machine-learning terms among all keywords over time. Math keywords remain consistently popular but show a decline in the past decade, while ML keywords surged in prominence over the last ten years.
The figure above shows the distribution of Citations in the SciEvo dataset, which exhibits higher intra-disciplinary (within identical subject areas) than cross-disciplinary citations.
Age of Citation (AoC) across the 8 arXiv subjects shows distinct trends. eess and cs exhibit left-skewed distributions, indicating a preference towards recent citation. In contrast, disciplines such as physics, math, and econ demonstrate broader AoCs, reflecting their reliance on historical foundational research.
We welcome contributions! Please see our Contributing Guidelines for details.
# Clone the repository
git clone https://github.com/Ahren09/SciEvo.git
cd SciEvo
# Install in development mode
pip install -e .
# Run tests
python -m pytest tests/
If you use SciEvo in your research, please cite our work:
@article{jin2024scito2m,
title={SciEvo: A 2 Million, 30-Year Cross-disciplinary Dataset for Temporal Scientometric Analysis},
author={Jin, Yiqiao and Xiao, Yijia and Wang, Yiyang and Wang, Jindong},
journal={arXiv:2410.09510},
year={2024}
}
SciEvo/
βββ analysis/ # Analytical tools and experiments
βββ dataset/ # Dataset storage and processing
βββ model/ # Models for citation analysis and topic modeling
βββ visualization/ # Code and tools for visualization
βββ utility/ # Utility scripts for data processing
βββ notebooks/ # Jupyter notebooks for analysis and experiments
βββ outputs/ # Results from analysis
β βββ citation_analysis/ # Citation trend insights
β βββ stats/ # Statistical summaries
β βββ visual/ # Visualization outputs
βββ checkpoints/ # Model checkpoints
βββ representations/ # Embedding and vector representations
βββ embed/ # Embedding-related scripts
arXiv has a comprehensive taxonomy that categorizes research papers into broadly 8 fields, including Computer Science (cs), Economics (econ), Electrical Engineering and Systems Science (eess), Mathematics (math), Physics, Quantitative Biology (q-bio), Quantitative Finance (q-fin), and Statistics (stats).
Within Computer Science, there are several subfields, each dedicated to specific areas of research and innovation. For example, Artificial Intelligence (cs.AI), Machine Learning (cs.ML), Computer Vision (cs.CV), etc.
More information can be found in arXiv Submission Taxonomy.
SciEvo is released under the Apache 2.0 License.
This repository contains the bibliographic data for all the arXiv papers released until April 21st.