IntelComp D1.2 MeasurementsRef.
and
data collection
Ares(2022)224835
- 12/01/2022
INTELCOMP PROJECT
A COMPETITIVE INTELLIGENCE CLOUD/HPC PLATFORM FOR AI-BASED STI
POLICY MAKING
(GRANT AGREEMENT NUMBER 101004870)
REPORT ON THE SELECTED MEASUREMENT AND DATA
COLLECTION. DELIVERABLE D1.2
Deliverable information
D1.2 Report on the selected measurement
and data collection
31 December
Deliverable number and name
Due date
Delivery date
Work Package
WP1
Lead Partner for deliverable:
Approved by
Technopolis group
Paresa Markianidou (Technopolis group)
Hannah Bernard (Technopolis group)
Apolline Terrier (Technopolis group)
Lena Tsipouri (OPIX)
Jeronimo Arenas Garcia (UC3M)
Doaa Samy (UC3M)
Ioanna Grypari (ARC)
Dimitris Pappas (ARC)
Dietmar Lampert (ZSI)
Dominique Guellec (Hcéres)
Cecilia Cabello, Project Coordinator (FECYT)
Dissemination level
Public
Version
1.2
Authors
Reviewers
1
IntelComp D1.2 Measurements and data collection
Table 1. Document revision history
Issue Date
Version
Comments
11/01/2022
1.0
Document with
contributions from internal
peer reviewers
11/01/2022
1.1
Version for approval by the
Project Coordinator and the
Technical Manager
12/01/2022
1.2
Document with
contributions from the
Project Coordinator
2
IntelComp D1.2 Measurements and data collection
DISCLAIMER
This document contains a description of the IntelComp project findings, work and products.
Certain parts of it might be under partner Intellectual Property Right (IPR) rules so, prior to using
its content please contact the consortium coordinator for approval.
In case you believe that this document harms in any way IPR held by you as a person or as a
representative of an entity, please do notify us immediately.
The authors of this document have taken any available measure in order for its content to be
accurate, consistent and lawful. However, neither the project consortium as a whole nor the
individual partners that implicitly or explicitly participated in the creation and publication of this
document hold any sort of responsibility that might occur as a result of using its content.
The content of this publication is the sole responsibility of IntelComp consortium and can in no
way be taken to reflect the views of the European Union.
The European Union is established in accordance with the Treaty on European Union
(Maastricht). There are currently 27 Member States
of the Union. It is based on the European
Communities and the member states cooperation in
the fields of Common Foreign and Security Policy and
Justice and Home Affairs. The five main institutions of
the European Union are the European Parliament,
the Council of Ministers, the European Commission,
the Court of Justice and the Court of Auditors.
(http://europa.eu.int/)
This project has received funding from the European Union’s Horizon 2020 research and
innovation programme under grant agreement No. 101004870.
3
IntelComp D1.2 Measurements and data collection
CONTENTS
Disclaimer
3
Acronyms
4
1.
summary note
7
2.
Domain-agnostic Measurements
8
2.1.
Measurements
8
2.2.
Measurements for Agenda setting
9
2.3.
Measurements for Evaluation
14
3.
Measurements specific to the domain of Cancer
20
4.
Data sources
24
5.
TOOLS for STI policy actors
28
6.
SERVICES
29
7.
6.1.
Service for domain-related subcorpus generation
30
6.2.
Classification Service
31
6.3.
Advanced Topic Modelling Service
32
6.4.
Topic-based time analysis service
33
6.5.
Graph-based impact analysis
34
Gap analysis
36
References
37
Appendix I – Long list of sources considered
38
Appendix II – selection criteria for policy questions
40
FIGURES
Figure 1: basic structure of the processing pipeline to identify relevant documents................. 30
Figure 2: Basic structure of the classification service ................................................................. 31
Figure 3: Topic modelling pipeline .............................................................................................. 32
Figure 4: Dynamic topic modelling pipeline ................................................................................ 34
Figure 5: Structure of the processing pipelines for graph-based impact analysis ...................... 35
Figure 6: Criteria selection for policy questions.......................................................................... 40
4
IntelComp D1.2 Measurements and data collection
TABLES
Table 1: Entrepreneurial Activity ................................................................................................ 10
Table 2: Knowledge creation ...................................................................................................... 11
Table 3: Knowledge Linkages and Diffusion ................................................................................ 12
Table 4: Guidance - Contribution to societal challenges............................................................. 13
Table 5: Market formation .......................................................................................................... 13
Table 6: Human and financial resources mobilisation ................................................................ 13
Table 7: Creation of legitimacy/address public concerns ........................................................... 14
Table 8: Knowledge ..................................................................................................................... 15
Table 9: Diffusion ........................................................................................................................ 16
Table 10: Innovation/Invention .................................................................................................. 17
Table 11: Investments ................................................................................................................. 18
Table 12: jobs .............................................................................................................................. 18
Table 13: Gender......................................................................................................................... 18
Table 14: Objectives .................................................................................................................... 19
Table 15: Other ........................................................................................................................... 19
Table 16: Objectives .................................................................................................................... 21
Table 17: inputs........................................................................................................................... 21
Table 18: Outputs (first level needs) ........................................................................................... 21
Table 19: Scientific, medical, and social outcomes (second level needs) ................................... 21
Table 20: Scientific, medical, and social impacts ........................................................................ 22
Table 21: Science & Innovation................................................................................................... 24
Table 22: Company websites and financials ............................................................................... 26
Table 23: Public and private investment..................................................................................... 26
Table 24: Legal and policy documents ........................................................................................ 26
Table 25: Public procurement ..................................................................................................... 27
Table 26: Social Media ................................................................................................................ 27
Table 27: Skills demand and supply ............................................................................................ 27
Table 28: Typologies of Policy questions not addressable in intelcomp ..................................... 36
5
IntelComp D1.2 Measurements and data collection
ACRONYMS
AI — Artificial Intelligence
EC — European Commission
FP — Framework Programmes
H2020 — Horizon 2020
HEUROPE — Horizon Europe
IPC — International Patent Classification
LL — Living Lab
NACE — Statistical Classification of Economic Activities in the European Community
NLP — Natural Language Processing
PU — Positive-Unlabeled
R&I — Research and Innovation
SDGs — Sustainable Development Goals
STI — Science Technology and Innovation
TED — Tenders Electronic Daily
TRL — Technology Readiness Levels
6
IntelComp D1.2 Measurements and data collection
1. SUMMARY NOTE
The objective of IntelComp’s D1.2 deliverable is to translate its conceptual framework into
concrete measurements which serve as basis for the co-creation process in the three science,
technology and innovation (STI) domains: AI, Climate Change - Blue growth and Health – Cancer.
In the current version of deliverable D1.2 we report on the progress in defining the
measurements and data sources and briefly explain the tools we foresee for end users and the
services which are required for the calculation of the identified measurements. The final listing
of measurements and data sources will only be concluded upon finalisation of the living labs’
needs.
In section 2 we describe the first set of measurements which correspond to the domain-agnostic
(i.e. non domain specific) policy framework. The domain-agnostic set of measurements serves
as a catalogue of measurements which require Natural Language Processing (NLP) and Artificial
Intelligence (AI) techniques to discover relevant information connected to the target
measurements. The list covers measurements: 1) which require AI; 2) for which sufficient data
could be sourced (provisional assessment); 3) which are technically feasible (provisional
assessment) and 4) which are within the scope of the IntelComp project . The prioritised list
of domain agnostic measurements and their corresponding sources to be integrated in
IntelComp will be finalised by Month 16.
In section 3 we describe the second set of measurements which correspond to the domain
specific policy framework, as expressed by the needs of the living labs. In this report a provisional
set of measurements and data sources are provided for the cancer living lab, the first living lab
to provide a needs assessment. The inputs of section 3 are subject to prioritisation in the context
of the co-creation process with the living labs starting in January 2022. During 2022 the final
listing of measurements and data sources for all living labs will be completed.
The data sources are listed in section 4 and Appendix I corresponding to a short and a long list
of sources respectively. The data sources list is a result of internal consultations on each of the
individual measurements of the policy framework. The prioritised list of sources to be integrated
in IntelComp will be finalised by Month 16.
Sections 5 and
6 describe briefly the tools and services provided by IntelComp to calculate
those measurements which require processing unstructured text, enriching and extending the
evidence-basis for STI policy makers and Public Administrations.
Finally, section 7 provides a gap analysis by comparing the domain agnostic policy framework to
the provisional implementation plan in IntelComp. It synthesises the policy questions which
cannot be addressed by IntelComp organised by a typology of main reasons for exclusion.
7
IntelComp D1.2 Measurements and data collection
2. DOMAIN-AGNOSTIC MEASUREMENTS
2.1.
Measurements
In IntelComp, a distinction is made between statistical indicators and quantitative
measurements for policy making.
The OECD glossary of statistical terms defines statistical indicators as ‘data elements that
represent statistical data for a specified time, place, and other characteristics’.1 The European
Statistical System Committee (ESSC) defines indicators for policy making as ‘a particular subset
of statistical information, directly related to a special purpose such as monitoring specific policy
objectives’ (Eurostat, 2017).
Statistical indicators supporting evidence based policies need to meet stringent quality
standards as set in the European Statistics Code of Practice containing 15 principles.2 To enrich
the evidence basis with indicators derived from big data which are trusted by policy makers, the
data must meet quality standards as described by the quality dimensions in the UNECE
framework for the quality of big data described in the European Statistical System handbook for
quality and metadata reports (Eurostat, 2020). This requires for instance sound methodologies
applying appropriate statistical procedures to address sample bias.
We are aware that some of the data sources to be exploited by the IntelComp platform do not
provide a representative coverage of innovation at either the industry or national level because
the data are based on self-selection (e.g. firms that apply for a patent) and the information they
provide is often incomplete, covering only one facet of innovation (e.g. company R&D
investments). In addition, some data sources are inconsistent in their coverage of innovation
activities (e.g. company websites). As a consequence, the measures derived from these data
sources may not be considered statistical indicators because of the lack of quality and
representativeness of those data sources.
In IntelComp, some measurements are designed even if they do not yet comply fully with quality
standards, either because they are geographically restricted to one or a limited number of
Member States or because a representativeness analysis is not performed on all possible
dimensions of the data (e.g. country, gender, level of education, industrial sectors, etc.).
IntelComp data and measurements represent experimental statistics and should be
distinguished from traditional statistical indicators compiled by Eurostat and National Statistical
Offices. IntelComp data and measurements will be complementary to standard indicators by
generating tailor-made measurements aimed for policy making.
Despite their limitations, these measurements serve the purpose of providing relevant
information for specific tasks in the policy cycle of a specific strategy, program or call. They are
not all designed to inform policy discussions at higher levels, for example to monitor progress
1
Available here: https://stats.oecd.org/glossary/detail.asp?ID=2547
The 15 principles set in the European Statistics Code of Practice include: professional independence,
mandate for data collection, adequacy of resources, commitment to quality, statistical confidentiality,
impartiality and objectivity, sound methodology, appropriate statistical procedures, non-excessive
burden on respondents, cost effectiveness, relevance, accuracy and reliability, timeliness and punctuality,
coherence and comparability, accessibility and clarity.
2
8
IntelComp D1.2 Measurements and data collection
towards a related policy target. Nor are they all designed for international comparability or
benchmarking.
Indeed, during the development of the framework for policy making, an analysis was carried out
for each measurement identified, analysing in which cases IntelComp will offer the evidence
sought. The use of unstructured text associated with the documents in the datalake and
exploitation of AI pipelines constitutes the core of the methodological contribution of
IntelComp. As a result of this analysis, four tools are proposed (described in section 5) and a
series of minimum services are identified (described in section 6).
Using these tools/services, some of the prioritised measurements (described in section 3) can
be directly calculated, while in other cases other correlated information will be obtained. To this
aim, IntelComp will enrich the data sets in the datalake with information obtained from the
Internet or other available datasets (using the crawling and homogenization services included in
the data lake), as well as with the outputs of the AI services.
As an example, the calculation of high-impact publications will involve enriching the desired
publication data set with journal quartile information, while topic modelling will allow the
analysis to be broken down by areas with the desired level of granularity. This information will
be made available to the user in the STI Viewer through a Business Intelligence (BI) panel
enriched with the newly calculated data, so that the activation of the corresponding filters will
allow the user to obtain the desired information.
2.2.
Measurements for Agenda setting
At the start of policy making the problem(s) to be addressed need to be defined. Policy makers
need information to understand the array of sectoral/technological/institutional potential for a
specific future period, determined by internal and external factors. While policy makers may
have solid knowledge of the past performance in their area of competence, emerging changes
constitute important information to guide them to the next (usually 5-7 years) policy cycle.
Policy needs refer to the decision on priorities and budget allocations.
The information needed is on the current and emerging global societal challenges, the way these
challenges are translated into their own context, the way their peers adopt their agendas and
the potential of civil society to co-create the agendas but also on opportunities to improve the
country’s economic benefits in the years to come by identifying sectors or products and
technologies with increasing global demand. The outcome can lead to strategic priorities
forming Smart Specialisation Strategies as well as lower priority areas to be supported.
Short listed policy questions for agenda setting are described in terms of measurements, data
sources and most relevant taxonomies. The list includes measurements which: 1) require AI; 2)
sufficient data could be sourced (provisional assessment); 3) are technically feasible (provisional
assessment) and 4) are in scope as per the proposal. The final list of measurements to be
included in Intelcomp will be provided by Month 16 and is subject to the final list of sources.
Equally, the final unit of observation is subject to technical assessment and relevance of each
measurement. For instance, in agenda setting measurements, if, for instance, the specific policy
is a national strategy, the unit of observation of each measurement would be the country/ies,
the scientific discipline (e.g. cancer research) or the technology (e.g. AI) related to that strategy.
9
IntelComp D1.2 Measurements and data collection
Table 1: Entrepreneurial Activity
Policy question
Are national/regional companies adapting to
technological transformation trends in their
respective sectors? How do they compare with
major (foreign or non regional) competitors?
Measurement
Number of companies developing/adopting
transformative technologies per 10,000 companies
in the country [‘transformative technologies’ are
defined by living labs]
Data Sources
* Company database compiled from various sources
(see data sources for more detail). Representative
database for Large R&D investors, Large companies
and technology start-ups and scaleups).
What is the composition of emerging technology
portfolios of entrepreneurial companies?
* Technology topics supported by venture capital
* Technology topics and associated applications by
largest R&D investors
* Technology topics from companies with highest
company valuations
* Crunchbase/ Bloomberg / Thomson
Reuters in
their news sections
* National VC (LL specific)
* Largest R&D Investors Websites
* Largest R&D Investors Company annual reports
* Company valuations from crunchbase and
dealroom and news reporting from Pitchbook
(market news are published every day or second
day)
Which companies are pioneers in transformative
technologies in the country?
* Company types receiving venture capital or other
forms of financing for transformative technologies
* Companies with highest number of contracts
systematically cooperating with top research
institutes
Share of companies with continuous innovative
activity in two consecutive periods (periods
defined as: every 3 years)
Innovative activity is defined according to:
1) patenting activity (at least one) OR
2) trademark applications (at least one) OR
3) design applications (at least one) OR
4) standards (at least one) OR
5) software development (at least one)
Distribution of technology fields of persistent
innovators per sector
* Crunchbase
* National VC (LL specific)
* OpenAIRE
* NACE
* Company characteristics [LL specific]
* Patstat
* EUIPO (for trademarks and design)
* Standards (ETSI and ISO micro data)
* Github
* Company websites (for future temporal analysis)
* NACE
* Patstat
* EUIPO (for trademarks and design)
* Standards (ETSI and ISO micro data)
* Github
* OpenAIRE
* Patstat
* Github
* Crunchbase
* Websites
R&D and innovation topics derived by the data
and not predefined
Who are the companies with persistent
innovative activity in the country?
In which technology fields do the persistent
innovators invest?
In which technology fields is the highest share of
all company R&D investments? [EU & globally]
Listing of technology fields of Top R&D investors
captured by different 1) STI outputs: publications,
patenting, software; 2) investments: VC and 3)
products/services: company websites
10
Taxonomy
* Technological transformation trends/
innovations/corresponding products [LL specific]
* NACE
* Company type (Largest R&D investors, Large
companies, technology start-ups)
* Technologies
* NACE
Technologies
Scientific Disciplines
IntelComp D1.2 Measurements and data collection
Policy question
In which technology fields is the country
improving its Revealed Comparative Advantage
(RCA)?
Measurement
RCA on patents and publications: share of an
economy’s patents/publications in a particular
technology field relative to the share of total
patents/publications in that economy over time
Data Sources
* OpenAIRE
* Patstat
* Company websites
Taxonomy
* Technologies
* Scientific Disciplines
Measurement
* Annual Growth in counts of publications/patents
by scientific field
* Annual Growth in average citations per
publication/citations per patent by scientific
field/technology
* Annual Growth in counts of publications/patents
of different topics of interdisciplinary publications
* Annual Growth in average citations per
publication/patent of different topics of
interdisciplinary publications/intertechnological
patents (more commonly known as converging
technologies, i.e. closely integrated technologies)
* Ranking of organisations according to counts and
citations of interdisciplinary publications (per year,
for a period and average annual growth)
* Networks of organisations undertaking research
in interdisciplinary fields
Organisations with strong growth and strong
system linkages (composite): 1) high growth in
cited publications (+10%); high growth in patents
filed
; high growth in participation in RDI
projects (+10%); 2) participation and involvement
in DIH, Cluster organisations; Technology centres;
3) high share of public - private co-publications/copatenting (+50%) etc.
Data Sources
* OpenAIRE
* Patstat
Taxonomy
* Basic/applied
* Interdisciplinarity/ Multidisciplinarity
* Technologies (IPC) possibly to check the RISIS
classification of patents
* OpenAIRE
* Patstat
interdisciplinarity topics
* OpenAIRE (Interdisciplinary Journals)
* Patstat
interdisciplinarity
* OpenAIRE
* Patstat
* CORDIS
* National programmes
* CMISA project (pending assessment)
Scientific Disciplines
Table 2: Knowledge creation
Policy question
Which scientific fields demonstrate the highest
growth in terms of publications/
citations/patents globally?
Which are the emerging interdisciplinary fields
globally (i.e. integrated knowledge from
different disciplines)?
Which are the research teams in the country
undertaking research in interdisciplinary fields?
Are there pockets of excellence for these
research areas in the country
11
IntelComp D1.2 Measurements and data collection
Table 3: Knowledge Linkages and Diffusion
Policy question
Which knowledge diffusion channels work best
in good practices per discipline at international
level?
What are themes in common between the
actors of the ecosystem? What are observed
specialisation patterns? What is the evolution of
topics among the different actors?
Are actors of the ecosystem collaborating?
What are forms of collaboration?
What are the cross sectoral or cross
technological collaborations occurring and
among which actors?
Measurement
* International co-publication: ratio of share of
cited International co-publications and share of
cited publications of national publications
* Participation in conferences: ratio of citations by
publication in conference proceedings and
citations per publication only in peer reviewed
journals (excl. those previously in conference
proceedings) by scientific area
* Open Access publications: ratio of citations per
Open Access publication and non-open access
publications by scientific discipline
* Participation in EU programmes: ratio of
citations per publication from H2020/HEurope and
citations per publication of non H2020/HEurope
funded research
* Topic distribution between industry, science and
citizens on scientific disciplines, SDGs and its
evolution in time
* Concentration measured with Location Quotient
which measures the degree that a topic is overrepresented in a particular country relative to the
topic's overall distribution in Europe
* Share of Public-Private collaboration (copatenting) in total patents
* Share of Public-Private collaboration (copublications) in total publications
* Share of Public-Private collaboration (H2020/HE
urope projects) in total participations
* network analysis of topics based on cross
technological publications/projects/ patents
* network analysis of topics based on cross
sectoral publications/projects/ patents
* network analysis of actors based on cross
technological publications/projects/ patents
* network analysis of actors based on cross
sectoral publications/projects/ patents
Data Sources
*O
penAIRE
* Cordis
Taxonomy
* Scientific discipline
* Industry: websites of actors [LL specific]
* Science: OpenAIRE; H2020/HE
urope
* Citizens: European Media Monitoring
* Scientific discipline
* SDGs
* Patstat
* OpenAIRE
* H2020/HE
urope
* Scientific discipline
* SDGs
* Technologies
urope
* Scientific discipline
* SDGs
* Technologies
* OpenAIRE
* Patstat
* H2020/HE
12
IntelComp D1.2 Measurements and data collection
Table 4: Guidance - Contribution to societal challenges
Policy question
To which global societal challenges are research
groups contributing to?
To which EU societal challenges are research
groups contributing to?
Are there specific national societal challenges?
Measurement
* Number of Publications and patents by SDG
* Distribution of SDG publications by scientific area
* Distribution of SDG patents by technology field
Data Sources
* OpenAIRE
* Patstat
* H2020/HE
Taxonomy
* SDGs for publications
* SDGs for patents
* Scientific disciplines
* Technology
* EU Missions classifier for publications
* EU Missions classifier for patents
* EU Missions classifier for Horizon projects
urope
* Number of Publications and patents by EU
Mission
This indicator would be living lab specific
considering the missions (e.g. Adaptation to
climate change including societal transformation
Cancer; Climate-neutral and smart cities; Healthy
oceans, seas, coastal and inland waters)
* Share of publications in Missions in total
publications
* Share of patents in Missions in total patents
* Topics from national work programmes and
corresponding calls
* OpenAIRE
* Patstat
* H2020/HE
urope
* National programmes & Calls
* Societal challenges (LL specific)
Measurement
Rising topics associated to transformative
technologies in TED
Data Sources
TED
Taxonomy
* Taxonomies in transformative technologies
* Topics in focus [living lab specific]
Topics on technologies in foresight publications
and standards (for instance banning of plastics
leading to research for biodegradable plastics)
* Set of pre-identified foresight studies
* Standards (ETSI and ISO micro data)
* Transformative technologies [living lab
specific]
Data Sources
* Living lab specific [EU Public policy documents and
national policy documents]
Taxonomy
*Topics
Table 5: Market formation
Policy question
What is the role of public procurement for
transformative technologies
(theoretically/practically)? [living lab specific
example required]
What is the content of policy papers/standards
guiding markets?
Table 6: Human and financial resources mobilisation
Policy question
What are opportunities for EU financing?
Measurement
* List of topics financed through national funds
which can leverage EU funding
13
IntelComp D1.2 Measurements and data collection
Policy question
What are opportunities for EU financing?
Measurement
* List of Research teams (organisation level)
financed through national funds which can
leverage EU funding
(using Publications of national research teams with
acknowledgements to national funding matched to
EU funding opportunities in TED)
Number of skilled professionals per technology in
total STEM professionals
Data Sources
* OpenAIRE
* TED
Taxonomy
*Topics
* LinkedIn (subject to data access rights)
Is there sufficient S&T talent demand?
* Number of skilled professionals demanded per
technology in total enterprises
* Number of skilled professionals demanded per
technology in total enterprises per sector
* Cedefop (subject to the potential for text mining
of Cedefop snippets)
* ESCO
* NACE
* Technologies
* ESCO
* NACE
* Technologies
Is there a gap between supply and demand?
Derived from the analysis of S&T supply and
demand
* LinkedIn
* Cedefop
* ESCO
* NACE
* Technologies
Data Sources
European Media Monitoring
Parliamentary minutes
European Media Monitoring
Taxonomy
* Topics [LL specific]
Is there sufficient S&T talent supply?
Table 7: Creation of legitimacy/address public concerns
Policy question
What is the public opinion on related topics (old
and new ones)
What is the role of the press in topics addressed
in policy objectives? Is resistance expected?
2.3.
Measurement
Sentiment analysis: share of positive and negative
sentiment in total mentions
trend analysis: temporal evolution of topics in
social media associated to policy objectives
* Policy objectives
* Topics [LL specific]
Measurements for Evaluation
Based on the data generated during implementation, systematic evaluations of efficiency, effectiveness and impact of the policy mix implemented are
conducted to help update strategies in the next policy cycle. Policy questions become more complex: Were the targets met? How can we increase efficiency?
How did we perform compared to peers? Which results are attributed to which interventions? Evaluations require significant data to check the intervention
logic and run counterfactual evaluations. Combining inputs to respond to these questions have always been a challenge because of lack of data and attribution
problems. It is mainly in this area where traditional indicators are insufficient that machine learning can add value.
14
IntelComp D1.2 Measurements and data collection
Short listed policy questions for evaluation are described in terms of measurements, data sources and most relevant taxonomies. The list includes
measurements which: 1) require AI intelligence; 2) sufficient data could be sourced (provisional assessment); 3) are technically feasible (provisional
assessment) and 4) are in scope as per the proposal. The final list of measurements to be included in Intelcomp will be provided by Month 16 and are subject
to the final list of sources. In terms of the unit of observation, the specific policy may be a program or a call for funding, and the unit of observation would be
the outputs, outcomes or impacts related to that program or call.
Table 8: Knowledge
Objective
Policy question
Measurement
Science
How many scientific publications
were published?
Science
How many scientific publications
are applied research?
How many scientific publications
are basic research?
How many scientific publications
are interdisciplinary?
How many presentations were
made in top scientific
conferences?
How many scientific publications
were published in top 1% or top
10% of scientific journals?
How were citations in
publications associated to
projects compared to scientific
discipline average?
Science
Science
Science
Science
Science
Data Sources
Taxonomy
Number of scientific publications published
1.output
2.outcome
3.impact
1.output
* Project publications
* OpenAIRE
Share of applied research publications in total publications
1.output
Share of basic research publications in total publications
1.output
Share of interdisciplinary scientific publications in total
publications
Share of conference papers published in top 1% or top 10% of
scientific conferences in total conference papers
1.output
* Project publications
* OpenAIRE
* Project publications
* WoS, Scopus; OpenAIRE
* Project publications
* WoS, Scopus; OpenAIRE
* Project conference papers
* Conference papers classification
* Scientific disciplines
* Technologies
* SDGs
* Applied research
Share of project scientific publications published in top 1% or
top 10% of scientific journals
2.outcome
Field-Weighted Citation Index of project peer reviewed
publications
2.outcome
15
2.outcome
* Project publications
* OpenAIRE
* Journal classification
* OpenAIRE
* Basic research
* Interdisciplinarity
* Scientific disciplines
* Technologies
* SDGs
* Scientific disciplines
* Technologies
* SDGs
* Scientific disciplines
* Technologies
* SDGs
IntelComp D1.2 Measurements and data collection
Table 9: Diffusion
Objective
Policy question
Measurement
Science
In which ways has the diffusion
of knowledge taken place?
Towards innovation: Number of (OS) publications (directly
linked to each project result) referenced in non-patents
citations of patents
Science
In which ways has the diffusion
of knowledge taken place?
2.outcome
Science
In which ways has the diffusion
of knowledge taken place?
Science
In which ways has the diffusion
of knowledge taken place at
programme level?
Social
What were dissemination
methods used towards the
public?
What were dissemination
methods used towards the
public?
In which ways has the diffusion
of knowledge taken place?
Shared knowledge: Share of research outputs (software,
datasets publications) shared through open knowledge
infrastructures in total research outputs
Cocreation: number and share of projects where EU citizens and
end-users contribute to the co-creation of R&I content in total
projects [entities are defined by the domain in focus]
Open Science: Share of open access programme research
outputs (publications) actively used/cited after programme in
total outputs (publications)
OR : average citations of Open Science Research Outputs (i.e.
publications in peer-reviewed journals and conferences)
Events: Number and share of projects with event participations
by type of event in total projects
Social
Social
Social
What were dissemination
methods used towards the
public?
1.output
2.outcome
3.impact
2.outcome
2.outcome
Data Sources
Taxonomy
* Project outputs
* OS publications - OpenAIRE
* Patstat
* NACE
* Scientific areas
* Technologies
* SDGs
* NACE
* Technologies
* SDGs
* SDGs
* Project outputs
* OpenAIRE
* GitHub/GitLab
* Project periodical/final reports
2.outcome
* Project outputs
* OpenAIRE
* Open science observatory
* Scientific areas
1.output
* Events in OpenAIRE
Outreach activities: Number and share of projects with
outreach of scientific results digitally in total projects
1.output
Engagement: Number and share of projects with citizen and
end-user engagement mechanisms after the project in total
projects
2.outcome
General public reach: Number of people reached through
dissemination activities (on topics associated to the project's
expected impacts
2.outcome
* Newspapers
* Social media
* Wikipedia
* Project descriptions of activities
* Open source publications
* Social media
* Beneficiaries websites
* European Digital media observatory
* Twitter
* SDGs
* Policy objectives
* Events typology
* SDGs
* Policy objectives
16
* SDGs
* Policy objectives
* SDGs
* Policy objectives
IntelComp D1.2 Measurements and data collection
Table 10: Innovation/Invention
Objective
Policy question
Measurement
Economy
Has the programme
enabled the research
activities to reach high
technological readiness
levels?
How many patents were
produced (applications
/grants)
Technology Readiness Level: Share of outputs with TRL level higher than
6 and above compared to all projects
Economy
1.output
2.outcome
3.impact
1.output
Data Sources
Taxonomy
* Project outputs/deliverables
* TRLs
Patents: Number of EPO patent applications and grants; Percentage
share of patent grants and patent applications [Note: a patent does not
signal an innovation, but an invention: i.e. an idea that is demonstrated
as operational, but has not necessarily been commercialised]
Innovations: Number of innovative products, prototypes, industrial
production processes, research datasets, methods, algorithms/software,
business models
1.output
* FP/National programme
* Patstat
1.output
* Company websites of beneficiaries f
* Project deliverables
* Publications of participants
* Openaire
* Github
* Open Access repositories
* Classifier of types of innovations
* Company websites
* scientific disciplines
* technologies
* policy objectives
* SDGs
* NACE
* Innovations (LL
specific)
* Technologies
Economy
What innovations were
developed?
Economy
What were the private
returns on investment?
From innovation to market: R&D and Innovation products and services
brought to market associated to the results of the programme
2.outcome
Economy
What is the uptake of
project innovations in
the market?
Has the programme
stimulated the
development of
transformative
innovations necessary
for the twin transition of
industry?
Has public procurement
of innovation produced
product/process
innovations launched in
the market
Company uptake score: a measure linking the innovations developed in
the projects with those taken up by the company beneficiaries after the
end of the project lifecycle.
Transformative innovations: Number of projects in transformative
technologies; Share of projects in transformative technologies in all
projects
3.impact
* Project deliverables *project publications
*company websites
3.impact
* Project outputs
* Transformative
technologies (LL
specific)
*TRL
Innovations: Types of Innovations introduced by companies
beneficiaries of public procurement (topics)
2.outcome
* National data on public procurement
* EU level TED
* Companies websites
* Companies social media
* NACE
* Technologies
* SDGs
Economy
Economy
17
* NACE
* Innovations (LL
specific)
* Technologies
IntelComp D1.2 Measurements and data collection
Objective
Policy question
Measurement
Social
What were the social
returns on investments?
Carbon footprint: Types and number of innovations on reducing carbon
footprint compared to all programme innovations
1.output
2.outcome
3.impact
2.outcome
Data Sources
Taxonomy
* Project deliverables
* Carbon footprint
innovations (LL
specific)
* TRL
1.output
2.outcome
3.impact
2.outcome
Data Sources
Taxonomy
* Crunchbase
* Companies social media
* NACE
* Technologies
* SDGs
* NACE
* Technologies
* SDGs
Table 11: Investments
Objectiv
e
Policy question
Measurement
Econom
y
What were the private
returns on investment?
Private funding: Private investments raised to exploit or scale up results
of the programme (level of organisation) in million euro
Econom
y
What is the total public
funding mobilised?
Public funding: Amount of public investment mobilised in million euros
from EU and National funding
2.outcome
* Framework programme data
* National public funding
1.output
2.outcome
3.impact
3.impact
Data Sources
Taxonomy
* Cedefop online job advertisements
snippets (subject to content and volume of
text within the snippets published by
Cedefop)
ISCED
1.output
2.outcome
3.impact
1.output
Data Sources
Taxonomy
* Project participants
* Gender
1.output
* Project participants and outputs
* Female as first author
* Gender
Table 12: jobs
Objectiv
e
Policy question
Measurement
Econom
y
How many new jobs were
created after the project
(research and beyond)
within the country?
Temporal evolution: growth of job offers in the areas of impact
Table 13: Gender
Objectiv
e
Policy question
Measurement
Social
What were the social
returns on investments?
What were the social
returns on investments?
Project participation: female/male ratio
Social
Research outputs: Share of research outputs (inc. publications, datasets,
software) produced by females in total research outputs
18
IntelComp D1.2 Measurements and data collection
Table 14: Objectives
Objective
Policy question
Measurement
1.output
2.outcome
3.impact
1.output
Data Sources
Taxonomy
/
Which societal challenges
have been addressed?
* Share of projects by SDG
* Share of project outputs by SDG
* Project outputs
* Project descriptions
* Share of project topics associated to policy documents
* Share of project topics associated to parliament discussion minutes
2.outcome
* Overton
* Parliament discussion minutes
* SDGs classifier of outputs
(publications)
* SDGs classifier of projects
* SDGs
* Policy objectives [LL
specific]
/
Which policy objectives
have been addressed
1.output
2.outcome
3.impact
2.outcome
Data Sources
Taxonomy
* OpenAIRE
* FP funded projects
* National funded projects
* Technologies,
* Scientific disciplines
* SDGs
2.outcome
* Programme/Project outputs i.e.
publications
Not applicable
* OpenAIRE
* Programme/Project outputs i.e.
publications
* Scientific areas
* Sectors
* SDGs
* Technologies,
* Scientific disciplines
* SDGs
Table 15: Other
Objective
Policy question
Measurement
Leverage
What has been the
leverage of national
support measures for EU
competitive funding?
What are the multiplication
effects of each
programme?
Are we investing in topics
that several other funders
are interested in, or
supporting a field by
ourselves
* Share of EU funding beneficiaries who received national support prior
to receiving EU funding (at organisation level)
* Share of EU funded project outputs referencing project outputs of
Nationally funded projects (at project level)
Degree of collaborations with other projects within the same
programme after the programme measured by the share of copublications between different project teams in total publications
Number of funders on a specific topic (crowded vs exclusive)
Multiplication
Exclusivity
19
IntelComp D1.2 Measurements and data collection
3. MEASUREMENTS SPECIFIC TO THE DOMAIN OF CANCER
Provisional domain specific measurements are provided for the domain of cancer, focusing on
the analysis of impact of funded research projects and the characterisation of 'impact
pathways'. The later focus represents the main area of interest of the cancer living lab. The
climate change and AI living labs will equally provide their main areas of interest within 2022.
Three levels of needs have been identified:
1. To characterise the scientific production of funded projects (outputs)
2. To identify and characterise the outcome of funded projects (outcomes)
3. To identify and characteris
e the social impact of funded projects (impacts)
In the tables below we describe a set of measurements per level of need identified. These
measurements are provisional and will be updated in the course of 2022 in cooperation with the
cancer living lab. The final measurements to be implemented in IntelComp will depend on the
formulation of narratives on impact pathways defined by the Cancer living lab.
To facilitate understanding of the tables below, we describe shortly the distinction between
outputs, outcomes and impacts.
Outputs are the tangibles or intangibles that an organisation or project produces. These could
be completed services, products, interventions or other ‘deliverables’. They should act to ‘spark
change’ or act as the catalyst for your identified outcomes. They are normally fairly easy to
measure and can often be quantified e.g. how many do we do or the number of outputs you
create. Outputs in the cancer domain would be e.g. research results, clinical trials.
Outcomes are the intended short to medium effects or the ‘step changes’, which need to occur
in order to achieve your long term or ultimate goal. If you are trying to facilitate change within
an individual, you can think of this as the journey your beneficiary needs to go on to reach the
change you have identified. They are often more difficult to measure than outputs, as they can
frequently relate to an individual’s perceptions, emotions or other internal state. So drugs,
clinical guidelines and new technologies and treatments are outcomes because we do not know
whether they will really reduce mortality rates and improve health.
Impact is your long-term goal or ultimate objective. If you are talking about your organisation’s
impact, it will likely be closely linked to your mission statement or vision statement. Whether
for your organisation or a project, your impact(s) will be what you are ultimately trying to
achieve. If you work with individuals, it will be the end state you would like your beneficiary to
be in. Your impact should be achieved, as a result of your outcomes. If your outcomes are the
journey your beneficiary will go on, your impact is the end destination. Your impact will often
be the most difficult to measure, and since it will frequently occur over a long period of time
with other influencing factors, it can be challenging to identify whether any changes you do
observe are a result of your efforts or something else (attributing causality). Impacts are
improved quality of life of individuals with cancer, reduced mortality rates from cancer.
20
IntelComp D1.2 Measurements and data collection
Table 16: Objectives
Objectives
Framework
Programmes
(H2020 and
HEurope)
National
programmes
Source
Cordis
Measurement
* Topics on expected
impacts
Taxonomies
* Topics (e.g. tobacco, alcohol, food, pollution)
* Technologies and treatments (e.g. genetics,
biotherapies, predictive medicine, e-health)
* Stages of patient care 1) prevention; 2) early
detection; 3) diagnosis and treatment; and 4) quality of
life for cancer patients and survivors
LL specific
Table 17: inputs
Inputs
Framework
Programmes
(H2020 and
HEurope)
National
programmes
Source
Cordis
Measurement
* Funding in million Euro
LL specific
Taxonomies
* Topics (e.g. tobacco, alcohol, food, pollution)
* Technologies and treatments (e.g. genetics,
biotherapies, predictive medicine, e-health)
* Stages of patient care 1) prevention; 2) early detection;
3) diagnosis and treatment; and 4) quality of life for
cancer patients and survivors
* Beneficiaries (types)
* Funders
* Applicants (types)
Table 18: Outputs (first level needs)
Project outputs
Scientific
publications
Source
OpenAIRE
Measurement
* Number of scientific
publications published
during the project
* Share of scientific
publications by taxonomy
Patents
* Patstat
* Programme
data
* Number of patents filed
during the project
* Share of patents filed by
taxonomy
Taxonomies
* International Classification of Diseases 11th Revision
(ICD11)
* Orphanet classification
* Basic/Clinical
* National/International
* Scientific discipline
* Topics: tobacco, alcohol, food pollution
* Technologies and treatments (e.g. genetics,
biotherapies, predictive medicine, e-health)
* Stages of patient care 1) prevention; 2) early detection;
3) diagnosis and treatment; and 4) quality of life for
cancer patients and survivors
* Public-Private co-publications
* International Classification of Diseases 11th Revision
(ICD11)
* Orphanet classification
* Technologies and treatments (e.g. genetics,
biotherapies, predictive medicine, e-health)
* Public-Private co-patenting
Table 19: Scientific, medical, and social outcomes (second level needs)
Project outcomes
Science
Publications
Patents
citations
could feature
here
Sources
OpenAIRE
Measurement
* Number of scientific
publications published after the
project associated to the
publications funded during the
project
* Share of scientific publications
published after the project by
taxonomy
* Field -Weighted Citation Index
of project peer reviewed
publications
21
Taxonomies
* International Classification of
Diseases 11th Revision (ICD11)
* Orphanet classification
* Basic/Clinical
* National/International
* Scientific discipline
* Topics (e.g. tobacco, alcohol, food,
pollution)
* Technologies and treatments (e.g.
genetics, biotherapies, predictive
medicine, e-health)
* Stages of patient care 1)
prevention; 2) early detection; 3)
diagnosis and treatment; and 4)
IntelComp D1.2 Measurements and data collection
Project outcomes
Medical
Sources
Measurement
Health data
OpenAIRE
* Number of data objects
produced
* Number of data objects
consumed
Clinical Trials
*Clinicaltrials.gov
Drugs
* Drugbank
* Number clinical trials linked to
projects
* Type of trial
* Phase it ended
* Age group targeted
* Number of clinical trial
references
* Citations in same disease or
other - cross over
* Number of hops before a
successful clinical trial
* Phase 5 + (new or repurposed)
drug or clinical guidelines)
* Number of new drugs linked to
projects (through clinical trials)
Social
* International Classification of
Diseases 11th Revision (ICD11)
* Orphanet classification
*Number of drug repurposing
linked to projects (through
clinical trials)
* Number of clinical guidelines
linked to projects (through
clinical trials)
Clinical
Guidelines
*OpenAIRE
*PubMed
New
technologies
and
treatments
* Project
deliverables
* Patents
* Beneficiary
websites
* PubMed
* openAIRE
* New technologies and
treatments from project
deliverables, patents and
beneficiary websites linked to the
projects
Clinical guidelines linked to
projects
Social media
buzz
* Twitter
* European Media
Monitor
Reach in Tweets of funded
participants related to the
outputs/outcomes of the funded
project
Position
papers
* Open Public
consultations
Share of positive/negative topics
(Sentiment analysis of position
papers)
Clinical
guidelines
Taxonomies
quality of life for cancer patients and
survivors
* Public-Private co-publications
* International Classification of
Diseases 11th Revision (ICD11)
* Orphanet classification
* Topics (e.g. tobacco, alcohol, food,
pollution)
* International Classification of
Diseases 11th Revision (ICD11)
* Orphanet classification.
* Technologies and treatments (e.g.
genetics, biotherapies, predictive
medicine, e-health)
* TRL
* International Classification of
Diseases 11th Revision (ICD11)
* Orphanet classification
* Topics (e.g. tobacco, alcohol, food,
pollution)
* Technologies and treatments (e.g.
genetics, biotherapies, predictive
medicine, e-health)
* Technologies and treatments (e.g.
genetics, biotherapies, predictive
medicine, e-health)
* Stages of patient care 1)
prevention; 2) early detection; 3)
diagnosis and treatment; and 4)
quality of life for cancer patients and
survivors
* Topics: tobacco, alcohol, food
pollution
Table 20: Scientific, medical, and social impacts
Project impacts
Science
Sources
OpenAIRE
Measurement
World
class
science:
Number and share of peer
reviewed publications from
projects that are core
contribution to scientific
fields in total peer
reviewed
publications
22
Taxonomies
* Scientific discipline
* Topics (e.g. tobacco, alcohol, food, pollution)
* International Classification of Diseases 11th Revision
(ICD11)
* Orphanet classification
IntelComp D1.2 Measurements and data collection
Project impacts
Sources
Measurement
core contribution: citing
top 1% publications in the
corresponding subject area
Taxonomies
Medical
* PubMed
* openAIRE
Uptake from
practitioners: Clinical
guidelines
Social
Public health
Contribution to policy
making/Legislation
impacting public health:
Share of project topics
associated to policy
documents and/or Share
of project topics
associated to parliament
discussion minutes
* International Classification of Diseases 11th Revision
(ICD11)
* Orphanet classification
* Topics (e.g. tobacco, alcohol, food, pollution)
23
IntelComp D1.2 Measurements and data collection
4. DATA SOURCES
To address the diverse policy aspects comprised in the scope of IntelComp, we consider a broad
variety of potential sources to be ingested and stored in the IntelComp Data Space. The
assessment of the sources’ feasibility and relevance is made based on six criteria:
1. Text mining potential: The source provides or contains text documents or text sections that
can be used for text mining processes. Text mining potential is a qualifier criterion, i.e. if not
fulfilled the source cannot be integrated into IntelComp
2. Potential for temporal data and time series data analyses: Sources can be analysed in past
and future moments in time allowing time series analyses. Two different issues are
important to distinguish: 1) are data sources periodically updated (necessary for future
sustainability of the source in IntelComp), and 2) do we have time information for the items
in the data source ? (necessary for time analysis)
3. Taxonomy: There are different classifications for the data provided by each source
identified. Additionally, we identified classifiers that we intend to use to sort the data (in
addition to those already available in the dataset). Both types of classifiers are listed in the
tables below under taxonomy
4. Representativeness: The data is derived from the whole population of interest or a
representative sample of it. At this stage, this criterion is assessed at a high-level and will be
further investigated as well as methods to address biases
5. Open access: The data can be accessed and extracted free of charge. Exceptions apply and
are being considered in the framework of the domain specific needs assessment
6. Availability of data for main competitors: Main competitors of the EU are defined as the
USA, Japan, South Korea and China. This criterion assesses whether the source also provides
data for the cited countries, to allow international comparisons or or homogeneous data
from these countries could be gathered from alternative sources
The full list of potential sources under consideration is available in Appendix I – Long list of
sources considered, while the current section presents the most promising ones, i.e. the sources
that match best the established criteria and that are the most versatile in terms of addressing
multiple policy questions. Sources identified belong to various typologies and are sorted
accordingly in the tables below.
Table 21: Science & Innovation
Source
OpenAire/
Semantic
Scholar
Description
Open access
publications platform,
with 129M
deduplicated
publications available
Suitability
High
Taxonomy
Scientific
disciplines
SDGs
Technologies
Relevant policy questions
In which ways has the diffusion of knowledge taken
place?
In which ways has the diffusion of knowledge taken
place at programme level?
What was the contribution of the publications to the
scientific field?
How many scientific publications were published in top
1% or top 10% of scientific journals per discipline?
How were citations in publications associated to
projects compared to scientific discipline average?
How many scientific publications are applied/basic
research?
How many scientific publications are interdisciplinary?
How many scientific publications were published?
24
IntelComp D1.2 Measurements and data collection
Source
Description
Suitability
Taxonomy
Cordis
Research
activities
and outputs in the EU
framework
programmes (public
investment). The data
available from Horizon
2020 and FP7 is
already ingested in
IntelComp, through
Corpus
Viewer.
information
on
countries outside the
EU is only available
regarding
their
involvement in H2020
partnerships.
High
Scientific
disciplines
SDGs
Technologies
Taxonomy of
innovations
TRL
Relevant policy questions
What has been the leverage of national support
measures for EU competitive funding?
How many people were trained as researchers?
Has the programme stimulated the development of
transformative innovation?
What was the uptake of scientific results in patents?
What were the social returns on investments?
Has the programme enabled the research activities to
reach high technological readiness levels?
How many patents were produced
(applications/grants)?
What innovations were developed?
What is the total public funding mobilised?
In which ways has the diffusion of knowledge taken
place at programme level?
How many people were trained as technicians? As
researchers?
How many presentations were made in top scientific
conferences?
How many scientific publications are applied/basic
research?
How many scientific publications are interdisciplinary?
How many scientific publications were published?
What are the multiplication effects of each
programme?
What were dissemination methods used towards the
public?
Which societal challenges have been addressed?
Patstat
Github
Online inventory of
patents with complete
coverage of patents
(more
than
100
million
patent
documents)
Code
repositories
used by 4+ million
companies
High
High
IPC
Technologies
SDGs
TRL
NACE
Policy
objectives
What is the generation of patentable (appropriable)
knowledge?
Technologies
What innovations were developed?
In which ways has the diffusion of knowledge taken
place?
What was the uptake of scientific results in patents?
How many patents were produced
(applications/grants)?
In which ways has the diffusion of knowledge taken
place at programme level?
What were dissemination methods used towards the
public?
25
IntelComp D1.2 Measurements and data collection
Table 22: Company websites and financials
Source
Innovative
companies'
websites
Description
Own compilation of innovative
companies from different sources
organised by different types of
companies. The listing of company
websites is expected to rely on several
companies repositories: 1) Crunchbase
for large companies and tech start-ups;
2) the JRC Scoreboard of the largest
R&D innovators (top 2500 worldwide
and top 1000 in EU); 3) Bloomberg,
Dealroom and/or Pitchbook; 4) Patstat
i.e. websites of companies with large
number of patents, 5) websites of
Unicorns; 6) Framework Programmes
for beneficiary companies active in FP7,
H2020 and HE urope projects, and 7)
the Living Labs will provide insights on
the main local innovators.
Suitability
Medium
Taxonomy
NACE
Technologies
Taxonomy of
innovations
Policy
objectives
SDG
Relevant policy questions
What was the contribution of
innovations to turnover,
profits, market shares?
What innovations were
developed by companies?
What innovations were
developed in the project?
What is the total funding
mobilis
ed?
Table 23: Public and private investment
Source
Crunchbase/
Pitchbook
Description
Inventory of worldwide companies
with comprehensive information
on their funding rounds (private
investment) and news items
Suitability
Medium
Taxonomy
Company size
Company
establishment
and funding
Industries
Technologies
NACE
Relevant policy questions
What were the private returns
on investment?
Table 24: Legal and policy documents
Source
Description
Suitability
Taxonomy
Relevant policy questions
Eurlex
Online database of European Union
treaties, legal acts, consolidated texts,
international agreements, etc.
High
Which policy objectives have
been addressed?
Overton
Index of policy literature with
comprehensive publication
information
A compilation of various sources: 1)
European Parliament (different
committees) and the publications from
all EU entities and agencies; 2) SIPER
and Fteval initiatives; 3) online
repositories of EU/OECD countries of
R&I policy and technology evaluations
; 4) foresight studies, from the
European Commission, the
Competence centre on foresight and
the OECD strategic foresight work
Compilation of studies shaping R&D
future orientations from different
institutions
Medium
SDG
Policy
objectives
Strategic pillars
Sectors
Technologies
Scientific areas
Own policy
documents
database
Foresight
studies
Medium
High
26
Are currently available
strategies/policies coherent?
IntelComp D1.2 Measurements and data collection
Table 25: Public procurement
Source
TED
Description
Online database of active and past
public procurement offers from local,
national and European authorities for
services, works and supplies. TED has
4,390,327 tenders registered,
providing a comprehensive, if not
exhaustive, overview of procurement
by public authorities in Europe. An
expected obstacle is the difficulty to
link procurement offers with the
technology taxonomy.
Suitability
Medium
Taxonomy
Contract
characteristics
Technologies
Sectors
Relevant policy questions
What are opportunities
financing?
for
EU
What is the role of public procurement
for
transformative
technologies
(theoretically/ practically)?
Table 26: Social Media
Source
European
Media
Monitor
Twitter
Description
The EU Competence Centre on Text
Mining and Analysis extracts
information from online data,
including traditional or social media,
or from large public or proprietary
document sets
Twitter activity (tweets) of preidentified actors: innovative
companies, FP projects,
beneficiaries. Tweets and their
associated reach are considered as
dissemination activities and citizen
engagement mechanisms.
Suitability
Low
Taxonomy
Relevant policy questions
In which ways has the diffusion of
knowledge taken place at programme level?
What were dissemination methods used
towards the public?
Medium
Technologies
Sectors
Has public procurement of innovation
produced product/process innovations
launched in the market (lead markets)
In which ways has the diffusion of
knowledge taken place at programme level?
What were dissemination methods used
towards the public?
Table 27: Skills demand and supply
Source
LinkedIn3
Euraxess
3
Description
Public profiles of
professionals associated to
specific skills or to FP
programmes’ positions
European Commission's job
offers and funding
opportunities platform for
researchers
Suitability
Pending
Taxonomy
Industries
Skills (ESCO)
Scientific
disciplines (FOS2)
Jobs typology
Detailed
taxonomies
developed with
Living Labs
Medium
Relevant policy questions
How many new jobs were created after the
project (research and beyond) within the
country?
How many new jobs were created for
researchers during the project?
What was total employment created?
What was the career development of
participating researchers?
LinkedIn is considered as the most promising source for skills demand and supply. Access to LinkedIn public profiles is however
not confirmed yet.
27
IntelComp D1.2 Measurements and data collection
5. TOOLS FOR STI POLICY ACTORS
IntelComp integrates different underlying technologies capable of providing evidence to answer
policy questions relevant for all phases of the policy cycle, addressing the needs of STI policy
actors. IntelComp builds upon the components and services from Corpus Viewer and
Data4Impact adding newly developed components, exploiting both structured metadata
available for the datasets and the output of AI pipelines that build on unstructured text. All these
components and services, together with the necessary visualisations, are grouped into four main
IntelComp tools:
1. STI Viewer: This tool targets mainly the Policy makers and Public Administrations. It offers
basic and advanced visualisations based on the back-end components for the analysis of
both structured and unstructured data. In addition to STI Viewer, advanced users from this
target group, such as policy analysts, will also have the possibility to analyse their own
datasets using IntelComp integrated components such as NLP pipelines, machine
translation, etc. Also, they can carry out inter-corpus comparisons against publicly available
datasets in the IntelComp Data Lake. This tool answers a wider range of policy questions
described in greater detail in Sections 2 and 3.
2. Interactive Model Trainer: It is a tool provided for technical and advanced users from STI
Policy Makers and Public Administrations. The Interactive Model Trainer allows this type of
users to use the back-office IntelComp architecture and components to train their own
models: either topic models or classification models. It also allows them to play an active
role in the creation and validation of these models to ensure “human-in-the-loop” principles
and unbiased data selection. In this way, they could customise their analysis, comparisons
and visualisations according to the newly trained models. From a technical point of view,
this tool could answer questions such as: “How can we
make use of IntelComp
components to train our
own models?” or “How can we
validate and interact in the
process of creation and training?”.
3. Evaluation Workbench: This tool targets Public Administrations and Funding Entities to
assist in the evaluation process of STI proposals. The Evaluation Workbench will assist in
different tasks such as: Identifying possible evaluators whose expertise and profile match
the thematic area under evaluation; contextualising the proposal within the STI information
space by comparing to existing patents, publications & funded projects, classifying proposals
automatically according to available taxonomies and, finally, checking if similar proposals
have been already funded. The Evaluation Workbench will assist in answering questions
such as: “Who are the experts in a specific area that can act as evaluators?” or “How could
proposals be classified according to available taxonomies?” or “Has this proposal or a similar
one been evaluated / funded before?”.
4. STI Participation Portal: The tool targets stakeholders from academia, industry as well as
citizens. It allows stakeholders to visualise the general STI panorama and its evolution across
the different domains at the national, regional or institutional level. It also links this
information with trending topics and provides some insights on the lag between the STI
outcomes and the social media impact. Moreover, stakeholders will also be able to interact
and share their views and feedback through the Participation Mailbox to guarantee channels
for an ongoing co-creation process. In this sense, the Participation Portal will assist in
providing answers to the following questions, among others: “What are the thematic
domains that have been funded? or In which areas were the STI public funds spent?” or
“Which are the emerging areas?” or “Which entities are the most active in the STI
panorama?” or “Where do we stand nationally or regionally with respect to other countries,
regions, etc.?”.
28
IntelComp D1.2 Measurements and data collection
The tools will provide different visualisations and services according to the users’ profiles.
Examples include:
●
“Enriched” business intelligence panels with topic data and graph exploration
●
Graphs for recursive information navigation (for large corpora or multi-corpora logical
datasets)
●
Topic models exploration tools: for static and dynamic models
●
Inter-corpus comparison tools
●
Bipartite graphs
●
Other services supported by back office elements, using unstructured text as input (e.g.,
classification services, topic inference, machine translation, etc).
6. SERVICES
The identified measurements described in sections 2 and 3 showcase the needs of STI policy
makers and public administrators in both structured and unstructured data. IntelComp's
technological proposal consists of the development of a platform that brings together a series
of data analysis tools to provide the evidence demanded by the proposed policy-making
framework. In a very global and probably over-simplified way, the procedure will involve three
phases linked to corresponding IntelComp work packages: 1) data acquisition and
homogenization, 2) enrichment of the datasets applying state-of-the-art AI and NLP techniques,
and 3) visualis ation of results. Users will be involved in all the steps of this procedure through
the co-creation activities that will be carried out in the living labs. From an information
enrichment point of view, and with the aim of providing data-based evidence that goes beyond
traditional metadata-based analysis, IntelComp focuses on applying NLP techniques to unveil
relevant information connected to the target measurements. When necessary, we also consider
enriching the available information by extracting additional information from other data
sources, or using the Internet as a Data Source (e.g., for extracting the quartile of publications,
etc), but the main focus of the project and what we will describe in this section is the application
of AI pipelines.
In addition to other NLP auxiliary services, the five main services that IntelComp relies on for
data enrichment are the following:
1.
2.
3.
4.
5.
Service for domain-related subcorpus generation
Classification service
Advanced topic modelling service
Topic-based time analysis service
Graph-based impact analysis
Below we briefly describe the listed services, as well as their connection with the information
demands and objective measures identified within the framework developed for evidencebased policy making.
29
IntelComp D1.2 Measurements and data collection
6.1.
Service for domain-related subcorpus generation
In IntelComp we build domain agnostic tools, but we are aware that on most occasions the
platform will be applied to analyse data of a specific domain. This is indeed the case for the three
living labs considered in the project.
Then, and since many of the datasets in the data lake are very wide in scope, we first need to
identify the documents that are relevant for a specific domain, which is a question that can in
many cases not be answered in a completely objective manner, e.g., do we care just about core
AI papers, or do we also wish to include application-related works?
For this reason, we envision a human-in-the-loop-based service for identifying documents
relevant for a particular domain using a relevance feedback mechanism. The basic structure of
the processing pipeline is shown in the figure below.
Figure 1: basic structure of the processing pipeline to identify relevant documents
The main components in the process are the following:
●
●
●
Data source: a corpus of STI documents, with some metadata. For some components of
the process, it will be assumed that the corpus has been processed with NLP tools and
by topic modelling algorithms (see Subsection 5.3).
Initial document selection. A set of tools that facilitate the selection of a subset of
documents from the domain specified by the user. In particular, the user will be allowed
to select documents from the subcorpus in three ways:
- By keywords: the user provides a list of keywords and a set of filters are applied
to select a subset of documents highly scored with respect to the given
keywords.
- By topics: the user selects one or several topics from those inferred by the topic
modelling service, maybe specifying a weight or importance value of each topic.
A set of filters is applied to select a subset of documents highly scored with
respect to the selected topics.
- By definition: the user provides a label identifying a specific domain. Then, a
zero-shot classifier is applied to select documents aligned with the label name.
To do so, the classifier might use documents defining the category specified by
the label (e.g., using related articles from wikipedia).
Machine learning (classification algorithm): after the document selection, a subset of
documents from the target domain is available and used as the training set for a learning
30
IntelComp D1.2 Measurements and data collection
●
6.2.
algorithm. Since the training set contains documents from the positive class only,
standard supervised learning algorithms are not feasible, and PU (Positive-Unlabeled)
learning models will be applied.
Active learning. The active learning module provides a relevance-feedback mechanism
to include a human in the loop. The user will be provided with tools to label specific
documents from the positive and negative classes. This will be useful to refine the
learning algorithm with a training set containing both positive and negative samples.
Classification Service
Some of the requested measurements, as well as comparative analysis, need the joint analysis
of several datasets. Experts find convenient some of the best known taxonomies which are
connected to their intuition, but the issue here is that different datasets include heterogeneous
taxonomies, which makes the joint analysis difficult. A second issue is that in many cases
labelling is carried out by the author or evaluators of the document (paper, project proposal,
etc), which introduces biases.
The classification service aims at producing labels associated with existing taxonomies, so that
the comparison can be carried out along these dimensions. It will allow labelling a dataset
according to a taxonomy which is not available for that dataset, or even relabelling documents
that have not been correctly labelled. The output of the classifiers will allow the end user to
objectively compare the similarity between documents from different datasets.
In order to do so, we will train supervised classifiers whenever possible. For this to be done it is
necessary to have labelled data with several examples of documents that belong to each target
class, so that the classifier can learn to predict them accurately. In the worst case scenario where
there is no training data available, the service will resort to a zero-shot text classification
approach, even though its performance is known to be far from the state-of-the-art.
The basic structure of the classification service is shown in the figure below.
Figure 2: Basic structure of the classification service
The main components are the following:
31
IntelComp D1.2 Measurements and data collection
The input to the service will be the data to be classified together with the desired taxonomy. At
this point there are three possible scenarios:
●
●
●
6.3.
Taxonomy for which a classifier is already available: In this situation an already trained
classifier will be used. Depending on the taxonomy, the classifier will be a single model
or will be composed of a cascade of models arranged in a hierarchical manner (as can
be seen in the right part of the figure).
New taxonomy with training data: In this situation the classification service will train a
new classifier using the provided supervised data. This pipeline will also allow the user
to arrange a set of models in a hierarchical structure. Note that to train a new classifier
from scratch it is required to have a reasonably large amount of data for each label, the
more the better.
New taxonomy without training data: This situation should be avoided at all costs, since
trying to classify new documents in a never seen taxonomy without any training
example is clearly a hard problem, especially in large-scale classification scenarios.
However, when this situation cannot be avoided the system will resort to a zero-shot
classifier. The idea of this classifier is to use an entailment method to compare the
embeddings of the documents to the embeddings of the labels (or a definition of the
labels extracted from a database like wikipedia). These models can only be expected to
perform reasonably well with the shallow labels of the taxonomies.
Advanced Topic Modelling Service
Topic modelling will be used to provide an additional dimension for analysis and comparison
with respect to existing taxonomies. This makes it feasible to analyse data with different levels
of granularity and detect niches that require specific consideration.
A pipeline of the processes involved in the topic modelling service is shown in the figure. The
topic modelling algorithms are fed with a corpus of STI documents, maybe after some preprocessing using auxiliary NLP pipelines (lemmatization, stopword removal, N-gram
identification, etc).
Figure 3: Topic modelling pipeline
32
IntelComp D1.2 Measurements and data collection
The service includes standard topic modelling services based on efficient and scalable
implementations of the Latent Dirichlet Allocation algorithm and, also, algorithms based on
neural networks. The service is expected to provide models based on corpora with tens of
millions of documents. A topic model will produce two kinds of outputs:
●
●
The topic model itself, which identifies and provides a characterisation of the most
relevant themes for a particular dataset
The assignment of documents to the topics in the model
In this way, we can automatically detect the main topics for a given dataset and include this
information for the analysis by the experts. Furthermore, since the number of topics can be
varied according to experts’ preferences, topic modelling offers a way to analyse data with
different granularity levels.
With respect to existing fully automatic topic modelling implementations, IntelComp advanced
topic modelling will bring modifications to satisfy the requirements from policy analysts: more
stable topics, better alignment with other available metadata, automatic labelling of topics, and
the introduction of a set of edition capabilities. This will be provided to the experts inside a tool
for model training, to facilitate the construction of high-quality models that are aligned with
expert intuition.
The service will also implement hierarchical models that allow providing a topic description with
different levels of resolution. The higher level topics provide a broad view description of the
corpus, while lower levels provide information about the internal structure of topics as a
collection of subtopics.
In IntelComp, the information obtained from the topic models may be exploited jointly with that
obtained through the classification modules or taxonomic information available directly for
some of the data sources used. That is, the user will be able to simultaneously view the available
taxonomies, those inferred through the classification modules, and the topics calculated, or a
subset of these, as well as study the relationships between them and other available metadata
(eg, temporal or geographic information). In this sense, the information on topics adds value
compared to the available taxonomies since, for example:
●
●
●
6.4.
allows to analyse
the data with different levels of granularity, e.g., by analysing
specific topics included within a same category of the taxonomy
as it is a completely automatic approach, it allows identifying novel topics, not included
in a specific taxonomy
allows a soft assignment of documents in different topics
Topic-based time analysis service
The Dynamic Topic Μodelling service assigns one or more topics to a publication using the title,
abstract, and venue of the publication. A pipeline of the processes involved is shown in the figure
below. Given DOI-venue-abstract triplets collected from scientific articles, input pre-processing
transforms the textual data into a useful input for the next parts of the pipeline. At the same
33
IntelComp D1.2 Measurements and data collection
time a disambiguation rule is applied on the name of the venue that the publications were
published in.
Hierarchical Classification is applied to detect the fields of science (FOS) of the publication. An
extended version of the Frascati manual developed by the Organisation for Economic Cooperation and Development (OECD) is used to detect fields of science in different granularities.
The simultaneous hierarchical classification allows a dynamic assignment of topics across the
scientific domains rather than assigning a general topic from a universally trained topic model.
Graph Analysis is applied to detect sets of venues that form the topics of a specific field of
science. The graph is developed using the publication venues and their connections through
publication citations from millions of publications.
Further detailed classification is provided using keyword extraction per field-of-study and
grouping of keywords allows detection of more fine-grained topics as dynamic topics formed in
a specific field of science. Keyword extraction can detect more subtle topics addressed in each
publication separately.
Figure 4: Dynamic topic modelling pipeline
Overall, the pipeline consists of pre-trained modules (disambiguation, graphs, KW extraction)
and can be applied in collections of publications to detect dynamic topics. The Dynamic Topic
Analysis extends previously introduced topic modelling methods by correlating topics to a
scientific classification schema with different granularities of detail.
Further, given the dynamic topics and using the year of publication of the input data, we are
able to create per-year and per-topic collections and conduct different types of time analysis,
such as, but not limited to: detection of emerging topics and lead-lag analysis.
6.5.
Graph-based impact analysis
A set of graph analysis tools will be incorporated in IntelComp to facilitate the analysis of the
impact of research agents (authors, inventors, institutions, publications) in their respective
fields.
34
IntelComp D1.2 Measurements and data collection
To do so, different types of graphs will be generated from the text corpora and the metadata
contained in the STI data sources (patents, publications, funding applications, etc.). The general
structure of the processing pipelines is illustrated in the Figure below.
Figure 5: Structure of the processing pipelines for graph-based impact analysis
Graphs may be used to encode and represent different types of relations between documents
(similarities, citations, co-citations), authors (cooperation, semantic similarity, citation), or
institutions (cooperation, etc). Bipartite graphs will be used to connect documents to their
authors and their funding institutions, or their clusters or communities
Graph inference methods and community detection algorithms can be applied to identify the
cluster structure of documents and agents. This, in combination with graph metrics to analyse
the impact or the relevance of nodes in graphs, can be used to extract information about the
particular role of each member of the network in its community, or the impact of a specific
publication or author in the advancement of a research field.
35
IntelComp D1.2 Measurements and data collection
7. GAP ANALYSIS
A gap analysis is performed to compare the domain agnostic conceptual framework i.e., the
identified needs of STI policy stakeholders to the provisional implementation plan in IntelComp.
In other words the gap analysis identifies the policy questions which are not possible to address
with the tools of IntelComp.
Table 28: Typologies of Policy questions not addressable in intelcomp
Agenda Setting
Evaluation
Policy Rationale
Policy question
I. Policy questions which require traditional data
Evaluation
Skills
How many people were trained as technicians?
How many people were trained as researchers?
Evaluation
Taxes
How much tax income was generated?
II. Policy questions which are best analysed through qualitative methods
Agenda setting
Market formation
What is the regulation globally for these technologies?
Agenda setting
Legitimacy
What are the reasons justifying the political choices made?
III. Policy questions for which no data source is available
Agenda setting
Knowledge diffusion
Which networks e.g., clusters, hubs, intermediaries operate nationally per
discipline?
Which financial resources were most effectively used in the previous cycle
(evidence from the evaluation part of the cycle)? [exclude as question]
Agenda setting
Resources mobilisation
What is the size of resources needed to become competitive in each
emerging technology?
What type of resources can be mobilised outside the national public funding
(EU, foundations)?
Evaluation
Innovation
How many patents were licensed?
How many patents were used in-house?
How much royalties did patents produce?
Evaluation
Markets
Has public procurement of innovation created lead markets?
Evaluation
Markets
Has the regulation adopted facilitated the creation/access to new markets?
IV. Policy questions which require statistical analysis or other methods
Evaluation
Innovation
What was the contribution of innovations to turnover, profits, market shares?
Evaluation
Jobs
What was total employment created?
Evaluation
Cost effectiveness
What was the cost per publication? At scientific discipline level?
Evaluation
Cost effectiveness
What was the cost per patent? At scientific discipline level?
Evaluation
Cost effectiveness
What is the cost benefit ratio of each programme?
V. Policy questions which can only marginally inform the policy question
Agenda setting
Entrepreneurial activity
Are scale ups leaving the country?
Evaluation
Jobs
How many new jobs were created for researchers during the project?
36
IntelComp D1.2 Measurements and data collection
REFERENCES
●
●
●
●
Eurostat. 2020. European Statistical System handbook for quality and metadata reports.
Available at: https://ec.europa.eu/eurostat/documents/3859598/10501168/KS-GQ-19006-EN-N.pdf/bf98fd32-f17c-31e2-8c7f-ad41eca91783?t=1583397712000
European Commission. 2020. Experimental big data statistics. Available at:
https://ec.europa.eu/eurostat/cros/content/Experimental_big_data_statistics_en
Deloitte. 2016. Big data analytics for policy making. Available at:
https://joinup.ec.europa.eu/sites/default/files/document/201607/dg_digit_study_big_data_analytics_for_policy_making.pdf
Eurostat. 2017. Towards a harmonised methodology for statistical indicators. Available
at: https://ec.europa.eu/eurostat/documents/3859598/8071770/KS-GQ-17-007-ENN.pdf/7d34c904-2d07-4e71-bd6f-8fe9ee373b60
37
IntelComp D1.2 Measurements and data collection
APPENDIX I – LONG LIST OF SOURCES CONSIDERED
Typology
Source label
Short description
Company
financials/websites
Company
financials/websites
Company
financials/
websites
Skills demand
Skills demand
Skills supply
Innovation
Innovation
Innovation
Innovation
Open corporates
Open database of companies (200 million companies)
European e-justice Business
Registers
RISIS FirmReg
Compilation of business registers in EU, Iceland, Liechtenstein and Norway
Innovation
Science
Science
Investments priv
Investments priv
Investments priv
Legislation
Policy documents
Policy documents
Policy documents
EUIPO trademarks and design
OpenAire
Cordis
Crunchbase
National VC
National Investment Laws
EURLEX
Overton
Parliament discussion minutes
Government sources
Policy documents
Policy
documents
(evaluations and IAs)
Policy
documents
(evaluations and IAs)
Foresight studies
EU publications
SIPER
A reference register on private actors, combining the firms from 3 firm datasets (CIB, VICO and Cheetah) with their linkages, enabling actor-level
harmonisation at European level. Currently at the prototype stage.
European Commission's job offers and funding opportunities platform for researchers
Toolkit of sources of labour market intelligent, with complete economy coverage
Public profiles of professionals associated to specific skills or to FP programmes’ positions
Online inventory of patents with complete coverage of patents (more than 100 million patent documents)
Online IPR database (14826 standards from 352 companies) for the telecommunication sector, hence no coverage of the whole economy
Complete database for European standards, but not informative on other standards
Code repositories used by 4+ million companies. More than 200 million codes available
Country of repositories to be retrieved from the contributors
Inventory of trademarks and designs covering 40 million trademarks and 9 million designs, and used by 200 countries
Open access publications platforms (with 128M deduplicated publications)
Research activities and outputs in the frame of H2020 programmes
Inventory of worldwide companies with comprehensive information on their funding rounds
Own compilation of venture capital websites
Living Lab specific as heterogeneous
across countries, e.g. in Greece all investments supported by the State are public
Online database of European Union treaties, legal acts, consolidated texts, international agreements, etc.
Index of policy literature with comprehensive publication information
Minutes of European Parliament minutes (different committees)
National government's policy documents based on the compilation of national governments' sources, Living Lab specific as heterogeneous
across
countries
Repository of publications by all EU entities and agencies
Repository of research and innovation policy evaluations, EU and OECD countries
Fteval
Repository of the Austrian Platform for Research and Technology Evaluation. Includes mainly European countries’ evaluations.
EC; Competence centre on
foresight; OECD strategic
foresight
TED
Compilation of studies shaping R&D future orientations from different institutions (EC; Competence centre on foresight; OECD strategic foresight)
Procurement
Euraxess
Cedefop
LinkedIn
Patstat
ETSI - standards
ISO micro data - standards
Github
Online database of active and past public procurement offers from local, national and European authorities for services, works and supplies.
4,390,327 tenders registered
38
IntelComp D1.2 Measurements and data collection
Typology
Source label
Short description
Procurement
Own compilation of national procurement websites, Living Labs specific as heterogeneous
Social media
National data on public
procurement
European Media Monitoring
Social media
Twitter
Online media
Online news
across countries
The EU Competence Centre on Text Mining and Analysis extracts information from online data, including traditional or social media, or from large
public or proprietary document sets
Twitter activity (tweets) of pre-identified actors: innovative companies, FP projects, beneficiaries. Tweets and their associated reach are considered
as dissemination activities and citizen engagement mechanisms.
Press announcements for radical technologies
39
IntelComp D1.2 Measurements and data collection
APPENDIX II – SELECTION CRITERIA FOR POLICY QUESTIONS
Figure 6: Criteria selection for policy questions
40