Nothing Special   »   [go: up one dir, main page]

232 Paper

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

LoRaLay: A Multilingual and Multimodal Dataset for

Long Range and Layout-Aware Summarization


Laura Nguyen, Thomas Scialom, Benjamin Piwowarski, Jacopo Staiano

To cite this version:


Laura Nguyen, Thomas Scialom, Benjamin Piwowarski, Jacopo Staiano. LoRaLay: A Multilingual
and Multimodal Dataset for Long Range and Layout-Aware Summarization. The 17th Conference
of the European Chapter of the Association for Computational Linguistics (EACL 2023), May 2023,
Dubrovnik, Croatia. �hal-03992015�

HAL Id: hal-03992015


https://hal.science/hal-03992015
Submitted on 16 Feb 2023

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est


archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
LoRaLay: A Multilingual and Multimodal Dataset for
Long Range and Layout-Aware Summarization
Laura Nguyen1,3 Thomas Scialom2∗ Benjamin Piwowarski3 Jacopo Staiano4∗
1
reciTAL, Paris, France 2 Meta AI, Paris, France
3
Sorbonne Université, CNRS, ISIR, F-75005 Paris, France
4
University of Trento, Italy

laura@recital.ai tscialom@fb.com
benjamin.piwowarski@cnrs.fr jacopo.staiano@unitn.it
Abstract a real-world context (business documentation, sci-
Text Summarization is a popular task and an entific articles, etc.), text does not naturally come
active area of research for the Natural Lan- as a sequence of characters, but is rather displayed
guage Processing community. It requires ac- in a bi-dimensional space containing rich visual
counting for long input texts, a characteristic information. The layout of e.g. this very paper
which poses computational challenges for neu- provides valuable semantics to the reader: in which
ral models. Moreover, real-world documents section are we right now? At the blink of an eye,
come in a variety of complex, visually-rich,
this information is readily accessible via the salient
layouts. This information is of great relevance,
whether to highlight salient content or to en-
section title (formatted differently and placed to
code long-range interactions between textual highlight its role) preceding these words. Just to
passages. Yet, all publicly available summa- emphasize this point, imagine having to scroll this
rization datasets only provide plain text con- content in plain text to access such information.
tent. To facilitate research on how to exploit In the last couple of years, the research commu-
visual/layout information to better capture long- nity has shown a growing interest in addressing
range dependencies in summarization models,
these limitations. Several approaches have been
we present LoRaLay, a collection of datasets
for long-range summarization with accompa- proposed to deal with visually-rich documents and
nying visual/layout information. We extend integrate layout information into language mod-
existing and popular English datasets (arXiv els, with direct applications to Document Under-
and PubMed) with visual/layout information standing tasks. Joint multi-modal pretraining (Xu
and propose four novel datasets – consistently et al., 2021; Powalski et al., 2021; Appalaraju et al.,
built from scholar resources – covering French, 2021) has been key to reach state-of-the-art per-
Spanish, Portuguese, and Korean languages.
formance on several benchmarks (Jaume et al.,
Further, we propose new baselines merging
layout-aware and long-range models – two or- 2019; Graliński et al., 2020; Mathew et al., 2021).
thogonal approaches – and obtain state-of-the- Nonetheless, a remaining limitation is that these
art results, showing the importance of combin- (transformer-based) approaches are not suitable for
ing both lines of research. processing long documents, the quadratic complex-
ity of self-attention constraining their use to short
1 Introduction
sequences. Such models are hence unable to en-
Deep learning techniques have enabled remarkable code global context (e.g. long-range dependencies
progress in Natural Language Processing (NLP) among text blocks).
in recent years (Devlin et al., 2018; Raffel et al., Focusing on compressing the most relevant infor-
2019; Brown et al., 2020). However, the majority mation from long texts to short summaries, the Text
of models, benchmarks, and tasks have been de- Summarization task naturally lends itself to benefit
signed for unimodal approaches, i.e. focusing ex- from such global context. Notice that, in practice,
clusively on a single source of information, namely the limitations linked to sequence length are also
plain text. While it can be argued that for specific amplified by the lack of visual/layout information
NLP tasks, such as textual entailment or machine in the existing datasets. Therefore, in this work,
translation, plain text is all that is needed, there we aim at spurring further research on how to in-
exist several tasks for which disregarding the vi- corporate multimodal information to better capture
sual appearance of text is clearly sub-optimal: in long-range dependencies.
*
Work partially done while at reciTAL. Our contributions can be summarized as follows:
• We extend two popular datasets for long-range visually rich document (e.g., What is the date given
summarization, arXiv and PubMed (Cohan at the top left of the form?, Whose picture is given
et al., 2018), by including visual and layout in this figure?). The DocVQA dataset (Mathew
information – thus allowing direct comparison et al., 2021) and InfographicsVQA (Mathew et al.,
with previous works; 2022) are commonly-used VQA datasets that re-
spectively provide industry documents and info-
• We release 4 additional layout-aware summa- graphic images, encouraging research on under-
rization datasets (128K documents), covering standing documents with complex interplay of text,
French, Spanish, Portuguese, and Korean lan- layout and graphical elements. Finally, to foster
guages; research on visually-rich document understanding,
• We provide baselines including adapted archi- Borchmann et al. (2021) introduce the Document
tectures for multi-modal long-range summa- Understanding Evaluation (DUE) benchmark, a
rization, and report results showing that (1) unified benchmark for end-to-end document under-
performance is far from being optimal; and standing, created by combining several datasets.
(2) layout provides valuable information. DUE includes several available and transformed
datasets for VQA, Key Information Extraction and
All the datasets are available on HuggingFace.1 Machine Reading Comprehension tasks.

2 Related Work
2.2 Existing Summarization Datasets
2.1 Layout/Visually-rich Datasets
Several large-scale summarization datasets have
Document Understanding covers problems that in- been proposed to boost research on text summa-
volve reading and interpreting visually-rich docu- rization systems. Hermann et al. (2015) proposed
ments (in contrast to plain texts), requiring com- the CNN/DailyMail dataset, a collection of English
prehending the conveyed multimodal information. articles extracted from the CNN and The Daily
Hence, several tasks with a central layout aspect Mail portals. Each news article is associated with
have been proposed by the document understand- multi-sentence highlights which serve as reference
ing community. Key Information Extraction tasks summaries. Scialom et al. (2020) bridge the gap be-
consist in extracting the values of a given set of tween English and non-English resources for text
keys, e.g., the total amount in a receipt or the date summarization by introducing MLSum, a large-
in a form. In such tasks, documents have a layout scale multilingual summarization corpus providing
structure that is crucial for their interpretation. No- news articles written in French, German, Spanish,
table datasets include FUNSD (Jaume et al., 2019) Turkish and Russian. Going toward more challeng-
for form understanding in scanned documents, and ing scenarios involving significantly longer doc-
SROIE (Huang et al., 2019), as well as CORD uments, the arXiv and PubMed datasets (Cohan
(Park et al., 2019), for information extraction from et al., 2018) consist of scientific articles collected
receipts. Graliński et al. (2020) elicit progress on from academic repositories, wherein the paper ab-
deeper and more complex Key Information Extrac- stracts are used as summaries. To encourage a shift
tion by introducing the Kleister datasets, a collec- towards building more abstractive summarization
tion of business documents with varying lengths, models with global content understanding, Sharma
released as PDF files. However, the documents et al. (2019) introduce BIGPATENT, a large-scale
in Kleister often contain single-column layouts, dataset made of U.S. patent filings. Here, invention
which are simpler than the various multi-column descriptions serve as reference summaries.
layouts considered in LoRaLay. Document VQA The vast majority of summarization datasets only
is another popular document understanding task deal with plain text documents. As opposed to
that requires processing multimodal information other Document Understanding tasks (e.g., form
(e.g., text, layout, font style, images) conveyed by understanding, visual QA) in which the placement
a document to be able to answer questions about a of text on the page and/or visual components are
1
https://hf.co/datasets/nglaura/arxivlay-summarization, the main source of information needed to find the
https://hf.co/datasets/nglaura/pubmedlay-summarization, desired data (Borchmann et al., 2021), text plays
https://hf.co/datasets/nglaura/hal-summarization,
https://hf.co/datasets/nglaura/scielo-summarization,
a predominant role in document summarization.
https://hf.co/datasets/nglaura/koreascience-summarization However, guidelines for summarizing texts – espe-
cially long ones – often recommend roughly pre- Iberian Peninsula and South Africa, and cov-
viewing them to break them down into their major ering a broad range of topics (e.g. agricultural
sections (Toprak and Almacioğlu, 2009; Luo et al., sciences, engineering, health sciences, letters
2019). This suggests that NLP systems might lever- and arts). Languages include English, Span-
age multimodal information in documents. Miculi- ish, and Portuguese.
cich and Han (2022) propose a two-stage method
which detects text segments and incorporates this • KoreaScience (Korean),5 an open archive of
information in an extractive summarization model. Korean scholarly publications in the fields of
Cao and Wang (2022) collect a new dataset for natural sciences, life sciences, engineering,
long and structure-aware document summarization, and humanities and social sciences. Articles
consisting of 21k documents written in English and are written in English or Korean.
extracted from WikiProject Biography. Further, we provide enhanced versions of the
Although not all documents are explicitly or- arXiv and PubMed datasets, respectively denoted
ganized into clearly defined sections, the great as arXiv-Lay and PubMed-Lay, for which layout
majority contains layout and visual clues (e.g., a information is provided.
physical organization into paragraphs, bigger head-
ings/subheadings) which help structure their textual 3.1 Collecting the Data
contents and facilitate reading. Thus, we argue that Extended Datasets The arXiv and PubMed
layout is crucial to summarize long documents. We datasets (Cohan et al., 2018) contain long scien-
propose a corpus of more than 345K long docu- tific research papers extracted from the arXiv and
ments with layout information. Furthermore, to PubMed repositories. We augment them by provid-
address the need for multilingual training data (Chi ing their PDFs, allowing access to layout and visual
et al., 2020), we include not only English docu- information. As the abstracts contained in the orig-
ments, but also French, Spanish, Portuguese and inal datasets are all lowercased, we do not reuse
Korean ones. them, but rather extract the raw abstracts using the
3 Datasets Construction corresponding APIs.
Note that we were unable to retrieve all the orig-
Inspired by the way the arXiv and PubMed datasets inal documents. For the most part, we failed to
were built (Cohan et al., 2018), we construct our retrieve the corresponding abstracts, as they did not
corpus from research papers, with abstracts as necessarily match the ones contained in the PDF
ground-truth summaries. As the PDF format allows files (due to e.g. PDF-parsing errors). We also
simultaneous access to textual, visual and layout found that some PDF files were unavailable, while
information, we collect PDF files to construct our others were corrupted or scanned documents.6 In
datasets, and provide their URLs.2 total, about 39% (35%) of the original documents
For each language, we select a repository that in arXiv (PubMed) were lost.
contains a high number of academic articles (in the
order of hundreds of thousands) and provides easy arXiv-Lay The original arXiv dataset (Cohan
access to abstracts. More precisely, we chose the et al., 2018) was constructed by converting the
following repositories: LATEX files to plain text. To be consistent with
the other datasets – for which LATEX files are not
• Archives Ouverte HAL (French),3 an open available – we instead use the PDF files to extract
archive of scholarly documents from all aca- both text and layout elements. For each document
demic fields. As HAL is primarily directed contained in the original dataset, we fetch (when
towards French academics, a great proportion possible) the corresponding PDF file using Google
of articles are written in French; Cloud Storage buckets. As opposed to the original
procedure, we do not remove tables nor discard
• SciELO (Spanish and Portuguese),4 an open
sections that follow the conclusion. We retrieve
access database of academic articles published
the corresponding abstracts from a metadata file
in journal collections from Latin America,
provided by Kaggle.7
2
We make the corpus-construction code publicly available at https://
5
github.com/recitalAI/loralay-datasets. http://www.koreascience.or.kr
3 6
https://hal.archives-ouvertes.fr/ For more details on this, see Section A.1 in the Appendix.
4 7
https://www.scielo.org/ https://www.kaggle.com/Cornell-University/arxiv
PubMed-Lay For PubMed, we use the PMC removed by searching for exact matches; when no
OAI Service8 to retrieve abstracts and PDF files. exact match is found, we use fuzzysearch13
and regex14 to find near matches.15 For the non-
HAL We use the HAL API9 to download re-
English datasets, documents might contain several
search papers written in French. To avoid exces-
abstracts, written in different languages. To avoid
sively long (e.g. theses) or short (e.g. posters)
information leakage, we retrieve the abstract of
documents, extraction is restricted to journal and
each document in every language available – ac-
conference papers.
cording to the API for HAL or the websites for
SciELO Using Scrapy,10 we crawl the following SciELO and KoreaScience – and remove them us-
SciELO collections: Ecuador, Colombia, Paraguay, ing the same strategy as for the main language. In
Uruguay, Bolivia, Peru, Portugal, Spain and Brazil. the case an abstract cannot be found, we discard
We download documents written either in Spanish the document to prevent any unforeseen leakage.
or Portuguese, according to the metadata, obtaining The dataset construction process is illustrated in
two distinct datasets: SciELO-ES (Spanish) and Section A in the Appendix.
SciELO-PT (Portuguese).
3.3 Datasets Statistics
KoreaScience Similarly, we scrape the Korea-
The statistics of our proposed datasets, along with
Science website to extract research papers. We
those computed on existing summarization datasets
limit search results to documents whose publishers’
of long documents (Cohan et al., 2018; Sharma
names contain the word Korean. This rule was de-
et al., 2019) are reported in Table 1. We see that
signed after sampling documents in the repository,
document lengths are comparable or greater than
and is the simplest way to get a good proportion
for the arXiv, PubMed and BigPatent datasets.
of papers written in Korean.11 Further, search is
For arXiv-Lay and PubMed-Lay, we retain the
restricted to papers published between 2012 and
original train/validation/splits and try to reconstruct
2021, as recent publications are more likely to have
them as faithfully to the originals as possible. For
digital-born, searchable PDFs. Finally, we down-
the new datasets, we order documents based on
load the PDF files of documents that contain an
their publication dates and provide splits following
abstract written in Korean.
a chronological ordering. For HAL and Korea-
3.2 Data Pre-processing Science, we retain 3% of the articles as validation
data, 3% as test, and the remaining as training data.
For each corpus, we use the 95th percentile of the
To match the number of validation/test documents
page distribution as an upper bound to filter out
in HAL and KoreaScience, we split the data into
documents with too many pages, while the 5th (1st
90% for training, 5% for validation and 5% for test,
for HAL and SciELO) percentile of the summary
for both SciELO datasets.
length distribution is used as a minimum thresh-
old to remove documents whose abstracts are too 4 Experiments
short. As our baselines do not consider visual in-
formation, we only extract text and layout from 4.1 Models
the PDF files. Layout is incorporated by provid- For reproducibility purposes, we make the mod-
ing the spatial position of each word in a docu- els’ implementation, along with the fine-tuning and
ment page image, represented by its bounding box evaluation scripts, publicly available.16
(x0 , y0 , x1 , y1 ), where (x0 , y0 ) and (x1 , y1 ) respec- We do not explore the use of visual information
tively denote the coordinates of the top-left and in long document summarization, as the focus is on
bottom-right corners. Using the PDF rendering li- evaluating baseline performance using state-of-the-
brary Poppler12 , text and word bounding boxes are art summarization models augmented with layout
extracted from each PDF, and the sequence order is information. While visual features might provide
recovered based on heuristics around the document a better understanding of structures such as tables
layout (e.g., tables, columns). Abstracts are then and figures, we do not expect substantial gains with
8 13
https://www.ncbi.nlm.nih.gov/pmc/tools/oai/ https://pypi.org/project/fuzzysearch/
9 14
https://api.archives-ouvertes.fr/docs/search https://pypi.org/project/regex/
10 15
https://scrapy.org/ We use a maximum Levenshtein distance of 20 with fuzzysearch, and a
11
For further details, see Section A.2 in the Appendix. maximum number of errors of 3 with regex.
12 16
https://poppler.freedesktop.org/ https://github.com/recitalAI/loralay-modeling
# Docs Mean Mean
Dataset Article Summary bounding box size (width and height). The layout
Length Length representation of a token is formed by summing
arXiv (Cohan et al., 2018) 215,913 3,016 203
PubMed (Cohan et al., 2018) 133,215 4,938 220 the resulting embedding representations The final
BigPatent (Sharma et al., 2019) 1,341,362 3,572 117
representation of a token is then obtained through
arXiv-Lay 130,919 7,084 125
PubMed-Lay 86,668 4,038 144 point-wise summation of its textual, 1D-positional
HAL 46,148 4,543 134 and layout embeddings.
SciELO-ES 23,170 4,977 172
SciELO-PT 21,563 6,853 162
KoreaScience 37,498 3,192 95 Long-range, text-only models To process longer
sequences, we leverage BigBird (Zaheer et al.,
Table 1: Datasets statistics. Article and summary
2020), a sparse-attention based Transformer which
lengths are computed in words. For KoreaScience,
words are obtained via white-space tokenization. Dif- reduces the quadratic dependency to a linear one.
ference between arXiv and arXiv-Lay is due to the fact For arXiv-Lay and PubMed-Lay, we initialize Big-
that we retain the whole document, while Cohan et al. Bird from Pegasus (Zaheer et al., 2020) and for
(2018) truncate it after the conclusion. the non-English datasets, we use the weights of
MBART. The resulting models are referred to as
BigBird-Pegasus and BigBird-MBART. For both
respect to layout-aware models. Indeed, the infor-
models, BigBird sparse attention is used only in
mation provided in figures (i.e., information that
the encoder. Both models can handle up to 4,096
cannot be captured by layout or text) are commonly
inputs tokens, which is greater than the median
described in the caption or related paragraphs.
length in PubMed-Lay, HAL and KoreaScience.
Text-only models with standard input size We
Long-range, layout-aware models We also in-
use Pegasus (Zhang et al., 2020) as a text-only base-
clude layout information in long-range text-only
line for arXiv-Lay and PubMed-Lay. Pegasus is
models. Similarly to layout-aware models with
an encoder-decoder model pre-trained using gap-
standard input size, we integrate layout informa-
sentences generation, making it a state-of-the-art
tion into our long-range models by encoding each
model for abstractive summarization. For the non-
token’s spatial position in the page. The resulting
English datasets, we rely on a finetuned MBART as
models are denoted as BigBird-Pegasus+Layout
our baseline. MBART (Liu et al., 2020) is a multi-
and BigBird-MBART+Layout.
lingual sequence-to-sequence model pretrained on
large-scale monolingual corpora in many languages Additional State-of-the-Art Baselines We fur-
using the BART objective (Lewis et al., 2019). We ther consider additional state-of-the-art baselines
use its extension, MBART-50 (Tang et al., 2020),17 for summarization: i) the text-only T5 (Raffel et al.,
which is created from the original MBART by ex- 2019) with standard input size, ii) the long-range
tending its embeddings layers and pre-training it on Longformer-Encoder-Decoder (LED) (Beltagy
a total of 50 languages. Both Pegasus and MBART et al., 2020), and iii) the layout-aware, long-range
are limited to a maximum sequence length of 1,024 LED+Layout, which we implement similarly to
tokens, which is well below the median length of the previous layout-aware models.
each dataset.
Layout-aware models with standard input size
4.2 Implementation Details
We introduce layout-aware extensions of Pega-
sus and MBART, respectively denoted as Pe- We initialize our Pegasus-based and MBART-based
gasus+Layout and MBART+Layout. Following models with, respectively, the google/pegasus-large
LayoutLM (Xu et al., 2020), which is state-of- and facebook/mbart-large-50 checkpoints shared
the-art on several document understanding tasks through the Hugging Face Model Hub. As for T5
(Jaume et al., 2019; Huang et al., 2019; Harley and LED, we use the weights from t5-base and
et al., 2015), each token bounding box coordinates allenai/led-base-16384, respectively.18
(x0 , y0 , x1 , y1 ) is normalized into an integer in the Following Zhang et al. (2020) and Zaheer et al.
range [0, 1000]. Spatial positions are encoded us- (2020), we fine-tune our models up to 74k (100k)
ing four embedding tables, namely two for the co- steps on arXiv-Lay (PubMed-Lay). On HAL, the
ordinate axes (x and y), and the other two for the total number of steps is set to 100k, while it is de-
17 18
For the sake of clarity, we refer to MBART-50 as MBART. The large versions of T5 and LED did not fit into GPU due to their size.
Instances Input Length Output Length
Dataset
Train Dev Test Median 90%-ile Median 90%-ile
arXiv (Cohan et al., 2018) 203,037 6,436 6,440 6,151 14,405 171 352
PubMed (Cohan et al., 2018) 119,924 6,633 6,658 2,715 6,101 212 318
arXiv-Lay 122,189 4,374 4,356 6,225 12,541 150 249
PubMed-Lay 78,234 4,084 4,350 3,761 7,109 182 296
HAL 43,379 1,384 1,385 4,074 8,761 179 351
SciELO-ES 20,853 1,158 1,159 4,859 8,519 226 382
SciELO-PT 19,407 1,078 1,078 6,090 9,655 239 374
KoreaScience 35,248 1,125 1,125 2,916 5,094 219 340

Table 2: Datasets splits and statistics. Input and output lengths are computed in tokens, obtained using Pegasus and
MBART-50’s tokenizers for the English and non-English datasets, respectively.

creased to 50k for the other non-English datasets.19 old that predicts when improvements are signifi-
For each model, we select the checkpoint with cant. ROUGE-L improvements between each pair
the best validation loss. For Pegasus and MBART of models are reported in Table 11 in the appendix.
models, inputs are truncated at 1,024 tokens. For On arXiv-Lay, we compute a threshold of 1.48
BigBird-Pegasus models, we follow Zaheer et al. ROUGE-L, showing that BigBird-Pegasus+Layout
(2020) and set the maximum input length at 3,072 significantly outperforms all Pegasus-based mod-
tokens. As the median input length is much greater els. In particular, we find a 1.56 ROUGE-L im-
in almost every non-English dataset, we increase provement between BigBird-Pegasus and its layout-
the maximum input length to 4,096 tokens for augmented counterpart, demonstrating that the ad-
BigBird-MBART models. Output length is re- dition of layout to long-range modeling signifi-
stricted to 256 tokens for all models, which is cantly improves summarization. On PubMed-Lay,
enough to fully capture at least 50% of the sum- we compute a threshold of 1.77. Hence, the 0.96
maries in each dataset. ROUGE-L improvement from BigBird-Pegasus to
For evaluation, we use beam search and report a its layout-augmented counterpart is not significant.
single run for each model and dataset. Following However, the variance in font sizes in PubMed-Lay
Zhang et al. (2020); Zaheer et al. (2020), we set the is much smaller compared to arXiv-Lay (see Ta-
number of beams to 8 for Pegasus-based models, ble 12 in the appendix), reflecting an overall more
and 5 for BigBird-Pegasus-based models. For the simplistic layout. Therefore, we argue that lay-
non-English datasets, we set it to 5 for all models, out integration has a lesser impact in PubMed-Lay,
for fair comparison. For all experiments, we use which can explain the non-significance of results.
a length penalty of 0.8. For more implementation In addition, we find that BigBird-Pegasus signifi-
details, see Section B.1 in the Appendix. cantly outperforms Pegasus and Pegasus+Layout
only when augmented with layout, with an im-
5 Results and Discussion provement of, respectively, 2.3 and 2.2 points. This
demonstrates the importance of combining layout
5.1 General Results and long-range modeling.
In Table 3, we report the ROUGE-L scores ob- While T5 and LED obtain competitive results,
tained on arXiv and PubMed datasets (reported by we find that the gain in adding layout to LED is
Zaheer et al. (2020)), as well as on the correspond- minor. However, the models we consider have all
ing layout-augmented counterparts we release. 20 been pre-trained only on plain text. As a result,
On arXiv-Lay and PubMed-Lay, we observe that, the layout representations are learnt from scratch
while the addition of layout to Pegasus does not during fine-tuning. Similarly to us, Borchmann
improve the ROUGE-L scores, there are gains in in- et al. (2021) show that their layout-augmented T5
tegrating layout information into BigBird-Pegasus. does not necessarily improve the scores, and that
To assess whether these gains are significant, we performance is significantly enhanced only when
perform significance analysis at the 0.05 level us- the model has been pre-trained on layout-rich data.
ing bootstrap, and estimate a ROUGE-L thresh- Further, we observe, for both Pegasus and
19
BigBird-Pegasus, a drop in performance w.r.t. the
We tested different values for the number of steps (10k, 25k, 50k, 100k)
and chose the one that gave the best validation scores for MBART. scores obtained on the original datasets. This can
20
For detailed results, please refer to Section C.1 in the Appendix. be explained by two factors. First, our extended
arXiv/ PubMed/
Model # Params arXiv-Lay PubMed-Lay
Pegasus (Zhang et al., 2020) 568M 38.83 41.34
BigBird-Pegasus (Zaheer et al., 2020) 576M 41.77 42.33
T5 (Raffel et al., 2019) 223M 37.90 39.23
LED (Beltagy et al., 2020) 161M 40.74 41.54
LED+Layout 165M 40.96 41.83
Pegasus 568M 39.07 39.75
Pegasus+Layout 572M 39.25 39.85
BigBird-Pegasus 576M 39.59 41.09
BigBird-Pegasus+Layout 581M 41.15 42.05

Table 3: ROUGE-L scores on arXiv-Lay and PubMed-Lay. Reported results obtained by Pegasus and BigBird-
Pegasus on the original arXiv and PubMed are reported with a gray background. The best results obtained on
arXiv-Lay and PubMed-Lay are denoted in bold.

HAL SciELO-ES SciELO-PT KoreaScience


Model # Params (fr) (es) (pt) (ko)
MBART 610M 42.00 36.55 36.42 16.94
MBART+Layout 615M 41.67 37.47 34.37 14.98
BigBird-MBART 617M 45.04 37.76 39.63 18.55
BigBird-MBART+Layout 621M 45.20 40.71 40.51 19.95

Table 4: ROUGE-L scores on the non-English datasets. The best results for each dataset are reported in bold.

Dataset Train Validation Test ther, we find that the plain-text BigBird models do
HAL (fr) 90.72 90.54 85.84 not improve over the layout-aware Pegasus and
SciELO-ES (es) 84.86 84.28 84.90 MBART on arXiv-Lay and SciELO-ES, demon-
SciELO-PT (pt) 90.95 90.58 91.96
KoreaScience (ko) 73.53 70.26 68.78
strating that simply capturing more context does
not always suffice. Regarding performance on Ko-
Table 5: Percent confidence obtained for the main lan- reaScience, we can see a significant drop in perfor-
guage, for each dataset split. mance for every model w.r.t the other non-English
datasets. At first glance, we notice a high amount
of English segments (e.g., tables, figure captions,
datasets contain less training data due to the inabil- scientific concepts) in documents in KoreaScience.
ity to process all original documents. Secondly, To investigate this, we use the cld2 library21 to de-
the settings are different: while the original arXiv tect the language in each non-English document.
and PubMed datasets contain clear discourse in- We consider the percent confidence of the top-1
formation (e.g., each section is delimited by mark- matching language as an indicator of the presence
ers) obtained from LATEX files, documents in our of the main language (i.e., French, Spanish, Por-
extended versions are built by parsing raw PDF tuguese or Korean) in a document, and average
files. Therefore, the task is more challenging for the results to obtain a score for the whole dataset.
text-only baselines, as they have no access to the Table 5 reports the average percent confidence ob-
discourse structure of documents, which further tained on each split, for each dataset. We find
underlines the importance of taking the structural that the percentage of text written in the main lan-
information, brought by visual cues, into account. guage in KoreaScience (i.e., Korean) is smaller
Table 4 presents the ROUGE-L scores reported than in other datasets. As the MBART-based mod-
on the non-English datasets. On HAL, we note els expect only one language in a document (the
that BigBird-MBART does not benefit from lay- information is encoded using a special token), we
out. After investigation, we hypothesize that this is claim the strong presence of non-Korean segments
due to the larger presence of single-column and in KoreaScience causes them to suffer from inter-
simple layouts, which makes layout integration ference problems. Therefore, we highlight that
less needed. On both SciELO datasets, we notice KoreaScience is a more challenging dataset, and
that combining layout with long-range modeling
21
brings substantial improvements over MBART. Fur- https://github.com/GregBowyer/cld2-cffi
Difference in ROUGE-L

Difference in ROUGE-L
2.5

Difference in ROUGE-L
2.5
2
2 2
1.5
1.5 1.5
1
1 1

0.5 0.5 0.5

0 0 0
n< Q1 Q2 Q3 m Q1 Q2 Q3 σ< Q1 Q2 Q3
Q1 ≤n ≤n ≤n <Q ≤m ≤m ≤m Q1 ≤σ ≤σ ≤σ
<Q <Q 1 <Q <Q <Q <Q
2 3 2 3 2 3

(a) Article length (b) Summary length (c) σ of bounding box height

Figure 1: Benefit of using layout on arXiv-Lay (blue) and PubMed-Lay (red), defined as the difference in ROUGE-L
scores between BigBird-Pegasus+Layout and BigBird-Pegasus. For each dataset, quartiles are calculated from
the distributions of article lengths (a), summary lengths (b) and variance in the height of the bounding boxes (c).
ROUGE-L scores are then computed per quartile range, and averaged over each range.

we hope our work will boost research on better noted hi ). Further, we ask them to rate the summary
long-range, multimodal and multilingual models. in terms of coherence and fluency, on a scale of 0
Overall, results show a clear benefit of integrat- to 5, following the DUC quality guidelines (Dang,
ing layout information for long document summa- 2005). Finally, annotators are asked to penalize
rization. summaries with hallucinated facts. The highlight-
ing process allows us to compute precision and
5.2 Human Evaluation recall as the percentage of highlighted information
in the generated summary and the ground-truth ab-
Metric BigBird BigBird+Layout stract, respectively. Moreover, we can compute an
Precision % 35.15 (0.81) 37.51 (0.70) overlap ratio as the percentage of highlighted infor-
Recall % 28.07 (0.73) 33.59 (0.86) mation that appears several times in the generated
Coherence 3.80 (0.38) 3.75 (0.62)
Fluency 4.48 (0.03) 4.34 (0.16)
summary. Lastly, we calculate a flow percentage
Overlap % 8.77 (0.24) 7.49 (0.36) that evaluates how well the order of the ground-
Flow % 30.75 (0.68) 33.02 (0.71) truth information is preserved by computing the
percentage of times where the highlighted text hi
Table 6: Average human judgement scores obtained by in the gold summary for one generated sentence
comparing gold-truth abstracts and summaries gener-
si follows the highlighted text hi−1 for the previ-
ated by BigBird and BigBird+Layout from 50 docu-
ments sampled from arXiv-Lay and HAL. Inter-rater ous sentence si−1 (i.e. where any token from hi
agreement is computed using Krippendorff’s alpha co- occurs after a token in hi−1 ). Table 6 reports the
efficient, and enclosed between parentheses. scores for each metric and model, averaged over all
50 documents, along with inter-rater agreements,
To gain more insight into the effect of docu- computed using Krippendorff’s alpha coefficient.
ment layout for summarizing long textual content, We find that adding layout to the models signifi-
we conduct a human evaluation of summaries gen- cantly improves precision and recall, results in less
erated by BigBird-Pegasus/BigBird-MBART and overlap (repetition), and is more in line with the
their layout-aware counterparts. We choose the ground truth order. Further, annotators did not en-
BigBird-based models over the LED ones, as the counter any hallucinated fact in the 50 generated
gain in augmenting BigBird with layout is much summaries. To conclude, reported results show that
more apparent. We evenly sample 50 documents human annotators strongly agree that adding lay-
from arXiv-Lay and HAL test sets, filtering docu- out generates better summaries, further validating
ments by their topics (computer science) to match our claim that layout provides vital information for
the judgment capabilities of the three human an- summarization tasks.
notators. We design an evaluation interface (see
5.3 Case Studies
Section C.2 in the appendix). For each sentence si
in the generated summary, we ask the annotators To have a better understanding of the previous re-
to highlight the relevant tokens in si , along with sults, we focus on uncovering the cases in which
the equivalent parts in the ground-truth abstract (de- layout is most helpful. To this end, we identify fea-
tures that relate to the necessity of having layout: 1) tion is already voluntarily made public by the au-
article length, as longer texts are intuitively easier thors themselves.
to understand with layout, 2) summary length, as
longer summaries are likely to cover more salient 8 Acknowledgements
information, and 3) variance in font sizes (using We thank the reviewers for their insightful com-
the height of the bounding boxes), and, as such, ments. This work is supported by the Associa-
the complexity of the layout. The benefit of using tion Nationale de la Recherche et de la Technolo-
layout is measured as the difference in ROUGE- gie (ANRT) under CIFRE grant N2020/0916. It
L scores between BigBird-Pegasus+Layout and was partially performed using HPC resources from
its purely textual counterpart, on arXiv-Lay and GENCI-IDRIS (Grant 2021-AD011011841).
PubMed-Lay. We compute quartiles from the dis-
tributions of article lengths, ground-truth summary
lengths, and variance in the height of bounding References
boxes.22 Based on the aforementioned factors, the
Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota,
scores obtained by each model are then grouped Yusheng Xie, and R Manmatha. 2021. Docformer:
by quartile range, and averaged over each range, End-to-end transformer for document understanding.
see Figure 1. On arXiv-Lay, we find that layout arXiv preprint arXiv:2106.11539.
brings most improvement when dealing with the
Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020.
25% longest documents and summaries, while, for Longformer: The long-document transformer. arXiv
both datasets, layout is least beneficial for the short- preprint arXiv:2004.05150.
est documents and summaries. These results cor-
Łukasz Borchmann, Michał Pietruszka, Tomasz Stanis-
roborate our claim that layout can bring important
lawek, Dawid Jurkiewicz, Michał Turski, Karolina
information about long-range context. Concerning Szyndler, and Filip Graliński. 2021. Due: End-to-end
the third factor, we see, on PubMed-Lay, that layout document understanding benchmark. In Thirty-fifth
is most helpful for documents that have the widest Conference on Neural Information Processing Sys-
ranges of font sizes, showcasing the advantage of tems Datasets and Benchmarks Track (Round 2).
using layout to capture salient information. Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
6 Conclusion Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, et al. 2020. Language models are few-shot
We have presented LoRaLay, a set of large-scale learners. Advances in neural information processing
systems, 33:1877–1901.
datasets for long-range and layout-aware text sum-
marization. LoRaLay provides the research com- Shuyang Cao and Lu Wang. 2022. Hibrids: Atten-
munity with 4 novel multimodal corpora cover- tion with hierarchical biases for structure-aware
ing French, Spanish, Portuguese, and Korean lan- long document summarization. arXiv preprint
arXiv:2203.10741.
guages, built from scientific articles. Furthermore,
it includes additional layout and visual informa- Zewen Chi, Li Dong, Furu Wei, Wenhui Wang, Xian-
tion for existing long-range summarization datasets Ling Mao, and Heyan Huang. 2020. Cross-lingual
(arXiv and PubMed). We provide adapted architec- natural language generation via pre-training. In Pro-
ceedings of the AAAI conference on artificial intelli-
tures merging layout-aware and long-range models, gence, volume 34, pages 7570–7577.
and show the importance of layout information in
capturing long-range dependencies. Arman Cohan, Franck Dernoncourt, Doo Soon Kim,
Trung Bui, Seokhwan Kim, Walter Chang, and Nazli
Goharian. 2018. A discourse-aware attention model
7 Limitations for abstractive summarization of long documents.
arXiv preprint arXiv:1804.05685.
The proposed corpus is limited to a single domain,
that of scientific literature. Such limitation arguably Hoa Trang Dang. 2005. Overview of duc 2005. In Pro-
extends to the layout diversity of documents. In ceedings of the document understanding conference,
volume 2005, pages 1–12.
terms of risks, we acknowledge the presence of
Personally Identifiable Information such as author Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
names and affiliations; nonetheless, such informa- Kristina Toutanova. 2018. Bert: Pre-training of deep
bidirectional transformers for language understand-
22
The quartiles are provided in Appendix C.3. ing. arXiv preprint arXiv:1810.04805.
Filip Graliński, Tomasz Stanisławek, Anna Wróblewska, Minesh Mathew, Dimosthenis Karatzas, and CV Jawa-
Dawid Lipiński, Agnieszka Kaliska, Paulina Rosal- har. 2021. Docvqa: A dataset for vqa on docu-
ska, Bartosz Topolski, and Przemysław Biecek. 2020. ment images. In Proceedings of the IEEE/CVF Win-
Kleister: A novel task for information extraction in- ter Conference on Applications of Computer Vision,
volving long documents with complex layout. arXiv pages 2200–2209.
preprint arXiv:2003.02356.
Lesly Miculicich and Benjamin Han. 2022. Document
Adam W Harley, Alex Ufkes, and Konstantinos G Der- summarization with text segmentation.
panis. 2015. Evaluation of deep convolutional nets
for document image classification and retrieval. In Seunghyun Park, Seung Shin, Bado Lee, Junyeop Lee,
2015 13th International Conference on Document Jaeheung Surh, Minjoon Seo, and Hwalsuk Lee. 2019.
Analysis and Recognition (ICDAR), pages 991–995. Cord: a consolidated receipt dataset for post-ocr
IEEE. parsing. In Workshop on Document Intelligence at
NeurIPS 2019.
Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-
stette, Lasse Espeholt, Will Kay, Mustafa Suleyman, Adam Paszke, Sam Gross, Soumith Chintala, Gregory
and Phil Blunsom. 2015. Teaching machines to read Chanan, Edward Yang, Zachary DeVito, Zeming Lin,
and comprehend. Advances in neural information Alban Desmaison, Luca Antiga, and Adam Lerer.
processing systems, 28. 2017. Automatic differentiation in pytorch.
Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Di- Rafał Powalski, Łukasz Borchmann, Dawid Jurkiewicz,
mosthenis Karatzas, Shijian Lu, and CV Jawahar. Tomasz Dwojak, Michał Pietruszka, and Gabriela
2019. Icdar2019 competition on scanned receipt ocr Pałka. 2021. Going full-tilt boogie on document
and information extraction. In 2019 International understanding with text-image-layout transformer.
Conference on Document Analysis and Recognition arXiv preprint arXiv:2102.09550.
(ICDAR), pages 1516–1520. IEEE.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Guillaume Jaume, Hazim Kemal Ekenel, and Jean- Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Philippe Thiran. 2019. Funsd: A dataset for form Wei Li, and Peter J Liu. 2019. Exploring the limits
understanding in noisy scanned documents. In 2019 of transfer learning with a unified text-to-text trans-
International Conference on Document Analysis and former. arXiv preprint arXiv:1910.10683.
Recognition Workshops (ICDARW), volume 2, pages
1–6. IEEE. Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier,
Diederik P Kingma and Jimmy Ba. 2014. Adam: A Benjamin Piwowarski, and Jacopo Staiano. 2020. Ml-
method for stochastic optimization. arXiv preprint sum: The multilingual summarization corpus. arXiv
arXiv:1412.6980. preprint arXiv:2004.14900.

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Eva Sharma, Chen Li, and Lu Wang. 2019. Bigpatent:
Ghazvininejad, Abdelrahman Mohamed, Omer Levy, A large-scale dataset for abstractive and coherent
Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: De- summarization. arXiv preprint arXiv:1906.03741.
noising sequence-to-sequence pre-training for natural
language generation, translation, and comprehension. Noam Shazeer and Mitchell Stern. 2018. Adafactor:
arXiv preprint arXiv:1910.13461. Adaptive learning rates with sublinear memory cost.
In International Conference on Machine Learning,
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey pages 4596–4604. PMLR.
Edunov, Marjan Ghazvininejad, Mike Lewis, and
Luke Zettlemoyer. 2020. Multilingual denoising pre- Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Na-
training for neural machine translation. Transac- man Goyal, Vishrav Chaudhary, Jiatao Gu, and An-
tions of the Association for Computational Linguis- gela Fan. 2020. Multilingual translation with exten-
tics, 8:726–742. sible multilingual pretraining and finetuning. arXiv
preprint arXiv:2008.00401.
Ling Luo, Xiang Ao, Yan Song, Feiyang Pan, Min
Yang, and Qing He. 2019. Reading like her: Hu- Elif Toprak and Gamze Almacioğlu. 2009. Three read-
man reading inspired extractive summarization. In ing phases and their applications in the teaching of
Proceedings of the 2019 Conference on Empirical english as a foreign language in reading classes with
Methods in Natural Language Processing and the 9th young learners. Journal of language and Linguistic
International Joint Conference on Natural Language Studies, 5(1).
Processing (EMNLP-IJCNLP), pages 3033–3043.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthe- Chaumond, Clement Delangue, Anthony Moi, Pier-
nis Karatzas, Ernest Valveny, and CV Jawahar. 2022. ric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz,
Infographicvqa. In Proceedings of the IEEE/CVF et al. 2019. Huggingface’s transformers: State-of-
Winter Conference on Applications of Computer Vi- the-art natural language processing. arXiv preprint
sion, pages 1697–1706. arXiv:1910.03771.
Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu
Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha
Zhang, Wanxiang Che, Min Zhang, and Lidong Zhou.
2021. LayoutLMv2: Multi-modal pre-training for
visually-rich document understanding. In Proceed-
ings of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International
Joint Conference on Natural Language Processing
(Volume 1: Long Papers), pages 2579–2591, Online.
Association for Computational Linguistics.

Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu


Wei, and Ming Zhou. 2020. Layoutlm: Pre-training
of text and layout for document image understanding.
In Proceedings of the 26th ACM SIGKDD Interna-
tional Conference on Knowledge Discovery & Data
Mining, pages 1192–1200.
Manzil Zaheer, Guru Guruganesh, Kumar Avinava
Dubey, Joshua Ainslie, Chris Alberti, Santiago On-
tanon, Philip Pham, Anirudh Ravula, Qifan Wang,
Li Yang, et al. 2020. Big bird: Transformers for
longer sequences. Advances in Neural Information
Processing Systems, 33:17283–17297.
Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Pe-
ter Liu. 2020. Pegasus: Pre-training with extracted
gap-sentences for abstractive summarization. In In-
ternational Conference on Machine Learning, pages
11328–11339. PMLR.
LoRaLay: A Multilingual and
Multimodal Dataset for Long Range
and Layout-Aware Summarization –
Appendix
A Datasets Construction

Data Repository

(1) PDF Extraction

(2) Filtering
Figure 3: Distribution of failure types in arXiv-Lay (top)
and PubMed-Lay (bottom).
w1 bbox1 w1 bbox1 w1 bbox1
w2 bbox2 w2 bbox2 w2 bbox2
w3 bbox3 w3 bbox3
w4 bbox4 w4 bbox4
w3 bbox3 (3) Text Extraction A.2 KoreaScience – Extraction Rule
w4 bbox4
w5 bbox5 w5 bbox5 w5 bbox5
… … …
Korean documents in KoreaScience are extracted
by restricting search results to documents contain-
ing the word "Korean" in the publisher’s name. We
w1 bbox1 w1 bbox1 w1 bbox1 show that this rule does not bias the sample to-
w2 bbox2 w2 bbox2 w2 bbox2
w3 bbox3 w3 bbox3 w3 bbox3 (4) Abstract Removal wards a specific research area. We compute the
w4 bbox4 w4 bbox4 w4 bbox4
w5 bbox5 w5 bbox5 w5 bbox5
distribution of topics covered by all publishers, and
… … … compare it to the distribution of topics covered by
publishers whose name contains the word Korean.
Figure 2: Dataset Construction Process. Figure 4 shows that the distribution obtained using
our rule remains roughly the same as the original.

A.1 Extended Datasets – Lost Documents


Figure 3 provides details on the amount of original Publishers with `Korean` in name
documents lost in the process of augmenting arXiv 40 All publishers

and PubMed with layout/visual information. We 30

observe four types of failures, and provide numbers


20
for each type:
10

0
• The link to the document’s PDF file is not Na
tur
Lif
e
Ar
tifi
cia
Hu
ma
So
cie
Hu
ma
e l n ty nS
cie
provided (Unavailable PDF); nc
ea
nd
Te
chn
olo
gy
• The PDF file is corrupted (i.e., cannot be
opened) (Corrupted PDF); Figure 4: Distribution of topics covered by all publishers
(red) vs distribution of topics covered by publishers
• The document is not digital-born, making it whose name contains the word Korean (blue).
impossible to parse it with PDF parsing tools
( Scanned PDF);
A.3 Samples
• The document’s abstract cannot be found in We provide samples of documents from each
the PDF (Irretrievable Abstract). dataset in Figure 5.
A.4 Datasets Statistics C Results

The distribution of research areas in arXiv-Lay and


HAL are provided in Figures 6 and 7, respectively.
Such distributions are not available for the other
datasets, as we did not have access to topic infor- C.1 Detailed Results
mation during extraction.

Model R-1 R-2 R-L


MBART 47.05 22.23 42.00
MBART+Layout 46.65 21.96 41.67
BigBird-MBART 49.85 25.71 45.04
BigBird-MBART+Layout 49.99 25.20 45.20

Table 8: ROUGE scores on HAL. Best results are re-


Figure 6: Distribution of research areas in arXiv-Lay. ported in bold.

Model R-1 R-2 R-L


MBART 17.33 7.70 16.94
MBART+Layout 15.43 6.69 14.98
BigBird-MBART 18.96 8.01 18.55
BigBird-MBART+Layout 20.36 9.49 19.95

Table 10: ROUGE scores on KoreaScience. The best


results are reported in bold.
Figure 7: Distribution of research areas in HAL.

B Experiments

B.1 Implementation Details C.2 Human Evaluation

Models were implemented in Python using Py-


Torch (Paszke et al., 2017) and Hugging Face (Wolf
Using the Streamlit23 framework, we design and
et al., 2019) librairies. In all experiments, we use
develop an interface to aid human evaluation of
Adafactor (Shazeer and Stern, 2018), a stochastic
summarization models.24
optimization method based on Adam (Kingma and
Ba, 2014) that reduces memory usage while retain-
ing the empirical benefits of adaptivity. We set
a learning rate warmup over the first 10% steps –
except on arXiv-Lay where it is set to 10k consis-
tently with Zaheer et al. (2020), and use a square
23
root decay of the learning rate. All our experiments https://streamlit.io/
24
The code is publicly available at
have been run on four Nvidia V100 with 32GB https://anonymous.4open.science/r/
each. loralay-eval-interface-C20D.
arXiv / arXiv-Lay PubMed / PubMed-Lay
Model
R-1 R-2 R-L R-1 R-2 R-L
Pegasus (Zhang et al., 2020) 44.21 16.95 38.83 45.97 20.15 41.34
BigBird-Pegasus (Zaheer et al., 2020) 46.63 19.02 41.77 46.32 20.65 42.33
T5 (Raffel et al., 2019) 42.79 15.98 37.90 42.88 17.58 39.23
LED (Beltagy et al., 2020) 45.41 18.14 40.74 45.28 19.86 41.54
LED+Layout 45.51 18.55 40.96 45.41 19.74 41.83
MBART 37.64 13.29 33.49 41.19 16.04 37.47
Pegasus 43.81 17.27 39.07 43.52 17.96 39.75
Pegasus+Layout 44.10 17.01 39.25 43.59 18.24 39.85
BigBird-Pegasus 44.43 17.74 39.59 44.80 19.32 41.09
BigBird-Pegasus+Layout 46.02 18.95 41.15 45.69 20.38 42.05

Table 7: ROUGE scores on arXiv-Lay and PubMed-Lay. Reported results obtained by Pegasus and BigBird-Pegasus
on the original arXiv and PubMed are reported with a gray background. The best results obtained on arXiv-Lay and
PubMed-Lay are denoted in bold.

SciELO-ES SciELO-PT
Model R-1 R-2 R-L R-1 R-2 R-L
MBART 41.04 15.65 36.55 41.18 15.53 36.42
MBART+Layout 42.27 15.73 37.47 39.45 14.17 34.37
BigBird-MBART 42.64 16.60 37.76 44.85 18.70 39.63
BigBird-MBART+Layout 45.64 19.33 40.71 45.47 20.40 40.51

Table 9: ROUGE scores on the SciELO datasets. The best results are reported in bold.

C.3 Analysis of the Impact of Layout


Table 12 lists the quartiles computed from the dis-
tributions of article lengths, summary lengths, and
variation in the height of bounding boxes, for arXiv-
Lay and PubMed-Lay.

Figure 8: LoRaLay evaluation interface.


LED LED+Layout Pegasus Pegasus+Layout BigBird-Pegasus BigBird-Pegasus+Layout
T5 2.84 / 2.31 3.06 / 2.60 1.17 / 0.52 1.35 / 0.62 1.69 / 1.86 3.25 / 2.82
LED – 0.22 / 0.29 1.67 / 1.79 1.49 / 1.69 1.15 / 0.45 0.41 / 0.51
LED+Layout – – 1.89 / 2.08 1.71 / 1.98 1.38 / 0.74 0.19 / 0.22
Pegasus – – – 0.34 / 0.10 0.52 / 1.34 2.08 / 2.30
Pegasus+Layout – – – – 0.34 / 1.24 1.90 / 2.20
BigBird-Pegasus – – – – – 1.56 / 0.96

Table 11: Absolute ROUGE-L score differences between each pair of models, on arXiv-Lay/PubMed-Lay.

Q1 Q2 Q3
Distribution
arXiv-Lay PubMed-Lay arXiv-Lay PubMed-Lay arXiv-Lay PubMed-Lay
Article Length 6,226 3,513 9,142 5,557 13,190 8,036
Summary Length 119 130 159 182 202 247
σ of bounding box height 3.37 1.34 3.98 1.73 4.70 2.28

Table 12: Quartiles calculated from the distributions of article lengths, summary lengths, and variation in the height
of bounding boxes, for arXiv-Lay and PubMed-Lay.
Journal of Biomedicine and Biotechnology • 2004:5 (2004) 306–313 • PII. S111072430440401X • http://jbb.hindawi.com
MINIREVIEW ARTICLE

Anthocyanins and Human Health: An In Vitro


MPP-2009-131 Investigative Approach
Experimental Review of Photon Structure Func-
Mary Ann Lila∗
tion Data
Department of Natural Resources & Environmental Sciences, College of Agricultural Consumer and Environmental Sciences,
arXiv:0907.2782v1 [hep-ex] 16 Jul 2009

University of Illinois, Urbana, IL 61801, USA


Richard Nisius
Max-Planck-Institut für Physik (Werner-Heisenberg-Institut), Föhringer Ring 6, D-80805 Mün- Received 2 April 2004; revised 10 May 2004; accepted 12 May 2004
chen, Germany, E-mail: Richard.Nisius@mpp.mpg.de∗
Anthocyanin pigments and associated flavonoids have demonstrated ability to protect against a myriad of human diseases, yet they
DOI: will be assigned have been notoriously difficult to study with regard to human health. Anthocyanins frequently interact with other phytochemicals
to potentiate biological effects, thus contributions from individual components are difficult to decipher. The complex, multicompo-
nent structure of compounds in a bioactive mixture and the degradation of flavonoids during harsh extraction procedures obscure
The present knowledge of the structure of the photon is presented based on results obtained the precise assignment of bioactivity to individual pigments. Extensive metabolic breakdown after ingestion complicates tracking of
by measurements of photon structure functions at e+ e− collider. Results are presented both anthocyanins to assess absorption, bioavailability, and accumulation in various organs. Anthocyanin pigments and other flavonoids
for the QED structure of the photon as well as for the hadronic structure, where the data that are uniformly, predictably produced in rigorously controlled plant cell culture systems can be a great advantage for health and
are also compared to recent parametrisations of the hadronic structure function F2γ (x, Q2 ). nutrition research because they are quickly, easily isolated, lack interferences found in whole fruits, can be elicited to provoke rapid
Prospects of future photon structure function measurements, especially at an International and prolific accumulation, and are amenable to biolabeling so that metabolic fate can be investigated after ingestion.
Linear Collider are outlined.

ANTHOCYANINS AND BIOMEDICINAL PROPERTIES suggests that other mechanisms of action are also respon-
1 Introduction sible for observed health benefits [2, 3, 4, 5]. Anthocyanin
Anthocyanins are members of the flavonoid group isolates and anthocyanin-rich mixtures of bioflavonoids
The measurements of photon structure functions have a long tradition since the first of such of phytochemicals, a group predominant in teas, honey, may provide protection from DNA cleavage, estrogenic
measurements was performed by the PLUTO Collaboration in 1981. The investigations concern wines, fruits, vegetables, nuts, olive oil, cocoa, and cereals. activity (altering development of hormone-dependent
the QED structure of the photon as well as the hadronic structure. For the hadronic structure The flavonoids, perhaps the most important single group disease symptoms), enzyme inhibition, boosting produc-
function F2γ (x, Q2 ) the main areas of interest are the behavior at low values of x and the of phenolics in foods, comprise a group of over 4000 tion of cytokines (thus regulating immune responses),
evolution with the momentum scale Q2 , which is predicted by QCD to be logarithmic. The C15 aromatic plant compounds with multiple substitution anti-inflammatory activity, lipid peroxidation, decreas-
experimental information is dominated by the results from the four LEP experiments. patterns (www.nal.usda.gov/fnic/foodcomp/index.html). ing capillary permeability and fragility, and membrane
This review is based on earlier work [1, 2] and as an extension provides a number of updated The primary players in this group include the an- strengthening [6, 7, 8, 9, 10]. The chemical structure (po-
figures, together with a comparison of the experimental data with new parametrisations of thocyanins (eg, cyanidin, pelargonidin, petunidin), the sition, number, and types of substitutions) of the indi-
F2γ (x, Q2 ) that became available since then. Only results on the structure of quasi-real photons flavonols (quercetin, kaempferol), flavones (luteolin, vidual anthocyanin molecule also has a bearing on the
are discussed here. The structure of virtual photons and the corresponding measurements of apigenin), flavanones (myricetin, naringin, hesperetin, degree to which anthocyanins exert their bioactive prop-
effective structure functions are detailed in [3]. naringenin), flavan-3-ols (catechin, epicatechin, gallocat- erties [11, 12] and the structure/function relationships
echin), and, although sometimes classified separately, the also influence the intracellular localization of the pig-
isoflavones (genistein, daidzein). Phytochemicals in this ments [7]. The anthocyanin literature includes some con-
2 Structure function measurements class are frequently referred to as bioflavonoids due to troversy over the relative contributions of glycosylated an-
their multifaceted roles in human health maintenance, thocyanins versus aglycones in terms of bioavailability
The photon can fluctuate into a fermion–anti-fermion state consistent with the quantum num- and anthocyanins in food are typically ingested as com- and bioactive potential [7, 13, 14, 15, 16]. Originally, it
bers of the photon and within the limitations set by the Heisenberg uncertainty principle. These ponents of complex mixtures of flavonoid components. was assumed that only aglycones could enter the circu-
fluctuations are favored, i.e. have the longest lifetimes, for high energetic photons of low virtu- Daily intake is estimated from 500 mg to 1 g, but can be lation circuit, however, absorption and metabolism of an-
ality. If such a fluctuation of the photon is probed, the photon reveals its structure. Using this several g/d if an individual is consuming flavonoid supple- thocyanin glycosides has now been demonstrated. The na-
feature, measurements of photon structure functions are obtained from the differential cross- ments (grape seed extract, ginkgo biloba, or pycnogenol; ture of the sugar conjugate and the aglycone are important
section of the deep-inelastic electron-photon scattering1 process sketched in Figure 1. In this see, eg, [1]). determinants of anthocyanin absorption and excretion in
∗ Invited
The colorful anthocyanins are the most recognized, both humans and rats [15].
talk presented at the Photon09 Conference in Hamburg on May 12, 2009.
1 In this paper, the term electron encompasses positrons throughout.
visible members of the bioflavonoid phytochemicals. The The roles of anthocyanin pigments as medicinal
free-radical scavenging and antioxidant capacities of an- agents have been well-accepted dogma in folk medicine
thocyanin pigments are the most highly publicized of the throughout the world, and, in fact, these pigments are
PHOTON09 1 modus operandi used by these pigments to intervene with linked to an amazingly broad-based range of health ben-
human therapeutic targets, but, in fact, research clearly efits. For example, anthocyanins from Hibiscus sp have

© 2004 Hindawi Publishing Corporation

(a) arXiv-Lay (b) PubMed-Lay

Les représentations des enseignants de ZEP sur la relation école/famille à


ISSN-L: 2530-5115
travers le prisme des élèves en grande réussite scolaire

Publié dans la revue Cahier E&D 2017 Cahier N° 28 Familles, Parents, Ecole
DOI: http://doi.org/10.22585/hospdomic.v5i4.148
Lien vers le site Education & Devenir

Lien vers le site Les cahiers pédagogiques Tendencias temporales de los patrones de búsqueda
Résumé sobre Servicios de Atención de Salud a Domicilio
Les familles sont des partenaires essentiels de l’école. Pourtant, la relation école/famille est souvent décrite
comme problématique. Quelles représentations les enseignants ont de cette relation et de l’influence du
antes y después del COVID-19
milieu familial sur la réussite de leurs élèves ? Nous avons réalisé une enquête nationale auprès de 1790
professeurs des écoles (PE) en zone d’éducation prioritaire (ZEP) puis des entretiens avec dix d’entre eux. Le Temporal trends in Home Care Services search patterns
prisme des élèves en grande réussite scolaire (EGRS) dans les ZEP a été choisi pour étudier la différence de
perceptions des enseignants en fonction de la réussite de l’élève. Les PE décrivent le profil idéal des parents
before and after COVID-19
d’élèves. Ils souhaitent davantage d’implication de la part des familles et voudraient mettre en place une
réelle coéducation qu’ils jugent indispensable à la réussite des élèves. Rubén Palomo-Llinares1  0000-0002-1890-4337
Julia Sanchez-Tormo2  0000-0001-9341-8737
Mots clefs Benjamín Palomo-Llinares 3  0000-0002-3892-3551
Représentations – enseignants – coéducation – grande réussite scolaire - éducation prioritaire 1. Universidad Miguel Hernández, Departamento de Salud Pública e Historia de la Ciencia, Sant Joan d´Alacant, Alicante,
España.
Caroline HACHE – ADEF – AMU 2. Centro Internacional Virtual de Investigación en Nutrición (CIVIN), Alicante, España.
(caroline.hache@univ-amu.fr) 3. Universitat Miguel Hernández d’Elx, Elche, España.

Introduction
Lorsqu’ils étudient la proportion d’élèves de milieu populaire ayant obtenu le baccalauréat général sans Correspondencia/Correspondence Financiación/Funding
Rubén Palomo-Llinares Este trabajo no ha recibido ninguna financiación.
redoubler, Ould Ferhat et Terrail (2005) indique qu’un désir fort de la part des parents peut faire la différence
palomo.rub@gmail.com
entre les élèves qui réussissent et ceux qui échouent. On retrouve dans la littérature (Lorcerie, 2015) une Contribuciones de autoría/Author contributions
catégorisation des conduites des élèves lorsqu’ils font face aux apprentissages en fonction de l’attitude de Recibido/Received Todos los autores han contribuido por igual en la realiczación
leurs parents. Les textes officielsi encouragent une relation positive école/famille, car la famille est 29.09.2021 de este trabajo.
considérée comme un partenaire de l’école avec une place importante dans la scolarité de l’élève (Houssaye,
Aceptado/Accepted
2001). Que pensent les enseignants de ces déclarations ? Quelles sont les représentations des enseignants
12.10.2021
concernant l’influence des familles populaires sur la réussite scolaire de leur enfant ?
Conflicto de Intereses/Competing interest
Notre étude se propose, dans une première enquête, d’interroger par questionnaire 1790 enseignants
Los autores no presentan conflicto de intereses.
d’école élémentaire, toutes en ZEPii, autour de leur quotidien dans les classes et, dans une deuxième
enquête, de réaliser des entretiens avec dix d’entre eux. Le sujet des parents d’élèves a pris une place
importante dans les entretiens de tous les enseignants, comme ceux interrogés par Moisan et Simon (1997,
p. 68) qui ont plus parlé « des parents que des élèves ».

Le choix du prisme des élèves en grande réussite scolaire (EGRS) et donc ici, des parents de ceux-ci, a été pris
pour étudier l’avis des enseignants sur un profil particulier, celui des familles dont les élèves réussissent
(Hache, 2016) alors que l’on s’attendrait à ce qu’ils soient en difficulté scolaire. En effet, Charlot (2001, p. 7)
les appelle les « réussites paradoxales » car ils réussissent dans un milieu qualifié de défavorable pour la CÓMO CITAR ESTE TRABAJO | HOW TO CITE THIS PAPER
réussite scolaire. Cela a permis aux enseignants de s’exprimer sur la différence ou l’absence de différence
entre les parents des EGRS et les autres. Palomo-Llinares R, Sanchez-Tormo J, Palomo-Llinares B. Tendencias temporales de los patrones de búsqueda sobre
Servicios de Atención de Salud a Domicilio antes y después del COVID-19. Hosp Domic. 2021;5(4):187-95.

Hosp Domic. 2021;5(4):187-195 187

(c) HAL (d) SciELO-ES


rev. hist. (São Paulo), n.178, a03318, 2019 Gilberto da Silva Guizelin
http://dx.doi.org/10.11606/issn.2316-9141.rh.2019.144021 Uma luz sobre as relações Brasil-Moçambique no oitocentos:
a Missão Consular de João Luiz Airoza (1827-1828)

ISSN 1225-0740 Journal of the Korean Regional Science Association Vol.37, No.2(2021)
https://doi.org/10.22669/krsa.2021.37.2.063 한국지역학회지 『지역연구』 제37권 제2호 pp.63-82 63

ARTIGO
UMA LUZ SOBRE AS
RELAÇÕES BRASIL-
수도권 제조업 창업 활동의 공간적 분포 변화
MOÇAMBIQUE NO
- 공간 마르코프 체인의 응용 -*
OITOCENTOS: A
MISSÃO CONSULAR 송창현**·안순범***·임업****

DE JOÃO LUIZ AIROZA


Changes in Spatial Distribution of Manufacturing Startup
(1827-1828)* Activities in the Capital Region, Korea:
A Spatial Markov Chain Approach*
Changhyun Song* · Soonbeom Ahn** · Up Lim***
Contato
Rua Belo Horizonte, 433, apto. 603
Gilberto da Silva Guizelin**
86020-060 – Londrina – Paraná – Brasil Universidade de São Paulo 국문요약 본 연구는 2000년부터 2018년까지를 분석의 시간적 범위로 설정하여 제조업 창업 활동이 공간적으로
guizelin.gs@gmail.com São Paulo – São Paulo – Brasil 어떠한 변화를 보여왔는지를 탐색적으로 분석하고, 향후 창업 활동의 분포 패턴 변화를 예측하는 것을 목적으로 한
다. 분석을 위해 2000년부터 2018년까지의 「전국사업체조사」 마이크로데이터 제조업 사업체 자료를 활용하였다.
한국산업연구원의 ISTANS 분류체계에서 제시하는 40대 제조업 기준에 따라 제조업을 4개의 세부 산업군으로 구
Resumo 분한 후, 수도권 행정구역 읍면동 수준에서 공간자기상관 분석 및 공간 마르코프 체인 분석을 수행하였다. 분석 결
Após a assinatura da Convenção de 1826 com a Grã-Bretanha, pela qual o go-
verno de D. Pedro I concordou, em troca do reconhecimento britânico, coibir o 과에 따르면, 고위기술산업군 및 중고위기술산업군의 창업 활동은 시간이 흐름에 따라 경기도 남부를 중심으로 집
tráfico transatlântico de africanos para o Império a partir de 1830, foram criadas 중되고 있는 것으로 나타났으며, 중저위기술산업군 및 저위기술산업군 창업 활동의 집중은 수도권 외곽으로 분산
representações consulares brasileiras na África Portuguesa com a explícita fina- 되고 있는 것으로 나타났다. 2000년부터 2018년까지의 추세를 연장하여 2036년까지의 분포 변화를 예측하였을
lidade de proteger a atuação de negreiros brasileiros nos últimos anos de legali-
dade do comércio de escravos sob a bandeira imperial. Neste sentido, o presente 때, 창업 활동이 활발히 발생하는 지역 및 그와 인접하고 있는 지역의 경우 향후 분위 상승의 가능성이 높은 것으로
artigo investiga a atuação de João Luiz Airoza, cônsul do Brasil em Moçambique, 나타나 긍정적인 공간 효과가 존재하는 것으로 확인되었다. 본 연구는 일자리 창출의 주요 원천이 되는 제조업 창
entre 1827 e 1828, na defesa do circuito negro entre o Brasil e a África Oriental.
업 활동의 분포 패턴 변화를 동태적으로 분석함으로써 창업 육성 및 일자리 창출과 관련한 지역 정책에의 시사점을
Para tanto, o texto aqui apresentado priorizou como fonte de estudo a documen-
tação consular produzida por Airoza e dirigida à antiga Secretaria de Estado dos 제공하고자 하였다.
Negócios Estrangeiros.
Palavras-chave 주제어 제조업, 창업, 탐색적공간자료 분석, 공간마르코프 체인
Relações internacionais – Relações Brasil-Moçambique – Missão consular –
Tráfico de escravos – África Oriental.
Abstract: This study aims to explore how manufacturing start-up activities from 2000 to 2018 have changed spatially
and to predict changes in distribution patterns of future start-up activities. For the analysis, the Census on Establishments
microdata from 2000 to 2018 were used, and the manufacturing industry was classified into four detailed industrial
* Todas as obras e todos os documentos utilizados na pesquisa e na elaboração do artigo são
citados nas notas e na bibliografia. * 이 논문은 2019년 대한민국 교육부와 한국연구재단의 인문사회분야 중견연구자지원사업의 지원을 받아 수행된 연구임(NRF-2019
** Doutor em História pela Faculdade de Ciências Humanas e Sociais de Franca, da Universidade S1A5A2A01045590). 본 연구는 2020년 한국지역학회 후기학술대회에서 우수논문상을 수상한 연구임.
Estadual Paulista Júlio de Mesquita Filho (Unesp). Pós-doutorando em História pela Faculdade de ** 연세대학교 대학원 도시공학과 석박사통합과정(주저자, E-mail: changhyunsong@yonsei.ac.kr)
Filosofia, Letras e Ciências Humanas, da Universidade de São Paulo (FFLCH/USP). Bolsista pós-
*** 연세대학교 대학원 도시공학과 석사과정(공동저자, E-mail: soonbeomahn@yonsei.ac.kr)
doc processo nº 2018/07798-1, Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP).
**** 연세대학교 도시공학과 교수(교신저자, E-mail: uplim@yonsei.ac.kr)

(e) SciELO-PT (f) KoreaScience

Figure 5: Samples from each dataset.

You might also like