CSL Publised

Computer Speech and Language 75 (2022) 101381
Contents lists available at ScienceDirect
Computer Speech & Language

journal homepage: www.elsevier.com/locate/csl
Cross-Lingual Text Reuse Detection at sentence level for

English–Urdu language pair
Iqra Muneer a,b ,∗, Rao Muhammad Adeel Nawab a
a COMSATS University Islamabad, Lahore Campus, Lahore, Pakistan
b University of Engineering & Technology Lahore, Narowal Campus, Pakistan
ARTICLE INFO ABSTRACT
Keywords: In recent years, the problem of Cross-Lingual Text Reuse Detection (X-TRD) has gained the
Cross-Lingual Text Reuse interest of researchers due to the availability of large digital repositories and automatic
Cross-Lingual Text Reuse Detection translation systems. These systems are promptly available and openly accessible, which makes it
English–Urdu language pair
easier to reuse text across the languages and hard to detect. In previous studies, different corpora
Cross-lingual Sentence Transformer
and techniques have been developed for X-TRD at sentence/passage and document level for the
Translation plus Mono-Lingual Analysis
English–Urdu language pair. However, there is a lack of large benchmark corpora and standard
techniques for X-TRD for the English–Urdu language pair at the sentence level. To overcome
this limitation, this study presents a large benchmark sentential cross-lingual (English–Urdu)
corpus of 21,669 sentence pairs with simulated cases of X-TR, which are manually annotated
at three levels of rewrite (Wholly Derived (WD) = 7,655, Partially Derived (PD) = 6,461, and
Non Derived (ND) = 7,553). As a second major contribution, we have applied various state-of-
the-art Cross-Lingual Sentence Transformers (CLST), and Translation plus Mono-lingual Analysis
(T+MA) including N-gram Overlap (lexical), WordNet based techniques (semantic), mono-
lingual word embedding-based techniques, and Kullback–Leibler Distance (KLD) (probabilistic)
on our proposed sentential corpus for X-TRD. For the binary classification, the best results are
obtained (𝐹1 = 0.94) using a combination of all CLST and T+MA techniques and a combination
of all T+MA techniques, whereas, for the ternary classification task, the best results are obtained
(𝐹1 = 0.84) using a combination of all CLST and T+MA techniques. The corpus will be publicly
available to foster and promote research for X-TRD in an under-resourced language, such as
the Urdu language.
1. Introduction
Cross-Lingual Text Reuse (X-TR) is the process of borrowing text(s) from an already existing text by simply changing the language
of the source text. The freely accessible digital repositories (e.g., Wikipedia1 ) and efficient machine translation systems (e.g., Google
Translator2 and Bing Translator3 ) have contributed towards making the X-TR a very common practice across the languages. X-TRD
has numerous potential applications, for example, cross-lingual information retrieval, Cross-Lingual Plagiarism Detection (CLPD),
and cross-lingual question answering (Ferrero et al., 2017b).
∗ Corresponding author at: COMSATS University Islamabad, Lahore Campus, Lahore, Pakistan.
E-mail addresses: fa18-pcs-002@cuilahore.edu.pk (I. Muneer), adeelnawab@cuilahore.edu.pk (R.M.A. Nawab).
1 https://www.wikipedia.org: Last Visited: 20-March-2021.
2 https://translate.google.com: Last Visited: 20-March-2021.
3 https://www.bing.com/translator: Last Visited: 20-March-2021.
https://doi.org/10.1016/j.csl.2022.101381
Received 30 May 2021; Received in revised form 12 November 2021; Accepted 20 March 2022
Available online 28 March 2022
0885-2308/© 2022 Elsevier Ltd. All rights reserved.
I. Muneer and R.M.A. Nawab Computer Speech & Language 75 (2022) 101381
The task of Cross-Lingual Text Reuse Detection can be broadly categorized into two categories : (1) Cross-Lingual Local Text
Reuse Detection (X-LTRD), and (2) Cross-Lingual Global Text Reuse Detection (X-GTRD) (Sameen et al., 2017). In the first case,
words, phrases, sentences, or passages are derived from the source(s) language (L1) to create a new text in destination language
(L2), whereas in the second case, the source document(s) is used to create a new document in a language (L2).
The act of text borrowing in X-TR can be performed either manually or it can be performed automatically using machine
translation tools. Based on the technique used, the task of X-TR can be further categorized as (1) artificial cases of X-TR - when new
text in different languages is created by using the text rewriting tools and the automatic translation tools, (2) simulated cases of
X-TR - when new text in the language (L2) is manually created by asking the humans to take source text(s) in the language (L1), and
(3) real cases of X-TR - which are manually created by the journalists to create newspaper stories in the language (L2) by reusing
news agencies text(s) from language (L1).
The main focus of this study is to generate a large X-LTRD benchmark corpus for the English–Urdu language pair based on
simulated cases of X-TR.
X-TRD and CLPD have been the interest of researchers for a variety of language pairs. The majority of these studies include
the English language paired with another language, e.g., English–Russian (Bakhteev et al., 2019), English–Indonesian (Alfikri
and Purwarianti, 2012), English–German (Franco-Salvador et al., 2016), English–Persian (Asghari et al., 2015; Hadgu, 2018),
English–Hindi (Kothwal and Varma, 2013), the English–Spanish (Potthast et al., 2011b), multi-lingual (English, German, Catalan,
Slovene, Spanish, English–Turkish) (Štajner and Mladenić, 2019), English–Arabic (Aljohani and Mohd, 2014), and the English–
Spanish, and English–Turkish (Li et al., 2018). The English–Urdu language pair for X-TRD has also been previously explored at
the sentence/passage level (Muneer et al., 2019) and document level (Sharjeel, 2020). These existing efforts (Muneer et al., 2019;
Sharjeel, 2020) are based on the real cases of X-TRD and the developed corpora are small in size. This research differs from the
previous efforts and focuses on developing the data and techniques for the simulated cases of X-TRD at the sentence level for
English–Urdu language pair, which is not previously explored. The two main contributions of this research are: (1) create large
X-TRD corpus at sentence level based on simulated cases, (2) apply, evaluate, and the compare various cross-lingual text reuse
detection techniques with the T+MA and without T+MA on our proposed corpus.
The major contribution of this study is a large novel gold standard X-TRD corpus containing 21,669 (WD = 7655, PD = 6461, and
ND = 7553) X-TR pairs. As a secondary contribution, we have applied the T+MA based techniques and the Cross-Lingual Sentence
Transformers based techniques to detect X-TR in our proposed corpus. The reason for selecting T+MA based techniques is that
in the previous studies, they have presented promising results for the text reuse and plagiarism detection tasks (Barrón-Cedeno
et al., 2008, 2013; Muneer et al., 2019; Sharjeel, 2020; Potthast et al., 2011b,a). The reason for selecting CLST based techniques
is that in the previous studies, they have presented promising results on the tasks similar to text reuse detection, e.g., paraphrase
detection (Reimers and Gurevych, 2019, 2020). As far as we are aware both the T+MA and the Transformer based techniques used
in this study have not been previously reported. Our extensive experimentation showed that a combination of T+MA and CLST
based techniques is the most effective in X-TRD on our proposed corpus.
This study holds both theoretical and practical significance. We believe that our proposed corpus will be helpful in (1) fostering
research in an under-resourced language i.e., Urdu, (2) identifying and understanding what strategies and edit operations are used
by the people when they reuse text across languages, (3) development of a bilingual dictionary for the English–Urdu language pair,
(4) making a direct comparison of existing techniques for X-TRD for English–Urdu language pair and (5) developing, evaluating and
comparing new techniques for X-TRD for the English–Urdu language pair.
The rest of this paper is organized as follows: Section 2 discusses existing corpora and techniques for X-TRD. Section 3 presents the
corpus generation process used to create the cross-lingual corpus. Section 4 describes the proposed techniques for X-TRD. Section 5
describes the experimental setup. Section 6 presents results and their analysis. Finally, Section 7 concludes the paper with future
research directions.
2. Related work
In this section, we will present existing corpora and techniques for cross-lingual text reuse detection.
2.1. Cross-lingual corpora
In the literature, the many efforts have been made to develop various techniques and corpora for estimating CLPD and X-TRD.
One notable effort in this manner is the series of three PAN International Competitions on CLPD for the English–German language
and the English–Spanish pairs (Potthast et al., 2011b).4 The result of these International Competitions is a bunch of three large
benchmark corpora (the PAN-PC-10 corpus Potthast et al., 2011b, the PAN-PC-11 corpus Potthast et al., 2011b, and the PAN-PC-12
corpus Potthast et al., 2011a) for CLPD in the English–German and the English–Spanish language pair. The PAN-PC corpora were
generated by using the simulated and cross-lingual artificial cases of text reuse. In all three PAN-PC corpora contain, the source text
is in English and derived is in either Spanish or German.5
4 https://pan.webis.de Last visited: 20-March-2021.

5 PAN-PC corpora are publicly available to download https://www.uni-weimar.de/en/media/structure/ Last visited: 20-March-2021.
2
Bakhteev et al. developed the CLPD system at the document level for the English–Russian language for the CLP estima-
tion (Bakhteev et al., 2019). The training dataset consists of 30 million parallel pairs for the proposed system from the Russian
and English Wikipedia articles based on artificial cases of CLP. The best performance is obtained 𝐹1 = 0.87 with the Translation
plus mono-lingual analysis technique.
Sharjeel proposed a document level benchmark (called TREU corpus) for the English–Urdu language pair for measuring X-
TRD (Sharjeel, 2020). There is a total of 2257 X-TR document pairs based on the real cases. The benchmark is manually annotated
into three categories (Wholly Derived = 672, Partially Derived = 888, Non Derived = 697) with the source in English and derived in
Urdu languages. The author compared T+MA-based techniques with Greedy String Tiling, N-gram Overlap, Longest Common Sub-
sequence, mono-lingual word embedding techniques, and mono-lingual sentence embedding. The best performance were achieved
(𝐹1 = 0.78) using N-gram Overlap (unigram) and (𝐹1 = 0.66) with the combination of all techniques for the binary and ternary
classification tasks respectively.
Muneer et al. proposed a sentence/passage level benchmark for the English–Urdu language pair for measuring X-TRD (called
CLEU corpus) (Muneer et al., 2019). There is a total of 3235 X-TR document pairs based on the real cases. The benchmark is
manually annotated into three categories (Near Copy = 751, Paraphrased Copy = 1751, Independently Written = 733) with the
source in English and derived in Urdu languages. To develop and evaluate X-TRD systems for English–Urdu language pair, three
sets of techniques (N-gram Overlap, Greedy String Tiling, and Longest Common Sub-sequence) using T+MA were applied on their
proposed CLEU sentence/passage corpus. The best performance was obtained (𝐹1 = 0.732) using N-gram Overlap (unigram) and
(𝐹1 = 0.552) using Greedy String Tiling (GST-mml1) for the binary and ternary classification tasks respectively.
Recently, Haneef et al. proposed a document level benchmark for the English–Urdu language pair for measuring CLPD (Haneef
et al., 2019). There is a total of 2395 CLPD document pairs with the source (in English), and derived (in Urdu), based on the
simulated cases of CLPD. The benchmark is comprised on 540 automatic translations, 539 artificially paraphrased, 508 manually
paraphrased, and 808 Non plagiarized. The author compared including the n-gram overlap and the longest common sub-sequence
for the development of the CLPD system. The best results were obtained using the n-gram Overlap (unigrams) mean similarity scores
as (1.00, 0.68, 0.52, and 0.22) for the automatic translation, artificially paraphrased, manually paraphrased, and Non plagiarized
documents respectively.
Cenedo et al. had proposed a document level for the English–Hindi language pair for measuring X-TRD (called Cross-Language
Indian Text Reuse (CLITR) corpus) (Barrón-Cedeno et al., 2013). There is a total of 5032 source documents (in English), and 388
derived documents (in Hindi) respectively based on the artificial and simulated cases of X-TR. The corpus is annotated at four levels
of rewrite (exact copy = 79, light revision = 99, heavy revision = 98, original = 112). For the development and evaluation of X-TRD
systems for the English–Hindi language pair. The CLITR corpus was presented in an International Competition.6 The best system in
the competition was based on key-phrase extraction (Kothwal and Varma, 2013) that obtained (𝐹1 = 0.79). The corpus is publicly
available for download.7
JRC-EU Corpus and Fairy Tale Corpus are the most well-known cross-lingual corpora for the CLPD task (Kent and Salim,
2010). In an another effort, Ceska et al. made JRC-EU and Fairy Tale corpora for the CLPD task (Ceska et al., 2008). The JRC-
EU corpus contains 400 legislative reports of the European Union documents with 200 English source documents and 200 Czech
documents (Potthast et al., 2011a). Fairy-tale corpus consists of 54 documents (source documents in English = 27, and suspicious
documents in Czech = 27). Ceska et al. applied the MLPlag technique dependent on the EuroWordNet thesaurus and obtained (𝐹1
= 72.53% and 𝐹1 = 100%) scores on the JRC-EU and Fairy Tale Corpus individually. Both corpora are not available for download.
2.2. Translation plus mono-lingual analysis based techniques
The most common and instinctive technique for recognizing the X-TR and CLPD includes using the web-based translation
services to normalize texts written in different languages into a common language. This technique is known as the translation
plus mono-lingual analysis (T+MA) and has been exceptional in the past studies when compared with other state-of-the-art X-
TR techniques (Barrón-Cedeno et al., 2010). In the literature, the most popular and widely used technique for X-TRD is T+MA.8
Additionally, for X-TRD, in the past studies, the T+MA technique has been shown to be more effective and provide reliable
performance than other techniques. For example, in all three PAN International Competitions on the X-TRD (English–Spanish and
English–German language pairs), the best systems were reported based on the T+MA technique (Potthast et al., 2011b,a). Other
studies have also shown that T+MA is the most suitable technique for X-TRD (Barrón-Cedeno et al., 2008, 2013; Muneer et al.,
2019; Sharjeel, 2020).
In the literature, T+MA has been explored using n-gram for X-TRD tasks (Muneer et al., 2019; Sharjeel, 2020; Sameen et al.,
2017). The Word N-gram overlap technique tries to estimate the number of common N-grams between the source and derived texts. It
works by breaking the text into a fixed-length of N-grams and then calculating similar N-grams present in the text pairs and dividing
6 FIRE 2013 competitionhttps://dl.acm.org/doi/proceedings/10.1145/2701336 Last visited: 20-March-2021.

7 https://www.uni-weimar.de/medien/webis/events/panfire-11/panfire11-web/#corpus Last visited: 20-March-2021.
8 Given a source-derived text pair, the T+MA technique first translates source text into the language of the derived text or vice versa. After translation, both
the source and derived texts are in the same language, and mono-lingual text reuse detection techniques are used to detect reuse between source-derived text
pair.
3
the value by the length of the one or both texts. This technique has proven to be very effective for detecting plagiarism (Barrón-
Cedeno et al., 2013, 2010), text reuse (Chiu et al., 2010; Sameen et al., 2017), cross-lingual text reuse (Muneer et al., 2019) and
near-duplicate detection (Stein et al., 2007).
WordNet is a large lexical database, similar to a thesaurus. In WordNet, similar meaning words are assigned to the same
group (University, 2010). A word can have one or multiple synsets in WordNet (Miller, 1995). A synset is a group of synonymous
words. If a word belongs to multiple synsets, then it means that it has multiple meanings. There are also different relationships
between synsets including synonymy, hypernym, antonymy, hyponymy, meronymy, troponymy, and entailment. WordNet is used
in variety of applications of natural language processing including semantic similarity recognition (Kocoń and Maziarz, 2021),
synset expansion on translation graph for automatic WordNet construction (Ercan and Haziyev, 2019), information filtering (Mock
and Vemuri, 1997), CLPD (Gang et al., 2018) etc.
Word embedding is the representation of the words on their context and words around it (Ferrero et al., 2017b). The words are
represented in a continuous space and those with the same context should be close to each other in this multi-dimensional space.
Word embedding models can be used to measure the semantic textual similarity by using the distributed representation of words.
Common but efficient (and effective) word embedding architectures include the wor2vec CBOW, skip-gram model (Mikolov et al.,
2013), GloVe (Pennington et al., 2014; Ghannay et al., 2016), and the Canonical Correlation Analysis (CCA) (Upadhyay et al., 2016).
These models would map the words to the vectors of real numbers and follow the logic that when words are represented, the similar
meaning words will be represented by the same vector in a common vector space. Word embedding was initially proposed for the
mono-lingual comparability analysis, and yet as of late, been reached out to the cross-lingual word similarity analysis by using a
common representation space for more than one language (Upadhyay et al., 2016). The mono-lingual word embedding technique has
been used in a range of applications including word sense disambiguation (Pelevina et al., 2016), recommendation service (Ozsoy,
2016), short text similarity (Kenter and De Rijke, 2015), plagiarism detection (Ferrero et al., 2017b; Khorsi et al., 2018), name
entity recognition (Nozza et al., 2021), semantic textual similarity (Tien et al., 2019), query performance prediction (Roy et al.,
2019), and analyzing the survey responses or verbatim comments (Healy, 2019).
Kullback–Leibler distance is a probabilistic technique. In previous studies, Kullback–Leibler distance has been used to reduce the
search space for document clustering (Barrón-Cedeno et al., 2008), and plagiarism detection, similar to text reuse detection task.
2.3. Cross-lingual techniques
Vstajner et al. explored the task of the cross-lingual similarity estimation for the English, German, Catalan, Slovene, Spanish,
and Croatian versions of the Wikipedia (Štajner and Mladenić, 2019). The authors compared cross-lingual latent semantic indexing,
low-rank canonical correlation analysis, and a nonlinear bilingual translation model using the monolingual word embedding and
kernel approximation. The best results precision = 0.89 were obtained using a word embedding and kernel approximation for the
cross-lingual similarity detection.
Ferreo et al. explored different similarity techniques for the cross-lingual semantic similarity detection (Ferrero et al., 2017a).
They applied the dictionary based, parallel corpora based and machine translation based techniques on a proposed multilingual
corpus (English, Spanish and French) (Ferrero et al., 2016). The best results (𝐹1 = 0.57) were obtained using the CL-CNG technique.
In another study, Ferrero et al. applied different state-of-the-art techniques including the Cross-Language Alignment-based
Similarity Analysis (CL-ASA), the Cross-Language Explicit Semantic Analysis (CL-ESA), and T+MA, the Cross-lingual conceptual
thesaurus based similarity using a word embedding and the cross-language word embedding-based Syntax Similarity for the
CLPD (Ferrero et al., 2017b). All these techniques were applied and evaluated on the X-TR corpora for the English, Spanish and
French languages (Ferrero et al., 2016). The best results were obtained (𝐹1 = 0.89) using a combination of techniques with cosine
the similarity measure.
Franco et al. explored different techniques including the Cross-Lingual Knowledge Graph Analysis (CL-KGA) and the external-data
composition neural network (XCNN) for the CLPD for the Spanish–English (ES–EN) and the German–English (DE–EN) language
pairs (Franco-Salvador et al., 2016). The best results were obtained using the XCNN with plagdet = 0.644, precision = 0.556,
granularity = 1.00 and recall = 0.95.
2.3.1. Cross-lingual sentence transformers based techniques

In recent years, various language representation models have been proposed including BERT (Devlin et al., 2019; Peters et al.,
2018), Universal Sentence Encoder (Cer et al., 2018), InferSent (Conneau et al., 2017), LASER (Language-Agnostic Sentence
Representations) (Feng et al., 2020; Liu et al., 2019). Recently, Reimers and Gurevych (2019) proposed Sentence-BERT (SBERT) as a
modification of the BERT (Bidirectional Encoder Representations from Transformers) neural network. Which used the Siamese neural
network architecture and the triplet neural networks, to derive semantically meaningful sentence embeddings. SBERT has proven
to be very effective for a range of NLP applications including mining bitext (Capstick et al., 2000), text summarization (Ermakova
et al., 2019; Nasar et al., 2019), sentimental analysis (Behera et al., 2021), and question answering (Moens and Saint-Dizier, 2011)
etc.
Reimers et al.,9 has developed the single and multi-lingual sentence transformers for sentence embedding (Reimers and Gurevych,
2020) for variety of tasks in Natural Language Processing. Initially, they presented Sentence-BERT (SBERT), the modified version of
9 https://www.sbert.net/ Last visited: 20-March-2021.
4
the BERT network using the Siamese and triplet networks which can derive semantically meaningful sentence embedding. SBERT
had been fine-tuned on the NLI data,10 and they have evaluated on various common bench-marks, and the quality of embedding has
improved the other state-of-the-art sentence embedding techniques like InferSent (Conneau et al., 2017), and Universal Sentence
Encoder (Cer et al., 2018). These frameworks have been extended for more than 100 languages, including English, Urdu, Dutch,
Chinese, Portuguese, Italian, Spanish, Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian,
Bengali, Georgian, German, Greek, Gujarati, Hausa Hebrew, Hindi, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Korean,
etc. The frame-works have been used in a variety of applications including document dating (Massidda, 2020), objective-based
hierarchical clustering (Naumov et al., 2020), generating a missing part for story completion (Mori et al., 2020), identify similar
patent documents (Navrozidis and Jansson, 2020), semantic textual similarity (Guo et al., 2020) etc.
Their developed frame-works are categorized into the general sentence transformers and the special transformers. Special
transformers include paraphrase identification, semantic textual similarity, duplicate questions detection, information retrieval, and
bitext Mining (Reimers and Gurevych, 2020). Multiple models can give the best performance on a specific task, however, there does
not exist a single model that performs best on all tasks.
In summary, the tasks of CLPD and X-TRD have been explored for various languages pairing with English including English–
Russian (Bakhteev et al., 2019), English–Spanish (Potthast et al., 2011a; Li et al., 2018), English–Hindi (Kothwal and Varma, 2013),
English–Czech, Ceska et al. (2008), English–German (Franco-Salvador et al., 2016), and English–Urdu (Muneer et al., 2019; Sharjeel,
2020). The existing corpora contain the artificial, simulated, and real cases of X-TR and CLP at the sentence, passage, and document
levels. The problem of X-TRD has been explored at the sentence/passage and document levels for the English–Urdu language pair.
However, the problem of X-TR has not been explored for the large benchmarks based on the simulated and artificial cases for the
English–Urdu language pair.
The major limitations of the three existing corpora developed for the CLPD (Haneef et al., 2019) and X-TRD (Muneer et al., 2019;
Sharjeel, 2020) are as follows. First, the size of these corpora is small. Second, the (Haneef et al., 2019; Sharjeel, 2020) corpora
have been mainly developed to measure the X-TRD at the document level. The (Muneer et al., 2019) corpus contains 3235 instances
at the sentence/passage level (out of 3235 instances, 1000 Instances are at passage level and 2235 instances are at the sentence
level). Therefore, it will not be possible to use these three existing corpora to accurately train and test the X-TRD systems at the
sentence level. Third, the (Haneef et al., 2019) corpus contains the artificial and simulated cases of the CLP, whereas Muneer et al.
(2019) and Sharjeel (2020) contain the real cases of X-TR from the domain of Journalism. However, all these existing corpora lack
simulated cases of X-TR at the sentence level, which reflects how a common person reuses text across the languages. To overcome the
limitations of the existing corpora, this study presents a large benchmark sentential cross-lingual (English–Urdu) corpus containing
21,669 sentence pairs with simulated cases of X-TR, which are manually annotated at three levels of rewrite (Wholly Derived (WD)
= 7655, Partially Derived (PD) = 6461, and Non Derived (ND) = 7553).
In addition, we have applied the various the Cross-Lingual Sentence Transformer (CLST) based techniques on our proposed
corpus for the X-TRD and T+MA based techniques including the N-gram overlap (lexical similarity), WordNet based techniques
(semantic similarity), mono-lingual word embedding based techniques, and Kullback–Leibler Distance (KLD) (probabilistic). To our
knowledge, the proposed large benchmark corpus based on simulated cases for X-TRD, as well as the feature fusion technique
‘Comb-All’ (combination of all techniques) for the English–Urdu language pair has not been previously reported.
3. Corpus generation process
In this section, we discuss the corpus generation process which includes the data collection, annotation process (annotation
guidelines, annotations, and Inter-annotator Agreement), corpus characteristics, and examples from the proposed cross-lingual
sentence corpus. Below we describe the proposed corpus generation process in detail.
3.1. Data collection
To create our proposed Cross-Lingual English–Urdu Sentence corpus CLEU-Sen, we used a subset of data from Quora cor-
pus11 (Chen et al., 2018). In January 2017, Quora released a public dataset which comprises of 404,351 question pairs (Imtiaz
et al., 2020). The question pairs are collected from different domains including technology, philosophy, politics, entertainment and
culture. In previous studies, this corpus has been used for various tasks including paraphrase identification (Chandra and Stefanus,
2020; Alzubi et al., 2020; Tomar et al., 2017), duplicate pairs recognition (Abishek et al., 2019; Viswanathan et al., 2019), duplicate
question pairs identification (Godbole et al., 2018), semantic textual similarity (Shajalal and Aono, 2020) dialog systems (Haponchyk
et al., 2018), sentence embedding (Reimers and Gurevych, 2019), Pointwise Paraphrase Appraisal (Chen et al., 2020), and paraphrase
generation (Qian et al., 2019; Kazemnejad et al., 2020; Hegde and Patil, 2020).
For the development of the Cross-Lingual English–Urdu Sentence (CLEU-Sen) corpus, we first extracted 7992 sentences from the
quora corpus12 . The subset of text pairs extracted from the Quora Question Pairs corpus is in the English language, i.e, mono-lingual.
The cross-lingual sentence corpus is generated by a linguist expert (who had proficiency in both English and Urdu languages) by
10 https://nlp.stanford.edu/projects/snli/ Last visited: 20-March-2021.

11 Quora Question Pairs https://www.kaggle.com/c/quora-question-pairs/data Last Visited: 20-March-2021.
12 The extracted sentences in the English language can be downloaded from the following link: https://drive.google.com/drive/folders/1dWri3-
DtwjjpeRyj9E57J9rm4pXcEw-W?usp=sharing
5
translating English texts into Urdu using Google Translator.13 After automatic translation, the expert corrected the translated text
manually. In the next step, after translation in addition to the manual correction of Urdu text, the linguistic expert created different
possible paraphrases of the Urdu texts. After that, we paired the English and Urdu texts to make the English–Urdu sentence pairs
for creating our Cross-Lingual English–Urdu corpus at the sentence level. After pairing the English–Urdu texts, we obtained 21,669
cross-lingual English–Urdu sentence pairs.
3.2. Annotation process
The annotation process is divided into three main steps: (1) preparation of annotation guidelines, (2) annotations, and (3)
calculation of Inter-Annotator Agreement.
3.2.1. Annotation guidelines

Our main goal is to create a cross-lingual English–Urdu sentence corpus with three levels of a rewrite: (1) Wholly Derived —
when both texts are near or exact copy of each other, (2) Partially Derived — when both texts are a paraphrase of one another,
and (3) Non Derived — When both texts are entirely different or unrelated. To achieve this objective, each cross-lingual text pair
was manually classified into one of the three categories: (a) Wholly Derived (WD), (b) Partially Derived (PD), or (c) Non Derived
(ND), depending upon the relationship between them. To classify a cross-lingual sentence pair into one of the three categories, the
following guidelines were prepared:
Wholly Derived If the derived text (Urdu) is an exact translation of the source text (English), then that cross-lingual sentence pair
will be annotated as Wholly Derived (WD). In exact translation, the most common means of the words are considered in
translation and the order of information is preserved in both the source and derived texts.
Partially Derived If the derived text (Urdu) is a paraphrase of the source text (English) then that cross-lingual sentence pair will
be annotated as Partially Derived (PD). A cross-lingual text pair was tagged as PD, if contents in both texts were semantically
the same, i.e., describing the same event (information). However, the derived text was not the mere translation of the
source text. Rather, the source text was paraphrased using the different text editing operations including (but not limited to)
word recording, merging or splitting of phrases, insertions or deletions of text, replacing words or phrases with appropriate
synonyms, and expansion or compression of text, etc.
Non Derived If the content of the source (English) and derived (Urdu) text is unrelated then that cross-lingual sentence pair will
be annotated as Non Derived (ND).
3.2.2. Annotations
Annotation guidelines prepared in the previous step were used to manually annotate the cross-lingual sentence corpus.
Annotations were carried out by three annotators A, B, and C. Annotator A is a native Urdu speaker, a post-graduate Natural
Language Processing student, and a Ph.D. scholar in the field of X-TR. All other annotators, were undergraduate Natural Language
Processing students, local Urdu speakers, with an undeniable degree of capability in the English and Urdu languages. Furthermore,
they were also provided with training in the X-TR process and X-TR edit and rewriting operations with help of tutorials and state-
of-the-art corpus by a domain expert. The fundamental goal of the preparation was to show the relatedness to various degrees of
X-TR and cross-lingual sentence annotation.
The cross-lingual (English–Urdu) sentence pairs were manually annotated by the annotators A, B, and C.14 The proposed corpus
was annotated in these steps: In the first step, the first two annotators annotated a subset of 500 cross-lingual sentence pairs. The
agreed and conflicting cross-lingual text pairs in the first 500 cross-lingual sentence pairs, were discussed by the annotators and
annotation guidelines were revised (if needed). The revised annotation guidelines were used to annotate the full corpus and the
inter-annotator agreement was computed for each of each entire corpus. The conflicting pairs of CLEU-Sen were annotated by
Annotator C.
3.2.3. Inter-Annotator Agreement

Table 1 shows the detailed statistics of Inter-Annotation Agreements (IAA). The Inter-Annotator Agreement of the corpus was
84.0%, and the Weighted Kappa Co-efficient (Cohen, 1968) was 81.4%. As can be noted that the Inter-Annotator Agreements and
Kappa Co-efficient are high. This highlights the fact that annotation guidelines were well defined which assisted annotators to
recognize between the various levels of X-TR in the proposed corpus. In addition, this also shows that annotators were well trained
and have expertise in the field of X-TR.
In the proposed corpus, there are a total of 21,669 instances (18,981 were agreed and 2688 disagreed between the first two
annotators). As can be noted from Table 1, the majority of conflicts are between Wholly Derived (WD) and Partially Derived
(PD) classes. Potentially, this sort of distinction happened because of the little text fragments matching and cross-lingual settings,
probably, it might have become difficult for annotators to recognize WD and PD classes. 1476 conflicts occurred between PD and
13 https://translate.google.com: LAST VISITED: 10-Nov-2021.

14 Annotator A is the first author of this paper.
6
Table 1
Annotation statistics.
Statistics CLEU-Sen
Total pairs 21,669
Agreed pairs 18,981
Conflicted pairs 2688
Inter-annotator agreement 0.840
Conflicts between WD and PD 1156
Conflicts between PD and ND 1476
Conflicts between WD and ND 56
Table 2
Corpus characteristics.
Characteristics CLEU-Sen
Total pairs 21,669
Wholly derived 7655
Partially derived 6461
Non Derived 7553
Source Derived
Total tokens 2,35,721 2,61,924
Total token (without stop-words) 1,43,381 1,30,454
Total types 11,920 8404
Total types (without stop-words) 11,776 7075
Min tokens per example 3 3
Max tokens per example 64 75
Mean of tokens per example 11 13
Median of tokens per example 9 10
ND class which is also not reasonable, as it is hard to distinguish between PD and ND class. The proportion of conflicts among WD
and ND is likewise a reasonable number (56) which shows that is relatively simple to discriminate between these two classes. In
conclusion, it was seen that the most elevated number of differences were among PD and WD cases, as both annotators thought
that it was hard to recognize these two classes for the majority of cases.
3.3. Corpus characteristics
As can be noted from Table 1, out of 21,669 cross-lingual text pairs, the gold standard CLEU-Sen contain 7655 (35.3%) WD, 6461
(29.9%) PD, and 7553 (34.8%) ND cross-lingual text pairs, and this indicate that the CLEU-Sen is very well balanced. Table 2 shows
detailed statistics of the proposed corpus. The CLEU-Sen contains in total 235,721 English and 261,924 Urdu tokens. This shows that
the length of Urdu text is larger than English text. More detailed statistics can be found in Table 2. The corpus is standardized in CSV
format and will be publicly accessible to download for the research purposes under a Creative Commons CC-BY-NC-SA license.15
This corpus can be accessed from the available link for the reviewers.16
3.4. Examples from proposed corpus
Figs. 1, 2, and 3, show the WD, PD, and ND cross-lingual text reuse of the CLEU-Sen. As Fig. 1 shows from the CLEU-Sen,
the translation of the derived text is exactly the same as that of the source text. This indicates that the derived text (Urdu) is an
exact translation of the source text (English) and this cross-lingual text pair is annotated as WD. As can be noted from Fig. 2,
the translation of the derived text is not exactly the same as that of the source text. The derived text is edited using two edit
operations i.e., word reordering and insertion/deletion of words. This indicates that derived text (Urdu) is not an exact translation
of the source text (English) and this cross-lingual text pair is annotated as PD. To conclude, WD cross-lingual text pairs are almost
exact translations of each other. However, different edit operations have been observed for the PD cross-lingual text pairs to make
cross-lingual paraphrased pairs for the proposed CLEU-Sen corpus. As Fig. 3 shows, the text in both the source and derived are
unique for the ND cross-lingual text pairs.
4. Techniques for X-TRD
To demonstrate how the proposed corpus can be used for the development of the comparison, analysis, and evaluation of the
X-TRD systems for the English–Urdu languages, we applied various X-TRD techniques using the T+MA and without T+MA for
the proposed corpus. We applied five different transformers called cross-lingual sentence transformer based techniques on our
15 https://creativecommons.org/licenses/by-nc-sa/3.0/ Last Visited: 2-March-2021.

16 https://forms.gle/V1frREMiaYBkRDJh8 Password: fa18-pcs-002.
7
Fig. 1. Example of WD.
Fig. 2. Example of PD.
Fig. 3. Example of ND.
proposed corpus. We also applied T+MA based techniques including N-gram overlap, WordNet based techniques, mono-lingual
word embedding based techniques, and Kullback–Leibler Distance. Finally, we made a detailed comparison between T+MA based
techniques and CLST based techniques. As far as we are aware these techniques have not been previously used for the X-TRD for
the English–Urdu language pair for simulated cases. Now, we will discuss these techniques in detail.
4.1. Cross-lingual sentence transformer based techniques
We used two general multi-lingual sentence transformers ‘bert-base-nli’ and ‘bert-large-nli’. The general sentence transformers are
trained on the combination of the Stanford Natural Language Inference (SNLI) corpus,17 Bowman et al. (2015), and the MultiGenre
NLI (MultiNLI) Corpus,18 Williams et al. (2018). ‘bert-large-nli’ can extract 1024 dimension vector by taking the mean of every
token’ s vector, however, ‘bert-base-nli’ can extract a 768 dimension vector, and also has more layers. To compute the similarity
between the cross-lingual source-derived text pairs, the cross-lingual sentence embedding technique was used as follows. In the
first step, English text was converted to sentence embedding vectors using the ‘bert-base-nli’ model. In the second step, Urdu text
17 https://nlp.stanford.edu/projects/snli Last visited: 10-5-2021.

18 https://cims.nyu.edu/~sbowman/multinli/ Last visited: 10-5-2021.
8
was converted to sentence embedding using the ‘bert-base-nli’ model. Finally, in the third step, the similarity between sentence
embedding vectors of English and Urdu texts was computed using the Cosine similarity (see Eq. (1)).
⃖⃖⃗ 𝐷
𝑆. ⃖⃖⃗
𝑆𝑖𝑚(𝑆, 𝐷) = (1)
|𝑆|
⃖⃖⃗ × |𝐷|
⃖⃖⃗
where |(𝑆)|
⃖⃖⃗ and |𝐷|
⃖⃖⃗ represents the length of source and derived text respectively. The Cosine similarity measure allows partial
matching, which enables for the better estimation of similarity. For exp 2, in the first step, English text was converted to the
sentence embedding vectors using the ‘bert-large-nli’ model. In the second step, Urdu text was converted to the sentence embedding
using the ‘bert-large-nli’ model. Finally, in the third step, the similarity between sentence embedding vectors of the English and
Urdu texts was computed using Cosine similarity (see Eq. (1)).
We need a multi-lingual Sentence transformer to compute the similarity between the cross-lingual sentence pairs of the proposed
corpus. For this study, we used the pre-trained model Language-agnostic BERT Sentence Embedding (LaBSE)19 (Feng et al., 2020) for
the source (English) text and the derived (Urdu) text. The reason for selecting the model is that the model supports 109 languages and
works well for the translated pairs in multiple languages. To compute the similarity between cross-lingual source-derived text pairs,
the cross-lingual sentence embedding technique was used as follows. In the first step, English text was converted to the sentence
embedding vectors using LaBSE. In the second step, Urdu text was converted to the sentence embedding using the LaBSE model.
Finally, in the third step, the similarity between sentence embedding vectors of the English and Urdu texts was computed using the
Cosine similarity (see Eq. (1)).
We select one more model to observe the performance on paraphrased pairs, For this experiment, we used the Sentence
transformer (paraphrase-xlm-r-multilingual-v1), previously trained20 (Reimers and Gurevych, 2020). The model is trained on 50
Million paraphrase pairs for multilingual on the parallel data for 50+ languages and extracts 768 dimensions averaged vectors
of the sentence. The reason for selecting the model is that the model supports 50 languages, and works well for paraphrase
identification pairs in multiple languages. In the first step, English text was converted to sentence embedding vectors using
‘paraphrase-xlm-r-multilingual-v1’ model. In the second step, Urdu text was converted to the sentence embedding using ‘paraphrase-
xlm-r-multilingual-v1’ model. Finally, in the third step, the similarity between sentence embedding vectors of English and Urdu texts
was computed using the Cosine similarity (see Eq. (1)).
We select another special model to see the performance on paraphrased pairs, For this experiment, we used the pre-trained CLST
(‘quora-distilbert-multilingual’)21 Reimers and Gurevych (2020) for the source (English) text and derived (Urdu) text. The reason
for selecting the model is that the model supports 50 languages and works well for the duplicate question detection in multiple
languages. In the first step, English text was converted to the sentence embedding vectors using the ‘quora-distilbert-multilingual’
model. In the second step, Urdu text was converted to the sentence embedding using the ‘quora-distilbert-multilingual’ model.
Finally, in the third step, the similarity between sentence embedding vectors of English and Urdu texts was computed using the
Cosine similarity (see Eq. (1)).
4.2. Translation plus mono-lingual analysis
In this study, the T+MA technique is applied in two steps. In the first step, the derived text (in Urdu) was automatically
translated using Google Translator.22 After translation, both the source and derived texts were in the same language (English). In the
second step, similarity scores were computed using the N-gram overlap techniques (lexical), WordNet based techniques (semantic),
mono-lingual word embedding based techniques (word embedding), and Kullback–Leibler Distance (probabilistic) between the
cross-lingual English Urdu text pairs.
4.2.1. N-gram overlap

We used six different combinations of the N-gram overlap to measure the similarity scores. The first five scores were computed
by varying N-gram lengths from 𝑁 = 1 to 𝑁 = 5 using an overlap similarity measure. Furthermore, all the features from 𝑁 = 1–5
were combined together for the sixth set. Taking two texts, source as ‘‘S’’ and derived as ‘‘D’’. The set of n-gram of texts a will be
represented by S (S, n) and document b will be represented by s (D, n) respectively.
The equation of overlap similarity (Vijaymeena and Kavitha, 2016) is:
|𝑆(𝑆, 𝑛) ∩ 𝑆(𝐷, 𝑛)|
𝑆𝑜𝑣𝑒𝑟𝑙𝑎𝑝 = (2)
𝑚𝑖𝑛(|𝑆(𝑆, 𝑛)|, |𝑆(𝐷, 𝑛)|)
19 corpus contain 17Billion mono-lingual sentences and 6Billion bilingual translation pairs and extracts 768 dimensions averaged vectors of sentence.
20 Multilingual version of ‘distilroberta-base-paraphrase-v1’, https://github.com/jmrf/sentence-transformers/tree/master/sentntransformers/datasets,Last visited:
10-5-2021.
21 Multilingual version of quora-distilbert-multilingual, Multilingual version of ‘distilbert-base-nli-stsb-quora-ranking’. Fine-tuned with parallel data for 50+
languages, and extracts 768 dimensions averaged vectors of the sentence.

22 https://translate.google.com: LAST VISITED: 20-March-2021.
9
4.2.2. Wordnet based techniques

For this study, we used the English WordNet to compute the semantic similarity between text pairs. Given a text pair, in the first
step, both the source and derived text were tokenized and pre-processed (removing punctuation marks and lowering the case). In
the second step, each word in the source and derived text synset(s) from the WordNet was extracted. In the final step, a comparison
was carried out between the set of unique synsets for the source text and set of unique synsets for derived text using four different
similarity coefficients: Overlap similarity co-efficient, Jaccard similarity coefficient, Dice similarity co-efficient, and Containment
similarity co-efficient.
If 𝑆𝑠𝑦𝑛𝑠𝑒𝑡𝑠 and 𝐷𝑠𝑦𝑛𝑠𝑒𝑡𝑠 represent the sets of unique synsets for source and derived texts respectively, then similarity between
𝑆𝑠𝑦𝑛𝑠𝑒𝑡𝑠 and 𝐷𝑠𝑦𝑛𝑠𝑒𝑡𝑠 was computed using overlap (Eq. (3)), Vijaymeena and Kavitha (2016), jaccard (Eq. (4)) (Koudas et al., 2006),
dice (Eq. (5)) (Mardiana et al., 2015) and containment (Eq. (6)) (Koudas et al., 2006) similarity co-efficients using the following
formulas.
|(𝑆𝑠𝑦𝑛𝑠𝑒𝑡𝑠 ) ∩ (𝐷𝑠𝑦𝑛𝑠𝑒𝑡𝑠 )|
𝑆𝑜𝑣𝑒𝑟𝑙𝑎𝑝 = (3)
𝑚𝑖𝑛(|(𝑆𝑠𝑦𝑛𝑠𝑒𝑡𝑠 )|, |(𝐷𝑠𝑦𝑛𝑠𝑒𝑡𝑠 )|)
|(𝑆𝑠𝑦𝑛𝑠𝑒𝑡𝑠 ) ∩ (𝐷𝑠𝑦𝑛𝑠𝑒𝑡𝑠 )|
𝑆𝐽 𝑎𝑐𝑐𝑎𝑟𝑑 = (4)
(|(𝑆𝑠𝑦𝑛𝑠𝑒𝑡𝑠 )| ∪ |(𝐷𝑠𝑦𝑛𝑠𝑒𝑡𝑠 )|)
2 × |(𝑆𝑠𝑦𝑛𝑠𝑒𝑡𝑠 ) ∩ (𝐷𝑠𝑦𝑛𝑠𝑒𝑡𝑠 )|
𝑆𝑑𝑖𝑐𝑒 = (5)
(|(𝑆𝑠𝑦𝑛𝑠𝑒𝑡𝑠 )| + |(𝐷𝑠𝑦𝑛𝑠𝑒𝑡𝑠 )| + 2 × |(𝑆𝑠𝑦𝑛𝑠𝑒𝑡𝑠 ) ∩ 𝑆(𝐷𝑠𝑦𝑛𝑠𝑒𝑡𝑠 )|)
|(𝐷𝑠𝑦𝑛𝑠𝑒𝑡𝑠 ) ∩ (𝐷𝑠𝑦𝑛𝑠𝑒𝑡𝑠 )|
𝑆𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑚𝑒𝑛𝑡 = (6)
|(𝑆𝑠𝑦𝑛𝑠𝑒𝑡𝑠 )|
Note that we used two variations in extracting synsets from the WordNet against a word from source or derived text. In the first
variation, only the first synset (or sense) of a word was extracted (called the first sense technique). In the second variation, all the
synsets (or senses) of a word were extracted (called the all senses technique).
4.2.3. Mono-lingual word embedding based techniques

For this study, we used the Google Word2Vec (Ghannay et al., 2016) pre-trained word embedding model. To compute the
similarity between the source-derived text pairs, the mono-lingual word embedding technique was used as follows.
In the first step, both the source and derived texts were tokenized and pre-processed by removing punctuation marks. In the
second step, 300 nearest neighbors were extracted for all the words in the source text and derived text using the pre-trained
Google Word2Vec model23 (Ghannay et al., 2016). In the next step, the similarities were computed between the derived and source
embedding vectors in two ways: (1) the sum of word embedding vectors technique and (2) the average of word embedding vectors
technique.
For the sum of the word embedding vectors technique, the word embedding vectors of all the source words were added up to
get a single source word embedding vector. Similarly, the vectors of all the derived words were added to get a single derived word
embedding vector. After that, the similarity between the (added) source and derived word embedding vectors were computed using
the Cosine similarity measure (Eq. (1)) and the Euclidean distance measure (Eq. (7)).
All the source’s word embedding vectors were averaged for getting a single source word embedding vector for the average word
embedding vectors technique. Similarly, All the derived word embedding vectors were averaged for getting a single source word
embedding vector. After that, the similarity between the (averaged) source and derived word embedding vectors were computed
using the Cosine similarity measure (Eq. (1)) (Lahitani et al., 2016) and the Euclidean distance measure (Eq. (7)) (Vijaymeena and
Kavitha, 2016).
√
𝐸𝑢𝑐𝑙𝑖𝑑𝑒𝑎𝑛(𝑆, 𝐷) = (𝑆 ⃖⃖⃗ − 𝐷).(
⃖⃖⃗ 𝑆⃖⃖⃗ − 𝐷)
⃖⃖⃗ (7)
⃖⃖⃗ and (𝐷)
where (𝑆) ⃖⃖⃗ represents the source and derived text respectively. The Euclidean distance compares the shortest distance
between source and derived text.
4.2.4. Kullback–Leibler distance

For this study, we used the Kullback–Leibler distance technique as follows. Given a text pair, our task is to calculate the distance
between both the source and derived text. For the first step, text processing was performed for both the source and derived text by
removing stop words and punctuation marks. In the second step, both the source and derived texts are converted into probability
distributions. In the third step, TF.IDF weights of both the source and derived were computed for word uni-grams. In the final step,
the probability distribution of both texts was used to calculate the Kullback–Leibler distance between the source and derived text.
Given two probability distributions for source ‘S’ and derived text D with mass functions 𝑆(𝑥) and 𝐷(𝑥) over an event space X,
the 𝐾𝐿𝑑 calculates how much S and D are different (Eq. (8)).
( )
∑ 𝑆(𝑥)
𝐾𝐿𝑑 (𝑆 ∥ 𝐷) = 𝑆(𝑥) log (8)
𝑥𝜖𝑋
𝐷(𝑥)
23 Pre-trained Google word embedding model is trained for the English language on 100 billion words a Google News dataset.
10
Table 3
Applied techniques.
Experiment Techniques Experiment Techniques
Exp 1 Using Bert-base transformer Exp 7.4 Using WordNet-FS-Dice
Exp 2 Using Bert-large transformer Exp 7.5 Using WordNet-AS-Overlap
Exp 3 Using LaBSE transformer Exp 7.6 Using WordNet-AS-Jaccard
Exp 4 Using Paraphrase-xlm-r-multilingual transformer Exp 7.7 Using WordNet-AS-Containment
Exp 5 Using quora-distilbert-multilingual transformer Exp 7.8 Using WordNet-AS-Dice
Exp 6 Using Comb-All-ST Exp 7.9 Using Combined-WordNet
Exp 7.1 Using WordNet-FS-Overlap Exp 8.1 Using MWE-Avg-Cosine
Exp 7.2 Using WordNet-FS-Jaccard Exp 8.2 Using MWE-Sum-Cosine
Exp 7.3 Using WordNet-FS-Containment Exp 8.3 Using MWE-Avg-Euclidean
Exp 8.4 Using MWE-Sum-Euclidean Exp 8.5 Using Combined-MWE
Exp 9 Using KLD Exp 10.1 Using Unigram
Exp 10.2 Using Bigram Exp 10.3 Using Trigram
Exp 10.4 Using Fourgram Exp 10.5 Using Fivegram
Exp 10.6 Using Combined-Ngram Exp 11 Using Comb-All-T+MA
Exp 12 Using Comb-All
5. Experimental setup
This section describes the corpus, evaluation methodology, and evaluation measures used for the X-TRD experiments applied on
proposed CLEU-Sen corpus.
5.1. Corpus
There is a total of 21,669 cross-lingual sentence pairs, out of which text pairs were 7655 WD, 6461 text pairs are PD, and 7553
are ND.
5.2. Techniques
We have concentrated on applying two types of techniques on our proposed corpus: (1) T+MA based techniques including
the N-gram Overlap (lexical), WordNet based techniques (semantic), mono-lingual word embedding based techniques, and the
Kullback–Leibler Distance (KLD) (probabilistic), and (2) Cross-Lingual Sentence Transformers based techniques including the
general transformers (bert-base-nli, bert-large-nli,) and special transformers (Language-agnostic BERT Sentence Embedding (LaBSE),
paraphrase-xlm-r-multilingual-v1, and quora-distilbert-multilingual) based techniques.24 The reason for selecting T+MA based
techniques is that in the previous studies, they have presented promising results for the text reuse and plagiarism detection
tasks (Barrón-Cedeno et al., 2008, 2013; Muneer et al., 2019; Sharjeel, 2020; Potthast et al., 2011b,a). The reason for selecting
CLST based techniques is that in the previous studies promising results were obtained using the multilingual BERT on tasks similar
to text reuse detection, e.g., paraphrase detection (Reimers and Gurevych, 2019, 2020).
The set of experiments that we carried out using the T+MA and CLST based techniques are shown in Table 3. For Exp 1–5,
We applied different sentence transformers, and for Exp 6, we applied one more experiment by combining the features of all the
cross-lingual sentence transformer-based techniques (called Comb-All-ST). For Exp 7, we applied WordNet based techniques: (1)
WordNet First Sense technique (WN-FS), WordNet All Sense (WN-AS), and Combined-WN. For Exp 8, we applied Mono-lingual
Word Embedding (MWE) based techniques including the Sum-Vector and Average-Vector. For Exp 9, we applied the Kullback–
Leibler Distance (KLD), and for Exp 10, we applied the N-gram Overlap techniques and one more using the Combined n-gram. For
Exp 11, we combined all the T+MA based mentioned in four different experiments i.e. Exp 7 to Exp 10 (called Comb-All-T+MA).
For Exp 12, we combined all features, obtained using the T+MA based techniques and CLST techniques (called Comb-All).
The Table 4 below shows the embedding size of CLST models and word embedding model.
5.3. Evaluation measures
Precision (Eq. (9)), Recall (Eq. (10)), and 𝐹1 (Eq. (11)) are the most commonly used evaluation measures for X-TRD tasks.
Precision (P) is defined as the proportion of correctly positive predicted cases.
𝑝 = 𝑇 𝑃 ∕(𝑇 𝑃 + 𝐹 𝑃 ) (9)
Recall (R) is defined as the proportion of correctly identified positive cases.
𝑅 = 𝑇 𝑃 ∕(𝑇 𝑃 + 𝐹 𝑁) (10)
24 The Code for all the techniques can be downloaded from the following link: https://drive.google.com/drive/folders/1BClVwnSYvu-Q_
ZiLe5QFE9nwyJstMyWZ?usp=sharing.
11
Table 4
Embedding size.
Model Embedding size
Sentence Transformers
Bert-base 768
Bert-large 1024
LaBSE 768
Paraphrase-xlm-r-multilingual 768
quora-distilbert-multilingual 768
Mono-lingual word embedding
Google Word2Vec model 300
𝐹1 measure is the harmonic mean of precision (P) and recall (R).
𝐹1 = (2 ∗ 𝑃 ∗ 𝑅)∕(𝑃 + 𝑅) (11)
5.4. Evaluation methodology
The problem of X-TRD for the English–Urdu language pair was treated as a supervised text classification task for the proposed
cross-lingual corpus. The task was evaluated for two versions of the classification (1) binary classification task and (2) ternary
classification task. The binary classification task is meant to recognize X-TR at two levels: (1) Derived and (2) Non Derived. For
this purpose, texts in the Wholly Derived and Partially Derived classes were joined to make a single class i.e. Derived. The ternary
classification task aims to discriminate X-TR at three levels: (1) Wholly Derived, (2) Partially Derived, and (3) Non Derived.
For both the binary and ternary classifications tasks, nine different machine learning algorithms were used, including Bernoulli
Naive Bayes (BNB), Gaussian Naive Bayes (GNB), Logistic Regression (LG), Ada Boost (AB), Decision Tree (DT), k-NN, Multi-Layer
Perception (MLP), Gradient Boosting Classifier (GBC), and Random Forest (RF).25 To better estimate the performance of machine
learning algorithms, 10 fold cross-validation was used. The similarity/distance scores obtained using various techniques (Section:
4) were passed to the machine learning algorithm as input. Macro-averaged 𝐹1 scores are reported for all the techniques for binary
and ternary classification tasks.
6. Results and analysis
Tables 5 and 6 show weighted average 𝐹1 scores obtained using various cross-lingual text reuse detection techniques for binary
and ternary classification tasks respectively. In both Tables, ‘Techniques’ refers to the techniques used for X-TRD.26
Overall, the best results are obtained using the combining features of all techniques for both the binary and ternary classification
tasks. The best results are obtained using the ‘Comb-All’ technique (𝐹1 = 0.94 using GBC, MLP, and KNN), and (𝐹1 = 0.84 using
RF) for the binary and ternary classifications tasks respectively. This shows that our combination of proposed techniques is fruitful
for X-TRD at the sentence level. As expected, the results of the ternary classification are lower than the binary classification. This
demonstrates the fact that it is easy to discriminate between the two levels of X-TR (Derived and Non Derived) when contrasted
with three levels of X-TR (Wholly Derived, Partially Derived, and Non Derived), see Tables 5, and 6. Consequently, there is a huge
difference in performance between the binary (𝐹1 = 0.94) and ternary classification (𝐹1 = 0.84) tasks for using the ‘Comb-All’ used
in this study.
For combined T+MA techniques, the best results are obtained using the combining features of all T+MA techniques for both
the binary and ternary classification tasks. The best results are obtained using the ‘Comb-All-T+MA’ technique (𝐹1 = 0.94 using
GBC), and (𝐹1 = 0.82 using RF) for the binary and ternary classifications tasks respectively. As can be observed the performance of
the ‘Comb-All-T+MA’ is the same as the ‘Comb-All’ for the binary classification task. This shows that this combination of proposed
techniques is also helpful for the X-TRD at the sentence level using T+MA. Moreover, T+MA outperforms the cross-lingual techniques
more effectively for the binary classification compared to the ternary classification task.
From WordNet based techniques, the best results are obtained with combined-WN. The best results are (𝐹1 = 0.92 DT, MLP,
RF, and GBC) using the FS, and AS with overlap, Jaccard, and Dice, combined-WN, and (𝐹1 = 0.77 using RF) with combined-WN.
These results indicate that the Combined-WN is more effective than the WN-AS and WN-FS and more useful in detecting the X-TR
techniques. This also shows that the Jaccard and Overlap are more effective from all similarity coefficients.
Among mono-lingual word embedding techniques, the best results are obtained using the Combined-MWE technique for both the
binary (𝐹1 = 0.92 using GBC) and ternary classification tasks (𝐹1 = 0.76 using RF). As can be noted that these results are slightly
lower than the WordNet techniques for ternary classification, but performance is the same as the binary classification, highlighting
the fact that the mono-lingual word embedding techniques also are effective in detecting X-TR for the binary classification but not
for the ternary classification.
25 Scikit-learn implementation of these machine learning algorithms was used.

26 For detailed results, see the following link: https://drive.google.com/drive/folders/1BClVwnSYvu-Q_ZiLe5QFE9nwyJstMyWZ?usp=sharing.
12
Table 5
Weighted 𝐹1 scores obtained by applying various techniques for binary classification task.
Techniques Scores
General sentence transformers
Exp 1: Bert-base 0.645137
Exp 2: Bert-large 0.651213
Special transformers
Exp 3: LaBSE 0.900204
Exp 4: Paraphr ase-xlm-r-multilingual 0.903046
Exp 5: quora-distilbert-multilingual 0.870214
Combine effect of all CLST
Exp 6: Comb-All-ST 0.907003
T+MA based techniques
WordNet based techniques
Exp 7.1: WordNet-FS-Overlap 0.920686
Exp 7.2: WordNet-FS-Jaccard 0.921569
Exp 7.3: WordNet-FS-Containment 0.774334
Exp 7.4: WordNet-FS-Dice 0.919555
Exp 7.5: WordNet-AS-Overlap 0.916644
Exp 7.6: WordNet-AS-Jaccard 0.915853
Exp 7.7: WordNet-AS-Containment 0.779389
Exp 7.8: WordNet-AS-Dice 0.903024
Exp 7.9: Combined-WordNet 0.923921
MWE based techniques
Exp 8.1: MWE-Avg-Cosine 0.894758
Exp 8.2: MWE-Sum-Cosine 0.894758
Exp 8.3: MWE-Avg-Euclidean 0.850698
Exp 8.4: MWE-Sum-Euclidean 0.852242
Exp 8.5: Combined-MWE 0.904686
Kullback–Leibler distance
Exp 9: KLD 0.922344
N-gram overlap
Exp 10.1: Unigram 0.889249
Exp 10.2: Bigram 0.862907
Exp 10.3: Trigram 0.772242
Exp 10.4: Fourgram 0.669418
Exp 10.5: Fivegram 0.561653
Exp 10.6: Combined-Ngram 0.890229
Combine effect of all T+MA techniques
Exp 11: Comb-All-T+MA 0.935980
Combine effect of all T+MA and CLST
Exp 12: Comb-All 0.939490
For the probabilistic technique, the KLD technique produces better results for the binary (𝐹1 = 0.92 using KNN, GBC, DT, and RF)
and for ternary classification (𝐹1 = 0.81 using Gradient Boosting Classifier) tasks out-performed the mono-lingual word embedding,
and WordNet based techniques. This demonstrates that the KLD technique is useful in detecting X-TR for both the binary and ternary
classification tasks.
For the N-gram overlap techniques, the best results for the binary and ternary classification tasks using unigram for the binary
and combine-Ngram for the ternary classification, as can be observe from both Tables 5, and 6, that increasing size of the N-gram is
decreasing the performances (5, and 6). The best results are (𝐹1 = 0.89 using KNN, GBC, DT, and RF) and for the ternary classification
(𝐹1 = 0.72 using GBC, RF, and MLP). As can be noted, the results for the ternary classification tasks are quite lower than all other
techniques, which shows that the N-gram is not helpful for the X-TRD problems.
Among the sentence transformers, overall best results are obtained using the combining features of all proposed techniques for
both binary and ternary classification tasks. The best results are obtained using the ‘Comb-All’ technique (𝐹1 = 0.91 using GBC, LR,
MLP, and GBC, 𝐹1 = 0.81 using RF) for the binary and ternary classifications tasks respectively. This shows that our combination
of proposed techniques is effective for the X-TRD at the sentence level.
Among the individual sentence transformers’ pre-trained models, ‘LaBSE’ model outperforms all other pre-trained models. The
best results are obtained using the LaBSE model (𝐹1 = 0.90 using GNB, LR, MLP, and GBC, 𝐹1 = 0.78 using GBC) for the binary
and ternary classifications tasks respectively. As can be observed, results using LaBSE are slightly lower than ‘Com-All’. The possible
reason is, as LaBSE is trained on 6 billion multi-lingual pairs and specially designed for the translated sentence pairs in two languages,
13
Table 6
Weighted 𝐹1 scores obtained by applying various techniques for ternary classification task.
Techniques Score
General sentence transformers
Exp 1: Bert-base 0.515805
Exp 2: Bert-large 0.517291
Special transformers
Exp 3: LaBSE 0.775840
Exp 4: Paraphrase-xlm-r-multilingual 0.709789
Exp 5: Quora-distilbert-multilingual 0.664528
Exp 6: Comb-All-ST 0.807834
T+MA based techniques
WordNet based techniques
Exp 7.1: WordNet-FS-Overlap 0.716203
Exp 7.2: WordNet-FS-Jaccard 0.732322
Exp 7.3: WordNet-FS-Containment 0.503677
Exp 7.4: WordNet-FS-Dice 0.700453
Exp 7.5: WordNet-AS-Overlap 0.688772
Exp 7.6: WordNet-AS-Jaccard 0.718259
Exp 7.7: WordNet-AS-Containment 0.529605
Exp 7.8: WordNet-AS-Dice 0.661121
Exp 7.9: Combined-WordNet 0.771802
MWE based techniques
Exp 8.1: MWE-Avg-Cosine 0.718603
Exp 8.2: MWE-Sum-Cosine 0.719290
Exp 8.3: MWE-Avg-Euclidean 0.682587
Exp 8.4: MWE-Sum-Euclidean 0.649005
Exp 8.5: Combined-MWE 0.760346
Kullback–Leibler distance
Exp 9: KLD 0.738160
N-gram overlap
Exp 10.1: Unigram 0.713706
Exp 10.2: Bigram 0.690509
Exp 10.3: Trigram 0.583092
Exp 10.4: Fourgram 0.490699
Exp 10.5: Fivegram 0.436405
Exp 10.6: Combined-Ngram 0.724263
Combine effect of all T+MA techniques
Exp 11: Comb-All-T+MA 0.815308
Combine effect of all T+MA and CLST
Exp 12: Comb-All 0.844869
therefore it can extract the best quality feature vectors for the cross-lingual pairs and LaBSE is trained on the sentence data and can
work better for finding the matching chunks of longer and different lengths.
Among the machine learning algorithms, different algorithms show different behavior for different techniques for the binary
and ternary classification tasks. Mostly, KNN, MLP, and GBC outperform all other machine learning algorithms for the binary
classification task, However, Random Forest showed the best performance with the combined features for the ternary classification
task. The performance deviates in different experiments, so there is no universal and conclusive machine learning algorithm that
can perform best in all cases.
To conclude, the main findings obtained after our extensive experimentation are as follows. First, there is a significant difference
in the performance for the binary and ternary classification tasks, which shows that it becomes more difficult to discriminate three
levels of the X-TR as compared to the two levels of text reuse. Second, this shows that the different SBERT frameworks affect
the performance of X-TRD. Third, our individual performance for the KLD and WordNet techniques with the Jaccard and dice
is equivalent to the sentence transformers for binary classification and slightly lower for the ternary classification. Fourth, the
LaBSE framework performs extracts more qualitative feature vectors for X-TRD for the English–Urdu language pair. However, the
combined effect T+MA based features still outperformed the cross-lingual techniques for both the binary and ternary classification
tasks. Finally, the best performance is shown by combination of all the proposed techniques including the T+MA and CLST on
CLEU-Sen.
14
7. Conclusion and future work
This paper presents a large X-TR corpus at the sentence level for the English–Urdu language pair. The proposed corpus contains
simulated cases of X-TR which are manually annotated at three levels of rewrite (Wholly Derived = 7655, Partially Derived = 6461,
and Non Derived = 7553). To demonstrate how our proposed corpus can be used for the development, evaluation, and comparison
of X-TRD techniques for the English–Urdu language pair, we applied various Cross-Lingual Sentence Transformers and Translation
plus Monolingual Analysis based techniques on our proposed corpus. For the binary classification, the best results are obtained (𝐹1
= 0.94) using the combination of all Cross-Lingual Sentence Transformers and Translation plus Monolingual Analysis techniques and
the combination of all Translation plus Monolingual Analysis techniques. Whereas for the ternary classification, the best results are
obtained (𝐹1 = 0.84) using the combination of all Cross-Lingual Sentence Transformers and Translation plus Monolingual Analysis
techniques.
In the future, we plan to apply the custom-trained word embedding, and the Sentence transformers on our proposed corpus and
compare their performance with the pre-trained models.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared
to influence the work reported in this paper.
References
Abishek, K., Hariharan, B.R., Valliyammai, C., 2019. An enhanced deep learning model for duplicate question pairs recognition. In: Soft Computing in Data
Analytics. Springer, pp. 769–777.
Alfikri, Z.F., Purwarianti, A., 2012. The construction of Indonesian-english cross language plagiarism detection system using fingerprinting technique. J. Ilmu
Komput. Inform. 5 (1), 16–23.
Aljohani, A., Mohd, M., 2014. Arabic-english cross-language plagiarism detection using winnowing algorithm. Inf. Technol. J. 13 (14), 2349.
Alzubi, J.A., Jain, R., Kathuria, A., Khandelwal, A., Saxena, A., Singh, A., 2020. Paraphrase identification using collaborative adversarial networks. J. Intell.
Fuzzy Systems (Preprint), 1–12.
Asghari, H., Khoshnava, K., Fatemi, O., Faili, H., 2015. Developing bilingual plagiarism detection corpus using sentence aligned parallel corpus. In: Notebook
for PAN At CLEF.
Bakhteev, O., Ogaltsov, A., Khazov, A., Safin, K., Kuznetsova, R., 2019. CrossLang: the system of cross-lingual plagiarism detection. In: Workshop on Document
Intelligence at NeurIPS 2019.
Barrón-Cedeno, A., Rosso, P., Agirre, E., Labaka, G., 2010. Plagiarism detection across distant language pairs. In: Proceedings of the 23rd International Conference
on Computational Linguistics (Coling 2010). pp. 37–45.
Barrón-Cedeno, A., Rosso, P., Devi, S.L., Clough, P., Stevenson, M., 2013. Pan@ fire: Overview of the cross-language indian text re-use detection competition.
In: Multilingual Information Access in South Asian Languages. Springer, pp. 59–70.
Barrón-Cedeno, A., Rosso, P., Pinto, D., Juan, A., 2008. On cross-lingual plagiarism analysis using a statistical model. PAN 212.
Behera, R.K., Jena, M., Rath, S.K., Misra, S., 2021. Co-LSTM: Convolutional LSTM model for sentiment analysis in social big data. Inf. Process. Manage. 58 (1),
102435.
Bowman, S.R., Angeli, G., Potts, C., Manning, C.D., 2015. A large annotated corpus for learning natural language inference. In: Proceedings of the 2015 Conference
on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, pp. 632–642. http://dx.doi.org/10.18653/
v1/D15-1075, URL: https://www.aclweb.org/anthology/D15-1075.
Capstick, J., Diagne, A.K., Erbach, G., Uszkoreit, H., Leisenberg, A., Leisenberg, M., 2000. A system for supporting cross-lingual information retrieval. Inf. Process.
Manage. 36 (2), 275–289.
Cer, D., Yang, Y., yi Kong, S., Hua, N., Limtiaco, N., John, R.S., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C., Sung, Y.-H., Strope, B., Kurzweil, R.,
2018. Universal sentence encoder. arXiv:1803.11175.
Ceska, Z., Toman, M., Jezek, K., 2008. Multilingual plagiarism detection. In: International Conference on Artificial Intelligence: Methodology, Systems, and
Applications. Springer, pp. 83–92.
Chandra, A., Stefanus, R., 2020. Experiments on paraphrase identification using quora question pairs dataset. arXiv:2006.02648.
Chen, H., Ji, Y., Evans, D., 2020. Pointwise paraphrase appraisal is potentially problematic. In: Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics: Student Research Workshop. Association for Computational Linguistics, pp. 150–155. http://dx.doi.org/10.18653/v1/2020.acl-
srw.20, Online. URL: https://www.aclweb.org/anthology/2020.acl-srw.20.
Chen, Z., Zhang, H., Zhang, X., Zhao, L., 2018. Quora question pairs. University of Waterloo.
Chiu, S., Uysal, I., Croft, W.B., 2010. Evaluating text reuse discovery on the web. In: Proceedings of the Third Symposium on Information Interaction in Context.
pp. 299–304.
Cohen, J., 1968. Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit.. Psychol. Bull. 70 (4), 213.
Conneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A., 2017. Supervised learning of universal sentence representations from natural language inference
data. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen,
Denmark, pp. 670–680. http://dx.doi.org/10.18653/v1/D17-1070, URL: https://www.aclweb.org/anthology/D17-1070.
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805.
Ercan, G., Haziyev, F., 2019. Synset expansion on translation graph for automatic wordnet construction. Inf. Process. Manage. 56 (1), 130–150.
Ermakova, L., Cossu, J.V., Mothe, J., 2019. A survey on evaluation of summarization methods. Inf. Process. Manage. 56 (5), 1794–1814.
Feng, F., Yang, Y., Cer, D., Arivazhagan, N., Wang, W., 2020. Language-agnostic BERT sentence embedding. arXiv:2007.01852.
Ferrero, J., Agnes, F., Besacier, L., Schwab, D., 2016. A multilingual, multi-style and multi-granularity dataset for cross-language textual similarity detection. In:
10th Edition of the Language Resources and Evaluation Conference.
Ferrero, J., Besacier, L., Schwab, D., Agnes, F., 2017a. Deep investigation of cross-language plagiarism detection methods. In: Proceedings of the 10th Workshop on
Building and using Comparable Corpora. Association for Computational Linguistics, Vancouver, Canada, pp. 6–15. http://dx.doi.org/10.18653/v1/W17-2502,
URL: https://www.aclweb.org/anthology/W17-2502.
Ferrero, J., Besacier, L., Schwab, D., Agnes, F., 2017b. Using word embedding for cross-language plagiarism detection. In: Proceedings of the 15th Conference of
the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, Valencia, Spain,
pp. 415–421, URL: https://www.aclweb.org/anthology/E17-2066.
15
Franco-Salvador, M., Gupta, P., Rosso, P., Banchs, R.E., 2016. Cross-language plagiarism detection over continuous-space-and knowledge graph-based
representations of language. Knowl.-Based Syst. 111, 87–99.
Gang, L., Quan, Z., Guang, L., 2018. Cross-language plagiarism detection based on WordNet. In: Proceedings of the 2nd International Conference on Innovation
in Artificial Intelligence. pp. 163–168.
Ghannay, S., Favre, B., Esteve, Y., Camelin, N., 2016. Word embedding evaluation and combination. In: Proceedings of the Tenth International Conference on
Language Resources and Evaluation (LREC’16). pp. 300–305.
Godbole, A., Dalmia, A., Sahu, S.K., 2018. Siamese neural networks with random forest for detecting duplicate question pairs. arXiv:1801.07288.
Guo, X., Mirzaalian, H., Sabir, E., Jaiswal, A., Abd-Almageed, W., 2020. CORD19STS: COVID-19 semantic textual similarity dataset. arXiv:2007.02461.
Hadgu, A.T., 2018. Cross-lingual short-text matching with deep learning. arXiv:1811.05569.
Haneef, I., Adeel Nawab, R.M., Munir, E.U., Bajwa, I.S., 2019. Design and development of a large cross-lingual plagiarism corpus for urdu-english language pair.
Sci. Program. 2019.
Haponchyk, I., Uva, A., Yu, S., Uryupina, O., Moschitti, A., 2018. Supervised clustering of questions into intents for dialog system applications. In: Proceedings
of the 2018 Conference on Empirical Methods in Natural Language Processing. pp. 2310–2321.
Healy, S., 2019. Corpora in word embedding training and application.
Hegde, C., Patil, S., 2020. Unsupervised paraphrase generation using pre-trained language models. arXiv:2006.05477.
Imtiaz, Z., Umer, M., Ahmad, M., Ullah, S., Choi, G.S., Mehmood, A., 2020. Duplicate questions pair detection using siamese malstm. IEEE Access 8, 21932–21942.
Kazemnejad, A., Salehi, M., Baghshah, M.S., 2020. Paraphrase generation by learning how to edit from samples. In: Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics. pp. 6010–6021.
Kent, C.K., Salim, N., 2010. Web based cross language plagiarism detection. In: 2010 Second International Conference on Computational Intelligence, Modelling
and Simulation. IEEE, pp. 199–204.
Kenter, T., De Rijke, M., 2015. Short text similarity with word embeddings. In: Proceedings of the 24th ACM International on Conference on Information and
Knowledge Management. pp. 1411–1420.
Khorsi, A., Cherroun, H., Schwab, D., et al., 2018. 2L-APD: A two-level plagiarism detection system for arabic documents. Cybern. Inf. Technol. 18 (1), 124–138.
Kocoń, J., Maziarz, M., 2021. Mapping WordNet onto human brain connectome in emotion processing and semantic similarity recognition. Inf. Process. Manage.
58 (3), 102530.
Kothwal, R., Varma, V., 2013. Cross lingual text reuse detection based on keyphrase extraction and similarity measures. In: Multilingual Information Access in
South Asian Languages. Springer, pp. 71–78.
Koudas, N., Sarawagi, S., Srivastava, D., 2006. Record linkage: similarity measures and algorithms. In: Proceedings of the 2006 ACM SIGMOD International
Conference on Management of Data. pp. 802–803.
Lahitani, A.R., Permanasari, A.E., Setiawan, N.A., 2016. Cosine similarity to determine similarity measure: Study case in online essay assessment. In: 2016 4th
International Conference on Cyber and IT Service Management. IEEE, pp. 1–6.
Li, X., Chen, M., Zeng, Z., 2018. Cross-lingual semantic textual similarity modeling using neural networks. In: China Workshop on Machine Translation. Springer,
pp. 52–62.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V., 2019. RoBERTa: A robustly optimized BERT pretraining
approach. arXiv:1907.11692.
Mardiana, T., Adji, T.B., Hidayah, I., 2015. The comparation of distance-based similarity measure to detection of plagiarism in Indonesian text. In: International
Conference on Soft Computing, Intelligence Systems, and Information Technology. Springer, pp. 155–164.
Massidda, R., 2020. rmassidda@ DaDoEval: Document dating using sentence embeddings at EVALITA 2020. In: Proceedings of Seventh Evaluation Campaign of
Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2020), Online. CEUR. Org.
Mikolov, T., Yih, W.-t., Zweig, G., 2013. Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 746–751.
Miller, G.A., 1995. WordNet: a lexical database for English. Commun. ACM 38 (11), 39–41.
Mock, K.J., Vemuri, V.R., 1997. Information filtering via hill climbing, WordNet, and index patterns. Inf. Process. Manage. 33 (5), 633–644.
Moens, M.-F., Saint-Dizier, P., 2011. Introduction to the special issue on question answering. Inf. Process. Manage. 47 (6), 805–807.
Mori, Y., Yamane, H., Mukuta, Y., Harada, T., 2020. Finding and generating a missing part for story completion. In: Proceedings of the the 4th Joint SIGHUM
Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature. pp. 156–166.
Muneer, I., Sharjeel, M., Iqbal, M., Nawab, R.M.A., Rayson, P., 2019. CLEU-A cross-language english-urdu corpus and benchmark for text reuse experiments. J.
Assoc. Inform. Sci. Technol. 70 (7), 729–741.
Nasar, Z., Jaffry, S.W., Malik, M.K., 2019. Textual keyword extraction and summarization: State-of-the-art. Inf. Process. Manage. 56 (6), 102088.
Naumov, S., Yaroslavtsev, G., Avdiukhin, D., 2020. Objective-based hierarchical clustering of deep embedding vectors. arXiv:2012.08466.
Navrozidis, J., Jansson, H., 2020. Using natural language processing to identify similar patent documents. LU-CS-EX.
Nozza, D., Manchanda, P., Fersini, E., Palmonari, M., Messina, E., 2021. earningtoadapt with word embeddings: Domain adaptation of named entity recognition
systems. Inf. Process. Manage. 58 (3), 102537.
Ozsoy, M.G., 2016. From word embeddings to item recommendation. arXiv:1601.01356.
Pelevina, M., Arefiev, N., Biemann, C., Panchenko, A., 2016. Making sense of word embeddings. In: Proceedings of the 1st Workshop on Representation Learning
for NLP. Association for Computational Linguistics, Berlin, Germany, pp. 174–183. http://dx.doi.org/10.18653/v1/W16-1620, URL: https://www.aclweb.org/
anthology/W16-1620.
Pennington, J., Socher, R., Manning, C.D., 2014. Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods
in Natural Language Processing (EMNLP). pp. 1532–1543.
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L., 2018. Deep contextualized word representations. In: Proceedings
of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1
(Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, pp. 2227–2237. http://dx.doi.org/10.18653/v1/N18-1202, URL: https:
//www.aclweb.org/anthology/N18-1202.
Potthast, M., Barrón-Cedeno, A., Stein, B., Rosso, P., 2011a. Cross-language plagiarism detection. Lang. Resour. Eval. 45 (1), 45–62.
Potthast, M., Eiselt, A., Barrón Cedeño, L.A., Stein, B., Rosso, P., 2011b. Overview of the 3rd international competition on plagiarism detection. In: CEUR
Workshop Proceedings, Vol. 1177. CEUR Workshop Proceedings.
Qian, L., Qiu, L., Zhang, W., Jiang, X., Yu, Y., 2019. Exploring diverse expressions for paraphrase generation. In: Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 3164–3173.
Reimers, N., Gurevych, I., 2019. Sentence-BERT: Sentence embeddings using siamese BERT-networks. arXiv:1908.10084.
Reimers, N., Gurevych, I., 2020. Making monolingual sentence embeddings multilingual using knowledge distillation. arXiv:2004.09813.
Roy, D., Ganguly, D., Mitra, M., Jones, G.J., 2019. Estimating gaussian mixture models in the local neighbourhood of embedded word vectors for query
performance prediction. Inf. Process. Manage. 56 (3), 1026–1045.
Sameen, S., Sharjeel, M., Nawab, R.M.A., Rayson, P., Muneer, I., 2017. Measuring short text reuse for the urdu language. IEEE Access 6, 7412–7421.
Shajalal, M., Aono, M., 2020. Semantic sentence modeling for learning textual similarity exploiting LSTM. In: International Conference on Cyber Security and
Computer Science. Springer, pp. 426–438.
16
Sharjeel, M., 2020. Mono-and Cross-Lingual Paraphrased Text Reuse and Extrinsic Plagiarism Detection (Ph.D. thesis). Lancaster University (United Kingdom).
Štajner, T., Mladenić, D., 2019. Cross-lingual document similarity estimation and dictionary generation with comparable corpora. Knowl. Inf. Syst. 58 (3),
729–743.
Stein, B., zu Eissen, S.M., Potthast, M., 2007. Strategies for retrieving plagiarized documents. In: Proceedings of the 30th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval. pp. 825–826.
Tien, N.H., Le, N.M., Tomohiro, Y., Tatsuya, I., 2019. Sentence modeling via multiple word embeddings and multi-level comparison for semantic textual similarity.
Inf. Process. Manage. 56 (6), 102090.
Tomar, G.S., Duque, T., Täckström, O., Uszkoreit, J., Das, D., 2017. Neural paraphrase identification of questions with noisy pretraining. arXiv:1704.04565.
University, P., 2010. About WordNet. Princeton University WordNet.
Upadhyay, S., Faruqui, M., Dyer, C., Roth, D., 2016. Cross-lingual models of word embeddings: An empirical comparison. In: Proceedings of the 54th Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, pp. 1661–1670.
http://dx.doi.org/10.18653/v1/P16-1157, URL: https://www.aclweb.org/anthology/P16-1157.
Vijaymeena, M., Kavitha, K., 2016. A survey on similarity measures in text mining. Mach. Learn. Appl. Int. J. 3 (2), 19–28.
Viswanathan, S., Damodaran, N., Simon, A., George, A., Kumar, M.A., Soman, K., 2019. Detection of duplicates in quora and Twitter corpus. In: Advances in
Big Data and Cloud Computing. Springer, pp. 519–528.
Williams, A., Nangia, N., Bowman, S., 2018. A broad-coverage challenge corpus for sentence understanding through inference. In: Proceedings of the 2018
Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers).
Association for Computational Linguistics, New Orleans, Louisiana, pp. 1112–1122. http://dx.doi.org/10.18653/v1/N18-1101, URL: https://www.aclweb.
org/anthology/N18-1101.
17

CSL Publised

Uploaded by

Copyright:

Available Formats

CSL Publised

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CSL Publised

Uploaded by

Copyright:

Available Formats

Computer Speech and Language 75 (2022) 101381

Contents lists available at ScienceDirect

Computer Speech & Language

Cross-Lingual Text Reuse Detection at sentence level for

ARTICLE INFO ABSTRACT

2.1. Cross-lingual corpora

4 https://pan.webis.de Last visited: 20-March-2021.

2.2. Translation plus mono-lingual analysis based techniques

6 FIRE 2013 competitionhttps://dl.acm.org/doi/proceedings/10.1145/2701336 Last visited: 20-March-2021.

2.3. Cross-lingual techniques

2.3.1. Cross-lingual sentence transformers based techniques

9 https://www.sbert.net/ Last visited: 20-March-2021.

3. Corpus generation process

3.1. Data collection

10 https://nlp.stanford.edu/projects/snli/ Last visited: 20-March-2021.

3.2. Annotation process

3.2.1. Annotation guidelines

3.2.3. Inter-Annotator Agreement

13 https://translate.google.com: LAST VISITED: 10-Nov-2021.

3.3. Corpus characteristics

3.4. Examples from proposed corpus

4. Techniques for X-TRD

15 https://creativecommons.org/licenses/by-nc-sa/3.0/ Last Visited: 2-March-2021.

Fig. 1. Example of WD.

Fig. 2. Example of PD.

Fig. 3. Example of ND.

4.1. Cross-lingual sentence transformer based techniques

17 https://nlp.stanford.edu/projects/snli Last visited: 10-5-2021.

4.2. Translation plus mono-lingual analysis

4.2.1. N-gram overlap

languages, and extracts 768 dimensions averaged vectors of the sentence.

4.2.2. Wordnet based techniques

4.2.3. Mono-lingual word embedding based techniques

4.2.4. Kullback–Leibler distance

5.3. Evaluation measures

𝐹1 measure is the harmonic mean of precision (P) and recall (R).

5.4. Evaluation methodology

6. Results and analysis

25 Scikit-learn implementation of these machine learning algorithms was used.

7. Conclusion and future work

Declaration of competing interest

You might also like