Annotation of plurilingual corpora
Experience from the CLAPOTY project
Pascal Vaillant
Université Paris 13
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
CLAPOTY
This talk is about a project that aimed at
collecting and analyzing a corpus of oral
speech in situations of language contact.
I am presenting a collective work (detailed
later).
The reference person for general information
on the project is:
Isabelle Léglise <leglise@vjf.cnrs.fr>
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Code-switching, code-mixing
●
A well-known phenomena, especially in migrant
communities
ʕla:s liʔana les moustiques daba ʕrfti fajn huma
mxbʕi:n taħt le lit et là, donc pour pouvoir être sûr qu’il
n’y a pas de moustiques... c’était la poubelle xSSha
tkun vide wtanqbD ana tanxarž hadši kullu et tandir
lma alors automatiquement la kajn ši moustique hnaja
mxbaʕ elle cherche l’ombre, elle fout le camp f la
journée.
(example from Bentahila & Davies, 1983)
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Code-switching, code-mixing
●
Also well-known as a side-effect of some
specific socio-linguistic situations :
–
Diglossia (Ferguson 1959)
–
Cultural (incl. technical etc.) pressure from a
dominant language (Thomason & Kaufmann 1988)
–
Language shift (ibid.)
–
Language attrition (ibid.)
–
Language death (ibid.)
–
Emerging Pidgins or Creoles (ibid.)
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Code-switching, code-mixing
●
Long regarded as a sociolinguistic topic, with
no interest for descriptive linguistics
●
For descriptive linguistics, noise or interference
●
But: rising interest since the early 80s in the
possibility to describe a “grammar” of codeswitching (Sankoff & Poplack 1981; Joshi 1982;
Bentahila & Davies 1983; Woolford 1983; Di
Sciullo, Muysken & Singh 1986; Myers-Scotton
& Azuma 1990; Myers-Scotton 1993)
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Language contact: psycholinguistics
●
What is happening in the head of a speaker
switching between one language and another?
●
Which constraints do the grammar of the two
(or more) languages in contact impose on the
actual productions of a plurilingual speaker?
●
Understanding codeswitching phenomena is
part of understanding utterance planning and
production (Myers-Scotton 1993).
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Language contact: sociolinguistics
●
Why do people mix different languages?
Utilitarian issues may quickly be accounted for.
All the other social and interactional functions of
language mixing (why do bilingual people mix
different languages) have to be studied.
●
What do the participants of an interaction
negociate when they switch languages?
●
How does the environing society consider the
different languages being used? How does it
judge language mixing practices?
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Language contact: linguistics
●
Language change is known to often be driven
or induced by language contact
(e.g. Old English with Norman French and Danish;
Yiddish as a German dialect with Romance, Slavic and
Hebrew elements; Creoles as heavily restructured
European languages in the New World)
●
Elementary steps of language evolution by
contact must involve actual speakers of more
than two languages interacting together
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Language contact: linguistics
●
Yet the methodology used to trace back
language contact is the usual reconstruction
method
●
Intermediate stages seldom are documented:
it is hard to hold both ends of the chain
●
Understanding how language contact situations
affect the languages being in contact is a key to
understanding linguistic change (Thomason &
Kaufmann 1988; Peyraube 2002; Heine &
Kuteva 2005, 2007; Aikhenvald & Dixon 2006)
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Language contact
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
A corpus of live language contact
●
2009-2014: CLAPOTY Project (funded by the
Agence Nationale pour la Recherche as ANR09-JCJC-0121-01) : http://clapoty.vjf.cnrs.fr/
●
Collect transcriptions of contemporary speech
in language contact situations
●
Develop computer tools to annotate, classify
and search data
●
See contact-induced language change in action
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
A corpus of live language contact
●
2009-2014: CLAPOTY: http://clapoty.vjf.cnrs.fr/
●
Import knowledge from, and inform the fields of:
sociolinguistics, typology, contact linguistics,
formal linguistics, corpus linguistics
Develop a multi-level and multifactorial
methodology for description and analysis
Develop computer standards to store and
annotate plurilingual corpora and metadata
Develop computer tools to mine plurilingual text
(Léglise & Alby, 2013; Vaillant & Léglise, 2014)
●
●
●
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Diversity
●
2009-2014: CLAPOTY: http://clapoty.vjf.cnrs.fr/
●
A team of people with different scientific
backgrounds
Evangelia Adamou, Sophie Alby, Claudine
Chamoreau, Anne Garcia-Fernandez, Gudrun
Ledegen, Isabelle Léglise, Bettina Migge, Richard
Nock, Claire Saillard, Duna Troiani, Pascal Vaillant
●
Corpora displaying a great diversity of
languages, and of language contact situations
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Diversity of languages
●
40 languages, various typological features :
–
Native American: Kalin’a (French Guiana), Nahuatl,
Purepecha (Mexico)
–
Creoles: French-based (Martinique, French Guiana,
Réunion); English-based (Suriname); Portuguese-based
(Guinea-Bissau)
–
West African: Wolof
–
West European: Romance languages (French, Portuguese,
Spanish); Germanic languages (English, Dutch)
–
Balkan: Indo-European (Greek, Romani), Turkish
–
East-Asian: Austronesian (Aboriginal Taiwan languages:
’Amis and Truku); Chinese (Taiwan)
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Diversity of situations
●
●
Different language contact situations:
–
Stable plurilingualism (Purepecha & Spanish in
Mexico; Truku & Chinese in Taiwan)
–
Creole continua (Martinique, Guiana, Réunion)
–
New emerging varieties (Suriname, Guiana)
Different types of interaction:
–
Multiple participants: family (12), school (15),
friends (24), media (15), work (51), interviews (27)
–
One speaker: political speech, narratives, tales
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Diversity of text plurilingualism
●
Different degrees of internal heterogeneity:
–
near-monolingual texts, where the influence of
language contact is felt through borrowings and
typological changes in slow motion (Purepecha)
–
occasional code-switching (Guinea-Bissau Creole)
–
intensive code-switching, language mixing (Kalin’a)
–
fused lects (Turco-Romani)
–
Creole-Lexifier contacts within a continuum
(Martinique, Réunion, Guiana)
(Auer 1999)
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
The weight of descriptive frames
●
Language contact phenomena have been
described with different terms:
–
–
–
–
–
–
–
–
borrowing
code switching
intra-sentential code switching, code mixing
bilingual speech (parler bilingue)
fused lects, pidginization
interference, creolisms, substratum influence
calques, pattern or matter borrowing
etc.
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
The weight of descriptive frames
●
Take a very common phenomenon, here
described (on purpose) with no scientific words:
an element of language A appears in an
utterance of language B
–
the “element” may be a “word”, an idiom, a
compound expression (possibly discontinuous), a
complete utterance; it may be a system morpheme
or a sequence of system morphemes;
–
even with no transfer of phonological matter, there
may be prosodical features, semantic values,
composition mechanisms... typical of B, used in A
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
The weight of descriptive frames
●
Do you want to call this phenomenon
borrowing or code-switching?
●
The question seems pointless, but choosing
either term has far-reaching implications on the
conceptualization of what is actually happening
e.g. implies different models:
–
different models of psycholinguistic processing
–
different models of (plurilingual) grammar
–
different models of language change mechanisms
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
The weight of descriptive frames
●
What defines the limit between borrowing and
code-switching? The size of the element? The
“degree of integration” into the target language?
What is it that defines a “degree of integration”?
Frequency? Diachronical depth? Phonological
integration?
●
There has been long debates between
specialists about what should be called “singleword code-switching” and what should be
called “nonce borrowings” (Winford, 2003)
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
A deliberately naive description
●
We do not know with certainty who is right and
who is wrong. We do not want to take sides.
●
The use of some terms implies the use of some
concepts; the use of some concepts implies
adhering to a model
●
There are some concepts that we do not wish
to adopt without further inquiry, because they
are subject to debate (e.g. matrix language)
●
Structuring empirical data with a priori concepts
the data is supposed to test would be illogical
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Squaring the circle
●
We want to: note, and annotate, all the possibly
interesting language contact phenomena in a
corpus, in order to be able to analyze them
empirically
●
We do not want to: use concepts that
presuppose that these phenomena are already
defined
●
We need a new, multi-layer, annotation schema
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Annotating plurilingualism
●
Why is it difficult?
●
Let’s take an example from CLAPOTY
(Léglise/Nelson (2008) : EDF corpus – Cayenne)
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Assigning a language to a word
●
Several languages displayed
●
Some of them share part of their lexical stock
●
To the bilingual speakers, the question of
whether they are picking the French hier or the
Creole yèr does not arise
●
To the linguists, there might be criteria to
choose which language to assign a word to
(e.g. phonological), but none is 100% certain
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Finding the border of segments
●
If a word, or sequence of words, belongs to the
shared lexical stock of languages A and B
(and hence may ambiguously been assigned to
either of them)
●
if there is a segment in language A before that
word or sequence of words, and a segment in
language B after it
●
then where should we draw the border (on the
syntagmatic axis) between A and B?
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Floating segments
●
Deciding to force the assignment of some
words or segments to language A or B:
–
sometimes implies a near-random decision from the
annotator, which yields uncertain data (minor sin)
–
always erases the actual complexity of the
language contact situation (major sin)
⇒
Even choosing a transcription scheme is an
arbitrary choice that imposes a grid on reality!
●
Some segments simply “float” between
languages (Ledegen, 2012)
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Floating segments
●
What we want to see is this:
Vaillant/Moustin (2007): Voyé kriyé doktè ban mwen
(Example displayed through the XSLT interface
developed for CLAPOTY)
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Implementation
●
We want to be interoperable with other corpora
●
we want to be state-of-the-art with regard to:
–
character encoding
⇒ Unicode
–
language encoding
⇒ BCP-47 (⊂ ISO-639)
–
document markup
⇒ XML
–
text annotation
⇒ TEI
●
but we want our plurilingual segments
●
and we want our language contact phenomena
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Beyond the TEI
●
The Text Encoding Initiative has planned a lot
of things, especially for corpora of oral
transcripts (TEI-P5 Guidelines, chap. 8)
… but it is somehow basic about how to
describe linguistic heterogeneity:
“Words or phrases which are not in the main language
of the text should be tagged as such : ‘John eats a
<foreign xml:lang="fr">croissant</foreign> every
morning’.” (TEI-P5 Guidelines, p. 65)
●
That’s ½ page in 1600 pages.
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
What’s in a language tag?
●
BCP-47:
+
+
+
●
language tag (ISO-639)
optional variant tag (IANA Language subtags registry)
optional script tag (ISO-15924)
optional tag for geographic variant
(ISO-3166 country codes, or UN M49 zone codes)
Examples:
–
–
–
fra vs. fra-GF
spa vs. spa-419
djk-aluku vs. djk-ndyuka
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
What’s in a language tag?
●
ISO-639-3 also has tags for macrolanguages, if
needed (e.g. ara, que, zho)
●
There is an implicit hierarchy of specificity in
language identification (zho > cmn > cmn-TW
(to be used with caution)
●
ISO-639-3 also has three tags with “special
values”:
–
‘und’ : undefined
–
‘zxx’ : no linguistic content
–
‘mul’ : multiple languages
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Beyond the TEI
●
We want to be TEI-conformant as much as
possible...
and to create our own extensions when needed
●
So we created a XML Schema of Documents
(XSD) adapted to our needs: Corpus-Contacts
(XSDs are like DTDs except that they also allow to
specify integrity constraints)
●
Essentially based on TEI-P5 Guidelines chap. 8
(Transcriptions of speech)
… plus some new element types
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
The Corpus-Contacts XSD
●
General structure :
–
–
–
–
–
The root element is a <corpus>
a <corpus> contains one <corpus_header>, then an
indefinite number (1..n) of texts (elements <text>)
a <text> contains one <text_header>, then an
indefinite number of events (elements <event>)
an <event> may be either a paraverbal element
(<incident>, <kinesic>, <vocal>), or a speech turn
a <speech_turn> consists in four tiers: transcription,
interlinear morphemic gloss, list of POS-tags, free
translation
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
The Corpus-Contacts XSD
●
Inside a transcription: TEI-P5 elements:
–
–
–
–
–
–
–
plain UTF-8 text (#PCDATA)
alignment tabs (to align with the IMG & POS-tags)
paraverbal events (incident, kinesic, vocal)
linguistic indications (shifts in pitch, tempo,
loudness, rhythm, tension, voice quality)
pauses
overlaps
incomplete forms
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
The Corpus-Contacts XSD
●
Specific Corpus-Contacts elements:
–
internal plurilingualism:
●
●
–
assignment to multiple languages
alternate transcriptions in multiple languages
remarkable phenomena
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Multiple language assignment
●
The basic idea: when a segment is multilingual:
‒
continue using the basic xml:lang attribute
(backward compatibility)
‒
give it value “mul”
(ISO-639-3 special tag: Multiple languages)
‒
add a new element <langues> to give the list of
alternate languages it is “floating among”:
<langues>
<langue xml:lang="fra">
<langue xml:lang="acf">
</langues>
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Multiple language assignment (P)
●
●
What does “multilingual segment” mean?
(P) Paradigmatic interpretation: when the
segment, with similar phonetic forms in A and
B, does not give enough hints as to what
language it should be assigned to (A or B),
then tagging it “mul” means: this could be A and
this could also be B (I, linguist, don’t know);
or: this could be some linguistic item floating
between A and B in bilingual speech.
A and B are specified in the <langues> element
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Multiple language assignment (P)
●
What does “multilingual segment” mean?
(P) Paradigmatic interpretation:
(Vaillant/Lengrai (2007) : Lignes de Vie)
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Multiple language assignment (P)
<transcription lang="acf">
piskè <tab/>
<segment lang="mul">
<langues><langue lang="acf"/><langue lang="fra"/></langues>
<trans_alt lang="acf">pou</trans_alt> <trans_alt
lang="fra">pour</trans_alt> <tab/>
<trans_alt lang="acf">lenstan</trans_alt>
<trans_alt lang="fra">l'instant</trans_alt>
</segment> <tab/>
sé <tab/> jounalis <tab/> ki <tab/> ni <tab/> la
</transcription>
<traduction_juxtalineaire>
puisque <tab/> pour <tab/> l'instant <tab/> être.COP <tab/>
journaliste <tab/> REL;SBJ <tab/> avoir <tab/> là
</traduction_juxtalineaire>
<traduction_libre>
puisque pour l'instant ce (ne) sont (que) des journalistes
qui sont là,
</traduction_libre>
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Multiple language assignment (S)
●
●
What does “multilingual segment” mean?
(S) Syntagmatic interpretation: when a
segment comprising multiple morphemes (in A
and in B, and/or in subsegments floating
between A and B) does not allow to clearly
identify a main language which rules the
syntagm construction,
then tagging it “mul” means: this segment is
internally multilingual (no “matrix language”)
A and B are specified in the <langues> element
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Multiple language assignment (S)
●
What does “multilingual segment” mean?
(S) Syntagmatic interpretation:
(Léglise/Nelson (2008) : EDF corpus – Cayenne)
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Multiple language assignment (S)
<transcription lang="mul">
<langues><langue lang="gcr"/><langue lang="fra"/><langue
lang="acf"/></langues>
<segment lang="fra">Ah <tab/> oui <tab> mais <tab/> même
<tab/> si</segment> <tab/>
<segment lang="gcr">ou <tab/> ka <tab/> vin</segment> <tab/>
<segment lang="fra">tant</segment> <tab/>
<segment lang="gcr">ou <tab/> pa </segment><tab/>
<segment lang="acf">ni</segment> <tab/>
<segment lang="gcr">tout <tab/> papié <tab/> a</segment>
</transcription>
<traduction_juxtalineaire>
INTJ <tab/> oui <tab/> mais <tab/> même <tab/> si <tab/> 2SG
<tab/> IPFV <tab/> venir <tab/> tant <tab/> 2SG <tab/> NEG
<tab/> avoir <tab/> tout.QUANT <tab/> papier <tab/> DEF
</traduction_juxtalineaire>
<traduction_libre>
Ah oui, mais même si vous venez, tant que vous n'avez pas
tous les papiers …
</traduction_libre>
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Multiple language assignment (PS)
●
Of course both interpretations are possible at
the same time (this is the case with a segment
fragmented in several subsegments which
themselves are “floating” units):
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Multiple language assignment (PS)
<transcription lang="mul">
<langues><langue lang="gcr"/><langue lang="fra"/></langues>
<segment lang="gcr">vini</segment>
<tab/>
<segment lang="mul"><langues><langue lang="gcr"/><langue
lang="fra"/></langues>
non</segment>
<tab/>
<segment lang="fra">bande <tab/> de <tab/> putes</segment>
</transcription>
<traduction_juxtalineaire>
venir <tab/> d'accord <tab/> bande <tab/> de.GEN <tab/>
pute.PL
</traduction_juxtalineaire>
<traduction_libre>venez ici, bande de putes</traduction_libre>
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Remarkable phenomena
●
●
●
Another purposefully naive description: we
refuse to use predefined categories of language
contact phenomena
Remarkable phenomenon: “something is worth
noting here”
Just a generic frame to annotate everything
worth analyzing
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Remarkable phenomena
●
●
●
●
We use the generic element
<passage_remarquable> (remarkable passage)
to signal the occurrence of a remarkable
phenomenon somewhere in a text in the corpus
Every “remarkable passage” has an XML ID tag
In the database, remarkable passages (tokens)
are linked to remarkable phenomena (types)
An indefinite number (1..n) of remarkable
passages may be linked to a single remarkable
phenomenon
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Remarkable phenomena
●
●
●
In the database, there is a hierarchy of
remarkable phenomena
The predefined description levels are not linked
to a theoretical model of language contact, but
are data-oriented: they specify (I) which layer of
language processing is involved; (II) which type
of syntagm is affected
The last description level is meant to be created
and maintained in a bottom-up process by
linguists users of the database
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Meta-categories of R.Ph.
●
First level of the hierarchy : three main metacategories : PREMS, PRINT and PREDISC
–
PREMS :
Phénomènes REmarquables Morpho Syntaxiques
–
PRINT :
Phénomènes Remarquables INTeractionnels
–
PREDISC :
Phénomènes REmarquables DISCursifs
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
PREMS
●
●
●
Morphosyntactically remarkable phenomena
Tactical subtypes: defined by the position of the
remarkable phenomenon ([]) in the chain of
alternating language segments (<A><B>)
Symbolic notation for the four tactical subtypes:
[<>]
[<><>]
<[><]>
<[]>
the presence of a segment of B in A is remarkable
the sequence of two segments in languages A and B
is remarkable
the switch between A and B is remarkable
something inside language A is remarkable
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
PREMS
●
●
Morphosyntactically remarkable phenomena
Subcategories under the major tactical
subtypes: defined by the type of syntagm
affected:
–
PREMS-GV: in the Verb Phrase
–
PREMS-GN: in the Noun Phrase
●
●
–
PREMS-GN-Det : concerning determination in the NP
PREMS-GN-Poss : concerning the expression of
possession in the NP
etc.
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
PRINT
●
Concerns the analysis of the alternation of
languages w.r.t. speakers during the interaction
(Auer, 1995)
●
A preliminary automatic annotation “à la Auer”
(Language A [Language B] – Speaker 1) is
automatically computed by the XSLT processor
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
PRINT
●
Concerns the analysis of the alternation of
languages w.r.t. speakers during the interaction
(Auer, 1995)
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
PREDISC
●
Concerns the impact of plurilingualism on
discourse cohesion and articulation
●
e.g. discourse connectors imported from
another language in situations of cultural
pressure
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
CLAPOTY Resource Set
●
The XSD Document Schema Corpus-Contacts
●
A specific config file for the open-source javabased JAXE XML editor
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
CLAPOTY Resource Set
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
CLAPOTY Resource Set
●
The XSD Document Schema Corpus-Contacts
●
A specific config file for the open-source javabased JAXE XML editor
●
A XSLT transform sheet allowing any standard
XSLT-1.0 conformant browser to display the
corpora as a sequence of aligned utterances
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
CLAPOTY Resource Set
●
The XSD Document Schema Corpus-Contacts
●
A specific config file for the open-source javabased JAXE XML editor
●
A XSLT transform sheet allowing any standard
XSLT-1.0 conformant browser to display the
corpora as a sequence of aligned utterances
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
CLAPOTY Resource Set
●
The XSD Document Schema Corpus-Contacts
●
A specific config file for the open-source javabased JAXE XML editor
●
A XSLT transform sheet allowing any standard
XSLT-1.0 conformant browser to display the
corpora as a sequence of aligned utterances
●
A relational (SQL) database to store
sociolinguistic information on corpora,
speakers, languages
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
CLAPOTY Resource Set
●
The XSD Document Schema Corpus-Contacts
●
A specific config file for the open-source javabased JAXE XML editor
●
A XSLT transform sheet allowing any standard
XSLT-1.0 conformant browser to display the
corpora as a sequence of aligned utterances
●
A relational (SQL) database to store
sociolinguistic information on corpora,
speakers, languages
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
CLAPOTY Resource Set
●
The XSD Document Schema Corpus-Contacts
●
A specific config file for the open-source javabased JAXE XML editor
●
A XSLT transform sheet allowing any standard
XSLT-1.0 conformant browser to display the
corpora as a sequence of aligned utterances
●
A relational (SQL) database to store
sociolinguistic information on corpora,
speakers, languages
●
A concordancer to search for patterns
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
CLAPOTY Resource Set
●
The XSD Document Schema Corpus-Contacts
●
A specific config file for the open-source javabased JAXE XML editor
●
A XSLT transform sheet allowing any standard
XSLT-1.0 conformant browser to display the
corpora as a sequence of aligned utterances
●
A relational (SQL) database to store
sociolinguistic information on corpora,
speakers, languages
●
A concordancer to search for patterns
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Concluding Remarks
●
Relevance to Network-Mediated
Communication?
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Plurilingualism in written form
●
Language mixing is not limited to oral speech
●
Oldest written testimony in transcripts from
Martin Luther:
si enim hoc verum esset, so schiss ich dem pabst auf
die kron.
(example from Stolt 1964, quoted in Auer &
Muhamedova 2005)
●
“Oralized” writing
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Plurilingualism in written form
●
●
With instant messaging systems (IM, SMS...)
and more generally CMC, there is a wealth of
new types of communication which:
–
are written;
–
are no oral transcript or oralized writing;
–
yet differ from what used to be considered written
language in many parameters.
Some of these new forms of communication
exhibit internal language mixing.
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Language mixing on social networks
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Language mixing in UGC (forums)
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Language mixing in SMS
●
“Ok pour le pot ! Suis 3 les 2 3 et 4as”
(A friend of mine, p.c.)
●
Cf. Simone Ueberwasser’s talk about
sms4science.ch
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
Open Questions
●
In Network-Mediated Communication:
●
The issue of plurality of languages exists
●
It interacts with other issues:
–
plurality of writing systems and encodings
–
plurality of writing standards
–
plurality of genres and genre-specific varieties
–
variable levels of conformance to writing standards
(at the speech community level, age/occupation
group level, user community level, individual level)
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
References
●
A. Bentahila & E.E. Davies, “The syntax of Arabic-French codeswitching”. Lingua 59 (1983), p. 301-330.
●
C. Ferguson, “Diglossia”. Word 15 (1959), p. 325-340.
●
S. Thomason & T. Kaufmann, Language Contact, Creolization,
and Genetic Linguistics. University of California Press (1988).
●
D. Sankoff & S. Poplack, “A formal grammar for codeswitching”. Papers in Linguistics 14 (1981), p. 3-46.
●
A. Joshi, “Processing of sentences with intra-sentential codeswitching”. COLING 1982 (Prague).
●
E. Woolford, “Code-switching and syntactic theory”. Linguistic
Inquiry 14 (3) (1983), p. 520-536.
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
References
●
A.-M. Di Sciullo, P. Muysken & R. Singh, “Government and
code-mixing”. Journal of Linguistics 22 (1986), p. 1-24.
●
C. Myers-Scotton & S. Azuma, “A frame-based process model
of codeswitching”. 26th annual regional meeting of the Chicago
Linguistic Society (1990).
●
C. Myers-Scotton, Duelling Languages : Grammatical Structure
in Codeswitching. Oxford University Press (1993).
●
A. Peyraube, “L’évolution des structures grammaticales”.
Langages 146 (2002), p. 46-58.
●
B. Heine & T. Kuteva, Language Contact and Grammatical
Change. Cambridge University Press (2005).
●
B. Heine & T. Kuteva, The Genesis of Grammar. Oxford
University Press (2007).
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
References
●
A. Aikhenvald & R. M. Dixon (eds.), Grammars in Contact: A
Cross-Linguistic Typology. Oxford University Press (2006).
●
I. Léglise & S. Alby, “Les corpus plurilingues, entre linguistique
de corpus et linguistique de contact”. Faits de langues 41
(2013), p. 95-122.
●
P. Vaillant & I. Léglise, “À la croisée des langues : Annotation et
fouille de corpus plurilingues”. RNTI SHS 2 (2014), p. 81-100.
●
P. Auer, “From code-switching via language mixing to fused
lects: Toward a dynamic typology of bilingual speech”.
International Journal of Bilingualism 3 (4) (1999), p. 309-332.
●
D. Winford, An Introduction to Contact Linguistics. Blackwell
(2003).
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015
References
●
G. Ledegen, “Prédicats ‘flottants’ entre le créole acrolectal et le
français à la Réunion : Exploration d’une zone ambigüe”. In C.
Chamoreau & L. Goury (eds.): Changements linguistiques et
langues en contact : Approches plurielles du domaine prédicatif.
CNRS Éditions (2012), p. 251-270.
●
P. Auer, “The pragmatics of code-switching: a sequential
approach”. In L. Milroy & P. Muysken (eds.): One Speaker, Two
Languages: Cross-Disciplinary Perspectives on CodeSwitching. Cambridge University Press (1995), p. 115-135.
●
B. Stolt, Die Sprachmischung in Luthers Tischreden.
Stockholm: Almqvist & Wiksell (1964).
●
P. Auer & R. Muhamedova, “‘Embedded language’ and ‘matrix
language’ in insertional language mixing: Some problematic
cases”. Rivista di Linguistica 17.1 (2005), p. 35-54.
Pascal Vaillant
LIMICS (UMR INSERM 1142)
vaillant@univ-paris13.fr
International Research Days Social Media
Université Rennes 2
Rennes, 24/10/2015