Nothing Special   »   [go: up one dir, main page]

Internship Topics 2022-2023

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Master’s internships 2022/2023

A limited number of positions for a Master’s internship are available this year. These
internships should be for A MINIMUM OF 5-6 MONTHS. Please send your application by
email (CV + LM required) to syntheticlearner@gmail.com (unless other email specified).
Indicate in the email which topic(s) you are interested in.

Subject 1: Emergent Communication with graphs (Mathieu) 2

Subject 2: Supervised learning at the word level (Robin) 4

Subject 3: Phylogeny of communication 5

Subject 4: Prospects in NeuroAI, aligning Brains and Nets 6

Subject 5: Internship: NLP, cognitive science, or psychology Profile 8

Subject 6: Spoken Language Modeling with Soft Speech Units (TuAnh) 9

Subject 7: Multilevel U-statistics with applications to the evaluation of representation


learning algorithms (Thomas) 10

1
Subject 1: Emergent Communication with graphs (Mathieu)

Emergent communication (EC) aims at simulating the emergence of a communication protocol


among a community of artificial agents. From a scientific perspective, such simulations aim at
identifying the cognitive, environmental and social pressures that shape languages during their
emergence and evolution. From an applied perspective, applying findings from cognitive
science could help improve current neural network models.

For now, EC experiments have mainly implemented the task of communicating about very
simple hand-designed objects (sequences of one-hot vectors) or images. Hand-designed
objects have the advantage of being easily controllable but are unfortunately not very complex ;
on the contrary, images are complex but hard to control.

The aim of the internship would be to study another set of objects - graphs - that could be both
controllable and increasingly complex. The role of the intern will be to:

- Identify different classes of graphs with increasing complexity and create simulated
datasets. The goal would be to find a controllable continuum of objects from sequences
of one-hot vectors to images.
- Benchmark different ways of encoding graphs as input vectors for neural nets (e.g. GNN
embeddings)
- Design a communication game with graphs (starting from an existing framework)
- Identify the theoretical solutions of the graph communication game (ie. the graph
linearization problem)
- Compare the emergent protocols to the theoretical solutions and real life graph
linearization systems (language, DNA, information theory encodings, etc.)

The intern should have a good knowledge of RL and DL (for simulations), basic knowledge on
information theory and optimization (for the theoretical part) and be interested in cognitive
science (for the lectures and grounding of the project).

Junior supervision: Mathieu Rita


Senior supervision: Emmanuel Dupoux

References

Emergent communication in the lab

- Emergent Communication: Generalization and Overfitting in Lewis Games. M Rita, C Tallec, P


Michel, JB Grill, O Pietquin, E Dupoux, F Strub 36th Conference on Neural Information
Processing Systems (NeurIPS) 2022 - https://arxiv.org/pdf/2209.15342
- On the role of population heterogeneity in emergent communication. M Rita, F Strub, JB Grill, O
Pietquin, E Dupoux. 10th International Conference on Learning Representations (ICLR) 2022 -
https://arxiv.org/pdf/2204.12982

2
- " LazImpa": Lazy and Impatient neural agents learn to communicate efficiently. M Rita, R
Chaabouni, E Dupoux. In Proceedings of CoNLL 2020 - https://arxiv.org/pdf/2010.01878

Graph and emergent communication


- Slowik et al. 2020 LINK

Graphs and deep learning


- Kipf et al. 2017 https://arxiv.org/pdf/1609.02907.pdf

Graph & Cognitive science


- Levelt 1981 - The speaker’s linearization problem LINK

3
Subject 2: Supervised learning at the word level (Robin)

The Candidate must be in end-of-study internship (no first year master) in machine learning and
computer sciences

Supervision in machine learning is a paradigm that requires labelled datasets which is often the
result of substantial human time and efforts. For that reason, unsupervised or self-supervised
methods are becoming increasingly used in several areas of machine learning: vision, text, and
very recently speech. For instance, contrastive predictive coding or Wav2vec2.0 have been
used to discover speech representations without supervision [1,2]. These models can embed
fixed duration (usually 10ms) of speech into a vector but cannot represent variable-length
speech sequences. These latter representations can be very useful in a variety of tasks ranging
from information retrieval to speech segmentation into words (i.e can you find word boundaries
in an audio recording without label nor prior knowledge of the language?).
The aim of this internship is to search for new and more robust methods to build variable length
speech embeddings. You will implement new loss functions and regularization schemes in a
pre-existing deep learning model to improve its performance. The resulting model will be used
as input to a speech segmentation model and hopefully improve the current state-of-the-art in
that domain.

[1] Aaron van den Oord, Yazhe Li, Oriol Vinyals (2019). Representation Learning with Contrastive
Predictive Coding https://arxiv.org/pdf/1807.03748.pdf
[2] https://arxiv.org/abs/2006.11477
[3] https://arxiv.org/abs/2007.13542

4
Subject 3: Phylogeny of communication
(with Emmanuel Chemla ENS LSCP and Robin Ryder Paris Dauphine)

There exist large databases of animal calls and shrieks. We have consolidated our own
database of primate calls with their sounds and meanings, and there are many such databases
for birds. With modern tools, we can mine these databases to make inferences about how the
communication systems of these species have evolved: what led to the formation of the first
communicative sounds, what did these sounds sound like and what did they mean (based on
what their modern “descendants” mean), how were they passed or lost from one generation to
the next, etc. Standard historical linguistics can trace and reconstruct the history of words or
language sounds across periods that span a thousand years, the current project aims at doing
so across millions of years.

Machine learning tools, either related to speech (to encode the animal sounds into manipulable
objects) or simply to perform reconstruction inferences are key to address these questions at a
large scale. The goal of this project will be to exhibit the most likely history of animal calls, with
their meanings and their sounds. The tasks will be (i) to improve on our encoding of the animal
sound signals into manipulable objects for inference and reconstruction (low dimensional
vectors) and/or (ii) perform inferences on the resulting objects. The project will lead to the
generation of likely animal sounds from millions of years ago.

Example of an application of speech technology to animal sounds: doi


Example of a relevant database (here entirely public): https://www.xeno-canto.org
General introduction to phylogenetic work (a great read): https://lukejharmon.github.io/pcm/

5
Subject 4: Prospects in NeuroAI, aligning Brains and Nets
Modern artificial neural networks are now very good at human tasks. To the point that some
researchers believe that artificial networks may become good models of how actual brains work.
As an illustration, Eickenberg et al. (2017) show that during a task of object recognition, brain
activations as collected through fMRI and artificial neural networks are well-aligned. Technically,
it means that brain fMRI data are well predicted from corresponding artificial neural network
activations. Similar results have been obtained in other domains, such as speech and music
perception (Kell et al, 2018), semantic and syntactic processing (Caucheteux et al., 2021;
Pasquiou et al., 2022) and even to the improvement of AI models by better aligning them with
brain data (Toneva et al., 2019).

However, the architecture, dynamics and training


procedures of artificial neural networks can often
be very different from those of the human brain.
Given that, what are the limits of this entire
approach in its ability to shed light on brain
function or improving AI? In this internship, we will
study such theoretical limitations of the new field of
neuroAI. For this, we will conduct simulations of
neural networks with varying degrees of biological
plausibility and study to what extent emerging
representations in these models can be aligned
using standard methods. Based on that, we will
develop new machine-learning methods to
improve brain-AI alignment.

Senior Supervision: Yair Lakretz (neuroscience), Emmanuel Chemla (cognitive science)


Junior Supervision/Peer: Nur Lan (PhD candidate, machine learning and cognitive science)
Contact us all with short CV to: yair.lakretz@gmail.com, emmanuel.chemla@ens.psl.eu,
nurxlan@gmail.com

This will be a one semester internship, hosted at the LSCP lab from Ecole Normale Supérieure.
Contingent on funding, the internship may be followed by a PhD.

Prerequisites. A good understanding of basic concepts in machine learning and deep learning
and a strong mastery of Python is essential for the project, and it will help to advance fast.
Technical familiarity with cloud computing is recommended, and also the ability to understand
and use existing code, and to create new ones to design, train and analyze neural networks
with, e.g., PyTorch. Math/stats background and familiarity with imagery are a bonus. Interest in
neuroscience and artificial intelligence, sure!

References:
Caucheteux, C., Gramfort, A., & King, J. R. (2021). Disentangling syntax and semantics in the brain with deep
networks. ICML.

6
Eickenberg, M., Gramfort, A., Varoquaux, G., & Thirion, B. (2017). Seeing it all: Convolutional network layers map the
function of the human visual system. NeuroImage.
Kell, A. J., Yamins, D. L., Shook, E. N., Norman-Haignere, S. V., & McDermott, J. H. (2018). A task-optimized neural
network replicates human auditory behavior, predicts brain responses, and reveals a cortical processing hierarchy.
Neuron.
Pasquiou, A., Lakretz, Y., Hale, J., Thirion, B., & Pallier, C. (2022). Neural Language Models are not Born Equal to Fit
Brain Data, but Training Helps. ICML 2022.
Toneva, M., & Wehbe, L. (2019). Interpreting and improving natural-language processing (in machines) with natural
language-processing (in the brain). Advances in Neural Information Processing Systems, 32.

7
Subject 5: Internship: NLP, cognitive science, or psychology Profile

Description of the team:


The aim of our LAAC (Language Acquisition Across Cultures) team is to shed light on the
mechanisms and processes involved in early language acquisition in a variety of cultures and
language communities. To this end, we use an interdisciplinary approach (ranging from
computational modeling to laboratory experiments and advanced data analysis) in the context of
open, collaborative and publicly engaged science.

Internship description:
Most research on early language acquisition has documented input and learning cues in only a
handful of cultures, the assumption being that the mechanisms postulated to explain acquisition
in these cultures are universal. However, there is too little research on some languages.
The team now has data from many languages, including some that are uncommonly studied
(such as Tsimane', Yélî Dnye, and many others). We now want to analyze these corpora:
- systematize and clean the data
- for corpora of texts, generate orthographic and phonological dictionaries
- for corpora of speech and texts, generate structured alignments
- where possible, generate tests at different linguistic levels as in the
- ZR Speech Challenge
- create an Android keyboard for low-resource languages
- analyze data from our citizen science project Zooniverse
- create (semi-)supervised classifiers (RNN, CNN, transform) to describe children's
language development
- apply whisper and other automatic speech processing tools on child data

Internship’s objectives:
- Develop speech and text tools for low-resource languages
- Join an interdisciplinary team
- Learn about open and cumulative science (a response to the replication crisis)
- Experience life in the Lab
- Exposure to research methods in experimental psychology and language science

Job location: LSCP - 29, rue d’Ulm - 75005 Paris; teleworking possible

Required profile:
We are looking for full-time interns, minimum 2 months, with the following profile:
❏ Organized, autonomous, rigorous;
❏ Knowledge of Python;
❏ Bachelor's or Master's degree in computational linguistics, data science,
cognitive science, psychology etc.
Application:
Submit a motivation letter and a CV to laac.lscp@gmail.com.

8
Subject 6: Spoken Language Modeling with Soft Speech Units
(TuAnh)

Pre-training language models (BERT, GPT) on large-scale text data have achieved tremendous
success and have become a standard in Natural Language Processing (NLP). Lately, language
models have also been successfully applied to other modality such as music (Jukebox) or
image (Parti). For speech, several works have introduced the task of spoken language
modeling, the learning of a language unsupervisedly from raw audio without any text labels
(gSLM [1, 2]). These works rely on transforming the audio into a sequence of discrete speech
units and training a language model on these speech units.

Recently, [3] investigated the importance of using these discrete speech units in training spoken
language models, they found that using discrete units are indeed important for spoken language
modeling as these units disentangle linguistic information from speaker information in the audio.
In a similar vein, [4] analyzed the use of soft speech units for the speech synthesis task, they
found that soft speech units contain more information than “one-hot” speech units and therefore
provide better speech synthesis. The objective of this internship is thus to study the effect of soft
speech units for the spoken language modeling task.

The intern should have good deep learning skills, and an interest in NLP.

Supervisor(s): Tu Anh Nguyen, Emmanuel Dupoux , Benoît Sagot

Contact (s): nguyentuanh208@gmail.com (in addition to syntheticlearner@gmail.com)

References

[1] Tu Anh Nguyen, Maureen de Seyssel, Patricia Rozé, Morgane Rivière, Evgeny Kharitonov, Alexei
Baevski, Ewan Dunbar, Emmanuel Dupoux. The Zero Resource Speech Benchmark 2021: Metrics and
baselines for unsupervised spoken language modeling.

[2] Kushal Lakhotia, Evgeny Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu Anh
Nguyen, Jade Copet, Alexei Baevski, Abdelrahman Mohamed, Emmanuel Dupoux. On Generative
Spoken Language Modeling from Raw Audio.

[3] Tu Anh Nguyen, Benoit Sagot, Emmanuel Dupoux. Are discrete units necessary for Spoken Language
Modeling?

[4] Benjamin van Niekerk, Marc-André Carbonneau, Julian Zaïdi, Mathew Baas, Hugo Seuté, Herman
Kamper. A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion.

9
Subject 7: Multilevel U-statistics with applications to the evaluation
of representation learning algorithms (Thomas)
Level: M1 or M2 internship

Advisor: Thomas Schatz (MCF, QARMA team in Marseille)

Localisation: preferably Marseille, but remote work is negotiable

Duration: 3 to 6 months

Funding: available, including for presenting the internships' results at an international


conference

Topic:

The overall objective of the internship will be to develop a generalization of


U-statistics to multilevel samples with a dependency structure described by a
directed graph. This is motivated by applications to the evaluation of representation
learning algorithms in the context of applications in artificial intelligence and
cognitive (neuro)science. The work will include establishing classical optimality
results for the generalized notion of U-statistics, devising practical point and range
estimators and statistical tests in the context of the application of interest, and
implementing these in a convenient and easy-to-use software package.

Contact thomas.schatz@univ-amu.fr if you are interested.

10

You might also like