Nothing Special   »   [go: up one dir, main page]

1 s2.0 S0093934X24000154 Main

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Brain & Language 251 (2024) 105392

Contents lists available at ScienceDirect

Brain and Language


journal homepage: www.elsevier.com/locate/b&l

Contribution of the language network to the comprehension of Python


programming code
Yun-Fei Liu (劉耘非)a, *, Colin Wilson b, Marina Bedny a
a
Department of Psychological and Brain Sciences, Johns Hopkins Universtiy, 232 Ames Hall, 3400 N. Charles Street, Baltimore, MD 21218, USA
b
Department of Cognitive Science, Johns Hopkins University, 237 Krieger Hall, 3400 N. Charles Street, Baltimore, MD 21218, USA

A B S T R A C T

Does the perisylvian language network contribute to comprehension of programming languages, like Python? Univariate neuroimaging studies find high responses to
code in fronto-parietal executive areas but not in fronto-temporal language areas, suggesting the language network does little. We used multivariate-pattern-analysis
to test whether the language network encodes Python functions. Python programmers read functions while undergoing fMRI. A linear SVM decoded for-loops from if-
conditionals based on activity in lateral temporal (LT) language cortex. In searchlight analysis, decoding accuracy was higher in LT language cortex than anywhere
else. Follow up analysis showed that decoding was not driven by presence of different words across functions, “for” vs “if,” but by compositional program properties.
Finally, univariate responses to code peaked earlier in LT language-cortex than in the fronto-parietal network. We propose that the language system forms initial
“surface meaning” representations of programs, which input to the reasoning network for processing of algorithms.

1. Introduction Chomsky, & Fitch, 2002; Yang, Crain, Berwick, Chomsky, & Bolhuis,
2017). Programming languages contain data structures, such as lists and
The invention of computer programming and its applications (e.g., trees, that can be recursive in the same way; furthermore, functions are
artificial intelligence) have altered human society and are fast becoming allowed to call other functions, including themselves, as subroutines.
a central aspect of employment and education in the modern world. Reading code can be thought of as partly analogous to reading natural
However, the cognitive and neural mechanisms that enable the human language, where progressively more abstract and larger structures (e.g.,
brain to support this important cultural skill are still poorly understood. functions) are constructed from lawful combinations of discrete symbols
A key outstanding question is the degree to which the neurocognitive at lower levels of representation (i.e., letters, words) (Fedorenko et al.,
system that supports natural language processing is involved in under­ 2019). These parallels predicts overlap in the neural representation of
standing and producing programming code (Fedorenko, Ivanova, Dha­ natural and programming languages (Fedorenko et al., 2019; Fitch et al.,
mala, & Bers, 2019; Ivanova et al., 2020; Liu, Kim, Wilson, & Bedny, 2005; Pandža, 2016; Peitek et al., 2018; Portnoff, 2018; Prat et al.,
2020; Peitek et al., 2018; Prat, Madhyastha, Mottarella, & Kuo, 2020; 2020).
Siegmund et al., 2014). On the other hand, there are key differences between natural and
Prima facie support for the idea that the language system is “recy­ programming languages. While English word forms do appear in com­
cled” for code comprehension comes from the fact that programming puter code, the meanings of these symbols are not exactly the same as in
languages borrow some elements of natural language. Words like “for”, natural languages. More generally, programming languages lack sym­
“if”, “and”, “or”, and “return” are used almost universally across pro­ bols that have the rich semantics of lexical items (e.g., “dog” or “walk”).
gramming languages, and their meanings are partly preserved. Even Grammatical categories such as nouns, verbs, adjectives, and preposi­
opaque and older function names (e.g., “chmod”, “mkdir” in bash tions do not have clear counterparts in programming languages (e.g., an
scripts) are abbreviations of English words rather than arbitrary letter object in code can have both verb-like and adjective-like attributes).
combinations. The syntax of natural and programming languages also Furthermore, the rules for combining basic units are distinct: while
share features, such as hierarchical structure and recursion (Fitch, human natural and programming languages both have function-
Hauser, & Chomsky, 2005). Natural languages are recursive because a argument structure, scope, and variable binding at the semantic level,
phrase can be embedded within another phrase of the same syntactic only human natural languages have grammatical relations such as
category (Friederici, Chomsky, Berwick, Moro, & Bolhuis, 2017; Hauser, subject and direct object or information structure such as topic and

* Corresponding author.
E-mail address: yliu291@jhu.edu (Y.-F. Liu).

https://doi.org/10.1016/j.bandl.2024.105392
Received 16 May 2023; Received in revised form 8 February 2024; Accepted 14 February 2024
Available online 22 February 2024
0093-934X/© 2024 Elsevier Inc. All rights reserved.
Y.-F. Liu et al. Brain and Language 251 (2024) 105392

focus. While natural languages are rife with ambiguity (i.e., a given decoding in functionally localized language areas.
sequence of words can have multiple lexical and syntactic parses with Recent evidence for the hypothesis that language areas may in fact
different meanings), the relation between code and meaning is deter­ contain multivariate information about programming code comes from
ministic. Programming languages may in fact have more in common Srikant et al. (2022). They found that a linear classifier trained on ac­
with logical reasoning than with natural language. Code and formal tivity patterns from the language network could be used to distinguish
logic both make use of conditions like “if”, quantifiers like “for all”, and between different types of control structures (FOR loop, IF conditional,
logical operators such as “and”, “or”, and “not”. In both systems, these or sequential operations lacking FOR and IF) and data types (string vs.
expressions are interpreted deterministically and without the pragmatic numeric) in Python programming scripts. These results suggest that,
enrichment that is so characteristic of human natural language. contrary to the inferences from univariate measures, the language
The available empirical evidence suggests that domain-general network does show sensitivity to the content of computer code.
logical reasoning systems, rather than language networks, support pro­ In the current study, we sought to replicate and extend these results
gramming (Dehaene, Al Roumi, Lakretz, Planton, & Sablé-Meyer, 2022; by probing in greater detail the contribution of the language network to
Fedorenko & Varley, 2016; Monti, Parsons, & Osherson, 2009, 2012; program comprehension. We used data from a previous publication (Liu
Monti & Osherson, 2012). Studies of individual differences find corre­ et al., 2020). In the previous work, we did not conduct any multi-variate
lations between programming learning outcome and logical, analogical, pattern analysis (MVPA) in the language network. In our current study,
and deductive reasoning abilities (McCoy & Burton, 1988; Pea & Kur­ MVPA was used to decode FOR from IF functions based on activity
land, 1984; Prat et al., 2020; Shute, 1991). Recent functional magnetic patterns within classic language regions identified in individual partic­
resonance imaging (fMRI) studies have found activity in fronto-parietal ipants. While undergoing functional magnetic resonance imaging (fMRI)
reasoning areas, rather than perisylvian language circuits, when scans, expert Python programmers read short functions, each of which
comparing programming tasks to various control conditions: code contained a single FOR loop or a single IF conditional. The same par­
reading vs. reading algorithms written in plain English (Ivanova et al., ticipants performed a language localizer task where sentence compre­
2020), code reading vs. syntactic bug finding (Siegmund et al., 2014), hension was compared to solving math equations. This functional
code reading vs. prose reading (Floyd, Santander, & Weimer, 2017), and localization allowed us to conduct our analyses within the neural pop­
searching for semantic bugs which prevent a program from imple­ ulation most sensitive to linguistic content within each individual
menting its intended algorithm vs. reading a bug-free program (Cas­ participant.
telhano et al., 2018). The same fronto-parietal network is also involved The current experiment went beyond previous studies by asking
in code writing as opposed to prose writing (Krueger et al., 2020), and whether the language network is sensitive to the compositional meaning
even when participants were covertly crafting the program without of programming functions or is restricted to retrieving the meanings of
actually typing it (Xu, Li, & Liu, 2021). In many of these studies, the programming keywords, such as “return”, “for” and “if”. If the language-
control condition involved language (e.g., prose reading), leading to the networks’ role is restricted to retrieving word meanings, decoding might
concern that language-related activation was subtracted out. However, be based purely on the presence of distinct lexical items across different
Liu et al. (2020), compared code comprehension to memorizing code function types. This hypothesis is consistent with the available data
scrambled code, where the scrambled code lacked meaningful words because in the only prior study to find decoding in the language
and sentential structure. Nevertheless, the code vs. scrambled code network, the decoded functions contained different words, in addition to
contrast still identified fronto-parietal and not perisylvian language differing in line structure (Srikant et al., 2022).
networks. A feature of the stimuli in the current study made it possible to test
A possible conclusion from these data is that the language system the hypothesis that the language network is sensitive to the composi­
plays little role in the processing of computer code (Ivanova et al., 2020; tional structure and not just the lexical items of code: the same partic­
Liu et al., 2020). At the same time, it is not clear that the involvement of ipants read real Python code functions and memorized similar functions
the language system in code processing can be entirely dismissed. One presented with all the words in scrambled order, i.e., “scrambled fake
recent study found that while Python code and natural language do not function”. Each scrambled “fake” function was generated from a real
activate the same cortical networks, they do show co-lateralization Python function by scrambling the words and symbols within each line.
across individuals, suggesting some relationship between them (Liu As a result, the lexical items within each scrambled function were
et al., 2020). Moreover, examining univariate responses relative to identical to a real function, so words like “for” and “if” were preserved.
control conditions (as reviewed above) is not the only way to test However, the fake functions lacked the meaning and structure that is
whether a neurocognitive system is involved in a particular task. present in real Python functions. If decoding of IF vs. FOR functions in
It remains possible that even though the fronto-temporal language language regions is driven by the presence of different lexical items, we
system does not show large activity during code processing (perhaps should be able to decode not only the real but also the fake functions. By
because natural rather than programming languages are its preferred contrast, if decoding in the language network is driven by the meaning
stimulus), multivariate patterns of activity in the language network still and/or structure of the Python code, then we should find decoding of
represent information relevant to programming code. Such dissociations real but not fake function.
between univariate and multivariate results have been observed in other Next, we compared the temporal dynamics of neural responses to
domains of cognitive neuroscience. For example, despite low overall Python functions across language and fronto-parietal reasoning net­
activity levels, early sensory areas contain information about the con­ works to test whether these networks contribute to different aspects of
tents of visual working memory (Bettencourt & Xu, 2016; Emrich, Rig­ code comprehension. We hypothesized that the language system is
gall, LaRocque, & Postle, 2013; Ester, Serences, & Awh, 2009; Harrison responsible for the initial meaning extraction from programming text,
& Tong, 2009; Riggall & Postle, 2012). whereas fronto-parietal logical reasoning system subsequently creates a
Only a handful of studies have used multivariate methods to study mental model of the programming algorithm. This latter mental model is
programming and most of these have not looked at decoding in language a more in-depth representation than the surface meaning extracted by
regions. Ikutani et al. (2021) and Liu et al. (2020) found that different the language system. It is more flexible, containing variables that can
types of programming algorithms can be decoded based on the spatial take on specific values and be entered into functions. Based on this
activation pattern in the fronto-parietal reasoning network. Further­ hypothesis, we predicted that the language system contributes to code
more, decoding accuracy in the fronto-parietal network correlated with comprehension earlier than the fronto-parietal reasoning system and
behavioral performance on the behavioral task where participants sor­ that the blood-oxygen level dependent (BOLD) signal responses to code
ted programming scripts based on the underlying algorithms (Ikutani would peak earlier in language relative to fronto-parietal systems.
et al., 2021). However, neither of these studies specifically looked at

2
Y.-F. Liu et al. Brain and Language 251 (2024) 105392

2. Methods regardless of the variants, were visually similar. All functions took a
character string as input and performed string manipulations. As dis­
As this study consists of further analyses on the data collected for a cussed in detail below, analyses focused on the function comprehension
previous publication, the participants, experiment design, and data portion of the trial, prior to input presentation. Each trial was followed
acquisition procedures are identical to what was described previously by by a 5-second inter-trial interval. Please see Fig. 1 for example stimuli.
Liu et al. (2020). However, for the sake of completeness, we still briefly Prior to the experiment, participants were told they were going to see
describe these aspects of the study. functions that work with character strings called “input”. They also
practiced with these types of stimuli, so they were well aware that in the
2.1. Participants scope of this experiment, “input” referred to the input argument of the
functions, rather than a Python built-in function which happened to
Fifteen individuals participated in the study (three women, twelve have the same name.
men, age range 20–38, mean age = 27.4, SD = 5.0). Participants had an Fake code memory trials had a similar structure to the code trails: a
average of 5.7 years of Python programming experience (range: 3–9, SD scrambled Python “function” (24 s), followed by a one-line “input” (6 s)
= 1.8). Other than self-report, Python expertise was evaluated with two and a scrambled one-line “output” (6 s). Participants were instructed to
Python exercises administered outside the MRI scanner. In the first ex­ remember the text presented during the first two phases of the trail.
ercise, participants answered what will be the output of a one-line Py­ They then judged whether the one-line scrambled “output” matched any
thon statement (e.g. for “print(‘3.14’.split(‘1’))”, participants of the lines presented during the previous phases (including both the
should type “[‘3.’, ‘4’]”). In the second exercise, participants saw a Py­ scrambled function and the scrambled “input”). As with real code,
thon program snippet with a blank, along with a sentence describing analysis focused on the “function” portion of the trail.
what the snippet should do when executed. Participants were required Every scrambled function was generated from a real Python function
to complete the snippet to fulfill the specification (e.g., for the snippet by separately scrambling each line of the real function at the level of
“a = ‘abc’; print(______(enumerate(a)))” and the specification words and symbols. Therefore, the words, digits and operators present in
“What should be filled in the blank if we want to print out a collection of real functions were preserved in scrambled functions, but none of the
tuples enclosed in square brackets, rather than something like < scrambled lines comprised an executable Python statement. Like the real
enumerate object at 0x000001888D2B2678 >?”, participants IF and FOR functions, the fake FOR functions contained the word “for”,
should type “list”). The first exercise evaluated participants’ knowl­ and the IF scrambled functions did not contain the word “for”. We
edge of basic Python syntaxes and built-in functions (participants’ re­ contrasted reading and comprehending Python code against an explicit
sponses M = 82.9 %, SD = 6.9 %, range: 70–96 %), whereas the second working memory task as an attempt to emphasize only the process of
exercise evaluated participants’ ability to use their programming understanding the algorithm, but not the working memory processes
knowledge to solve a problem (M = 64.6 %, SD = 16.6 %, range: associated with the comprehension.
37.5–93.75 %). For detailed descriptions of these exercises, please refer There were six task runs in this scan. During each run, participants
to Liu et al. (2020). saw 8 real FOR functions, 8 real IF functions, and 4 fake scrambled
All participants had normal or corrected to normal vision and none functions (48 FOR, 48 IF, and 24 fake across runs for each participant).
had been diagnosed with cognitive or neurological disabilities. All In each of the 6 task runs, each participant only saw either the “real” or
participants gave informed consent according to procedures approved the “fake” version of the same function, but not both.
by the Johns Hopkins Medicine Institutional Review Board (IRB protocol The localizer experiment included language comprehension, logical
number: NA_00087983). reasoning, and symbolic math tasks all following the same structure. On
language trials, participant judged whether two sentences, one in active
2.2. fMRI task design and stimuli and one in passive voice, had the same meaning (e.g., “The child that the
babysitter chased ate the apple” vs “The apple was eaten by the baby­
Participants took part in an fMRI Python code comprehension sitter that the child chased”). On formal logic trials, participants judged
experiment and a second localizer experiment for language, logical whether two logical statements were consistent. That is, whether the
reasoning, and symbolic math. Participants also performed a multi- two statements logically infer each other (e.g., “If either not X or not Y
source-interference-task (Bush & Shin, 2006) which is not relevant to then Z” vs “If not Z then both X and Y”). On math trials, participant
the current paper, and will not be discussed further. judged whether the variable X had the same value across two equations
The code experiment consisted of Python code condition and a “fake” (e.g., “X minus twenty-three equals forty-two” vs “X minus fifty-one
code memory control. Real code comprehension trials involved the equals fourteen”). In each trial, one of the two “sentences” appeared
sequential presentation of three elements: a Python function (24 s), an first, with the other following 3 s later. Both statements stayed visible on
input to the function (6 s), and a proposed output (6 s). Participants the screen for 16 s. Participants indicated their true/false judgment by
judged whether the proposed output was correct and indicated their pressing one of two buttons. The experiment consisted of 6 runs, each
response via a yes/no button press. Each Python function contained containing 8 trials of each type (language/logic/math) and six rest pe­
exactly one control structure, which was either a FOR loop or an IF riods, lasting 5 s each. In this study, we localized the perisylvian fronto-
conditional. There were two variants for FOR functions, and two vari­ temporal language network using the language > math contrast, and the
ants for IF functions. The first variant of FOR functions implemented the lateral fronto-parietal logical reasoning network using the logic > lan­
FOR loop in the canonical way, where a FOR loop began with the guage contrast (Kanjlia, Lane, Feigenson, & Bedny, 2016; Monti et al.,
keyword “for”, followed by actions to be taken in each iteration. The 2009, 2012; Monti & Osherson, 2012). The logic task was adapted from
second variant of FOR functions contained a Python-specific expression Monti et al. (2009, 2012) and Monti and Osherson (2012); whereas the
called “list comprehension”, where the keyword “for” was placed in the language task was adapted from Kanjlia et al. (2016), which was also
middle of a loop definition, rather than the beginning. In the first variant derived from Monti et al. (2012) and Monti and Osherson (2012).
of IF functions, a conditional statement began with the keyword “if”,
followed by the action to be taken if a condition was met. In the second 2.3. fMRI data acquisition and preprocessing
variant, the keyword “if” was not used. Instead, we multiplied the action
by the condition such that if the condition evaluated to false, the product All functional and structural MRI data were acquired at the F.M.
of the multiplication was 0, indicating no action was taken. Despite the Kirby Research Center of Functional Brain Imaging on a 3T Phillips
existence of the variants, all functions consisted of exactly 5 lines of code Achieva Multix X-Series scanner. T1-weighted structural images were
with the same patterns of indentation, such that FOR and IF functions, collected in 150 axial slices with 1 mm isotropic voxels using a

3
Y.-F. Liu et al. Brain and Language 251 (2024) 105392

Fig. 1. Example stimuli. The top row shows a real FOR function and a real IF function. The bottom row shows a scrambled (fake) FOR function and a fake IF function.
Each fake function was created by scrambling the words and symbols in each line of the corresponding real function. During the experiment, all stimuli were
presented on a black background. Real functions were presented in a white font, and fake functions were presented in a yellow font as a visual reminder to prevent
participants from temporarily thinking the fake functions were real functions. For an illustration for the experiment design, please refer to Liu et al. (2020).

magnetization-prepared rapid gradient-echo (MP-RAGE) sequence. A support vector machine (SVM) classifier was then trained and tested
Functional T2*-weighted BOLD scans were collected using a gradient on the spatial pattern of z-statistics associated with beta parameters
echo planar imaging (EPI) sequence with the following parameters: 36 estimated by the GLM. The SVM classifier was implemented in the Py­
sequential ascending axial slices, repetition time (TR) = 2 s, echo time thon toolbox Scikit-learn (Chang & Lin, 2011; Pedregosa et al., 2011).
(TE) = 0.03 s, flip angle = 70◦ , field of view (FOV) matrix = 76 × 70, For each vertex in the brain, one linear SVM classifier (regularization
slice thickness = 2.5 mm, inter-slice gap = 0.5, slice-coverage FH = parameter C = 5.0) was trained and tested on the spatial pattern in a
107.5, voxel size = 2.4 × 2.4 × 3 mm, PE direction = L/R, first order “searchlight” surrounding the vertex. The searchlight associated with a
shimming. Six dummy scans were collected at the beginning of each run vertex consisted of all the vertices within a circle of 8 mm diameter
but were not saved. We acquired the data in one code comprehension (according to geodesic distance) centered at the vertex (Glasser et al.,
session (six runs) and one localizer session (6 runs of language/math/ 2013; Kriegeskorte, Goebel, & Bandettini, 2006). Searchlights contain­
logic), with the acquisition parameters being identical for both sessions. ing sub-cortical vertex were excluded. The regularization parameter C
The stimuli in both sessions were presented with custom scripts for SVM classifiers indicates how much misclassification of the training
written in PsychoPy3 (https://www.psychopy.org/ (Peirce et al., data is allowed, where a larger value means less misclassification. But
2019)). The visual stimuli were presented on a rear projection screen, increasing C value also increases the risk of overfitting the training data,
cut to fit the scanner bore. The participant viewed the screen via a front- leading to reduced decoding accuracy when applying the model to the
silvered, 45◦ inclined mirror attached to the top of the head coil. The testing data. The default value of C provided by Scikit-learn was 1, and
stimuli were projected with an Epson PowerLite 7350 projector. The we selected a larger value to impose a harder margin to better avoid
resolution of the projected image was 1600 × 1200. misclassification. Empirically, a reasonable C value ranges from 1 to 10
Data were analyzed using Freesurfer, FSL, HCP workbench, and (https://www.ibm.com/docs/en/spss-modeler/18.2.2?topic=node
custom in-house software written in Python (Dale, Fischl, & Sereno, -svm-expert-options). We selected 5, which falls at the center of this
1999; Glasser et al., 2013; Smith et al., 2004). Functional data were range, and used this value consistently throughout the analysis.
motion corrected, high-pass filtered (128 s), mapped to the cortical To eliminate any difference in the overall signal strength across MRI
surface using Freesurfer, spatially smoothed on the surface (6 mm scanning runs, data were normalized within each run (Lee & Kable,
FWHM Gaussian kernel), and prewhitened to remove temporal auto­ 2018; Stehr, Garcia, Pyles, & Grossman, 2023). In each run, in each
correlation. Covariates of no interest were included to account for vertex on the cortical surface, the mean and standard deviation were
confounds related to white matter, cerebral spinal fluid, and motion computed across trials and used for normalization such that the mean
spikes. was set to 0 and standard deviation to 1. To avoid the dependency be­
tween trials from the same run artificially inflating the decoding accu­
3. Analysis racy, we performed a 6-fold leave-one-run-out cross-validation (Etzel,
Valchev, & Keysers, 2011; Mumford, Davis, & Poldrack, 2014; Valente,
3.1. Whole-cortex searchlight multivariate pattern analysis (MVPA) Castellanos, Hausfeld, De Martino, & Formisano, 2021). In each cross-
validation fold, the classifier was trained on the data from 5 out of the
In this analysis, we asked: in the whole cortex, where were the FOR 6 task runs and tested on the left-out run. The resulting 6 accuracy values
and IF programming functions most distinctly represented? To answer were averaged to derive the observed accuracy for one participant in one
the question, we conducted MVPA decoding to distinguish FOR and IF searchlight. In each searchlight, we used a one-sample t-test to test the
functions using local spatial activation patterns in the “searchlight” 15-participant group mean accuracy value (Fisher z-transformed)
associated with each vertex on the cortical surface. against chance of 50 % (also Fisher z-transformed).
To prepare data for decoding, we constructed a general linear model To control for family-wise error rate (FWER), we applied a cluster-
(GLM) where each real code function (48 FOR and 48 IF) and each based permutation correction (Elli, Lane, & Bedny, 2019; Musz, Loio­
scrambled control function (12 FOR and 12 IF) was entered as a separate tile, Chen, & Bedny, 2022; Regev, Honey, Simony, & Hasson, 2013;
predictor with 24 s duration, modeling the function presentation phase. Schreiber & Krekelberg, 2013; Stelzer, Chen, & Turner, 2013; Su,

4
Y.-F. Liu et al. Brain and Language 251 (2024) 105392

Fonteneau, Marslen-Wilson, & Kriegeskorte, 2012) with a vertex-wise math contrast were excluded, and the top 5 % of vertices were selected
cluster-forming threshold of uncorrected p < 0.001, and a cluster-wise from the subset of vertices with positive z-statistics (i.e., those preferring
FWER threshold of p < 0.05 (Eklund, Knutsson, & Nichols, 2019; sentence over math). In the IPS and, PFC, individual code-responsive
Eklund, Nichols, & Knutsson, 2016; Winkler, Ridgway, Webster, Smith, fROIs were defined in a similar fashion, but based on the real > fake
& Nichols, 2014). Specifically, we shuffled the condition labels (FOR function contrast from the code comprehension experiment. The
and IF) 100 times. For each shuffle, we derived one null group-mean average number of selected vertices are 130 in the LT, 70 in the IPS, and
accuracy map. For the observed group-mean accuracy map and each 101 in the PFC. For MVPA decoding, for each participant within each
of the null maps, we applied a cluster-forming threshold of p < 0.001. search space, we selected the top 500 vertices based on the fixed-effect
For each cluster, we computed its strength-over-spread, which is the real > fake function contrast. For the percent signal change (PSC)
average distance between each vertex in the cluster with the peak analysis in IPS and PFC, we used a leave-one-run-out approach (Glezer &
cluster, weighted by the decoding accuracy value of the vertex. We Riesenhuber, 2013; Kriegeskorte, Simmons, Bellgowan, & Baker, 2009),
recorded the maximum strength-over-spread value among the clusters in taking data from 5 out of the 6 runs to select the top 5 % of code-
each null accuracy map to form a null distribution of 100 maximum responsive vertices within a search space, and extracted the PSC time
strength-over-spread values. A cluster in the observed map passed the course from the held out run. For each participant, this process was
correction if its strength-over-spread value was greater than the 95th repeated for all 6 runs and the results were averaged across folds. In the
percentile of the null distribution. PSC analysis, for the code-responsive fROIs, the leave-one-run-out
approach was adopted to avoid circular analysis where we extract
3.2. ROI definition neural responses during code reading from the fROIs defined by the
responses during code reading. However, for the language-responsive
Separate GLMs were constructed for the code experiment and fROIs, because we extracted the neural responses during code reading
localizer scans for the purpose of generating region of interest (ROI) from the fROIs defined by an orthogonal localizer contrast, leave-one-
search spaces and individual functional ROIs (fROIs). Specifically, our run-out approach was neither necessary nor preferable.
analysis focused on four ROIs: language-responsive lateral temporal
cortex (LT), code-responsive intraparietal sulcus (IPS) and lateral pre­ 3.3. ROI-based MVPA
frontal cortex (PFC), and a control region, the medial occipital primary
visual cortex (OCC). Each ROI served as a search space within which we In this analysis, we investigated the neural representations of algo­
defined functional ROI (fROI) for each individual. The fROI approach rithms in four ROIs: language-responsive LT; the code-responsive IPS
was adopted to account for the known individual differences of neural and the PFC; and the control region OCC, which was not expected to
populations engaged by a cognitive task (Nieto-Castañón & Fedorenko, encode information relevant to programming algorithms.
2012). It enabled us to select the subject-specific neural populations In this analysis, we used the same configuration for the SVM classi­
engaged by experiment conditions within a larger group-based ROI fier, training data, and normalization scheme, and leave-one-run-out
search space. How the group-based ROI search spaces and the fROIs cross-validation as in the searchlight MVPA.
within each individual were defined is described in detail below. To determine whether decoding across IF vs. FOR functions was
In the localizer GLM, one predictor was included for each of the three driven by the presence of different “lexical” items such as the word “for”,
conditions (sentence, math, and logic), modelling the 16 s duration or rather by the compositional structure of programming functions, we
when the pair of statements was visible. A separate predictor was trained and tested the same SVM classifier on scrambled fake functions.
entered to model the 5 s rest period between trials. For further details see Recall that each fake function was derived from one real function, where
Liu et al., (2020). The group contrast of sentence > math (p < 0.05, words such as “for” or “if” were retained. As a result, fake functions can
FWER cluster-corrected) was used to define the language-responsive be divided into “for” fake functions and “if” fake functions. Except for
lateral temporal cortex (LT) ROI search space, and the individual sen­ the training and testing data, the procedures for fake function decoding
tence > math contrasts were used to select language-responsive fROIs was the same as real function decoding. Because participants saw four
within the LT ROI search space. On the other hand, the group contrast of times as many real functions as fake functions, we conducted a separate
logic > sentence (p < 0.05, FWER cluster-corrected) was used to define control decoding analysis, where we trained the same SVM classifier on
the intraparietal sulcus (IPS) and the lateral prefrontal cortex (PFC) ROI only a quarter of the real function data.
search spaces.
For the coding experiment GLM, each condition (5 in total: 2 variants 3.4. Percent signal change (PSC) analysis
of FOR functions, 2 variants of IF functions, and scrambled fake func­
tion) were entered as a separate predictor to model the duration of We examined the time courses of percent signal change (PSC) of the
function presentation (24sec). The individual real > fake function con­ blood-oxygen level dependent (BOLD) signal to study the dynamics of
trasts were used to select code-responsive fROIs within the IPS and the the neural responses to programming functions in each fROI. Specif­
PFC ROI search spaces. ically, we compared the (1) peak-to-trough amplitude, (2) peak time,
To generate the IPS, PFC (based on group logic > sentence contrast), and (3) average PSC of the time courses between the language-
and the LT (based on group sentence > math contrast) search-space responsive LT and the code-responsive IPS and PFC.
masks, we combined the cortical parcels in the 400-parcel map re­ PSC was calculated as [(Signal condition − Signal baseline)/Signal
ported by Schaefer et al.(2018) which included vertices activated in the baseline], where baseline is the activity during the inter-trial interval.
contrast of interest. To define the OCC search space, we combined the We extracted PSC from the duration of function presentation (FOR, IF, or
peri-calcarine parcels from Schaefer et al.(2018) to form a search space fake), excluding activity related to the derivation of specific output or
covering the medial occipital primary visual cortex. response processes. As introduced in the section regarding ROI defini­
Within each ROI search spaces, we defined the fROI for each indi­ tion, in the language-responsive LT, for each participant and each run,
vidual. Language-responsive individual subject fROIs were defined as we extracted PSC from the top 5 % of active vertices in the fixed-effect
vertices showing the strongest fixed-effect for the sentence > math sentence > math contrast. The 6 resultant PSC curves were then aver­
contrast in the LT search space. For MVPA decoding, we selected the top aged. In the IPS and PFC, we used a leave-one-run-out approach to select
500 sentence > math vertices in LT. For percent signal change (PSC) the top 5 % of active vertices in the fixed-effect real > fake function
analysis in LT, we selected the top 5 % sentence > math vertices (Kanjlia contrast, and extract the PSC using the left-out run. This was repeated
et al., 2016; Kim, Kanjlia, Merabet, & Bedny, 2017). Specifically, within across a 6 runs, and the 6 resultant PSC curves were then averaged.
the LT search space, vertices with negative z-statistics in the sentence > Since IPS and PFC showed similar time courses, they were averaged

5
Y.-F. Liu et al. Brain and Language 251 (2024) 105392

together. Therefore, the extraction resulted in one PSC time course per Table 1
condition (FOR, IF, and fake) per participant, in either the language- Clusters revealed by the searchlight MVPA decoding. FWER corrected. Only
responsive LT or the code-responsive lateral fronto-parietal network clusters with more than 50 vertices are included.
(IPS&PFC). The peak-to-trough amplitude was defined as the difference Peak MNI coordinates Cluster size Peak-p
between the maximum and minimum value of a time course. The peak X Y Z Vertices mm2
time was the time point when the maximum value happened. The
Left hemisphere
average PSC of the time courses was computed by averaging the middle
Superior − 46.6 − 69.2 21.9 3169 5260.58 3.62E− 09
14 s of the time course (that is, the beginning and the last 5 s were not temporal
considered). Paired t-tests were conducted to compare these variables of sulcus/
interest. angular
gyrus/
intraparietal
4. Results sulcus
Superior frontal − 24.9 16.8 40 979 2022.29 8.55E− 09
4.1. MVPA decoding of FOR and IF functions in language-responsive sulcus/
lateral temporal cortex middle
frontal
gyrus/
In whole-cortex searchlight analysis, we found reliable above-chance precentral
decoding of FOR and IF functions in a left lateralized fronto-temporal gyrus &
network, with the highest accuracy in the lateral superior temporal sulcus
(BA 22, 39) and temporo-parietal cortices (the angular and the supra­ Precuneus − 6.3 − 59 32 526 913.58 1.23E− 07
Superior frontal − 16.4 18.8 59 395 1026.33 3.27E− 08
marginal gyri, BA 40). Smaller clusters were also found in left prefrontal gyrus
cortex, left intraparietal sulcus and right temporo-parietal cortices Inferior frontal − 52.7 14.1 5.1 213 510.77 1.97E− 07
(Fig. 2). gyrus
Next, we used individual-subject ROI analysis to test whether in in­ Posterior dorsal − 7.4 − 40.8 39.1 121 260.38 1.31E− 07
cingulate
dividual participants, language-responsive lateral temporal cortices
gyrus
contain neural populations that distinguish between IF vs. FOR Python Superior frontal − 13.4 53.1 32.4 89 199.66 2.42E− 06
functions. Lateral temporal ROIs chosen in individual participants for gyrus
their language selectivity in the localizer scan (sentence > math) Middle frontal − 45.4 30.6 25.8 86 150.12 3.44E− 05
showed above chance classification of FOR and IF code functions (LT gyrus
Medial occipto- − 32.6 − 48.9 − 10.2 74 173.29 4.02E− 07
accuracy = 66.9 %, Wilcoxon signed rank test against chance: z = temporal
− 3.41, p < 0.001). Decoding in language-responsive lateral temporal sulcus
cortex was as high as decoding in IPS (accuracy = 67.8 %, z = − 3.41, p Orbital gyrus − 41.2 26.7 − 16.1 71 172.65 7.33E− 07
< 0.001) and PFC (accuracy = 64.4 %, z = − 3.41, p < 0.001), where Lingual gyrus − 3.8 − 91.1 − 6.8 51 136.47 7.01E− 06
vertices were chosen for their high responses to real over fake code.
Wilcoxon signed rank tests showed that decoding accuracy values in the Right
LT, the IPS, and the PFC were higher than in an occipital cortex control hemisphere
Intraparietal 35.5 − 54.8 39.8 1036 1954.92 9.23E− 08
region (OCC accuracy = 55.5 %, z = − 2.54, p < 0.05; LT vs OCC: z =
sulcus/
− 3.10, p < 0.01; IPS vs OCC: z = − 3.04, p < 0.01; PFC vs OCC: z = superior
− 2.56, p < 0.05) (Fig. 3). In sum, lateral temporal language-responsive temporal
areas showed high decoding accuracy for IF vs. FOR functions. Decoding sulcus/
angular gyrus
in language-responsive LT was as high as or higher than in IPS and PFC
Superior frontal 27 13.1 45.5 315 551.61 3.07E− 07
sulcus
Precuneus 5.9 − 68.1 47.6 266 504.43 1.52E− 06
Middle frontal 47.7 23.4 31.5 193 342.6 4.56E− 07
gyrus/
inferior
frontal sulcus
Precuneus 7.3 − 54.9 47.1 87 113.06 6.09E− 06
Superior frontal 11.5 54.7 31.8 60 205.07 3.13E− 06
gyrus
Superior frontal 16.5 38.8 45 57 162.05 6.77E− 06
gyrus
Lingual gyrus 3.1 − 79.2 0.2 52 195.12 4.26E− 06

areas identified for their univariate responses to code (Supplementary


Fig. 1).
Fig. 2. Searchlight multivariate pattern analysis (MVPA) FOR-vs-IF Python One possibility is that language-responsive cortex is sensitive only to
function decoding accuracy map. Family-wise error rate (FWER) was controlled the “lexical” items present in code and not to compositional structure of
by applying cluster-based permutation correction, with a vertex-wise cluster- code functions (e.g., presence of particular key words, such as “if” and
forming threshold of uncorrected p < 0.001, and a cluster-wise threshold of p “for”). Alternatively, language areas may be sensitive to the composi­
< 0.05. The blue outlines denote the language-responsive network defined tional structure of Python code functions. To distinguish between these
based on the sentence > math group contrast derived from the localizer scan, possibilities, we conducted the same decoding using the neural re­
whereas the green outlines denote the logic-responsive network defined based
sponses to scrambled fake functions. Recall that in our experiment, each
on the logic > sentence group contrast. All the vertices shown in this figure
fake function contained all the words and symbols of the corresponding
have significantly above-chance accuracy. For the list of clusters which passed
the FWER correction, please see Table 1. real function in a scrambled order, line by line. Therefore, like real FOR
functions, all fake FOR functions contained the word “for”. Contrary to
the lexical-only hypothesis, we did not find significantly above-chance

6
Y.-F. Liu et al. Brain and Language 251 (2024) 105392

Fig. 3. FOR-vs-IF Python function MVPA decoding accuracy in language-responsive left lateral temporal cortex (LT), code-responsive left intraparietal sulcus (IPS),
code-responsive left lateral prefrontal cortex (PFC), and left primary visual medial occipital cortex (OCC). The “real” is the accuracy of decoding if vs. for Python
functions. “Fake” is the accuracy of fake/scrambled function decoding. The search spaces are delineated on the inset brain map. Chance level is 50 %. Error bars
denote standard error of the decoding accuracy. *p < 0.05, **p < 0.01, ***p < 0.001.

decoding accuracy for fake code in language-responsive lateral temporal in the LT, SD = 0.20 %; mean = 1.24 % in the IPS&PFC, SD = 0.42 %; t
cortex, or in any of the other ROIs (LT: accuracy = 46.7 %, z = − 1.15, p (14) = − 6.22, p < 0.001).
= 0.24; IPS: accuracy = 46.9 %, z = − 1.43, p = 0.15; PFC: accuracy = Based on the hypothesis that language regions are involved in the
49.2 %, z = − 0.45, p = 0.65; OCC: accuracy = 44.4 %, z = − 1.92, p = representation of the initial “gist” of programming functions, we pre­
0.053). In the language-responsive LT and the code-responsive IPS and dicted that the neural response to programming functions should peak
PFC, the decoding accuracy for real functions was significantly higher earlier in language-responsive lateral temporal cortex than in the fronto-
than for fake functions. Curiously, this difference was observed in the parietal reasoning network (IPS and PFC). Consistent with this hy­
OCC (LT: z = − 3.01, p < 0.01; IPS: z = − 2.78, p < 0.01; PFC: z = − 2.22, pothesis, we observed significantly earlier signal peaks in language-
p < 0.05; OCC: z = − 3.04, p < 0.01) (Fig. 3). To account for the fact that responsive LT than the code-responsive fronto-parietal network (FOR:
there were four times as many real as fake functions in the experiment, mean peak time = 8.87 s in the LT, SD = 6.39 s; mean = 15 s in the
we repeated the decoding analysis on one quarter of the real code data. IPS&PFC, SD = 3.72 s; t(14) = − 4.77, p < 0.001. IF: mean peak time =
For real code, decoding in the language-responsive lateral temporal 11.67 s in the LT, SD = 6.14 s; mean = 15.93 s in the IPS&PFC, SD =
cortex, the IPS, and the PFC remained significantly above chance (LT: 3.99 s; t(14) = − 4.07, p < 0.005.) This result supports the hypothesis
accuracy = 65.8 %, z = − 3.08, p < 0.01; IPS: accuracy = 58.6 %, z = that language areas are involved in code comprehension than fronto-
− 3.06, p < 0.01; PFC: accuracy = 60.3 %, z = − 2.93p < 0.01). In the LT parietal systems (Fig. 4). Separately comparing LT with IPS and with
and the IPS, the difference between the decoding accuracy for a quarter PFC led to the same results. The responses in either the IPS or the PFC
of real functions and for fake functions remained significant (LT: z = were stronger and faster than in the LT (Supplementary Fig. 2).
− 3.01, p < 0.01; IPS: z = − 2.13, p < 0.05) and marginally significant in
the PFC (z = − 1.88, p = 0.06). In the OCC, neither the decoding accu­ 5. Discussion
racy for a quarter of real function or the difference between a quarter of
real function and fake function reached significance (OCC: accuracy = We find that language-responsive lateral temporal cortices are sen­
51.1 %, z = − 0.25, p = 0.80; difference with fake function decoding z = sitive to the contents of the programming functions. Linear classifier
− 1.61, p = 0.11) (Supplementary Fig. 1). This result suggests that trained on activity patterns in the language network can distinguish
multivariate patterns in language-responsive lateral temporal cortex (or between FOR loops and IF conditional functions. Indeed, decoding ac­
IPS and PFC) cannot be explained by the lexical meanings of the curacy in language-responsive lateral temporal cortex was as high as or
particular words used in the functions but is related at least in part to the higher than anywhere else on the cortical surface, including fronto-
compositional structure of code. parietal networks that show strong univariate responses to code. This
decoding result is consistent with a recent report by Srikant et al. (2022),
4.2. Language vertices had weaker but earlier univariate responses to which found above-chance decoding accuracy of control structures
programming functions (FOR, IF, or a sequential script with neither FOR or IF) and data types in
both the fronto-temporal language network and the fronto-parietal
In a sensitive individual-subject ROI analysis focusing on the top 5 % reasoning network. Interestingly, high decoding is found in the lan­
of language responsive vertices in lateral temporal cortex (language > guage network despite relatively low univariate activation to Python
math), we observed a small but significant response to real over fake code in language areas.
code in language-responsive LT (real functions: 0.095 %, SD = 0.26 %; The current results also suggest that language regions, as well as the
fake functions: − 0.028 %, SD = 0.27 %; t(14) = 2.87, p < 0.05). fronto-parietal network, are sensitive to the compositional structure of
Consistent with previously reported whole-cortex results (Liu et al., code. Decoding in language regions is not likely to be driven solely by
2020), univariate responses to Python code in LT were much smaller the presence of different lexical items across different Python functions
than in fronto-parietal cortices IPS&PFC (real functions: 0.65 %, SD = since the scrambled “fake” control functions contained the same lexical
0.27 %; fake functions: 0.023 %, SD = 0.22 %; real vs. fake t(14) = 9.52, items as individual real Python functions but did not show above chance
p < 0.001), both compared to the memory control condition and decoding in language-responsive lateral temporal cortex (or in the IPS
compared to rest (FOR: mean PSC difference = 0.42 % in the LT, SD = and PFC).
0.14 %; mean = 1.1 % in the IPS&PFC, SD = 0.41 %; paired t-test be­ One caveat to our conclusion is that differences between decoding
tween ROIs t(14) = − 6.10, p < 0.001. IF: mean PSC difference = 0.53 % for real and fake might be partly driven by task demands, and not only

7
Y.-F. Liu et al. Brain and Language 251 (2024) 105392

Fig. 4. Percent signal change (PSC) time courses of different conditions, averaged across participants: FOR (solid line), IF (dashed line), and scrambled fake function
(dotted line). Blue lines: PSC time courses extracted from the top 5 % language-responsive vertices in the left lateral temporal (LT) search space. The search spaces are
delineated on the inset brain map. Yellow lines: PSC time courses extracted from the top 5 % code-responsive vertices (using leave-one-run-out method) in the left
intraparietal sulcus (IPS) and the lateral prefrontal cortex (PFC), averaged across IPS and PFC. Translucent shades denote standard error. The across-participant
average peak time of the responses to FOR and IF functions (Blue: LT, Yellow: IPS&PFC) are denoted along the X axis.

the stimuli themselves. For the Python code, participants read for does contribute to code comprehension. Notably, since the current re­
meaning, but for fake functions, they performed memorization. How­ sults come from the same dataset as our prior analysis, it will be
ever, even in the memory control task, participants needed to attend to important to replicate these findings in a new set of participants.
individual lexical items in the fake functions to correctly perform the
task, making it unlikely that they were ignoring the lexical items. It is 5.1. Possible computational contributions of language network to code
also possible that the unstructured nature of fake code stimuli contrib­ comprehension
uted to the disengagement of the language network. That is, when the
lexical items in a programming function are scrambled to disrupt the We propose that during the comprehension of computer programs,
structure, they don’t engage the language network as an actual function information flows from visual cortex to the perisylvian language system,
does. Base on the current data alone, we cannot rule out the possibility which initially constructs a language-like surface-level representation of
that lower decoding accuracy for fake as compared to real code is related the programming script. Then, the surface-level representation is
in part to different task demands across these conditions or the greater transmitted to the logical reasoning system, where the algorithm is
variability of the position of the words in the fake functions. processed. This hypothesis is consistent with presence of multivariate
Prior studies provide further evidence for the idea that language information about programs in language areas and is supported by ev­
regions are not highly sensitive to lexical information in code (Srikant idence for an earlier peak response to programs in language areas,
et al., (2022). Srikant et al. (2022) failed to decode between functions relative to the fronto-parietal network.
that contained variables in English (e.g., the variable holding the mean According to this proposal, the “surface” representation of the lan­
of two values was named “mean”) as opposed to functions that con­ guage system describes the program in a non-computable format. By
tained the same variables named in Japanese spelled out with English contrast, the fronto-parietal system actually simulates the program’s
alphabet (e.g., the variable holding the mean was named “heikin”, execution, mapping between inputs and outputs, and changing the states
which means “mean” in Japanese). Together, these findings suggested of the variables throughout the execution of the program. When an input
that the lateral temporal language areas (and fronto-parietal areas) are is provided, the fronto-parietal systems simulate the program to
sensitive to the composition content of code functions, beyond the lex­ generate the output (Bunge, Kahn, Wallis, Miller, & Wagner, 2003;
ical items in a Python function. Crittenden, Mitchell, & Duncan, 2016; Pischedda, Görgen, Haynes, &
Support for the idea that the language network plays some role in Reverberi, 2017; Woolgar, Jackson, & Duncan, 2016; Woolgar,
code comprehension comes from previous evidence of co-lateralization. Thompson, Bor, & Duncan, 2011; Zhang, Kriegeskorte, Carlin, & Rowe,
In a previous publication on the same dataset, we found that fronto- 2013). This interpretation of the fronto-parietal system’s role is consis­
parietal responses to Python code are left-lateralized and importantly tent with another finding reported by Srikant et al.(2022), where the
co-lateralized with the language network across individuals (Liu et al., fronto-parietal network encodes the “run-time behavior” of a program
2020). That is, participants who have highly left lateralized responses to (e.g., the number of actual steps taken by the computer to execute the
language in fronto-temporal language regions are also more likely to program) better than the language network.
have highly left-lateralized responses to Python code in fronto-parietal Many questions remain regarding the precise role of the language
networks. Together these data suggested that the language system network in code comprehension. Here, we discuss two possibilities, a

8
Y.-F. Liu et al. Brain and Language 251 (2024) 105392

“weak” recycling and a “strong” recycling hypothesis. According to the considerations lead us to favor the “weak” recycling hypothesis, perhaps
weak recycling idea, during code comprehension, the language network better termed the “reuse” for the relationship of language and code.
represents a natural language description of the algorithm represented It is also worth noting that prior evidence suggests that other hier­
in programming code, something like what a programmer writes in the archical cultural symbol systems such as formal logic, mathematics and
comments section of programming script. Since humans are more music do not “recycle” the language network but instead rely on fronto-
familiar with natural than artificially designed programming languages, parietal circuits (Dehaene et al., 2022; Fedorenko & Varley, 2016;
it is possible that a natural language description facilitates program Monti, 2017; Monti et al., 2009, 2012; Monti & Osherson, 2012).
comprehension. We call this a “weak” recycling hypothesis because the However, most of these studies used univariate approaches. It would be
language network continues to perform its typical linguistic operations. interesting to ask whether the contents of math and formal logic ex­
Similar to using the language system to solve memorization-based math pressions could be decoded in language networks, despite low univariate
problems (e.g., 6 times 6 is 36) (Maruyama, Pallier, Jobert, Sigman, & response. If so, it might suggest that reuse of the language network is a
Dehaene, 2012). wide-spread phenomenon across symbol systems.
Consistent with the weak recycling possibility, it appears to be easy
for many programmers to generate language descriptions of their mental 5.2. Open questions and future directions
processes during coding. “Verbal protocols” are often used to study the
cognitive basis of code processing (Hungerford, Hevner, & Collins, 2004; Many open questions remain regarding the role of the language
Lethbridge, Sim, & Singer, 2005; Letovsky, 1987; Letovsky & Soloway, system in code comprehension. One is whether the involvement of the
1986; Littman, Pinto, Letovsky, & Soloway, 1987; Pennington, 1987; language system generalizes to programming languages that are less
Sharpe, 1997; von Mayrhauser & Vans, 1994). In such studies, pro­ language-like than Python, the programming language studied here.
grammers are instructed to “think aloud” during a code comprehension Specifically, does the language system represent program-relevant in­
task and the linguistic utterances are analyzed to gain insight into how a formation only when the code is language-like? In the current study and
programmer’s understanding of a script evolves through time. For Srikant et al. (2022), the programming language presented to partici­
example, while reading an unfamiliar script, programmers may begin by pants was Python, which is among the more language-like high-level
asking questions prompting closer inspection of other programming el­ programming languages, in comparison with other major programming
ements (e.g., “What does this other function ‘SRCH’ do?”, “Why are languages like C++ or Java. It is not known whether the language
there 7 elements in the array ‘DBASE’?”), in a later stage, programmers’ network encodes code-relevant information when the “code” does not
utterances may become more affirmative, indicating a better under­ consist of linguistic symbols.
standing of the program (e.g., “So you access records by name in the data A study by Ivanova et al. (2020) presented participants with
base.”, “Because they have that marker in there to determine if it’s ScratchJr programs in addition to Python programs. ScratchJr is a child-
deleted.”) (examples from Letovsky, 1987). One hypothesis is that such oriented graphic-based programming “language”, where a program is
verbalization occurs covertly, even when there is no motor speech “written” in a series of interlocked tiles containing icons and numbers.
output. If such verbalizations are specific enough, they would account Whereas Ivanova et al. (2020) reported a slight preference for Python
for distinctive neural patterns in the language network for different programs over lists of non-words in the language network, such an effect
programming functions. was not observed with ScratchJr “programs”. However, Ivanova et al.
As opposed to the weak recycling view, the alternative “strong (2020) did not test for multivariate decoding of ScratchJr programs in
recycling” hypothesis is that the language network supports the repre­ language regions. ScratchJr provides an interesting test-case to be used
sentations of hierarchical “syntactic trees” of computer programs. On in future studies.
the strong recycling view, the language system’s capacity for repre­ Another open question concerns the neural flow of information
senting hierarchical structures is “repurposed” to represent the struc­ during program comprehension, akin to what has been processed for
tures found in computer code (Fitch et al., 2005; Friederici, comprehension of natural languages (Price, 2010, 2012). Based on our
Rueschemeyer, Hahne, & Fiebach, 2003; Humphries, Binder, Medler, & findings, we propose that during comprehension of programs, infor­
Liebenthal, 2006; Pallier, Devauchelle, & Dehaene, 2011). We call this mation flows from visual regions to language circuits followed by fronto-
view “strong recycling” because it would require modifying represen­ parietal reasoning systems, with feedback from later to earlier regions
tations in the fronto-temporal language network to accommodate throughout the process. However, it warrants further investigation
encoding the types of “trees” found in computer code. Such repurposing whether the language system is behaviorally necessary, or merely active
may even involve sub-specialization of some subset of the language during code comprehension. It is possible that the logical reasoning
network for representing programming code, akin to the development of system alone is sufficient to parse a program and construct the repre­
the visual word form area (VWFA) in the ventral stream during literacy sentations of the actual algorithms, whereas the surface-level repre­
acquisition (Dehaene-Lambertz, Monzalvo, & Dehaene, 2018; Dehaene, sentation constructed by the language network only facilitates the
Cohen, Morais, & Kolinsky, 2015; Dehaene, Le Clec’H, Poline, Le Bihan, comprehension process. The necessity of the language network can be
& Cohen, 2002; Dehaene et al., 2010; McCandliss, Cohen, & Dehaene, studied by conducting code comprehension experiments while inhibit­
2003; Szwed, Vinckier, Cohen, & Dehaene, 2012). Potentially consistent ing the language system using transcranial magnetic stimulation (TMS),
with the idea that the language network represents the hierarchical trees or overloading the language system with verbal shadowing task, or with
of computer code, Srikant et al. (2022) found that the language network aphasic programmers.
is sensitive to the number of nodes in the abstract syntactic tree of a Moving forward, as we delve deeper into understanding the role of
program. However, it is also likely that more complex programs require the language system in code comprehension, it becomes evident that
more complex linguistic descriptions, so this data is also consistent with future investigations will not only illuminate the necessity of this system
the weak recycling view. but also provide valuable insights into the mental processes involved in
At present, we favor the “weak” recycling view for several reasons. programming. Furthermore, these inquiries will help us unravel the
Given the similarity of natural languages to each other and their col­ intricate relationship between language and abstract symbolic
lective difference from programming languages, it seems likely that the reasoning, ultimately shedding light on the intriguing concept of neural
language system is particularly suited for representing not just any recycling.
“hierarchical tree” but specifically the types of trees that are found in
natural languages. Even if the language network were capable of suffi­ CRediT authorship contribution statement
cient plasticity to accommodate trees found in computer code, coding is
acquired well past the critical period of language acquisition. These Yun-Fei Liu: Writing – review & editing, Writing – original draft,

9
Y.-F. Liu et al. Brain and Language 251 (2024) 105392

Visualization, Project administration, Methodology, Investigation, Etzel, J. A., Valchev, N., & Keysers, C. (2011). The impact of certain methodological
choices on multivariate analysis of fMRI data with support vector machines.
Formal analysis, Data curation, Conceptualization. Colin Wilson:
NeuroImage, 54(2), 1159–1167.
Writing – review & editing, Resources, Methodology, Conceptualization. Fedorenko, E., Ivanova, A., Dhamala, R., & Bers, M. U. (2019). The language of
Marina Bedny: Writing – review & editing, Supervision, Resources, programming: a cognitive perspective. Trends in Cognitive Sciences. https://doi.org/
Project administration, Methodology, Funding acquisition, 10.1016/j.tics.2019.04.010
Fedorenko, E., & Varley, R. (2016). Language and thought are not the same thing:
Conceptualization. evidence from neuroimaging and neurological patients. Annals of the New York
Academy of Sciences, 1369(1), 132–153. https://doi.org/10.1111/nyas.13046
Declaration of competing interest Fitch, W. T., Hauser, M. D., & Chomsky, N. (2005). The evolution of the language faculty:
clarifications and implications. Cognition, 97(2), 179–210.
Floyd, B., Santander, T., & Weimer, W. (2017). Decoding the representation of code in
The authors declare that they have no known competing financial the brain: An fMRI study of code review and expertise. Paper presented at the
interests or personal relationships that could have appeared to influence proceedings of the 39th international conference on software engineering.
Friederici, A., Chomsky, N., Berwick, R., Moro, A., & Bolhuis, J. (2017). Language, mind
the work reported in this paper. and brain. Nature Human Behaviour, 1(10), 713–722.
Friederici, A., Rueschemeyer, S.-A., Hahne, A., & Fiebach, C. J. (2003). The role of left
Data availability inferior frontal and superior temporal cortex in sentence comprehension: localizing
syntactic and semantic processes. Cerebral Cortex, 13(2), 170–177. https://doi.org/
10.1093/cercor/13.2.170
Data will be made available on request. Glasser, M. F., Sotiropoulos, S. N., Wilson, J. A., Coalson, T. S., Fischl, B.,
Andersson, J. L., … Jenkinson, M. (2013). The minimal preprocessing pipelines for
the Human Connectome Project. NeuroImage, 80, 105–124. https://doi.org/
Appendix A. Supplementary material
10.1016/j.neuroimage.2013.04.127
Glezer, L. S., & Riesenhuber, M. (2013). Individual variability in location impacts
Supplementary material to this article can be found online at https:// orthographic selectivity in the “visual word form area”. Journal of Neuroscience, 33
doi.org/10.1016/j.bandl.2024.105392. (27), 11221–11226.
Harrison, S. A., & Tong, F. (2009). Decoding reveals the contents of visual working
memory in early visual areas. Nature, 458(7238), 632–635.
References Hauser, M. D., Chomsky, N., & Fitch, W. T. (2002). The faculty of language: what is it,
who has it, and how did it evolve? Science, 298(5598), 1569–1579. https://doi.org/
Bettencourt, K. C., & Xu, Y. (2016). Decoding the content of visual short-term memory 10.1126/science.298.5598.1569
under distraction in occipital and parietal areas. Nature Neuroscience, 19(1), Humphries, C., Binder, J., Medler, D., & Liebenthal, E. (2006). Syntactic and semantic
150–157. modulation of neural activity during auditory sentence comprehension. Journal of
Bunge, S. A., Kahn, I., Wallis, J. D., Miller, E. K., & Wagner, A. D. (2003). Neural circuits Cognitive Neuroscience, 18(4), 665–679.
subserving the retrieval and maintenance of abstract rules. Journal of Hungerford, B. C., Hevner, A. R., & Collins, R. W. (2004). Reviewing software diagrams:
Neurophysiology, 90(5), 3419–3428. A cognitive study. IEEE Transactions on Software Engineering, 30(2), 82–96.
Bush, G., & Shin, L. M. (2006). The Multi-Source Interference Task: an fMRI task that Ikutani, Y., Kubo, T., Nishida, S., Hata, H., Matsumoto, K., Ikeda, K., & Nishimoto, S.
reliably activates the cingulo-frontal-parietal cognitive/attention network. Nature (2021). Expert programmers have fine-tuned cortical representations of source code.
Protocols, 1, 308. https://doi.org/10.1038/nprot.2006.48 eneuro, 8(1), Article ENEURO.0405-0420.2020. https://doi.org/10.1523/
Castelhano, J., Duarte, I. C., Ferreira, C., Duraes, J., Madeira, H., & Castelo-Branco, M. ENEURO.0405-20.2020
(2018). The role of the insula in intuitive expert bug detection in computer code: an Ivanova, A. A., Srikant, S., Sueoka, Y., Kean, H. H., Dhamala, R., O’Reilly, U.-M., …
fMRI study. Brain Imaging and Behavior. https://doi.org/10.1007/s11682-018-9885- Fedorenko, E. (2020). Comprehension of computer code relies primarily on domain-
1 general executive brain regions. eLife, 9, e58906. https://doi.org/10.7554/
Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM eLife.58906
Transactions on Intelligent Systems and Technology (TIST), 2(3), 1–27. https://doi.org/ Kanjlia, S., Lane, C., Feigenson, L., & Bedny, M. (2016). Absence of visual experience
10.1145/1961189.1961199 modifies the neural basis of numerical thinking. Proceedings of the National Academy
Crittenden, B., Mitchell, D., & Duncan, J. (2016). Task encoding across the multiple of Sciences, 113(40), 11172–11177.
demand cortex is consistent with a frontoparietal and cingulo-opercular dual Kim, J. S., Kanjlia, S., Merabet, L. B., & Bedny, M. (2017). Development of the visual
networks distinction. The Journal of Neuroscience, 36(23), 6147–6155. https://doi. word form area requires visual experience: Evidence from blind Braille readers.
org/10.1523/jneurosci.4590-15.2016 Journal of Neuroscience, 37(47), 11495–11504. https://doi.org/10.1523/
Dale, A. M., Fischl, B., & Sereno, M. I. (1999). Cortical surface-based analysis: I. jneurosci.0997-17.2017
Segmentation and surface reconstruction. NeuroImage, 9(2), 179–194. https://doi. Kriegeskorte, N., Goebel, R., & Bandettini, P. (2006). Information-based functional brain
org/10.1006/nimg.1998.0395 mapping. Proceedings of the National Academy of Sciences of the United States of
Dehaene-Lambertz, G., Monzalvo, K., & Dehaene, S. (2018). The emergence of the visual America, 103(10), 3863–3868. https://doi.org/10.1073/pnas.0600244103
word form: Longitudinal evolution of category-specific ventral visual areas during Kriegeskorte, N., Simmons, W. K., Bellgowan, P. S., & Baker, C. I. (2009). Circular
reading acquisition. PLOS Biology, 16(3), e2004103. analysis in systems neuroscience: the dangers of double dipping. Nature Neuroscience,
Dehaene, S., Al Roumi, F., Lakretz, Y., Planton, S., & Sablé-Meyer, M. (2022). Symbols 12(5), 535–540.
and mental programs: a hypothesis about human singularity. Trends in Cognitive Krueger, R., Huang, Y., Liu, X., Santander, T., Weimer, W., & Leach, K. (2020).
Sciences. https://doi.org/10.1016/j.tics.2022.06.010 Neurological divide: an fMRI study of prose and code writing. Paper presented at the
Dehaene, S., Cohen, L., Morais, J., & Kolinsky, R. (2015). Illiterate to literate: 2020 IEEE/ACM 42nd international conference on software engineering (ICSE).
behavioural and cerebral changes induced by reading acquisition. Nature Reviews Lee, S., & Kable, J. W. (2018). Simple but robust improvement in multivoxel pattern
Neuroscience, 16(4), 234–244. https://doi.org/10.1038/nrn3924 classification. PLoS One, 13(11), e0207083.
Dehaene, S., Le Clec’H, G., Poline, J.-B., Le Bihan, D., & Cohen, L. (2002). The visual Lethbridge, T. C., Sim, S. E., & Singer, J. (2005). Studying software engineers: data
word form area: a prelexical representation of visual words in the fusiform gyrus. collection techniques for software field studies. Empirical Software Engineering, 10,
Neuroreport, 13(3), 321–325. https://doi.org/10.1097/00001756-200203040-00015 311–341.
Dehaene, S., Pegado, F., Braga, L. W., Ventura, P., Filho, G. N., Jobert, A., … Cohen, L. Letovsky, S. (1987). Cognitive processes in program comprehension. Journal of Systems
(2010). How learning to read changes the cortical networks for vision and language. and Software, 7(4), 325–339. https://doi.org/10.1016/0164-1212(87)90032-X
Science, 330(6009), 1359–1364. https://doi.org/10.1126/science.1194140 Letovsky, S., & Soloway, E. (1986). Delocalized plans and program comprehension. IEEE
Eklund, A., Knutsson, H., & Nichols, T. E. (2019). Cluster failure revisited: Impact of first Software, 3(3), 41.
level design and physiological noise on cluster false positive rates. Human Brain Littman, D. C., Pinto, J., Letovsky, S., & Soloway, E. (1987). Mental models and software
Mapping, 40(7), 2017–2032. https://doi.org/10.1002/hbm.24350 maintenance. Journal of Systems and Software, 7(4), 341–355. https://doi.org/
Eklund, A., Nichols, T. E., & Knutsson, H. (2016). Cluster failure: why fMRI inferences for 10.1016/0164-1212(87)90033-1
spatial extent have inflated false-positive rates. Proceedings of the National Academy Liu, Y.-F., Kim, J., Wilson, C., & Bedny, M. (2020). Computer code comprehension shares
of Sciences, 113(28), 7900–7905. https://doi.org/10.1073/pnas.1602413113 neural resources with formal logical inference in the fronto-parietal network. eLife,
Elli, G. V., Lane, C., & Bedny, M. (2019). A double dissociation in sensitivity to verb and 9, e59340.
noun semantics across cortical networks. Cerebral Cortex. https://doi.org/10.1093/ Maruyama, M., Pallier, C., Jobert, A., Sigman, M., & Dehaene, S. (2012). The cortical
cercor/bhz014 representation of simple mathematical expressions. NeuroImage, 61(4), 1444–1460.
Emrich, S. M., Riggall, A. C., LaRocque, J. J., & Postle, B. R. (2013). Distributed patterns https://doi.org/10.1016/j.neuroimage.2012.04.020
of activity in sensory cortex reflect the precision of multiple items maintained in McCandliss, B. D., Cohen, L., & Dehaene, S. (2003). The visual word form area: expertise
visual short-term memory. Journal of Neuroscience, 33(15), 6516–6523. for reading in the fusiform gyrus. Trends in Cognitive Sciences, 7(7), 293–299. https://
Ester, E. F., Serences, J. T., & Awh, E. (2009). Spatially global representations in human doi.org/10.1016/S1364-6613(03)00134-7
primary visual cortex during working memory maintenance. Journal of Neuroscience, McCoy, L. P., & Burton, J. K. (1988). The relationship of computer programming and
29(48), 15258–15265. mathematics in secondary students.
Monti, M. (2017). The role of language in structure-dependent cognition. In Neural
mechanisms of language (pp. 81–101). Springer.

10
Y.-F. Liu et al. Brain and Language 251 (2024) 105392

Monti, M., & Osherson, D. (2012). Logic, language and the brain. Brain Research, 1428, Schreiber, K., & Krekelberg, B. (2013). The statistical analysis of multi-voxel patterns in
33–42. https://doi.org/10.1016/j.brainres.2011.05.061 functional imaging. PLoS One, 8(7), e69328.
Monti, M., Parsons, L., & Osherson, D. (2009). The boundaries of language and thought Sharpe, S. (1997). Unifying theories of program comprehension. Journal of Computer
in deductive inference. Proceedings of the National Academy of Sciences, 106(30), Information Systems, 38(1), 86–93. https://doi.org/10.1080/
12554–12559. 08874417.1997.11647312
Monti, M., Parsons, L., & Osherson, D. (2012). Thought beyond language: neural Shute, V. J. (1991). Who is likely to acquire programming skills? Journal of Educational
dissociation of algebra and natural language. Psychological Science, 23(8), 914–922. Computing Research, 7(1), 1–24. https://doi.org/10.2190/VQJD-T1YD-5WVB-RYPJ
https://doi.org/10.1177/0956797612437427 Siegmund, J., Kästner, C., Apel, S., Parnin, C., Bethmann, A., Leich, T., … Brechmann, A.
Mumford, J. A., Davis, T., & Poldrack, R. A. (2014). The impact of study design on (2014). Understanding understanding source code with functional magnetic
pattern estimation for single-trial multivariate pattern analysis. NeuroImage, 103, resonance imaging. Paper presented at the proceedings of the 36th international
130–138. conference on software engineering.
Musz, E., Loiotile, R., Chen, J., & Bedny, M. (2022). Naturalistic Audio-Movies reveal Smith, S. M., Jenkinson, M., Woolrich, M. W., Beckmann, C. F., Behrens, T. E., Johansen-
common spatial organization across “visual” cortices of different blind individuals. Berg, H., … Flitney, D. E. (2004). Advances in functional and structural MR image
Cerebral Cortex, bhac048. https://doi.org/10.1093/cercor/bhac048 analysis and implementation as FSL. NeuroImage, 23, S208–S219. https://doi.org/
Nieto-Castañón, A., & Fedorenko, E. (2012). Subject-specific functional localizers 10.1016/j.neuroimage.2004.07.051
increase sensitivity and functional resolution of multi-subject analyses. NeuroImage, Srikant, S., Lipkin, B., Ivanova, A. A., Fedorenko, E., & O’Reilly, U.-M. (2022).
63(3), 1646–1669. Convergent representations of computer programs in human and artificial neural
Pallier, C., Devauchelle, A.-D., & Dehaene, S. (2011). Cortical representation of the networks. Paper presented at the 36th conference on neural information processing
constituent structure of sentences. Proceedings of the National Academy of Sciences. systems.
https://doi.org/10.1073/pnas.1018711108 Stehr, D. A., Garcia, J. O., Pyles, J. A., & Grossman, E. D. (2023). Optimizing multivariate
Pandža, N. B. (2016). Computer programming as a second language. In Advances in pattern classification in rapid event-related designs. Journal of Neuroscience Methods,
human factors in cybersecurity (pp. 439–445). Cham: Springer. 387, Article 109808.
Pea, R. D., & Kurland, D. M. (1984). On the cognitive effects of learning computer Stelzer, J., Chen, Y., & Turner, R. (2013). Statistical inference and multiple testing
programming. New Ideas in Psychology, 2(2), 137–168. correction in classification-based multi-voxel pattern analysis (MVPA): Random
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., … permutations and cluster size control. NeuroImage, 65, 69–82. https://doi.org/
Dubourg, V. (2011). Scikit-learn: machine learning in Python. Journal of Machine 10.1016/j.neuroimage.2012.09.063
Learning Research, 12(Oct), 2825–2830. https://doi.org/10.5555/1953048.2078195 Su, L., Fonteneau, E., Marslen-Wilson, W., & Kriegeskorte, N. (2012). Spatiotemporal
Peirce, J., Gray, J. R., Simpson, S., MacAskill, M., Höchenberger, R., Sogo, H., … searchlight representational similarity analysis in EMEG source space. Paper
Lindeløv, J. K. (2019). PsychoPy2: Experiments in behavior made easy. Behavior presented at the 2012 second international workshop on pattern recognition in
Research Methods, 51(1), 195–203. https://doi.org/10.3758/s13428-018-01193-y neuroimaging.
Peitek, N., Siegmund, J., Apel, S., Kästner, C., Parnin, C., Bethmann, A., … Szwed, M., Vinckier, F., Cohen, L., & Dehaene, S. (2012). Towards a universal
Brechmann, A. (2018). A look into programmers’ heads. IEEE Transactions on neurobiological architecture for learning to read. Behavioral and Brain Sciences, 35
Software Engineering, 1. https://doi.org/10.1109/TSE.2018.2863303 (5), 308–309.
Pennington, N. (1987). Stimulus structures and mental representations in expert Valente, G., Castellanos, A. L., Hausfeld, L., De Martino, F., & Formisano, E. (2021).
comprehension of computer programs. Cognitive Psychology, 19(3), 295–341. Cross-validation and permutations in MVPA: validity of permutation strategies and
https://doi.org/10.1016/0010-0285(87)90007-7 power of cross-validation schemes. NeuroImage, 238, Article 118145.
Pischedda, D., Görgen, K., Haynes, J.-D., & Reverberi, C. (2017). Neural representations von Mayrhauser, A., & Vans, A. M. (1994). Comprehension processes during large scale
of hierarchical rule sets: the human control system represents rules irrespective of maintenance. Paper presented at the proceedings of 16th international conference on
the hierarchical level to which they belong. The Journal of Neuroscience, 37(50), software engineering.
12281–12296. https://doi.org/10.1523/jneurosci.3088-16.2017 Winkler, A. M., Ridgway, G. R., Webster, M. A., Smith, S. M., & Nichols, T. E. (2014).
Portnoff, S. R. (2018). The introductory computer programming course is first and Permutation inference for the general linear model. NeuroImage, 92(100), 381–397.
foremost a language course. ACM Inroads, 9(2), 34–52. https://doi.org/10.1016/j.neuroimage.2014.01.060
Prat, C. S., Madhyastha, T. M., Mottarella, M. J., & Kuo, C.-H. (2020). Relating natural Woolgar, A., Jackson, J., & Duncan, J. (2016). Coding of visual, auditory, rule, and
language aptitude to individual differences in learning programming languages. response information in the brain: 10 years of multivoxel pattern analysis. Journal of
Scientific reports, 10(1), 3817. https://doi.org/10.1038/s41598-020-60661-8 Cognitive Neuroscience, 28(10), 1433–1454. https://doi.org/10.1162/jocn_a_00981
Price, C. (2010). The anatomy of language: a review of 100 fMRI studies published in Woolgar, A., Thompson, R., Bor, D., & Duncan, J. (2011). Multi-voxel coding of stimuli,
2009. Annals of the New York Academy of Sciences, 1191(1), 62–88. https://doi.org/ rules, and responses in human frontoparietal cortex. NeuroImage, 56(2), 744–752.
10.1111/j.1749-6632.2010.05444.x https://doi.org/10.1016/j.neuroimage.2010.04.035
Price, C. (2012). A review and synthesis of the first 20 years of PET and fMRI studies of Xu, S., Li, Y., & Liu, J. (2021). The neural correlates of computational thinking:
heard speech, spoken language and reading. NeuroImage, 62(2), 816–847. collaboration of distinct cognitive components revealed by fMRI. Cerebral Cortex.
Regev, M., Honey, C., Simony, E., & Hasson, U. (2013). Selective and invariant neural https://doi.org/10.1093/cercor/bhab182
responses to spoken and written narratives. Journal of Neuroscience, 33(40), Yang, C., Crain, S., Berwick, R. C., Chomsky, N., & Bolhuis, J. J. (2017). The growth of
15978–15988. https://doi.org/10.1523/jneurosci.1580-13.2013 language: universal Grammar, experience, and principles of computation.
Riggall, A. C., & Postle, B. R. (2012). The relationship between working memory storage Neuroscience & Biobehavioral Reviews, 81, 103–119.
and elevated activity as measured with functional magnetic resonance imaging. Zhang, J., Kriegeskorte, N., Carlin, J. D., & Rowe, J. B. (2013). Choosing the rules:
Journal of Neuroscience, 32(38), 12990–12998. distinct and overlapping frontoparietal representations of task rules for perceptual
Schaefer, A., Kong, R., Gordon, E. M., Laumann, T. O., Zuo, X.-N., Holmes, A. J., … decisions. The Journal of Neuroscience, 33(29), 11852–11862. https://doi.org/
Yeo, B. T. T. (2018). Local-global parcellation of the human cerebral cortex from 10.1523/jneurosci.5193-12.2013
intrinsic functional connectivity MRI. Cerebral Cortex, 28(9), 3095–3114. https://
doi.org/10.1093/cercor/bhx179

11

You might also like