Natural Language Reasoning

Natural Language Reasoning, A Survey
FEI YU, HONGBO ZHANG, The Chinese University of Hong Kong, Shenzhen, China
PRAYAG TIWARI, School of Information Technology, Halmstad University, Sweden
BENYOU WANG∗ , The Chinese University of Hong Kong, Shenzhen, China
This survey paper proposes a clearer view of natural language reasoning in the field of Natural Language Processing (NLP),
both conceptually and practically. Conceptually, we provide a distinct definition for natural language reasoning in NLP,
based on both philosophy and NLP scenarios, discuss what types of tasks require reasoning, and introduce a taxonomy of
arXiv:2303.14725v2 [cs.CL] 13 May 2023
reasoning. Practically, we conduct a comprehensive literature review on natural language reasoning in NLP, mainly covering
classical logical reasoning, natural language inference, multi-hop question answering, and commonsense reasoning. The
paper also identifies and views backward reasoning, a powerful paradigm for multi-step reasoning, and introduces defeasible
reasoning as one of the most important future directions in natural language reasoning research. We focus on single-modality
unstructured natural language text, excluding neuro-symbolic techniques and mathematical reasoning1 .
ACM Reference Format:
Fei Yu, Hongbo Zhang, Prayag Tiwari, and Benyou Wang. 2023. Natural Language Reasoning, A Survey. 1, 1 (May 2023),
36 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
∗ Corresponding author
1 https://github.com/FreedomIntelligence/ReasoningNLP
Authors’ addresses: Fei Yu, Hongbo Zhang, The Chinese University of Hong Kong, Shenzhen, Shenzhen, China, feiyu1@link.cuhk.edu.cn,
hongboz183@gmail.com; Prayag Tiwari, School of Information Technology, Halmstad University, Halmstad, Sweden, prayag.tiwari@ieee.org;
Benyou Wang, The Chinese University of Hong Kong, Shenzhen, Shenzhen, China, wangbenyou@cuhk.edu.cn.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that
copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first
page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy
otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
permissions@acm.org.
© 2023 Association for Computing Machinery.
XXXX-XXXX/2023/5-ART $15.00
https://doi.org/10.1145/nnnnnnn.nnnnnnn
, Vol. 1, No. 1, Article . Publication date: May 2023.

2 • Fei Yu, Hongbo Zhang, Prayag Tiwari, and Benyou Wang
1 INTRODUCTION
Natural Language Processing (NLP) has shown significant advancements in recent years, particularly with the
introduction of transformers and pre-trained language models (PLMs). However, their abilities2 to perform
natural language reasoning (NLR) are still far from satisfactory. Reasoning, the process of making inferences
based on existing knowledge, is a fundamental aspect of human intelligence and is essential for complex tasks
such as decision-making. Building an artificial intelligence system capable of reasoning is both the ultimate
goal of the research community and the necessary way to improve the performance of complex applications.
Compared to reason with formal language, reasoning with natural language expressions provides a more natural
human-computer interaction interface and opens the door to research on defeasible reasoning, such as abduction
and induction, which are incapable of formal-based symbolic methods.
PLMs such as BERT [34] and GPT [115] have been the essential components in NLP research since they
occurred. Pre-trained on large-scale text corpora, PLMs are capable of natural language understanding. Recent
progresses suggest that PLMs also have the potential to solve reasoning problems [25, 141, 145, 158]. Specifi-
cally, PLMs can perform soft deductive reasoning over natural language statements [25], reason with implicit
knowledge memorized in their parameters [145], and perform multi-step reasoning step-by-step just with a few
demonstrations or instructions when the model size is large enough via chain-of-thought prompting [77, 158].
Recently, ChatGPT and GPT4 also made impressive reasoning capabilities to the community [4, 16].
However, while reasoning has attracted increasing attention recently [25, 27, 28, 77, 107, 143, 158], there still
lacks a distinct definition of reasoning and the term “reasoning” is sometimes of mistaken usage, which may
affect the communication and development towards reasoning in the NLP community. For example, while it
belongs to “commonsense reasoning”, few people might deem that telling about a shared lived experiences [10],
e.g. “name something that you might forget in a hotel room”, is reasoning. Another example is that sometimes
“natural language inference” is introduced as a task of natural language understanding [12], but other times of
reasoning [25]. By now, none all of the tasks named with “reasoning” are believed as reasoning (e.g. commonsense
reasoning), and none all of the tasks named “without reasoning” are thought of as non-reasoning (e.g. natural
language inference and multi-hop question answering). This raises a question: what reasoning is actually, and how
can we identify reasoning tasks if their names are not much indicative? Although many researches [25, 58, 167, 174]
refer to a definition of reasoning from philosophy and logic, the definition cannot capture the reasoning in NLP
well enough. For example, while reasoning is philosophically defined as “using evidence and logic to arrive at
conclusions” [58], it fails to clarify whether implicit commonsense knowledge can be evidence and what types of
conclusion are reasoning products, e.g. how about named-entity disambiguation?
To promote the research on reasoning in NLP, we make an attempt to propose a clearer view of NLP reasoning,
both conceptually and practically. Conceptually, we propose a definition for NLP reasoning based on both
philosophy and NLP scenarios, discuss what types of tasks require reasoning, and introduce a taxonomy of
reasoning. Practically, we provide a comprehensive literature review on natural language reasoning in NLP based
on our clarified definition, mainly covering classical logical reasoning, natural language inference, multi-hop
question answering, and commonsense reasoning. Reviewing papers of all sizes of PLMs, we capture general
methodologies that can be applied to different model sizes: end-to-end reasoning, forward reasoning, and backward
reasoning. Finally, we discuss some limitations and future directions of reasoning.
In addition to the definition of reasoning, there is an important point distinguishing this survey from the other
surveys [58, 110, 168]: we identify and view backward reasoning, another powerful paradigm for multi-step
reasoning in addition to forward reasoning. While forward reasoning, such as chain-of-thought prompting, has
been popular in LLMs recently, we argue that it is worth conducting more exploration of backward reasoning.
2 In this survey, we refer to transformer-based pre-trained language models.

Natural Language Reasoning, A Survey • 3
Backward reasoning is more efficient than forward reasoning both conceptually and empirically due to smaller
search space [72], which is the potential to generalize to complex reasoning with longer steps.
In this article, we focus on the single-modality unstructured natural language text (without knowledge triples,
tables and intermediate formal language) and natural language reasoning (rather than symbolic reasoning and
mathematical reasoning)3 . Concretely, We conduct a review of related works that utilize transformer-based PLMs,
with a deliberate exclusion of neuro-symbolic techniques. We sorted the collected papers and categorised the
methodologies of natural language reasoning in NLP. We identify the progress and trend in recent years in this
domain. The paper is organized into five sections (as shown in Figure 1).
We collected more than two hundred papers related to reasoning or PLMs in recent years. We searched
keywords such as inference, reasoning, infer, reason, multi-step, and multi-hop on the top conferences, including
ACL, EMNLP, NAACL, ICML, ICLR, and NeurIPS, from 2019 to 2022. We also found some related works from the
collected papers.
In conclusion, the main contributions of this survey are
(1) To our best knowledge, we are the first to provide a distinct definition for natural language reasoning in
NLP and discuss to what degree some popular benchmarks are related to reasoning.
(2) To our best knowledge, we are the first to conduct a comprehensive review on PLM-based natural language
reasoning, covering diverse NLR benchmarks, and providing a comprehensive taxonomy of methodology.
We also cover backward reasoning, which is neglected but has potential.
(3) We introduce defeasible reasoning, which we believe is one of the most potential future directions, compare
differences between deductive reasoning and defeasible reasoning, discuss how they can affect NLP solutions,
and review current methods.
2 WHAT IS NATURAL LANGUAGE REASONING

There still lacks a distinct definition of natural language reasoning in NLP, which affects the development and
communication of NLR in the NLP community. To promote understanding, analysis and communication, we
aim to suggest distinct definitions of terms and concepts for natural language reasoning in NLP. To realize this
goal, we take a look into two relevant areas which have studied reasoning for a long time: philosophy and
logic and transfer the relevant reasoning theory into NLP. First, we propose a definition for NLR in NLP that
satisfies the concerns of the NLP community (Sec 2.1). Then, we provide categories of NLR and introduce how
the differences between them can affect NLP solutions (Sec 2.2). Finally, we introduce the potentials, challenges,
and requirements to achieve NLR (Sec 2.3).
2.1 Definition
Reasoning in NLP has been focused on in recent years while philosophy has studied reasoning since thousand
years ago, and logic is seen as the art of correct reasoning, which studies the concepts of inference, systematizes
its categories, and develops principles of good reasoning, including formal logic and informal logic [9, 46, 63]. In
this section, we first include reasoning theory from philosophy and logic and derive it into NLP reasoning. Then,
we review some natural language reasoning topics in NLP. Finally, we propose a definition for reasoning in NLP,
which combines the definition in philosophy and logic and the concerns of the NLP community.
2.1.1 Definition from philosophy and logic. Here we introduce two descriptions and three definitions of reasoning
from philosophy and logic: task-based description (Description 2.1), negation-based description (Description 2.2),
logic-based definition (Definition 2.1), assertion-based definition (Definition 2.2), and action-based definition
3 Although recently it is popular to solve mathematical reasoning problems such as math word problems using NLP methods, we do not cover
them in this paper since mathematical reasoning is very different to natural language reasoning in nature as math is precise and formal.

Definition
§ 2: What is Categories
Reasoning
Potential& Challenges
& Requirements
Introduction to PLMs
§ 3: Why
PLMs for Empirical Development
Reasoning
End-to-End Reasoning
Forward Reasoning
§ 4: Method-
ologies of NLR Basckward Reasoning
Summary
Reasoning
and PLMs
Classical Logical Reasoning
Natural Language Inference
Multi-Hop QA
§ 5: NLR
Benchmarks Commonsense Reasoning
Complex Reasoning
Others
Open Question
§ 6: Discussion Limitations
Future
Fig. 1. Architecture of this survey.
(Definition 2.3). The former two descriptions can tell us “what reasoning can do” and “what isn’t reasoning”,
while the latter three provide us different definitions of “what is reasoning”. However, the definition from logic
(Definition 2.1) restricts reasoning to a subset within the coverage of formal logic. To reach a more generalized

Fig. 2. Timeline of important works.
definition, we adopt the latter two definitions from philosophy, which are two different classes named theoretical
reasoning and practical reasoning, respectively, as the basis for defining natural language reasoning in NLP.
Description 2.1 (task-based). Reasoning is an essential mental activity when conducting conscious tasks with
complex computations such as problem-solving, decision-making, persuasion, and explaining [3, 41, 47, 71].
Description 2.2 (negation-based). Reasoning is a dynamic process to get some knowledge without direct recourse
to sense perceptions or immediate experience, which is opposed to sensation, perception and feeling [3, 14, 155].
Definition 2.1 (logic-based reasoning). Reasoning is to discover valid conclusions by applying logic [3, 14, 41,
91, 155].
Definition 2.2 (assertion-based reasoning / theoretical reasoning). Reasoning is to infer conclusions from a set
of premises, consisting of one or more inference steps, where premises and conclusions are assertions that claim
something is true or false about the world [3, 9, 14, 123, 155].
Definition 2.3 (action-based reasoning / practical reasoning). Practical reasoning is to infer actions from goals
and knowledge, which is oriented to deciding whether an action is practically reasonable [9, 155].
2.1.2 Definition in NLP we suggest. According to Definition 2.2, Definition 2.3 and negation-based description 2.2,
we can know “what is reasoning” and “what isn’t reasoning” from the perspective of philosophy. There are also
some descriptions towards the two questions in NLP. We compare and combine them in Table 1. We also review
typical natural language reasoning datasets in NLP to observe and capture what the NLP community is concerned
about.
From our observations, in NLP, natural language reasoning also combines multiple knowledge to derive
conclusions. The unique characteristics are (1) knowledge sources and (2) conclusion types. Firstly, common
knowledge sources are knowledge bases, context, and PLMs, where the former two can explicitly provide
encyclopedic knowledge and contextual knowledge, while the last is implicit knowledge sources. Secondly, In
addition to assertions and actions, it is also popular to infer relations, e.g. causes and effects, of events. We
demonstrate examples of these three conclusion types in Table 2.
What is Reasoning What isn’t Reasoning

infer a new assertion from a set of assertions sensation, perception and feeling
Philosophy
infer an action from goals and knowledge direct recourse to sense perceptions or immediate experience
more than understanding, slow thinking memorize, look up, match information
NLP
e.g. multi-hop QA, commonsense reasoning e.g. text summarization, style transfer
a dynamic process to integrate multiple knowledge to get new conclusions,
Combination
rather than direct recourse to memorized or provided first-hand information
Table 1. Comparison and combination of descriptions about reasoning from philosophy and NLP.

Premise Conclusion
Cat is animal.
Assertion Cat can breathe.
Animal can breathe.
John was shot.
Event There are people around. John will be sent to see a doctor.
Doctor can save life.
Marry is on the living room.
go to the bedroom, take the remote control
Action Marry feels it hot.
come back and turn on the air conditioner
Remote control for air conditioner is in the bedroom.
Table 2. Three types of conclusion in reasoning, where “assertion” and “event” assume something true or likely to be true in
the world.
Correspondingly, we propose the definition of NLP reasoning in Definition 2.4 and suggest “what isn’t reasoning
in NLP” and “what NLP reasoning can do” in Description 2.3 and Description 2.4. It should be emphasized that
conclusions are new (or unknown) assertions, events, or actions, which distinguishes reasoning from other
knowledge-intensive tasks that may also require multiple knowledge. To better demonstrate the definition, we
explain why some knowledge-intensive datasets are not reasoning in Table 3.
Definition 2.4 (NLP reasoning). Natural language reasoning is a process to integrate multiple knowledge (e.g.
encyclopedic knowledge and commonsense knowledge) to derive some new conclusions about the (realistic or
hypothetical) world. Knowledge can be from both explicit and implicit sources. Conclusions are assertions or
events assumed to be true in the world, or practical actions.
Description 2.3 (NLP negation-based). Natural language reasoning is to derive new assertions, events, or actions
without direct recourse to models’ memorization, knowledge base storage and the provided context.
Description 2.4 (NLP task-based). Reasoning is an important method to arrive at the required answers or
solutions. It is effective when what we need is neither provided by context nor memorized by models and stored
by knowledge bases, but reachable by integrating available information.
Task Why not reasoning

just align known entities
CoNLL [55] entity linking
without producing new assertions, events, or actions
generate text
CommonGen [85] constrained text generation
but neither true assertions or events, nor actions
Natural Questions [78] open-domain QA the answer can be simply matched
Table 3. Examples to explain what is not reasoning.
2.1.3 Key Concepts. We first introduce the key concepts: proposition and inference. Similarly, we derive the
definitions from philosophy and logic to NLP. Then, we further clarify the definition of reasoning in NLP.
Definition of key concepts. In logic, the proposition is the basic operation unit in reasoning, and inference is
a sub-process of a complete reasoning process. Concretely, while reasoning is performed with statements (as
premises and conclusions), the real operation units are the semantics behind sentences, i.e. propositions [63].

Inference is a single step in reasoning [9, 13, 51, 123, 155], and each reasoning can be made of one or more
inference steps (Definition 2.2. We put the two key concepts into NLP in Definition 2.5 and Definition 2.6.
Definition 2.5 (NLP proposition). A proposition is the semantic meaning or information content of a statement
rather than its superficial linguistic.
Definition 2.6 (NLP inference). Inference is a single step that produces a single (intermediate) conclusion from
some premises.
Fig. 3. Reasoning process. The premises can be either explicit or implicit knowledge, e.g. PLMs’ memory.
Further clarify the definition of NLP reasoning. We leverage these concepts to clarify the definition of NLP
reasoning: what we mean by “integrate multiple information to derive new conclusions” is that (1) a single
sentence conveying multiple semantics can provide multiple premises, (2) there must yield new semantics in
inference and reasoning, i.e. conclusions are semantically different to all premises. We detail two examples to
demonstrate this key idea (Fig 4) and illustrate the definition of reasoning in Fig 3.
Fig. 4. Examples to show the key idea of “semantic difference”, where check mark denotes reasoning while cross denotes not
reasoning.

2.2 Categories of Inference

While knowledge has been well categorised in NLP (e.g. explicit world knowledge and implicit commonsense
knowledge), we find that there is still a lack of reasonable taxonomy for inference. Therefore, we borrow the
categories from philosophy and discuss the differences between classes to NLP and how they can affect the
solutions.
Inference can be divided into (mainly) deductive, inductive and abductive [43, 103], or divided into monotonic
and defeasible. Actually, the deduction is monotonic inference while induction and abduction are sub-classes
of defeasible inference. Since “monotonic” and “defeasible” can capture the difference between deductive and
non-deductive inference, we combine the two taxonomies into one: deductive inference and defeasible inference.
2.2.1 Deduction, Induction, and Abduction. According to Aristotle and Peirce, there are three major inferences:
deduction, induction, and abduction [43, 103]. This taxonomy is the most familiar one to the NLP community,
adopted and studied by several works [7, 25, 49, 100, 137, 160, 167, 174]. The definitions are shown below (Table 4
shows examples).
Definition 2.7 (Deduction). A deductive inference is to infer valid knowledge (conclusion) from the given
knowledge (premises).
Definition 2.8 (Induction). An inductive inference is to infer probable knowledge, which describes a more
general rule, extrapolated from the given knowledge.
Definition 2.9 (Abduction). An abductive inference is to infer probable knowledge, as the best explanation (i.e.
cause), for the given knowledge (i.e. phenomena).
Fact1: Aristotle is a human

Rule: All human will die
Fact2: Aristotle will die
Deduction Abduction Induction
(Fact1 + Rule → Fact2) (Fact1 + Rule ← Fact2) (Fact1 + Fact2 → Rule)
Table 4. An simple example to show the difference between deduction, abduction and induction, where text in black is the
given knowledge while text in red is the inferred knowledge. “Fact” denotes specific knowledge while “rule” denotes general
principle.
However, among these three classes, researches on abduction and induction are much under-explored than
deduction, while the widely studied deduction is only a very small set of human daily reasoning.
2.2.2 Deductive Inference and Defeasible Inference. Our main goal is to promote research on non-deductive
reasoning and highlight the differences and challenges. Therefore, we turn into monotonic inference and defeasible
inference, which can better capture the features of deductive and non-deductive inference respectively.
Key difference. The key difference between monotonic inference and defeasible inference from philosophy is that
the former derives valid conclusions4 while the latter only produces probable conclusions. Since the conclusions
of deductive inference are truth-preserving that the future added knowledge will not affect their validity, thus
the set of knowledge is incremental, i.e. monotonic. By contrast, the conclusions of non-deductive inference
(e.g. induction and abduction) may be wrong, and the newly added knowledge may retract the conclusion, i.e.
4 “Valid” means when the premises are true, the conclusion is impossible to be false.

defeasible. For example, one may inductively infer “birds can fly” with the premises “parrots can fly” and “eagles
can fly”. However, when he or she discovers the new knowledge “ostrich cannot fly”, the conclusion will be
retracted.
Different characteristics. This difference towards conclusions between deductive inference and defeasible
inference leads to many different characteristics, including inference relations between premises and conclusions,
the quality of inference, and the requirement of knowledge. Concretely, there is only one inference relation
between premises and each conclusion in deductive inference, i.e. support, and the inference is either valid or
invalid. Therefore, we can derive a valid conclusion just with several supporting premises. By contrast, knowledge
can strengthen, weaken and even rebut (the probability of) the conclusion in defeasible inference, and the quality
of inference varies from weak to strong. Therefore, it is better to collect more comprehensive information to
arrive at a more probable conclusion. We compare the characteristics of the deductive inference and defeasible
inference in Table 5.
Deductive Inference Defeasible Inference

Conclusion true probably true
Inference relation support strengthen, weaken, rebut
Quality of inference valid or invalid weak to strong
Required knowledge bounded unbounded
Table 5. The characteristics of the deductive inference and defeasible inference.
Affects on NLP. These characteristics affect relevant knowledge acquisition, reasoning path structure, and
the importance of interpretability in NLP. Firstly, while collecting the supporting knowledge toward the valid
conclusion is enough for deductive reasoning, it is better to collect both supportive and opposing knowledge
to compare the confidence of different conclusions for defeasible reasoning. Then, there has been increasing
attention on reasoning path generation in NLP [126, 143, 158]. However, due to more types of inference relation,
the structure of reasoning paths for defeasible reasoning is more complex than deductive reasoning and thus
becomes more challenging to generate. Finally, it is more important and sometimes even crucial for NLP models
to perform interpretable defeasible reasoning. This is because people with different background knowledge can
infer very different and even opposite conclusions by themselves, thus it is much more difficult to clarify the
conclusion without explicit premises and reasoning procedure.
2.3 Potentials, Challenges, and Requirements of NLR

Potentials. Compared to reasoning with precise formal language, natural language provides a better human-
computer interaction interface. Besides, natural language opens a door to play with defeasible reasoning, where
formal language fails.
Challenges. Firstly, natural language suffers from ambiguity and variety, since there are polysemy, synonymy
and diverse structures. Therefore, while triples and formal languages are precise, statements and propositions
are many-to-many in natural language, which poses a challenge on natural language understanding. Secondly,
supervised data of inference is difficult to obtain, which may prevent it from large-scale training. Moreover, the
step of reasoning is diverse at the instance level, i.e. different questions may require different inference steps to
answer, and it is important to generalize to the unseen steps.

Requirements. Based on the definition (Definition 2.4), the key components in NLP to achieve reasoning are
(1) (multiple) knowledge and (2) an algorithm capable of understanding and inference. Correspondingly, there
are three stages: knowledge acquisition, knowledge understanding, and inference. Firstly, it requires collecting
the relevant knowledge required for reasoning (knowledge acquisition). Then, the algorithm requires to capture
propositions underlying the given knowledge (knowledge understanding). In addition to the general semantics, it
should also capture the logical semantics such as negation, conjunction and disjunction. Subsequently, beginning
from these propositions, the algorithm requires integrating some knowledge to infer a new conclusion with one
or more steps to reach the final answer (inference). Though knowledge acquisition and understanding are also
necessary for reasoning, the two topics are big enough to write another survey, thus we just focus on inference
in this article.
3 WHY PLMS FOR NATURAL LANGUAGE REASONING

3.1 Introduction to PLMs
Pre-trained language models (PLMs) are based on transformer architecture [153], which is built with many
attention modules and are pre-trained on massive amounts of text data via unsupervised learning techniques such
as predicting masked tokens [34] or generating the next tokens [115]. Since BERT [34] occurred, pretraining-then-
finetuning became a common paradigm, which transfers the general abilities of PLMs learned in the pretraining
stage to downstream tasks with further task-specific finetuning. Since large language models have been found
to be few-shot learners [15], in-context learning has become a new popular paradigm, which can predict a
new sample with only a few demonstrations without finetuning parameters. Recently, the zero-shot prompting
paradigm also becomes more popular in LLMs [77].
Types of PLMs. According to the architecture, PLMs can be divided into encoder-only (e.g. BERT [34]), decoder-
only (e.g. GPT [115]) and encoder-decoder (e.g. T5 [116]). According to the directivity, PLMs can be divided
into bidirectional (encoder-only) and causal (decoder-only and encoder-decoder), while bidirectional PLMs are
commonly used for discriminative tasks, causal PLMs can model general tasks but are more capable of generative
tasks. According to the model size, there are medium-size PLMs and large language models, where LLMs are
much larger than the former (e.g. 13B parameters).
Advantages of PLMs for NLR. We conclude with four advantages of PLMs for NLR.
• Ability of natural language understanding. Transformers represent words and sentences in a context-
dependent manner as continuous vectors in a high-dimensional space dealing with ambiguity and uncer-
tainty in nature. After large-scale pretraining, PLMs can learn a powerful understanding capability, which
helps them to capture and understand knowledge mentioned in the text.
• Ability to learn implicit knowledge into parameters. It has been found that PLMs can capture some
implicit knowledge that is not explicitly mentioned, such as commonsense knowledge, into their parameters.
This is important since it is impossible to explicitly enumerate and provide commonsense knowledge for
reasoning.
• Ability of in-context learning. LLMs such as GPT-3 exhibit the impressive ability to perform tasks only
with some demonstrations without further fine-tuning, which is valuable to alleviate data sparsity problems.
• Emergent abilities. Recently, it was found that LLMs have some emergent abilities that only occur when
the model size is big enough [157], and LLMs can perform much more complex tasks as their size increases.
Moreover, it has been demonstrated that performing multi-step reasoning in a few-shot or zero-shot manner
is one of the emergent abilities [158].

specialized models
vanilla medium-size PLMs

End-to-End Reasoning
(§ 4.1) vanilla decoder-only LLMs
Natural Language Reasoning Forward Reasoning specialized pretraining

(§ 4.2)
backward chaining
Backward Reasoning
(§ 4.3) question decomposition
Fig. 5. Taxonomy of natural language reasoning
3.2 Empirical Development

Recent progresses also show the potential to leverage PLMs on natural language reasoning, which exhibits their
learning and generalization abilities of reasoning skills with both explicit and implicit knowledge.
By finetuning on the specific dataset, [25] first demonstrated that PLMs can perform deductive reasoning over
explicitly provided natural language statements, which can zero-shot transfer to different domains. Moreover, [145]
showed that PLMs can combine memorized implicit taxonomic and world knowledge with explicitly provided
knowledge for the deduction. In addition to deduction, PLMs can also learn to perform defeasible reasoning [122,
167, 174].
While LLMs with in-context learning were once thought to be incapable of multi-step reasoning, it has been
found that their capabilities of reasoning can be unlocked by generating forward reasoning paths before the
final answer [158], which is called Chain-of-Thought (CoT) prompting. With this prompting, the performance of
many multi-step reasoning tasks in Big-Bench Hard can surpass the average human rater. Furthermore, LLMs
can perform multi-step reasoning not only with few-shot exemplars, [77] also found that they can automatically
produce intermediate steps with a simple “Let’s think step by step” prompting in a zero-shot manner. Surprisingly,
LLMs can even learn from their self-generated reasoning paths [59, 178]. Moreover, GP4 outperformed a majority
of people on several realistic examinations such as Uniform Bar Exam which also require some reasoning.
In addition, to forward reasoning paths, question decomposition, a backward reasoning method, is also effective
in multi-hop question answering, which is beneficial to both medium-size PLMs [97, 102] and LLMs [102, 107].
Moreover, while neural-based methods are blamed for black box prediction, [27, 130] demonstrated that PLMs
can produce faithful reasoning paths and make predictions based on them.
In conclusion, PLMs can learn to perform multi-step reasoning from supervised data or few-shot demonstrations.
Their capabilities of natural language understanding, generalization, and leveraging implicit knowledge make
them promising to deal with arbitrary natural language, commonsense knowledge and defeasible reasoning.
4 METHODOLOGIES OF NLR
In this section, we introduce three types of natural language reasoning approaches: end-to-end reasoning (Sec 4.1),
forward reasoning, and backward reasoning. The overall taxonomy is shown as Figure 5.
The key difference among these three categories lies in the reasoning path. Concretely, “end-to-end reasoning”
only predicts the final answers without any intermediate text, while the latter two approaches can produce
reasoning paths, containing one or more steps with the intermediate conclusions, showing the process of (possibly
multi-step) reasoning that links premises to the conclusion5 .
5 There are also some researches on producing natural language explanations instead of reasoning procedure, but we just focus on reasoning
paths in this survey

Presenting the reasoning path for each prediction can improve the interpretability of a system. Especially, a
strict reasoning path can also explicitly expose the supporting knowledge of each step. Moreover, producing
reasoning paths has been demonstrated to be beneficial to the final performance of multi-step reasoning [77, 102,
107, 141, 158]. There are two directions of reasoning.
Two Directions of reasoning. Multi-step reasoning can be performed by either forward [28, 130, 142, 158] or
backward [74, 83, 97, 107, 143]. Forward reasoning is a bottom-up procedure, which starts from the existing
knowledge and repeatedly makes inferences to obtain new knowledge until the problem is solved. The other,
backward reasoning, is a top-down procedure, which starts from the problem and repeatedly breaks down into
sub-problems until all of them can be solved by the existing knowledge. While backward reasoning targets the
specified problems, forward reasoning can freely uncover new knowledge implicated by the existing knowledge
without preassigned problems. Accordingly, the search space of forward reasoning is much larger than backward
reasoning when solving a specific problem, facing the combinatorial explosion as the step of inference goes. When
it comes to theorem proving, which is a verification problem, where the reasoning path is named “proof”, forward
reasoning and backward reasoning are often called “forward chaining” and “backward chaining” respectively.
We compare these three methods in Table 6 and demonstrate an example in Figure 6. The following subsections
will further introduce and discuss the comparison.
Direction Pros Cons

blackbox
End-to-End Reasoning - most efficient
bad generalization
interpretability huge search space
Forward Reasoning bottom-up
open-ended only effective in LLMs
interpretability
Backward Reasoning top-down goal-specific
efficient
Table 6. Comparison of end-to-end reasoning, forward reasoning, and backward reasoning.
4.1 End-to-End Reasoning

End-to-end reasoning is a complete black-box prediction that only outputs the final answers without any
explanation, intermediate conclusion, or reasoning path, whether it is a single-step or multi-step reasoning
problem. There are mainly three kinds of models used to perform end-to-end reasoning: specialized models built
upon medium-size PLMs, vanilla medium-size PLMs, and decoder-only LLMs. Besides, there is also some research
on specialized pretraining methods.
4.1.1 Training specialized models. To perform end-to-end reasoning, models need to aggregate multiple knowl-
edge and reason over them. Correspondingly, there are specialized models improving the capability of multiple
evidence aggregation [36, 93, 183, 185] or reasoning [40, 84, 113, 170, 187]. Previous research often incorporated
some task-specific inductive biases via architectural designs. For example, graph neural networks are popularly
used to leverage edges (e.g. entity-entity relations) to promote information aggregation and integration between
nodes (e.g. entity information) [40, 113, 170]. However, these designs only specialize in either specific tasks or
datasets. By contrast, ReasonFormer [188] proposed a variant architecture of transformer for general reasoning,
with different modules responsible for different predefined fundamental reasoning capabilities. This kind of
model can improve performance on specific tasks or datasets. Nevertheless, all of these designs rely heavily on
handcrafts, introducing strong prior assumptions, which may hurt the generalization ability to other tasks.

Fig. 6. An example to demonstrate the reasoning procedure of end-to-end reasoning, forward reasoning, and backward
reasoning. White colours the question, green colours the intermediate text, and orange colours the answer.
4.1.2 Finetuning vanilla medium-size PLMs. Medium-size PLMs lack the ability to perform zero-shot reasoning
such as theorem proving, argument completion, commonsense reasoning, and abduction without training [6,
25, 174, 192]. Recently, it was found that transformers can be good soft deductive reasoners after in-domain
training [6, 25]. By contrast, it is more challenging to perform defeasible reasoning [122, 167, 174].
Deductive reasoning. Both bidirectional and causal PLMs have demonstrated learning ability for deductive
reasoning. [25] first found that BERT and RoBERTa (bidirectional PLMs) can perform theorem proving over
synthetic natural language facts and rules after training. When it comes to causal PLMs, [6] demonstrated that
GPT2 can learn to reason over deductively valid arguments and is able to generalize from simple core schemes to
some unseen composite schemes. However, there are two challenging problems in this paradigm: data sparsity
and spurious correlations.
Due to data sparsity, many researchers resort to synthetic data, which is far away from the realistic setting [6,
25, 142]. Moreover, researchers demonstrated that training RoBERTa on synthetic data fails to generalize to
linguistic variations on theorem proving and commonsense reasoning [25, 192], which indicates they learn less
of the general logical structure underlying the linguistic variations. While training on high-quality data [49]
can alleviate the spurious correlation problem [49], such data is difficult to annotate on a large scale. Although
automatic data collection can obtain large-scale examples, it is restricted to limit reasoning types dependent on
the designed heuristic methods [11].
On the other hand, PLMs are found to learn spurious correlations on multi-hop reasoning, theorem proving
and commonsense reasoning [96, 181, 192]. In other words, finetuning on specific tasks and datasets may lead
models to overfit to the specific spurious correlations underlying them. There are several researchers trying
to reduce artifacts in the dataset such as by adding adversarial data [68] and carefully constructing the new
dataset [54, 151]. However, it is difficult to construct data without any artifact, and there may be some statistical
features inherent in the problem which cannot avoid in principle [181]. Another line to alleviate shortcuts is

increasing attention on reasoning path generation, which may encourage models to perform actual reasoning
(Sec 4.2).
Defeasible reasoning. The capability of defeasible reasoning seems to be more challenging for vanilla medium-
size PLMs to learn. Specifically, [122] demonstrated that the performance of BART-large and T5-large on a
defeasible reasoning task, i.e. generate a statement to update the strength of a probable conclusion, is far from
satisfaction. There is a similar observation on inductive reasoning [167]. Besides, it is hard to generalize the ability
of abduction learned in a synthetic dataset to unseen domains [174]6 . While data sparsity is also a challenge for
defeasible reasoning, how to better enable PLMs capable of defeasible reasoning remains a problem.
4.1.3 Few-shot decoder-only LLMs. Few-shot prompting using decoder-only LLMs without finetuning can
alleviate data sparsity and also prevents models from overfitting to specific tasks or datasets. However, there
remains the question of whether models can be better capable of reasoning as the model size increases.
Although the performance on reasoning problems improves as the model size increases [28, 49, 122, 167], it is
still unclear how much progress can be attributed to the improvement in reasoning capability. [28] demonstrated
that (deductive) reasoning problems are much more challenging that the scaling laws (of the Gopher family) work
much slower than other tasks in BigBench and vanilla LLMs struggle with multi-step reasoning problems. [107]
found that while LLMs memorize more factual knowledge as the model size increases, it seems their ability to
implicitly integrate knowledge for deduction does not improve.
Surprisingly, more reasoning capabilities of LLMs can be elicited by chain-of-thought prompting, as introduced
in Sec 4.2.
4.1.4 Specialized pretraining. To improve the reasoning capability of PLMs, there is some research on introducing
inductive biases of reasoning when continual pretraining [33, 69, 105, 131]. There are type-specific inductive
biases [69, 105] and type-agnostic inductive biases [33, 131]. For example, [33] incorporated the general inductive
bias of reasoning over multiple long evidence texts, while [69] mainly designed for relational reasoning. Inductive
biases are introduced with reasoning-related data and training strategies. For example, [131] collected reasoning-
related text that involves logical inference keywords and let models to self-supervised predict these keywords.
When the pretraining improves performance on multi-hop reasoning and logical reasoning problems, especially
in the low-resource setting [33, 69, 131], all of them worked on encoder-only PLMs, i.e. BERT and RoBERT.
Recently, [105] proposed a new line that leverages programs such as SQL to pretrain PLMs with synthesized
(program, execution result) pairs. The results are inspiring that PLMs, including medium-size, large-size, encoder-
only and encoder-decoder, can attain significant improvement in multi-hop reasoning and logical reasoning.
However, it is important to ask whether it is still beneficial to incorporate inductive biases into LLMs, or
whether simply increasing the model size and pretraining on more data is enough to improve reasoning capability.
In other words, can LLMs learn reasoning well enough just by the current general pretraining? Maybe LLMs
have already learned powerful reasoning capabilities that just need to be elicited via smart prompting such as
CoT [158].
4.2 Forward Reasoning

Forward reasoning repeatedly composes the existing knowledge to derive new knowledge until reaching the
answers. There are two kinds of benefits to producing a forward reasoning path: trustworthiness [27, 31, 130]
and performance improvement [28, 141, 158].
4.2.1 Trustworthiness. Showing how multiple knowledge interacts and contribute to new conclusions can
contribute to the system’s interpretability. Furthermore, when the prediction is based on the reasoning procedure,
6 By contrast, the ability of deduction learned in the synthetic dataset can be generalized to other domains [25]

it can alleviate the widespread shortcut problem. To exhibit the structure of reasoning, involving the required
knowledge and their inference relation, reasoning paths are often represented as directed graphs or trees [31, 62,
92, 126]. Typically, each node represents one piece of knowledge and the edge represents the inference relation
between knowledge. For example, a single inference linking two premises to one conclusion can be represented
as two nodes linking to their shared parent node.
Deductive reasoning. There is only one inference relation in deductive reasoning, i.e. support. To construct such
an interpretable reasoning path, it needs to find the relevant knowledge as premises and infer the conclusions
(inference). Since inference is to produce new knowledge with the given premises, it is usually implemented by
vanilla generative PLMs [120, 130]. Instead of explicitly selecting or retrieving the required knowledge [126],
some works put the context into the input and modelled both knowledge selection and inference as unified gener-
ation [142, 166]. However, it may generate hallucinations and invalid inferences. To alleviate this problem, [166]
leveraged an additional verifier to score the validity. In addition to just improving the validity of the knowledge
node and inference relation edge, some researchers proposed performing faithful reasoning, which forces the
prediction to rely on reasoning paths. This is mainly realized by designing decoupled modular frameworks to
avoid shortcuts to irrelevant context [27, 56, 130]. For example, [27] iteratively performed knowledge selection
and inference alternately in a step-by-step manner, where each inference step only conditions the currently
selected knowledge to infer the conclusion without seeing the question and the previous steps. Both supervised
modular frameworks based on medium-size PLMs [56, 130] and in-context learning modular frameworks based
on LLMs [27] have been explored to perform faithful reasoning. In addition to faithfulness, such step-decoupling
behaviors also bring other effects. On the one hand, it is easier to provide supervised training data or in-context
exemplars. The supervised framework can leverage one-step supervision [56, 130] to train the system, which alle-
viates the data sparse problem in multi-step reasoning, while the in-context learning framework can demonstrate
representative one-step examples that avoid the challenge of selecting the appropriate exemplars for multi-step
reasoning [27, 28]. On the other hand, it brings error propagation. However, all of these works consider the
simplest setting, where all the required knowledge is either explicitly provided in context or retrievable from
knowledge bases.
Defeasible reasoning. There are more types of inference relations in defeasible reasoning, i.e. strengthen, weaken
(the probability of the conclusion) and rebut. Since it is difficult to collect all the supporting premises, researches
on this line mainly concern the label of inference relations between statements. In other words, there exists
implicit reasoning, i.e. some premises are not explicitly provided. Similar to deductive reasoning, reasoning paths
can be generated by one-shot generation [62, 92] or faithful modular framework [62]. Different to deductive
reasoning, it is more challenging to generate defeasible reasoning paths that even finetuned LLMs (T5-11B) find
difficult.
However, there remains a problem with the evaluation of the constructed reasoning path. Specifically, there
may be multiple reasoning paths for each problem, which poses challenges on data annotation [31] and automatic
evaluation [27]. Annotating all possible reasoning paths for evaluation is impractical, especially for those long-
step problems facing combinatorial explosion. And it is also challenging to automatically evaluate the validity of
reasoning paths without annotated data.
4.2.2 Performance improvement. Reasoning path can also be used to improve the answer performance on multi-
step deductive reasoning, including the in-domain performance of LLMs and the generalization ability of PLMs.
For this purpose, it is not necessary to involve all the required knowledge in the reasoning path or keep the
validity of inferences as what we are concerned about are the final results rather than reasoning paths.
Firstly, reasoning paths can improve the in-domain performance by providing enriching context [141, 158] or
supervision signal [23, 176]. Recently, [158] demonstrated that the LLMs’ performance of several reasoning tasks

such as commonsense reasoning (both deductive and defeasible) can be significantly improved by generating a rea-
soning path before the final answers, which is called chain-of-thought prompting (CoT). Before this, while LLMs
are successful in classical NLP tasks, they fail in reasoning, especially multi-step reasoning tasks. This finding
boosted a series of research on this line [27, 28, 53, 77, 94, 135, 136, 141, 156, 171, 176, 184]. Especially, [77] showed
that even a simple zero-shot prompting “let’s think step by step” can activate LLMs to perform commonsense
reasoning and attained impressive performance. Furthermore, [156] found that the final performance on common-
sense reasoning can be greatly further improved by just voting the results on multiple reasoning paths. Besides, in
addition to performing reasoning on downstream tasks via few-shot prompting without changing the parameters,
supervised finetuning LLMs on CoT annotations can further improve their reasoning capability [23, 176]. In
addition to commonsense reasoning, the performance of classical logical reasoning and multi-step reasoning
are also improved significantly by generating CoT [28, 141]. However, classical logical reasoning is much more
challenging than other typical tasks [28]. Instead of one-shot CoT generation, [28] proposed a more inspiring
framework (SI) for theorem proving (a task of classical deductive reasoning) based on modules with different
prompting, which outperforms 40x larger LLMs with CoT. Moreover, [59, 178] found that LLMs can self-improve
their reasoning capabilities by finetuning their self-generated reasoning paths. However, such abilities are only
effective in LLMs, i.e. the model scale should be large enough, which is also seen as an emergent ability of LLMs
that can be elicited by few-shot [158] and even zero-shot prompting [77]. There are some researches transferring
the CoT reasoning capability of LLMs to smaller models via knowledge distillation [53, 94, 136].
Moreover, it can improve the generalization ability of PLMs. It has been observed that constructing the proof
graph for the goal hypothesis can improve the zero-shot generalization ability of medium-size PLMs to the
unseen step of reasoning [126, 128, 142] and to unseen domain [56, 142] on the theorem proving task, which
is likely because it forces models to perform reasoning rather than exploit shortcuts. Also, turning one-shot
construction into a stepwise procedure has a better generalization to the unseen steps of reasoning and to cross
datasets [56, 142].
However, the search space of forward reasoning suffers from the combinatorial explosion as the number of
reasoning steps increases. In addition to performing a single-step inference, planning is also very important
to multi-step reasoning, especially to deep steps. It has been observed that while LLMs are capable of a single
inference, they still struggle to plan on deep reasoning steps [28, 135]. Yet this topic is under-explored with few
researches [27, 166].
In addition to deductive reasoning, leveraging reasoning paths to improve performance on defeasible reasoning
is still under-explored.
4.3 Backward Reasoning

Backward reasoning repeatedly breaks down problems into sub-problems and solves them until reaching the
answers. Similar to forward reasoning, it can be used to produce trustworthy reasoning paths explicitly repre-
sented with knowledge and inference relations [56, 114, 143] or improve the final performance without strict
structures [72, 74, 97]. It faces a smaller search space and thus is more efficient than forward reasoning. There
are two popular backward reasoning methods: backward chaining and question decomposition. While the former
is a proof-finding strategy, the latter is a general strategy available for general problems. Researches mentioned
in this section are mainly about deductive reasoning.
4.3.1 Backward Chaining. Backward chaining is the preferable approach for proof-finding by humans. Beginning
from the goal, it repeatedly performs abductive reasoning to derive the potential premises as sub-goals until all
the sub-goals can be proved or disproved by the existing knowledge. According to the source of the premises, or
sub-goals, there are two kinds of abduction: predict part of premises (others are the existing knowledge) and
predict all premises. The first one is to predict the unknown required premise for a conclusion with the existing

knowledge, which can be realized by vanilla generative PLMs, either medium-size [56] or large size [72]. The
other one is to predict all the premises from scratch without relying on the existing explicit knowledge, which
can be realized by LLMs [143]. While the former kind of abduction is easier to perform, the latter can solve the
scenario where all the required premises do not exist in the knowledge base. Compared to forward chaining,
backward chaining has a smaller search space and thus is more efficient [72]. Moreover, [72] proposed a backward
chaining modular framework with LLMs as modules, which attains better performance than the existing forward
chaining frameworks. Another direction is to perform forward chaining (deduction) and backward chaining
(abduction) simultaneously [56]. In addition to proof-finding, backward chaining can also be generalized to more
general problems. For example, [143] applied it to a multi-choice question-answering problem by combing the
question and each answer choice into a verifiable hypothesis.
However, researches on this line are more under-explored than forward chaining.
4.3.2 Question Decomposition. Question decomposition is a backward reasoning method to improve performance
on multi-hop questions that require integrating multiple pieces of knowledge and inferring over them to obtain
the answers. It decomposes each question into several simpler sub-questions and answers these sub-questions
to derive the final answers. In analogy to forward reasoning, solving a single-hop sub-question is to query a
single piece of knowledge, and combining sub-answers to form the final answer is inference. And decomposing
a question into sub-questions is an abductive step. In other words, while question decomposition introduces
abduction steps, it removes the requirement of multi-step knowledge selection/retrieval.
Multi-hop questions are difficult to answer because they have a long tail distribution and are challenging to find
the relevant multiple pieces of knowledge. Especially, it might be very challenging to find the required knowledge
for implicit multi-hop questions, whose superficial text and semantics can be very different to the required
knowledge. By contrast, it is easier to query a piece of knowledge and answer each decomposed single-hop
sub-question. For example, [102] demonstrated that both medium-size PLMs and LLMs can significantly improve
the performance on multi-hop questions with human-decomposed questions. It was also found effective in
mathematical reasoning and symbolic reasoning [191]. Besides, previous research has also shown that question
decomposition is effective with both medium-size PLMs [190] and LLMs [107] on multi-hop questions. Research
of this line has a longer history than LLM-only CoT methods.
Decomposition of explicit and implicit multi-hop question. According to the difficulty of decomposition, multi-
hop questions can be divided into explicit multi-hop questions and implicit multi-hop questions. Explicit multi-hop
questions are those which can be decomposed simply based on their superficial text (syntactical pattern). For
example, the question “where was Obama’s wife born?” can be decomposed into “who is Obama’s wife?” and
“where was #1 born?”7 based on the superficial text of the original question. Implicit multi-hop questions, however,
are more difficult to decompose since their sub-questions are not syntactically consistent with the questions.
For example, the question “can we directly live in the space?” needs to be decomposed into “what do we need
to keep alive?” and “are there #1 in the space?”, where the key predicate in the first sub-question “need” is not
explicitly mentioned in the original question. While explicit multi-hop questions can be decomposed based on
their superficial text and syntactical structures via extraction and editing [97], decomposing implicit multi-hop
questions is much more difficult. A key challenge is that it lacks large-scale annotated data, which is labour-
intensive to obtain especially as the number of hops increases. StrategyQA [45] is an implicit multi-hop question
dataset annotated with sub-questions and the corresponding knowledge pieces, but its size is small (2.7k). To
alleviate the data sparsity problem, there is some research on weak supervision data [104, 190]. Recently, in-context
learning provides a new solution [102, 107] to this problem, which requires only a small set of demonstrations.
7 “#1” denotes the answer of the first sub-question.

Framework with respect to sequential and tree structure. There are different structures of decomposition based
on the dependencies among the parent question and sub-questions, involving sequential structure and tree
structure. In a sequential structure, each sub-question is linearly dependent on the answer (e.g. a bridge entity) of
the antecedent sub-question, and the answer of the last sub-question is the multi-hop question’s answer. For
example, the answer “Michelle” of the first sub-question “who is Obama’s wife?” makes up of its subsequent
sub-question “where was #1 born?” whose answer “1964” is also the final answer of the multi-hop question
“where was Obama’s wife born?”. By contrast, in a tree structure, sub-questions are independent to each other
with their answers equally contributing to the final answer. For example, the question “who can swim better,
elephant or dolphin?” consists of “can elephant swim?” and “can dolphin swim?”, and the final answer is derived
by composing the corresponding sub-answer “elephant can’t swim” and “dolphin can swim”. There are three kinds
of decomposition-based framework: module-based decomposition [97], decompose-then-recompose [102, 104],
and generate-then-answer [74, 107]. The first framework designs different modules responsible for different
reasoning types, which separate and model sequential and tree structure independently [97]. The decompose-
then-recompose framework first decomposes the multi-hop question into all its comprised sub-questions and
recomposes their sub-answers to derive the final answer [102, 104]. However, it ignores the dependencies among
sub-questions (sequential structure). By contrast, the last one, generate-then-answer, is sequential in nature, which
iteratively generates and answers a single-hop sub-question [74, 107]. It considers the question dependencies in
sequential structure and is compatible with tree structure, but is less efficient than decompose-then-recompose
since it can’t solve sub-questions of tree structure in parallel.
However, it is still challenging to solve multi-hop questions with very long hops. Due to the combinatorial
explosion, it becomes increasingly difficult to annotate decomposition supervision data and provide representative
demonstrations for in-context learning. Also, there are likely to exist multiple decomposition paths when there
are long hops, which also puts a challenge on planning. The following are the potential directions we suggest.
• Hierarchical decomposition. Instead of directly decomposing the multi-hop question into the simplest
single-hop sub-questions, it might be easier for models to perform hierarchical decomposition, i.e. repeatedly
decompose multi-hop questions into simpler multi-hop questions until there are only single-hop questions.
Moreover, it is also more practical for researchers to annotate supervision data or select appropriate
exemplars of in-context learning for layer-by-layer decomposition.
• Knowledge-aware planning. When there exist multiple decomposition paths, it is critical to plan for a
decomposition way to answerable sub-questions. For this purpose, it is important to be aware of what the
existing knowledge there is.
4.4 Summary
Reasoning requires models to integrate multiple knowledge and reason over them. Early research mostly improved
reasoning performance via architectural designs and only constructed forward reasoning paths for interpretability
or faithfulness. Specialized models were designed to improve evidence aggregation, reasoning capability or
faithfulness, but they are constrained to specific tasks, datasets or reasoning types that hurt the generalization.
Since transformers have been found to be soft deductive reasoners after in-domain finetuning, vanilla PLMs have
been more popular to perform reasoning. However, data sparsity and spurious correlation problems make it
difficult for medium-size PLMs to learn the general logical structure of diverse reasoning types. There are also
some researches incorporating inductive biases via specialized pretraining, but it is unclear whether this is still
worth as the model size and the number of pretraining data increases. Recently, it was found that an emergent
ability comes as PLMs are large enough: generating a reasoning path before the final answer can significantly
improve the multi-step reasoning performance, which boosts much research on this line. In addition to the
forward reasoning direction, the other reasoning direction is backward, which is more efficient than forward

Classical Logical Reasoning

(Sec 5.1)
Natural Language Inference

(Sec 5.2)
Multi-Hop Question Answering

(Sec 5.3)
Downstream Benchmarks
(Sec 5) Commonsense Reasoning
(Sec 5.4)
Complex Reasoning
(Sec 5.5)
Others
(Sec 5.6)
Fig. 7. Natural language reasoning benchmarks in NLP
reasoning due to the smaller search space. While forward reasoning can expose arbitrary new knowledge entailed
by the existing knowledge, backward reasoning just targets at the specific goal or the problem solution. A typical
approach of backward reasoning is question decomposition, which can improve performance on multi-hop
questions for both medium-size PLMs and LLMs. While there is much research on deductive reasoning, defeasible
reasoning is much more challenging for PLMs and is still under-explored.
5 NLR BENCHMARKS
In this section, we review some typical and popular downstream benchmarks thought to require natural language
reasoning and discuss to what extent they are actually related to reasoning. Although there might be more
downstream benchmarks with respect to natural language reasoning, here we mainly focus on four of the most
popular and familiar to the community: classical logical reasoning, natural language inference, multi-hop question
answering, and commonsense reasoning. We list the corresponding datasets and benchmarks and briefly introduce
the development. Besides, we present some datasets collected from realistic examinations or explicitly designed
to challenge LLMs, which we name “complex reasoning”. In addition to well-known reasoning benchmarks, we
also introduce some other tasks that require performing natural language reasoning. A figure of the taxonomy is
shown in Fig 7.
5.1 Classical Logical Reasoning

Some datasets explicitly target classical reasoning types in philosophy and logic, e.g. deduction, abduction and
induction, following the definitions in the two areas. Thus, we call them “classical logical reasoning tasks”. A
key characteristic of this topic is that tasks are mostly artificial to study reasoning. There are both deductive
reasoning and defeasible reasoning.
5.1.1 Deductive reasoning. Classical deductive reasoning tasks are defined formally based on formal logic, such
as propositional logic and first-order logic. There are mainly three types of task: inference [6, 100, 160], theorem
proving [5, 25, 49, 142] and reasoning path generation [100]. The inference task is to reason the conclusion given
the premises in a single step, while theorem proving is to predict whether the given proposition is true or false
with the given knowledge bases, which usually requires multiple steps. Obviously, inference is the fundamental
task that forms the basic capability of multi-step reasoning tasks such as theorem proving, while reasoning
path generation is an interpretable task that can be complementary to multi-step reasoning. However, except

FOLIO [49], all the existing explicit deductive reasoning datasets are synthesized. We list the classical deductive
reasoning datasets in Table 7.
Dataset Size Data Source Task Remark

bAbI-15 [160] - synthetic inference basic deduction
RuleTaker† [25]/ProofWriter† [142] 500k synthetic theorem proving the first natural language theorem proving
PARARULE-Plus [5] 400k synthetic theorem proving addresses the depth imbalance issue on ParaRules
AAC [6] 710k synthetic inference based on 8 syllogistic argument schemes
inference
LogicInference [100] 200k synthetic -
reasoning path generation
FOLIO [49] 1.4k expert-written theorem proving more diverse patterns
Table 7. Datasets of classical deductive reasoning, where bAbI-15 means “the 15-th task in bAbI tasks”. † denotes there are
ground reasoning paths.
Proof-finding and faithful reasoning. Since [25] has proposed a theorem proving dataset and showed that vanilla
medium-size PLMs can be soft theorem provers, a series of researches emerge on this task to study natural language
reasoning, with both vanilla medium-size PLMs [126, 128, 130, 166, 181] and LLMs [27, 28, 72, 135, 142]. However,
while the performance of transformers on theorem proving is promising, [181] found that there are some statistical
features inherently existing in the problem, which may hinder models from generalization. In addition to just
classifying the final label[25, 181], it has been demonstrated that producing proofs can bring better generalization
ability to unseen proof depth and out-of-domain data [126, 142] and contribute to interpretability. There is
several research on proof generation or proof-finding, either forward [27, 28, 130, 142] or backward [72, 83, 114],
where backward chaining is more efficient than forward chaining on proof-finding intrinsically [72]. To alleviate
the combinatorial explosion problem in the search space of the forward chaining, some researchers proposed
planning [27, 166]. Moreover, faithful reasoning is also an interesting topic in this problem, where the procedure of
reasoning is strictly designed to guarantee that models actually perform reasoning to derive the answer rather than
rely on shortcuts [27, 130]. However, while the performance is promising, even approaching perfect sometimes,
all research mentioned above is based on synthetic datasets. Moreover, recently, the new expert-written dataset
FOLIO [49] showed that when it comes to more diverse natural language, the performance degrades severely. By
contrast, the entailment tree generation dataset EntailmentBank [31] is often used to study the proof generation
and faithful reasoning as with theorem proving [27, 56, 120, 143, 166]. The target hypotheses in this dataset
are collected from realistic examinations and proofs are annotated by humans, which is a better alternative for
studies on proof generation.
There are also some benchmarks to diagnose model’s capabilities on logical semantics understanding [119,
127, 129].
5.1.2 Defeasible reasoning. Two typical defeasible reasoning types are abduction [7, 174] and induction [160, 167].
There is also another type of defeasible reasoning [122]. Datasets are shown in Table 8. Compared to classical
deductive reasoning, researches on defeasible reasoning are still under-explored. Experiments suggested that
there remains a large space to improve [167].
Inductive reasoning. Induction produces a more general principle from the given knowledge that can express or
explains them. Early datasets require first inducing rules and then applying them to perform deduction, without
inducing explicit rules [137, 160]. Recently, a new dataset DEER [167] studies rule prediction, where the task is to
induce natural language rules from natural language facts.

Abductive reasoning. Abduction is to predict the best explanation for the observations. According to the mode
of the reversed reasoning, abduction can provide explanations that constitute premises of whether deductive
reasoning [174] or defeasible reasoning [7]. Based on the explained objects (i.e. input), abduction may target a
small set of premises [7] or a knowledge corpus [174].
Others. In addition to abduction and induction, defeasibleNLI [122] focuses on whether a premise can weaken
or strengthen a probable conclusion. There are researches on defeasible inference graphs to improve both human
reasoning [92] and machine reasoning performance [93].
Dataset Reasoning Size Source Task Remark

bAbI-16 [160] induction - synthetic extraction induce-then-deduce
CLUTRR [137] induction - synthetic extractive QA induce-then-deduce
DEER [167] induction 1.2k Wikipedia generation rule prediction
AbductionRules [174] abduction - synthetic generation abduce from knowledge database
ART [7] abduction 17.8k ROCStories [98] 2-choice/generation abduce from two premises
defeasibleNLI [122] others 43.8k other datasets classification/generation concern the change of strength
Table 8. Datasets of classical defeasible reasoning, where bAbI-16 means “the 16-th task in bAbI tasks”.
5.2 Natural Language Inference

Natural language inference (NLI), also known as recognizing textual entailment (RTE), is a typical task in NLP. It
is a 3-way classification task labelling as entailment, contradiction and neutral, to identify whether the given
premise entails a hypothesis. An entailment is described as a conclusion that a person would typically infer from
the premise or the implication described by the premise [12, 29].
While NLI is regarded as a natural language understanding [12] or natural language reasoning [25] problem, we
find it involves examples of both understanding and reasoning problems. Specifically, we identify there are mainly
three types of premise-hypothesis entailment problems: paraphrasing, compound semantics understanding, and
reasoning with implicit premises. For the first type, the hypothesis is a paraphrase of the premise. For the second
type, the premise is a compound proposition entailing the hypothesis. For the last type, there need some unstated
premises to link the provided premise to the hypothesis. We demonstrate samples from the popular dataset
SNLI [12] for each type respectively in Table 9.
Premise Hypothesis
Paraphrase Two doctors perform surgery on patient Doctors are performing surgery
Two women are holding packages
CSU Two women are embracing while holding to go packages
(Two women are embracing)
A soccer game with multiple males playing
Reasoning Some men are playing a sport
(Soccer is a sport)
Table 9. Examples from SNLI [12] of three types of entailment, where CSU indicates “Compound Semantics Understanding”.
The blue-coloured sentence is the implicit premise, while the orange-coloured sentence is the other semantics of the premise.
There are several popular generic datasets listed in Table 10, where datasets with realistic hypotheses have few
hypothesis-only biases than those with human-authored hypotheses. Concretely, it has been found that there are

significant biases in human-authored hypotheses [12, 161], with which models can even predict the label without
premise [106, 152, 162].
Dataset Domain Size P Source H Source Remark

SNLI [12]/e-SNLI† [18] generic 570k realistic human-authored the first large-scale NLI dataset
MultiNLI [161] generic 433k realistic human-authored cover more styles and topics
XNLI [26] generic 7.5k - - cross-lingual, based on MultiNLI
SciTail [75] science 27k realistic realistic the first NLI dataset with entirely realistic data
SciNLI [124] science 107k realistic realistic -
Table 10. Datasets of NLI. “P” denotes “Premise” while “H” denotes “Hypothesis”. † means that e-SNLI provides explanations
for examples of SNLI.
Several datasets and benchmarks of NLI are just understanding problems, such as those presented specifically to
probe and improve the model capabilities of paraphrase and compound semantics understanding [57, 127, 164, 165].
Also, datasets that are converted from other tasks into NLI-style are irrelevant to reasoning when they are not
the reasoning problems originally [19, 173].
Interestingly, it was shown that crowdworkers sometimes annotated different labels to the same premise-
hypothesis pair [12, 21]. We think this phenomenon can be attributed to the existence of defeasible reasoning,
where people with different background knowledge can derive different conclusions.
5.3 Multi-Hop Question Answering

Multi-hop question answering (MHQA) studies answering the complex questions that require reasoning over
evidence scattered in different contexts8 , thus it is also called as multi-hop reading comprehension, where
candidate contexts are either explicitly provided involving some distractors [54, 151, 159, 169] (distractor setting),
or can be retrieved from external knowledge bases such as Wikipedia [45, 109, 169] and WorldTree [73] (retrieval
setting). The term “hop” here indicates the number of contexts required to reason rather than the number of
inference steps, which describes the behaviour moving among different contexts.
Datasets & Benchmarks. We list some typical datasets in Table 11. A key challenge on dataset construction
is that it is very label-intensive to annotate large-scale multi-hop questions especially there is a combinatorial
explosion as the number of hops increases. Many datasets are synthetic or semi-synthetic [54, 65, 159, 169], where
questions are mainly deductive, i.e. the answers are necessarily true with the given contexts. There are two types
of rationale: supporting text set [45, 54, 73, 151, 169] and reasoning path including both forward [65, 67] and
backward [45, 151].
Multi-hop question construction. There are mainly two lines on multi-hop question construction: improve
data quality and increase data number. Firstly, it has been found that there are artifacts in HotpotQA that can
be leveraged to answer questions without performing multi-hop reasoning [20, 68, 96, 150]. To deal with this
problem, one way is to leverage adversarial data [54], another way is to construct new datasets of high-quality
multi-hop questions with carefully designed data collection strategies [54, 151]. Secondly, as multi-hop questions
are difficult to annotate, there are some researches on automatic data generation[39, 101, 175].
Reasoning. After deriving the relevant contexts, it requires aggregating multiple pieces of evidence and
reasoning over them. Firstly, there are some specialized models designed for better evidence aggregation [61,
185]. Secondly, reasoning is usually performed via end-to-end answering [61, 66, 81, 185, 186] or backward
8 There are not only natural language reasoning questions, but also other types such as numerical comparison [45, 169].

Dataset Domain Size CS QS AT Rationale

WikiHop [159] generic 51k Wikipedia synthetic option ×
MedHop [159] medicine 2.5k Medline synthetic option ×
span
HotpotQA [169] generic 112k Wikipedia semi-synthetic sentences
yes/no
span
R4C [65] generic 4.6k Wikipedia semi-synthetic triples
yes/no
span
BeerQA [109] generic 530 Wikipedia human-authored ×
yes/no
sentences
2WikiMultiHopQA [54] generic 192k Wikipedia synthetic span
triples
paragraphs
MuSiQue [151] generic 25k Wikipedia human-composed span
decomposition★
sentences
QASC [73]/eQASC† [67] science 9.9k WorldTree human-authored option
reasoning path [67]★
paragraphs
StrategyQA [45] generic 2.7k Wikipedia human-authored yes/no
decomposition★
Table 11. Datasets of multi-hop question answering. † indicates it annotates the rationale for this dataset. “CS” denotes
“Context Source”, “QS” denotes “Question Source”, and “AT” denotes “Answer Type”. In CS, the distractor setting is coloured
blue, while the retrieval setting is coloured orange, and black means there are both. For rationale, ★ means “reasoning path”,
otherwise “supporting evidence set”. “decomposition” indicates the ground annotations of decomposed sub-questions and
the corresponding contexts.
decomposition [45, 74, 97, 102, 104, 107]. In this topic, question decomposition (i.e. backward reasoning) is more
popular than forward reasoning.
5.4 Commonsense Reasoning

Commonsense reasoning deals with implicit commonsense knowledge, where commonsense knowledge is
necessarily required to solve the problem. Such knowledge may be obvious to people but non-trivial to machines
since they are difficult to retrieve from the web due to reporting bias, e.g. “when people are hungry, they would
like to eat something”.
However, although it is named as “commonsense reasoning”, not all the datasets are reasoning as defined
(Sec 2.1), such as querying shared living experiences [10], identifying pragmatic implications [133], and so
on [44, 85].
5.4.1 Datasets & Benchmarks. According to the conclusion type, there are mainly three types of reasoning
problems in commonsense reasoning: “what” (i.e. assertions or events) “what if / why” (e.g. causal and temporal
relations between events), and “how” (i.e. actions).
What. This type of problem is similar to multi-hop question answering, where the problems require combining
multiple pieces of knowledge that some are from external knowledge sources. The key difference is that it requires
some commonsense knowledge, which is not explicitly provided, in commonsense reasoning. In other words, the
problems require integrating explicit knowledge, such as science [84, 95], with some commonsense knowledge.
We list some datasets in Table 12.

Dataset Other Knowledge Knowledge Source Size Task Rationale

OpenBookQA [95] science WorldTree 6k multi-choice QA science facts
OpenCSR [84] science WorldTree, ARC corpus 20k free-form QA ×
CREAK [99] entity Wikipedia 13k claim verification explanation
Table 12. Datasets of “what” commonsense reasoning.
What if & Why. This type of problem often reasons for causal and temporal relations between events. There
are two causal relations: causes and effects, which can be seen as backward causal reasoning and forward causal
reasoning respectively. Take the causality of events as an example, forward causal reasoning asks “what events
are likely to happen next?”, while backward causal reasoning asks “what may cause this event?” in a scenario
described by the context, i.e. querying the plausible previous or subsequent events respectively. Besides, there
are some problems that require considering another scenario in addition to the context, which can be seen as
constrained causal reasoning. For example, TIMETRAVEL [7, 111] is a counterfactual story rewriting dataset,
where the original story is also given. See relevant datasets and benchmarks in Table 13.
Dataset Size Direction Context Source Task Remark

ROCStories [98] 50k temporal human-authored 2-choice QA -
SWAG [179] 113k temporal ActivityNet, LSMDC multi-choice QA -
HellaSwag [180] 20k temporal ActivityNet, WikiHow multi-choice QA an upgraded SWAG
COPA [121] 1k both human-authored 2-choice QA -
Social-IQA [134] 38k both human-authored multi-choice QA social situations
e-CARE† [37] 21k both human-authored 2-choice QA -
WIQA [149] 40k forward ProPara [148] multi-choice QA about nature processes
TIMETRAVEL [111] 29k forward ROCStories [98] generation counterfactual reasoning
ART [7] 20k backward ROCStories [98] 2-choice/generation abductive commonsense reasoning
TellMeWhy [79] 30k backward ROCStories [98] free-form QA each annotated 3 possible answers
WikiWhy† [52] 9k backward human-edited Wikipedia free-form QA about Wikipedia entities / events
Table 13. Datasets of “what if” / “why” commonsense reasoning, where † denotes there annotates supporting facts or
reasoning paths. For direction, “both” indicates there are both forward and backward causal reasoning.
How. This type of problem is mainly about “how to do it”. It is more complex and also involves problem-solving
and decision-making. See some in Table 14).
Dataset Size Context Source Option Source Task Remark

WikiHow Goal-Step [182] 1489k WikiHow automatically generated multi-choice goals, steps, and temporal ordering
PIQA [8] 21k human-authored human-authored 2-choice physical causal reasoning
Table 14. Datasets of “how” commonsense reasoning.
Others. Besides, some datasets involve multiple types of reasoning. We list some typical datasets in Table 15.

Size Context Source Question Source Task Remark

CSQA [144] ConceptNet concepts [138]
12k - semi-synthetic multi-choice QA
CoS-E† [117]/ECQA† [1] explanation [1, 117], commonsense facts [1]
CSQA2 [146] 14k - human-authored boolen QA data construction via gamification
CosmosQA [60] 35k blog [17] human-authored multi-choice QA reading comprehension on blogs
Moral Stories [38] 12k human-authored - classification/generation situated reasoning with social norms
Table 15. Datasets and benchmarks with multiple types of commonsense reasoning. † indicates it annotates the rationale for
the dataset.
5.4.2 Reasoning. Since commonsense knowledge is essential to this topic, much research focused on common-
sense knowledge [42, 64, 76, 82, 88, 118, 132, 138]. As for the reasoning system, there are mainly two types of
methods: graph-based [40, 84, 170, 183] and vanilla PLMs [70, 76, 90, 112, 147, 176], where graph-based methods
are designed to aggregate knowledge from commonsense knowledge bases while vanilla PLMs are used as implicit
knowledge bases themselves.
5.5 Complex Reasoning

There are some datasets collected from realistic examinations or tests, which may require domain-specific
knowledge and multiple types of reasoning skills (Table 16).
Dataset Size Domain Source Task

AR-LSAT [189] 2k law law school admission test multi-choice QA
HEAD-QA [154] 6.7k healthcare specialized healthcare examination multi-choice QA
AI2-ARC [24]/EntailmentBank† [31] 7.7k science grade-school standardized test multi-choice QA
ReClor [177]/MetaLogic† [62] 6k generic standardized graduate admission examination RC + multi-choice QA
LogiQA [89] 8k generic national civil servants examination of China RC + multi-choice QA
ConTRoL [87] 8k generic competitive selection and recruitment test passage-level NLI
Table 16. Complex reasoning datasets with the realistic data from examinations or tests, where “RC” denotes “reading
comprehension”. † indicates “it annotates reasoning paths for some examples in this dataset”.
To better diagnose the ability of LLMs, two few-shot prompting benchmarks called MMLU [50] and Big-
Bench [139] are proposed, where tasks are much more challenging and even believed to be beyond the capabilities
of current language models, in which some require to perform reasoning. Among tasks in Big-Bench, [141]
identified 23 challenging tasks, named as Big-Bench Hard (BBH), that LLMs failed to surpass the average human-
rater, and many of them require to perform multi-step reasoning. However, when equipped with CoT prompting,
GPT3 can outperform human performance on a major of these hard tasks.
5.6 Others
In addition to the above-mentioned datasets and benchmarks, there are also some other tasks requiring natural
language reasoning scattered in the NLP domain, involving dialog [125], reading comprehension [86] and so
on [48]. Note that reasoning is an important method to arrive at the required answers or solutions, which is
of more frequent usage in complex problems. In other words, reasoning can occur in many other domains to
solve challenging problems that require multiple knowledge to derive conclusions. While there might be more
reasoning tasks or datasets, we just list some of them in Table 17.

Dataset Size Reasoning Context Source Task

ShARC [125] 32k deductive government document conversation + boolean QA
ROPES [86] 14k deductive science textbook, Wikipedia RC + extractive QA
ARC [48] 2k abductive news comment 2-choice
Table 17. Some other NLP benchmarks requiring natural language reasoning.
6 DISCUSSION
In this section, we propose some open questions, introduce some limitations, and suggest some future directions
for reasoning. Among these, we also discuss the limitations of ChatGPT and GPT4.
6.1 Open questions

We propose some open questions towards the reasoning capabilities of LLMs. There are many mysteries in their
emergent reasoning capabilities.
• Why are CoT prompting effective?. Why can just produce reasoning paths, which can even be wrong,
before the final answer bring such significant improvement? And why CoT prompting is only effective for
LLMs? What happens to LLMs when prompting with CoT but fails at medium-size PLMs?
• Where are these reasoning capabilities of LLMs from?. Why can LLMs emerge reasoning capabilities
just as the model size increases? Where does the magic “Let’s think step by step” come from? How can
they learn these capabilities? While the mechanism of another LLMs magic, in-context-learning, has been
studied [2, 30, 163], it remains more mysterious about reasoning capabilities [108].
• Do even larger models reason better?. If LLMs can emerge reasoning capabilities that can be elicited
by prompts, whether they can learn competitive reasoning capabilities just as the model size increases? Or,
whether it is still beneficial to build more datasets and design reasoning algorithms?
6.2 Limitations
We introduce both limitations of the current research and intrinsic in PLMs.
Firstly, there are gaps in defeasible reasoning and reasoning path evaluation.
• Research gap on defeasible reasoning. While defeasible reasoning is widely used in our daily life, this
topic is still under-explored in NLP. [4] found that it is more challenging for ChatGPT to perform abductive
reasoning and inductive reasoning than deduction, among which induction is the much more difficult one.
• Lack of effective ways to evaluate reasoning paths. It is still challenging to automatically evaluate
generated reasoning paths without ground truth. Evaluating reasoning paths might become increasingly
important to build explainable and reliable AI systems, especially when more people contact and use
ChatGPT-like products nowadays.
Secondly, there are also limitations intrinsic to PLMs.
• Soft deduction can produce invalid conclusions. Transformers can only predict conclusions with
probability, irrespective of whether the conclusion of deductive reasoning is necessarily true in nature,
which might prevent it from precise reasoning. This characteristic can result in a sub-optimal solution to
deductive problems (including arithmetic reasoning and symbolic reasoning). For example, while ChatGPT
is impressive on reasoning tasks, it still fails to achieve perfect performance on the simplest one-step
deductive inference task [4].

• Biases on content. PLMs make their prediction based on context. While LLMs have made huge progress
in reasoning, [32] found that LLMs are biased by content like humans when performing deduction. For
example, they perform worse in abstract or counterfactual situations than the realistic ones. Such biases
will hinder them from actual reasoning and lead to wrong answers, degrading downstream performance.
More severely, it might cause harmful societal influences due to some social biases such as gender, which
also exist in GPT4 [16].
6.3 Future
We suggest some potential research directions at both the holistic and technical levels in the future.
At the holistic level, firstly, reasoning should be generalized to more complex settings (longer steps and
defeasible reasoning) and more diverse knowledge mediums (languages and modalities). Secondly, it should put
more attention on interpretability and faithfulness. We introduce these directions as the following.
• Generalization to longer steps. The multi-step performance degrades as PLMs encounter samples that
require more reasoning steps than those in training data or few-shot exemplars. Although there is research
on decoupled one-step inference, which can alleviate the challenge of the OOD problem, it still struggles
with planning. How to better generalize to longer steps is an important problem for complex reasoning
tasks, which are also challenging to ChatGPT [4].
• More researches on defeasible reasoning. PLMs are currently the most potential path to defeasible
reasoning due to advantages we have introduced in Sec 3. According to philosophy, non-deductive reasoning
is much more common than deductive reasoning in our daily lives and practical scenes. It is worth more
effort to explore PLMs on defeasible reasoning since there lack of effective methods to deal with defeasible
reasoning, while deductive reasoning can be solved by developed symbolic engines9 , e.g. Prolog coding
for first-order logic. Moreover, it might benefit scientific research a lot if AI can induce general rules from
specific facts.
• Reasoning over non-English languages. In addition to reason over English statements, it is also impor-
tant to perform reasoning with other languages, which is much more challenging due to more severe data
sparsity problems.
• Reasoning with multi-modality. Other types of modalities can also contribute to reasoning, such as
tables [22] and images [35, 80, 140, 172]. Recently, GPT4 can process images, which might push forward
visual reasoning.
• Interpretability and faithful reasoning. Transparent and reliable reasoning paths become increasingly
important when it generalizes to longer steps and defeasible reasoning. Firstly, when there are many
steps, it takes more time and effort for people to check the quality of reasoning. Therefore, unfaithful
reasoning might introduce difficulty in people’s judgement and decision-making. Secondly, when it comes
to defeasible reasoning, exposing interpretable reasoning paths is much more important and sometimes
necessary for people to be convinced. In this case, different people with different background knowledge
can derive different and even opposed conclusions, thus it is crucial to illustrate the evidence collected to
reason.
At the technical level, we suggest several directions to improve reasoning capabilities and performance of
multi-step reasoning and defeasible reasoning as follows.
• More prompts to elicit reasoning capabilities from LLMs. Few-shot CoT and zero-shot CoT prompting
are inspiring, and CoT annotations have been used to improve LLMs’ reasoning capabilities [23]. It is both
9 Arithmetic reasoning and symbolic reasoning, which are popular recently, are also deductive and can be solved by calculator or code.

interesting and important to find whether there are other prompts that can activate LLMs to perform
reasoning or are beneficial to improving reasoning capabilities, especially on complex reasoning.
• Self-improvement of LLMs. Data annotations for reasoning paths, especially for long-step and defeasible
reasoning, are difficult to obtain. Interestingly, in our case studies, we found that ChatGPT can provide more
comprehensive answers than the ground annotations in some existing datasets such as EntailmentBank [31]
and WikiWhy [52]. Recent research have demonstrated that LLMs can learn from their self-generated
reasoning paths to improve reasoning capabilities [59, 178], which is the potential to alleviate the data
challenge.
• More exploration on backward reasoning. Backward reasoning can benefit both medium-size PLMs
and LLMs, while CoT prompting only benefits LLMs. Moreover, it is more efficient than forward reasoning
with a smaller search space, which can bring more benefits as the depth of reasoning increases. To solve
more complex reasoning problems, it is worth conducting more exploration on this direction.
• More researches on planning. Planning is important to perform longer-step reasoning since the search
space will become bigger as the depth increases.
• Exploration on self-correction. Since the conclusion of defeasible reasoning can be retracted by newly
added evidence, it might be important for PLMs to self-correct their conclusions as the reasoning proceeds.
ACKNOWLEDGMENTS
We appreciate the assistance from Ridong HAN during the investigation and the suggestion from Zhihong CHEN
on the figure demonstration in this survey.
REFERENCES
[1] S. Aggarwal, D. Mandowara, V. Agrawal, D. Khandelwal, P. Singla, and D. Garg. Explanations for commonsenseqa: New dataset and
models. In C. Zong, F. Xia, W. Li, and R. Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational
Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers),
Virtual Event, August 1-6, 2021, pages 3050–3065. Association for Computational Linguistics, 2021.
[2] E. Akyürek, D. Schuurmans, J. Andreas, T. Ma, and D. Zhou. What learning algorithm is in-context learning? investigations with linear
models. CoRR, abs/2211.15661, 2022.
[3] P. A. Angeles. Dictionary of Philosophy. Barnes & Noble Books, 1981.
[4] Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chung, Q. V. Do, Y. Xu, and P. Fung. A multitask,
multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. CoRR, abs/2302.04023, 2023.
[5] Q. Bao, A. Y. Peng, T. Hartill, N. Tan, Z. Deng, M. Witbrock, and J. Liu. Multi-step deductive reasoning over natural language: An
empirical study on out-of-distribution generalisation. The 2nd International Joint Conference on Learning and Reasoning and 16th
International Workshop on Neural-Symbolic Learning and Reasoning (IJCLR-NeSy 2022), 2022.
[6] G. Betz, C. Voigt, and K. Richardson. Critical thinking for language models. In IWCS, pages 63–75. Association for Computational
Linguistics, 2021.
[7] C. Bhagavatula, R. L. Bras, C. Malaviya, K. Sakaguchi, A. Holtzman, H. Rashkin, D. Downey, W. Yih, and Y. Choi. Abductive commonsense
reasoning. In ICLR. OpenReview.net, 2020.
[8] Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth
AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference,
IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12,
2020, pages 7432–7439. AAAI Press, 2020.
[9] S. Blackburn. The Oxford Dictionary of Philosophy. Oxford University Press, 2008.
[10] M. Boratko, X. Li, T. O’Gorman, R. Das, D. Le, and A. McCallum. Protoqa: A question answering dataset for prototypical common-sense
reasoning. In EMNLP (1), pages 1122–1136. Association for Computational Linguistics, 2020.
[11] K. Bostrom, X. Zhao, S. Chaudhuri, and G. Durrett. Flexible generation of natural language deductions. In M. Moens, X. Huang,
L. Specia, and S. W. Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021,
Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 6266–6278. Association for Computational Linguistics, 2021.
[12] S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning. A large annotated corpus for learning natural language inference. In L. Màrquez,
C. Callison-Burch, J. Su, D. Pighin, and Y. Marton, editors, Proceedings of the 2015 Conference on Empirical Methods in Natural Language

Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 632–642. The Association for Computational Linguistics, 2015.
[13] T. E. o. E. Britannica. inference. Encyclopedia Britannica, 16 Jun. 2017, 2017. https://www.britannica.com/topic/inference-reason.
[14] T. E. o. E. Britannica. reason. Encyclopedia Britannica, 15 May. 2020, 2020. https://www.britannica.com/topic/reason.
[15] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal,
A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin,
S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. Language models are few-shot learners.
In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems 33: Annual
Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
[16] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg, H. Nori, H. Palangi, M. T.
Ribeiro, and Y. Zhang. Sparks of artificial general intelligence: Early experiments with gpt-4. CoRR, abs/2303.12712, 2023.
[17] K. Burton, A. Java, and I. Soboroff. The icwsm 2009 spinn3r dataset. In Third Annual Conference on Weblogs and Social Media (ICWSM
2009), 2009.
[18] O. Camburu, T. Rocktäschel, T. Lukasiewicz, and P. Blunsom. e-snli: Natural language inference with natural language explanations.
In S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information
Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal,
Canada, pages 9560–9572, 2018.
[19] T. Chakrabarty, D. Ghosh, A. Poliak, and S. Muresan. Figurative language in recognizing textual entailment. In C. Zong, F. Xia, W. Li,
and R. Navigli, editors, Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021,
volume ACL/IJCNLP 2021 of Findings of ACL, pages 3354–3361. Association for Computational Linguistics, 2021.
[20] J. Chen and G. Durrett. Understanding dataset design choices for multi-hop reasoning. In NAACL-HLT (1), pages 4026–4032. Association
for Computational Linguistics, 2019.
[21] T. Chen, Z. Jiang, A. Poliak, K. Sakaguchi, and B. V. Durme. Uncertain natural language inference. In D. Jurafsky, J. Chai, N. Schluter,
and J. R. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online,
July 5-10, 2020, pages 8772–8779. Association for Computational Linguistics, 2020.
[22] W. Chen, H. Wang, J. Chen, Y. Zhang, H. Wang, S. Li, X. Zhou, and W. Y. Wang. Tabfact: A large-scale dataset for table-based
fact verification. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020.
OpenReview.net, 2020.
[23] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai,
M. Suzgun, X. Chen, A. Chowdhery, S. Narang, G. Mishra, A. Yu, V. Y. Zhao, Y. Huang, A. M. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean,
J. Devlin, A. Roberts, D. Zhou, Q. V. Le, and J. Wei. Scaling instruction-finetuned language models. CoRR, abs/2210.11416, 2022.
[24] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try
arc, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018.
[25] P. Clark, O. Tafjord, and K. Richardson. Transformers as soft reasoners over language. In IJCAI, pages 3882–3890. ijcai.org, 2020.
[26] A. Conneau, R. Rinott, G. Lample, A. Williams, S. R. Bowman, H. Schwenk, and V. Stoyanov. XNLI: evaluating cross-lingual sentence
representations. In E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods
in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 2475–2485. Association for Computational
Linguistics, 2018.
[27] A. Creswell and M. Shanahan. Faithful reasoning using large language models. CoRR, abs/2208.14271, 2022.
[28] A. Creswell, M. Shanahan, and I. Higgins. Selection-inference: Exploiting large language models for interpretable logical reasoning.
CoRR, abs/2205.09712, 2022.
[29] I. Dagan, D. Roth, M. Sammons, and F. M. Zanzotto. Recognizing Textual Entailment: Models and Applications. Synthesis Lectures on
Human Language Technologies. Morgan & Claypool Publishers, 2013.
[30] D. Dai, Y. Sun, L. Dong, Y. Hao, Z. Sui, and F. Wei. Why can GPT learn in-context? language models secretly perform gradient descent
as meta-optimizers. CoRR, abs/2212.10559, 2022.
[31] B. Dalvi, P. Jansen, O. Tafjord, Z. Xie, H. Smith, L. Pipatanangkura, and P. Clark. Explaining answers with entailment trees. In EMNLP
(1), pages 7358–7370. Association for Computational Linguistics, 2021.
[32] I. Dasgupta, A. K. Lampinen, S. C. Y. Chan, A. Creswell, D. Kumaran, J. L. McClelland, and F. Hill. Language models show human-like
content effects on reasoning. CoRR, abs/2207.07051, 2022.
[33] X. Deng, Y. Su, A. Lees, Y. Wu, C. Yu, and H. Sun. Reasonbert: Pre-trained to reason with distant supervision. In M. Moens, X. Huang,
[34] J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In
J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and
Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019.

[35] Q. Dong, Z. Qin, H. Xia, T. Feng, S. Tong, H. Meng, L. Xu, Z. Wei, W. Zhan, B. Chang, S. Li, T. Liu, and Z. Sui. Premise-based multimodal
reasoning: Conditional inference on joint textual and visual clues. In S. Muresan, P. Nakov, and A. Villavicencio, editors, Proceedings of
the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27,
2022, pages 932–946. Association for Computational Linguistics, 2022.
[36] L. Du, X. Ding, T. Liu, and B. Qin. Learning event graph knowledge for abductive reasoning. In ACL/IJCNLP (1), pages 5181–5190.
Association for Computational Linguistics, 2021.
[37] L. Du, X. Ding, K. Xiong, T. Liu, and B. Qin. e-care: a new dataset for exploring explainable causal reasoning. In ACL (1), pages 432–446.
[38] D. Emelin, R. L. Bras, J. D. Hwang, M. Forbes, and Y. Choi. Moral stories: Situated reasoning about norms, intents, actions, and their
consequences. In EMNLP (1), pages 698–718. Association for Computational Linguistics, 2021.
[39] Z. Fei, Q. Zhang, T. Gui, D. Liang, S. Wang, W. Wu, and X. Huang. CQG: A simple and effective controlled generation framework for
multi-hop question generation. In ACL (1), pages 6896–6906. Association for Computational Linguistics, 2022.
[40] Y. Feng, X. Chen, B. Y. Lin, P. Wang, J. Yan, and X. Ren. Scalable multi-hop relational reasoning for knowledge-aware question
answering. In EMNLP (1), pages 1295–1309. Association for Computational Linguistics, 2020.
[41] M. A. Finocchiaro. Informal logic and the theory of reasoning. Informal Logic, 6(2), 1984.
[42] M. Forbes, J. D. Hwang, V. Shwartz, M. Sap, and Y. Choi. Social chemistry 101: Learning to reason about social and moral norms. In
EMNLP (1), pages 653–670. Association for Computational Linguistics, 2020.
[43] A.-V. P. Francesco Bellucci. Peirce’s logic. The Internet Encyclopedia of Philosophy, ISSN 2161-0002, 2022. https://iep.utm.edu/peir-log/.
[44] S. Gabriel, S. Hallinan, M. Sap, P. Nguyen, F. Roesner, E. Choi, and Y. Choi. Misinfo reaction frames: Reasoning about readers’ reactions
to news headlines. In ACL (1), pages 3108–3127. Association for Computational Linguistics, 2022.
[45] M. Geva, D. Khashabi, E. Segal, T. Khot, D. Roth, and J. Berant. Did aristotle use a laptop? A question answering benchmark with
implicit reasoning strategies. Trans. Assoc. Comput. Linguistics, 9:346–361, 2021.
[46] A. I. Goldman. Epistemology and cognition. harvard university Press, 1986.
[47] T. Govier. Critical thinking as argument analysis. Argumentation, 3(2):115–126, 1989.
[48] I. Habernal, H. Wachsmuth, I. Gurevych, and B. Stein. The argument reasoning comprehension task: Identification and reconstruction
of implicit warrants. In NAACL-HLT, pages 1930–1940. Association for Computational Linguistics, 2018.
[49] S. Han, H. Schoelkopf, Y. Zhao, Z. Qi, M. Riddell, L. Benson, L. Sun, E. Zubova, Y. Qiao, M. Burtell, D. Peng, J. Fan, Y. Liu, B. Wong,
M. Sailor, A. Ni, L. Nan, J. Kasai, T. Yu, R. Zhang, S. Joty, A. R. Fabbri, W. Kryscinski, X. V. Lin, C. Xiong, and D. Radev. Folio: Natural
language reasoning with first-order logic. CoRR, abs/2209.00840, 2022.
[50] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding.
In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
[51] J. J. Hintikka. logic. Encyclopedia Britannica, 9 Jun. 2022, 2022. https://www.britannica.com/topic/logic.
[52] M. Ho, A. Sharma, J. Chang, M. Saxon, S. Levy, Y. Lu, and W. Y. Wang. Wikiwhy: Answering and explaining cause-and-effect questions.
CoRR, abs/2210.12152, 2022.
[53] N. Ho, L. Schmid, and S. Yun. Large language models are reasoning teachers. CoRR, abs/2212.10071, 2022.
[54] X. Ho, A. D. Nguyen, S. Sugawara, and A. Aizawa. Constructing A multi-hop QA dataset for comprehensive evaluation of reasoning
steps. In COLING, pages 6609–6625. International Committee on Computational Linguistics, 2020.
[55] J. Hoffart, M. A. Yosef, I. Bordino, H. Fürstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and G. Weikum. Robust disambiguation of
named entities in text. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, 27-31
July 2011, John McIntyre Conference Centre, Edinburgh, UK, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 782–792.
ACL, 2011.
[56] R. Hong, H. Zhang, X. Yu, and C. Zhang. METGEN: A module-based entailment tree generation framework for answer explanation. In
M. Carpuat, M. de Marneffe, and I. V. M. Ruíz, editors, Findings of the Association for Computational Linguistics: NAACL 2022, Seattle,
WA, United States, July 10-15, 2022, pages 1887–1905. Association for Computational Linguistics, 2022.
[57] M. M. Hossain, V. Kovatchev, P. Dutta, T. Kao, E. Wei, and E. Blanco. An analysis of natural language inference benchmarks through the
lens of negation. In B. Webber, T. Cohn, Y. He, and Y. Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 9106–9118. Association for Computational Linguistics, 2020.
[58] J. Huang and K. C. Chang. Towards reasoning in large language models: A survey. CoRR, abs/2212.10403, 2022.
[59] J. Huang, S. S. Gu, L. Hou, Y. Wu, X. Wang, H. Yu, and J. Han. Large language models can self-improve. CoRR, abs/2210.11610, 2022.
[60] L. Huang, R. L. Bras, C. Bhagavatula, and Y. Choi. Cosmos QA: machine reading comprehension with contextual commonsense
reasoning. In EMNLP/IJCNLP (1), pages 2391–2401. Association for Computational Linguistics, 2019.
[61] Y. Huang and M. Yang. Breadth first reasoning graph for multi-hop question answering. In NAACL-HLT, pages 5810–5821. Association
[62] Y. Huang, H. Zhang, R. Hong, X. Liang, C. Zhang, and D. Yu. Metalogic: Logical reasoning explanations with fine-grained structure.
CoRR, abs/2210.12487, 2022.

[63] P. J. Hurley. A concise introduction to logic. Cengage Learning, 2014.

[64] J. D. Hwang, C. Bhagavatula, R. L. Bras, J. Da, K. Sakaguchi, A. Bosselut, and Y. Choi. (comet-) atomic 2020: On symbolic and neural
commonsense knowledge graphs. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on
Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence,
EAAI 2021, Virtual Event, February 2-9, 2021, pages 6384–6392. AAAI Press, 2021.
[65] N. Inoue, P. Stenetorp, and K. Inui. R4C: A benchmark for evaluating RC systems to get the right answer for the right reason. In ACL,
pages 6740–6750. Association for Computational Linguistics, 2020.
[66] N. Inoue, H. Trivedi, S. Sinha, N. Balasubramanian, and K. Inui. Summarize-then-answer: Generating concise explanations for multi-hop
reading comprehension. In EMNLP (1), pages 6064–6080. Association for Computational Linguistics, 2021.
[67] H. Jhamtani and P. Clark. Learning to explain: Datasets and models for identifying valid reasoning chains in multihop question-
answering. In EMNLP (1), pages 137–150. Association for Computational Linguistics, 2020.
[68] Y. Jiang and M. Bansal. Avoiding reasoning shortcuts: Adversarial evaluation, training, and model development for multi-hop QA. In
ACL (1), pages 2726–2736. Association for Computational Linguistics, 2019.
[69] F. Jiao, Y. Guo, X. Song, and L. Nie. Merit: Meta-path guided contrastive learning for logical reasoning. In S. Muresan, P. Nakov, and
A. Villavicencio, editors, Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages
3496–3509. Association for Computational Linguistics, 2022.
[70] J. Jung, L. Qin, S. Welleck, F. Brahman, C. Bhagavatula, R. L. Bras, and Y. Choi. Maieutic prompting: Logically consistent reasoning
with recursive explanations. CoRR, abs/2205.11822, 2022.
[71] D. Kahneman. Thinking, fast and slow. Macmillan, 2011.
[72] S. M. Kazemi, N. Kim, D. Bhatia, X. Xu, and D. Ramachandran. LAMBADA: backward chaining for automated reasoning in natural
language. CoRR, abs/2212.13894, 2022.
[73] T. Khot, P. Clark, M. Guerquin, P. Jansen, and A. Sabharwal. QASC: A dataset for question answering via sentence composition. In
AAAI, pages 8082–8090. AAAI Press, 2020.
[74] T. Khot, D. Khashabi, K. Richardson, P. Clark, and A. Sabharwal. Text modular networks: Learning to decompose tasks in the language
of existing models. In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tür, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty,
and Y. Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 1264–1279. Association for Computational Linguistics,
2021.
[75] T. Khot, A. Sabharwal, and P. Clark. Scitail: A textual entailment dataset from science question answering. In S. A. McIlraith and
K. Q. Weinberger, editors, Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative
Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18),
New Orleans, Louisiana, USA, February 2-7, 2018, pages 5189–5197. AAAI Press, 2018.
[76] T. Klein and M. Nabi. Attention is (not) all you need for commonsense reasoning. In ACL (1), pages 4831–4836. Association for
Computational Linguistics, 2019.
[77] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa. Large language models are zero-shot reasoners. CoRR, abs/2205.11916, 2022.
[78] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. P. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova,
L. Jones, M. Kelcey, M. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and S. Petrov. Natural questions: a benchmark for question answering
research. Trans. Assoc. Comput. Linguistics, 7:452–466, 2019.
[79] Y. K. Lal, N. Chambers, R. J. Mooney, and N. Balasubramanian. Tellmewhy: A dataset for answering why-questions in narratives. In
C. Zong, F. Xia, W. Li, and R. Navigli, editors, Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event,
August 1-6, 2021, volume ACL/IJCNLP 2021 of Findings of ACL, pages 596–610. Association for Computational Linguistics, 2021.
[80] H. Le, C. Sankar, S. Moon, A. Beirami, A. Geramifard, and S. Kottur. DVD: A diagnostic dataset for multi-step reasoning in video
grounded dialogue. In C. Zong, F. Xia, W. Li, and R. Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1:
Long Papers), Virtual Event, August 1-6, 2021, pages 5651–5665. Association for Computational Linguistics, 2021.
[81] K. Lee, S. Hwang, S. Han, and D. Lee. Robustifying multi-hop QA through pseudo-evidentiality training. In ACL/IJCNLP (1), pages
[82] D. B. Lenat. CYC: A large-scale investment in knowledge infrastructure. Commun. ACM, 38(11):32–38, 1995.
[83] Z. Liang, S. Bethard, and M. Surdeanu. Explainable multi-hop verbal reasoning through internal monologue. In NAACL-HLT, pages
[84] B. Y. Lin, H. Sun, B. Dhingra, M. Zaheer, X. Ren, and W. W. Cohen. Differentiable open-ended commonsense reasoning. In NAACL-HLT,
[85] B. Y. Lin, W. Zhou, M. Shen, P. Zhou, C. Bhagavatula, Y. Choi, and X. Ren. Commongen: A constrained text generation challenge for
generative commonsense reasoning. In T. Cohn, Y. He, and Y. Liu, editors, Findings of the Association for Computational Linguistics: EMNLP
2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 1823–1840. Association for Computational

Linguistics, 2020.
[86] K. Lin, O. Tafjord, P. Clark, and M. Gardner. Reasoning over paragraph effects in situations. In A. Fisch, A. Talmor, R. Jia, M. Seo,
E. Choi, and D. Chen, editors, Proceedings of the 2nd Workshop on Machine Reading for Question Answering, MRQA@EMNLP 2019, Hong
Kong, China, November 4, 2019, pages 58–62. Association for Computational Linguistics, 2019.
[87] H. Liu, L. Cui, J. Liu, and Y. Zhang. Natural language inference in context - investigating contextual reasoning over long texts. In
Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial
Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9,
2021, pages 13388–13396. AAAI Press, 2021.
[88] H. Liu and P. Singh. Conceptnet—a practical commonsense reasoning tool-kit. BT technology journal, 22(4):211–226, 2004.
[89] J. Liu, L. Cui, H. Liu, D. Huang, Y. Wang, and Y. Zhang. Logiqa: A challenge dataset for machine reading comprehension with logical
reasoning. In C. Bessiere, editor, Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020,
pages 3622–3628. ijcai.org, 2020.
[90] J. Liu, A. Liu, X. Lu, S. Welleck, P. West, R. L. Bras, Y. Choi, and H. Hajishirzi. Generated knowledge prompting for commonsense
reasoning. In ACL (1), pages 3154–3169. Association for Computational Linguistics, 2022.
[91] J. Locke. An essay concerning human understanding. Kay & Troutman, 1847.
[92] A. Madaan, D. Rajagopal, N. Tandon, Y. Yang, and E. H. Hovy. Could you give me a hint ? generating inference graphs for defeasible
reasoning. In C. Zong, F. Xia, W. Li, and R. Navigli, editors, Findings of the Association for Computational Linguistics: ACL/IJCNLP
2021, Online Event, August 1-6, 2021, volume ACL/IJCNLP 2021 of Findings of ACL, pages 5138–5147. Association for Computational
Linguistics, 2021.
[93] A. Madaan, N. Tandon, D. Rajagopal, P. Clark, Y. Yang, and E. H. Hovy. Think about it! improving defeasible reasoning by first modeling
the question scenario. In EMNLP (1), pages 6291–6310. Association for Computational Linguistics, 2021.
[94] L. C. Magister, J. Mallinson, J. Adámek, E. Malmi, and A. Severyn. Teaching small language models to reason. CoRR, abs/2212.08410,
2022.
[95] T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suit of armor conduct electricity? A new dataset for open book question
answering. In E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods
in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 2381–2391. Association for Computational
Linguistics, 2018.
[96] S. Min, E. Wallace, S. Singh, M. Gardner, H. Hajishirzi, and L. Zettlemoyer. Compositional questions do not necessitate multi-hop
reasoning. In ACL (1), pages 4249–4257. Association for Computational Linguistics, 2019.
[97] S. Min, V. Zhong, L. Zettlemoyer, and H. Hajishirzi. Multi-hop reading comprehension through question decomposition and rescoring.
In ACL (1), pages 6097–6109. Association for Computational Linguistics, 2019.
[98] N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Vanderwende, P. Kohli, and J. F. Allen. A corpus and cloze evaluation
for deeper understanding of commonsense stories. In K. Knight, A. Nenkova, and O. Rambow, editors, NAACL HLT 2016, The 2016
Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego
California, USA, June 12-17, 2016, pages 839–849. The Association for Computational Linguistics, 2016.
[99] Y. Onoe, M. J. Q. Zhang, E. Choi, and G. Durrett. CREAK: A dataset for commonsense reasoning over entity knowledge. In NeurIPS
Datasets and Benchmarks, 2021.
[100] S. Ontañón, J. Ainslie, V. Cvicek, and Z. Fisher. Logicinference: A new dataset for teaching logical inference to seq2seq models. CoRR,
abs/2203.15099, 2022.
[101] L. Pan, W. Chen, W. Xiong, M. Kan, and W. Y. Wang. Unsupervised multi-hop question answering by question generation. In
NAACL-HLT, pages 5866–5880. Association for Computational Linguistics, 2021.
[102] P. Patel, S. Mishra, M. Parmar, and C. Baral. Is a question decomposition unit all we need? CoRR, abs/2205.12538, 2022.
[103] C. S. Peirce. Reasoning and the logic of things: The Cambridge conferences lectures of 1898. Harvard University Press, 1992.
[104] E. Perez, P. S. H. Lewis, W. Yih, K. Cho, and D. Kiela. Unsupervised question decomposition for question answering. In B. Webber,
T. Cohn, Y. He, and Y. Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP
2020, Online, November 16-20, 2020, pages 8864–8880. Association for Computational Linguistics, 2020.
[105] X. Pi, Q. Liu, B. Chen, M. Ziyadi, Z. Lin, Y. Gao, Q. Fu, J. Lou, and W. Chen. Reasoning like program executors. CoRR, abs/2201.11473,
2022.
[106] A. Poliak, J. Naradowsky, A. Haldar, R. Rudinger, and B. V. Durme. Hypothesis only baselines in natural language inference. In M. Nissim,
J. Berant, and A. Lenci, editors, Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, *SEM@NAACL-HLT
2018, New Orleans, Louisiana, USA, June 5-6, 2018, pages 180–191. Association for Computational Linguistics, 2018.
[107] O. Press, M. Zhang, S. Min, L. Schmidt, N. A. Smith, and M. Lewis. Measuring and narrowing the compositionality gap in language
models. CoRR, abs/2210.03350, 2022.
[108] B. Prystawski and N. D. Goodman. Why think step-by-step? reasoning emerges from the locality of experience. CoRR, abs/2304.03843,
2023.

[109] P. Qi, H. Lee, T. Sido, and C. D. Manning. Answering open-domain questions of varying reasoning steps from text. In EMNLP (1), pages
[110] S. Qiao, Y. Ou, N. Zhang, X. Chen, Y. Yao, S. Deng, C. Tan, F. Huang, and H. Chen. Reasoning with language model prompting: A
survey. CoRR, abs/2212.09597, 2022.
[111] L. Qin, A. Bosselut, A. Holtzman, C. Bhagavatula, E. Clark, and Y. Choi. Counterfactual story reasoning and generation. In EMNLP/IJCNLP
[112] L. Qin, V. Shwartz, P. West, C. Bhagavatula, J. D. Hwang, R. L. Bras, A. Bosselut, and Y. Choi. Back to the future: Unsupervised
backprop-based decoding for counterfactual and abductive commonsense reasoning. In EMNLP (1), pages 794–805. Association for
Computational Linguistics, 2020.
[113] L. Qiu, Y. Xiao, Y. Qu, H. Zhou, L. Li, W. Zhang, and Y. Yu. Dynamically fused graph network for multi-hop reasoning. In ACL (1),
[114] H. Qu, Y. Cao, J. Gao, L. Ding, and R. Xu. Interpretable proof generation via iterative backward reasoning. In NAACL-HLT, pages
[115] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving language understanding by generative pre-training. OpenAI, 2018.
[116] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning
with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020.
[117] N. F. Rajani, B. McCann, C. Xiong, and R. Socher. Explain yourself! leveraging language models for commonsense reasoning. In ACL
[118] H. Rashkin, M. Sap, E. Allaway, N. A. Smith, and Y. Choi. Event2mind: Commonsense inference on events, intents, and reactions. In
I. Gurevych and Y. Miyao, editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018,
Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 463–473. Association for Computational Linguistics, 2018.
[119] A. Ravichander, M. Gardner, and A. Marasovic. CONDAQA: A contrastive reading comprehension dataset for reasoning about negation.
In Y. Goldberg, Z. Kozareva, and Y. Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language
Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 8729–8755. Association for Computational
Linguistics, 2022.
[120] D. N. Ribeiro, S. Wang, X. Ma, R. Dong, X. Wei, H. Zhu, X. Chen, P. Xu, Z. Huang, A. O. Arnold, and D. Roth. Entailment tree explanations
via iterative retrieval-generation reasoner. In M. Carpuat, M. de Marneffe, and I. V. M. Ruíz, editors, Findings of the Association for
Computational Linguistics: NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pages 465–475. Association for Computational
Linguistics, 2022.
[121] M. Roemmele, C. A. Bejan, and A. S. Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In
Logical Formalizations of Commonsense Reasoning, Papers from the 2011 AAAI Spring Symposium, Technical Report SS-11-06, Stanford,
California, USA, March 21-23, 2011. AAAI, 2011.
[122] R. Rudinger, V. Shwartz, J. D. Hwang, C. Bhagavatula, M. Forbes, R. L. Bras, N. A. Smith, and Y. Choi. Thinking like a skeptic: Defeasible
inference in natural language. In T. Cohn, Y. He, and Y. Liu, editors, Findings of the Association for Computational Linguistics: EMNLP
2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 4661–4675. Association for Computational
Linguistics, 2020.
[123] D. D. Runes. The dictionary of philosophy. Citadel Press, 2001.
[124] M. Sadat and C. Caragea. Scinli: A corpus for natural language inference on scientific text. In S. Muresan, P. Nakov, and A. Villavicencio,
editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022,
Dublin, Ireland, May 22-27, 2022, pages 7399–7409. Association for Computational Linguistics, 2022.
[125] M. Saeidi, M. Bartolo, P. S. H. Lewis, S. Singh, T. Rocktäschel, M. Sheldon, G. Bouchard, and S. Riedel. Interpretation of natural language
rules in conversational machine reading. In E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii, editors, Proceedings of the 2018 Conference
on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 2087–2097. Association
[126] S. Saha, S. Ghosh, S. Srivastava, and M. Bansal. Prover: Proof generation for interpretable reasoning over rules. In EMNLP (1), pages
[127] S. Saha, Y. Nie, and M. Bansal. Conjnli: Natural language inference over conjunctive sentences. In B. Webber, T. Cohn, Y. He, and
Y. Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November
16-20, 2020, pages 8240–8252. Association for Computational Linguistics, 2020.
[128] S. Saha, P. Yadav, and M. Bansal. multiprover: Generating multiple proofs for improved interpretability in rule reasoning. In NAACL-HLT,
[129] S. Sanyal, Z. Liao, and X. Ren. Robustlr: Evaluating robustness to logical perturbation in deductive reasoning. CoRR, abs/2205.12598,
2022.
[130] S. Sanyal, H. Singh, and X. Ren. Fairr: Faithful and robust deductive reasoning over natural language. In ACL (1), pages 1075–1093.

[131] S. Sanyal, Y. Xu, S. Wang, Z. Yang, R. Pryzant, W. Yu, C. Zhu, and X. Ren. APOLLO: A simple approach for adaptive pretraining of
language models for logical reasoning. CoRR, abs/2212.09282, 2022.
[132] M. Sap, R. L. Bras, E. Allaway, C. Bhagavatula, N. Lourie, H. Rashkin, B. Roof, N. A. Smith, and Y. Choi. ATOMIC: an atlas of machine
commonsense for if-then reasoning. In AAAI, pages 3027–3035. AAAI Press, 2019.
[133] M. Sap, S. Gabriel, L. Qin, D. Jurafsky, N. A. Smith, and Y. Choi. Social bias frames: Reasoning about social and power implications of
language. In ACL, pages 5477–5490. Association for Computational Linguistics, 2020.
[134] M. Sap, H. Rashkin, D. Chen, R. L. Bras, and Y. Choi. Social iqa: Commonsense reasoning about social interactions. In EMNLP/IJCNLP
[135] A. Saparov and H. He. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. CoRR, abs/2210.01240,
2022.
[136] K. Shridhar, A. Stolfo, and M. Sachan. Distilling multi-step reasoning capabilities of large language models into smaller models via
semantic decompositions. CoRR, abs/2212.00193, 2022.
[137] K. Sinha, S. Sodhani, J. Dong, J. Pineau, and W. L. Hamilton. CLUTRR: A diagnostic benchmark for inductive reasoning from text. In
EMNLP/IJCNLP (1), pages 4505–4514. Association for Computational Linguistics, 2019.
[138] R. Speer, J. Chin, and C. Havasi. Conceptnet 5.5: An open multilingual graph of general knowledge. In S. Singh and S. Markovitch,
editors, Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, pages
4444–4451. AAAI Press, 2017.
[139] A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, A. Kluska,
A. Lewkowycz, A. Agarwal, A. Power, A. Ray, A. Warstadt, A. W. Kocurek, A. Safaya, A. Tazarv, A. Xiang, A. Parrish, A. Nie, A. Hussain,
A. Askell, A. Dsouza, A. Rahane, A. S. Iyer, A. Andreassen, A. Santilli, A. Stuhlmüller, A. M. Dai, A. La, A. K. Lampinen, A. Zou, A. Jiang,
A. Chen, A. Vuong, A. Gupta, A. Gottardi, A. Norelli, A. Venkatesh, A. Gholamidavoodi, A. Tabassum, A. Menezes, A. Kirubarajan,
A. Mullokandov, A. Sabharwal, A. Herrick, A. Efrat, A. Erdem, A. Karakas, and et al. Beyond the imitation game: Quantifying and
extrapolating the capabilities of language models. CoRR, abs/2206.04615, 2022.
[140] A. Suhr, S. Zhou, A. Zhang, I. Zhang, H. Bai, and Y. Artzi. A corpus for reasoning about natural language grounded in photographs. In
A. Korhonen, D. R. Traum, and L. Màrquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics,
ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 6418–6428. Association for Computational Linguistics,
2019.
[141] M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. H. Chi, D. Zhou, and J. Wei. Challenging
big-bench tasks and whether chain-of-thought can solve them. CoRR, abs/2210.09261, 2022.
[142] O. Tafjord, B. Dalvi, and P. Clark. Proofwriter: Generating implications, proofs, and abductive statements over natural language. In
ACL/IJCNLP (Findings), volume ACL/IJCNLP 2021 of Findings of ACL, pages 3621–3634. Association for Computational Linguistics,
2021.
[143] O. Tafjord, B. D. Mishra, and P. Clark. Entailer: Answering questions with faithful and truthful chains of reasoning. CoRR, abs/2210.12217,
2022.
[144] A. Talmor, J. Herzig, N. Lourie, and J. Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge.
In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and
Short Papers), pages 4149–4158. Association for Computational Linguistics, 2019.
[145] A. Talmor, O. Tafjord, P. Clark, Y. Goldberg, and J. Berant. Leap-of-thought: Teaching pre-trained models to systematically reason over
implicit knowledge. In NeurIPS, 2020.
[146] A. Talmor, O. Yoran, R. L. Bras, C. Bhagavatula, Y. Goldberg, Y. Choi, and J. Berant. Commonsenseqa 2.0: Exposing the limits of AI
through gamification. In J. Vanschoren and S. Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets
and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021.
[147] A. Tamborrino, N. Pellicanò, B. Pannier, P. Voitot, and L. Naudin. Pre-training is (almost) all you need: An application to commonsense
reasoning. In ACL, pages 3878–3887. Association for Computational Linguistics, 2020.
[148] N. Tandon, B. Dalvi, J. Grus, W. Yih, A. Bosselut, and P. Clark. Reasoning about actions and state changes by injecting commonsense
knowledge. In E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in
Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 57–66. Association for Computational Linguistics,
2018.
[149] N. Tandon, B. Dalvi, K. Sakaguchi, P. Clark, and A. Bosselut. WIQA: A dataset for "what if..." reasoning over procedural text. In
EMNLP/IJCNLP (1), pages 6075–6084. Association for Computational Linguistics, 2019.
[150] H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal. Is multihop QA in dire condition? measuring and reducing disconnected
reasoning. In EMNLP (1), pages 8846–8863. Association for Computational Linguistics, 2020.
[151] H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal. Musique: Multihop questions via single-hop question composition. Trans.
Assoc. Comput. Linguistics, 10:539–554, 2022.

[152] M. Tsuchiya. Performance impact caused by hidden bias of training data for recognizing textual entailment. In N. Calzolari, K. Choukri,
C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, and T. Tokunaga,
editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May
7-12, 2018. European Language Resources Association (ELRA), 2018.
[153] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In
I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, editors, Advances in Neural
Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach,
CA, USA, pages 5998–6008, 2017.
[154] D. Vilares and C. Gómez-Rodríguez. HEAD-QA: A healthcare dataset for complex reasoning. In ACL (1), pages 960–966. Association
[155] D. N. Walton. What is reasoning? what is an argument? The journal of Philosophy, 87(8):399–419, 1990.
[156] X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi, and D. Zhou. Self-consistency improves chain of thought reasoning in language
models. CoRR, abs/2203.11171, 2022.
[157] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto,
O. Vinyals, P. Liang, J. Dean, and W. Fedus. Emergent abilities of large language models. CoRR, abs/2206.07682, 2022.
[158] J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. H. Chi, Q. Le, and D. Zhou. Chain of thought prompting elicits reasoning in large
language models. CoRR, abs/2201.11903, 2022.
[159] J. Welbl, P. Stenetorp, and S. Riedel. Constructing datasets for multi-hop reading comprehension across documents. Trans. Assoc.
Comput. Linguistics, 6:287–302, 2018.
[160] J. Weston, A. Bordes, S. Chopra, and T. Mikolov. Towards ai-complete question answering: A set of prerequisite toy tasks. In ICLR
(Poster), 2016.
[161] A. Williams, N. Nangia, and S. R. Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In M. A.
Walker, H. Ji, and A. Stent, editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers),
[162] Y. Wu, M. Gardner, P. Stenetorp, and P. Dasigi. Generating data to mitigate spurious correlations in natural language inference datasets.
In S. Muresan, P. Nakov, and A. Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 2660–2676. Association for Computational
Linguistics, 2022.
[163] S. M. Xie, A. Raghunathan, P. Liang, and T. Ma. An explanation of in-context learning as implicit bayesian inference. In The Tenth
International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
[164] H. Yanaka, K. Mineshima, D. Bekki, K. Inui, S. Sekine, L. Abzianidze, and J. Bos. Can neural networks understand monotonicity
reasoning? In T. Linzen, G. Chrupala, Y. Belinkov, and D. Hupkes, editors, Proceedings of the 2019 ACL Workshop BlackboxNLP:
Analyzing and Interpreting Neural Networks for NLP, BlackboxNLP@ACL 2019, Florence, Italy, August 1, 2019, pages 31–40. Association
[165] H. Yanaka, K. Mineshima, D. Bekki, K. Inui, S. Sekine, L. Abzianidze, and J. Bos. HELP: A dataset for identifying shortcomings of neural
models in monotonicity reasoning. In R. Mihalcea, E. Shutova, L. Ku, K. Evang, and S. Poria, editors, Proceedings of the Eighth Joint
Conference on Lexical and Computational Semantics, *SEM@NAACL-HLT 2019, Minneapolis, MN, USA, June 6-7, 2019, pages 250–255.
[166] K. Yang, J. Deng, and D. Chen. Generating natural language proofs with verifier-guided search. CoRR, abs/2205.12443, 2022.
[167] Z. Yang, L. Dong, X. Du, H. Cheng, E. Cambria, X. Liu, J. Gao, and F. Wei. Language models as inductive reasoners. CoRR, abs/2212.10923,
2022.
[168] Z. Yang, X. Du, R. Mao, J. Ni, and E. Cambria. Logical reasoning over natural language as knowledge representation: A survey. CoRR,
abs/2303.12023, 2023.
[169] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning. Hotpotqa: A dataset for diverse, explainable
multi-hop question answering. In EMNLP, pages 2369–2380. Association for Computational Linguistics, 2018.
[170] M. Yasunaga, H. Ren, A. Bosselut, P. Liang, and J. Leskovec. QA-GNN: reasoning with language models and knowledge graphs for
question answering. In NAACL-HLT, pages 535–546. Association for Computational Linguistics, 2021.
[171] X. Ye, S. Iyer, A. Celikyilmaz, V. Stoyanov, G. Durrett, and R. Pasunuru. Complementary explanations for effective in-context learning.
CoRR, abs/2211.13892, 2022.
[172] D. Yin, L. H. Li, Z. Hu, N. Peng, and K. Chang. Broaden the vision: Geo-diverse visual commonsense reasoning. In M. Moens, X. Huang,
[173] W. Yin, D. R. Radev, and C. Xiong. Docnli: A large-scale dataset for document-level natural language inference. In C. Zong, F. Xia,
W. Li, and R. Navigli, editors, Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021,

volume ACL/IJCNLP 2021 of Findings of ACL, pages 4913–4922. Association for Computational Linguistics, 2021.
[174] N. Young, Q. Bao, J. Bensemann, and M. Witbrock. Abductionrules: Training transformers to explain unexpected inputs. In S. Muresan,
P. Nakov, and A. Villavicencio, editors, Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27,
2022, pages 218–227. Association for Computational Linguistics, 2022.
[175] J. Yu, W. Liu, S. Qiu, Q. Su, K. Wang, X. Quan, and J. Yin. Low-resource generation of multi-hop reasoning questions. In ACL, pages
[176] P. Yu, T. Wang, O. Golovneva, B. AlKhamissy, G. Ghosh, M. T. Diab, and A. Celikyilmaz. ALERT: adapting language models to reasoning
tasks. CoRR, abs/2212.08286, 2022.
[177] W. Yu, Z. Jiang, Y. Dong, and J. Feng. Reclor: A reading comprehension dataset requiring logical reasoning. In ICLR. OpenReview.net,
2020.
[178] E. Zelikman, Y. Wu, J. Mu, and N. D. Goodman. Star: Bootstrapping reasoning with reasoning. CoRR, abs/2203.14465, 2022.
[179] R. Zellers, Y. Bisk, R. Schwartz, and Y. Choi. SWAG: A large-scale adversarial dataset for grounded commonsense inference. In E. Riloff,
D. Chiang, J. Hockenmaier, and J. Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,
Brussels, Belgium, October 31 - November 4, 2018, pages 93–104. Association for Computational Linguistics, 2018.
[180] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. Hellaswag: Can a machine really finish your sentence? In A. Korhonen, D. R.
Traum, and L. Màrquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence,
Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4791–4800. Association for Computational Linguistics, 2019.
[181] H. Zhang, L. H. Li, T. Meng, K. Chang, and G. V. den Broeck. On the paradox of learning to reason from data. CoRR, abs/2205.11502,
2022.
[182] L. Zhang, Q. Lyu, and C. Callison-Burch. Reasoning about goals, steps, and temporal ordering with wikihow. In EMNLP (1), pages
[183] X. Zhang, A. Bosselut, M. Yasunaga, H. Ren, P. Liang, C. D. Manning, and J. Leskovec. Greaselm: Graph reasoning enhanced language
models. In ICLR. OpenReview.net, 2022.
[184] Z. Zhang, A. Zhang, M. Li, and A. Smola. Automatic chain of thought prompting in large language models. CoRR, abs/2210.03493, 2022.
[185] C. Zhao, C. Xiong, C. Rosset, X. Song, P. N. Bennett, and S. Tiwary. Transformer-xh: Multi-evidence reasoning with extra hop attention.
In ICLR. OpenReview.net, 2020.
[186] C. Zheng and P. Kordjamshidi. SRLGRN: semantic role labeling graph reasoning network. In EMNLP (1), pages 8881–8891. Association
[187] V. Zhong and L. Zettlemoyer. E3: entailment-driven extracting and editing for conversational machine reading. In A. Korhonen, D. R.
Traum, and L. Màrquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence,
Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 2310–2320. Association for Computational Linguistics, 2019.
[188] W. Zhong, T. Ma, J. Wang, J. Yin, T. Zhao, C.-Y. Lin, and N. Duan. Disentangling reasoning capabilities from language models with
compositional reasoning transformers. CoRR, abs/2210.11265, 2022.
[189] W. Zhong, S. Wang, D. Tang, Z. Xu, D. Guo, Y. Chen, J. Wang, J. Yin, M. Zhou, and N. Duan. Analytical reasoning of text. In M. Carpuat,
M. de Marneffe, and I. V. M. Ruíz, editors, Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, WA, United
States, July 10-15, 2022, pages 2306–2319. Association for Computational Linguistics, 2022.
[190] B. Zhou, K. Richardson, X. Yu, and D. Roth. Learning to decompose: Hypothetical question decomposition based on comparable texts.
CoRR, abs/2210.16865, 2022.
[191] D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuurmans, C. Cui, O. Bousquet, Q. Le, and E. Chi. Least-to-most prompting
enables complex reasoning in large language models. CoRR, abs/2205.10625, 2022.
[192] P. Zhou, R. Khanna, S. Lee, B. Y. Lin, D. Ho, J. Pujara, and X. Ren. RICA: evaluating robust inference capabilities based on commonsense
axioms. In M. Moens, X. Huang, L. Specia, and S. W. Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural
Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 7560–7579. Association

Natural Language Reasoning

Uploaded by

Copyright:

Available Formats

Natural Language Reasoning

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Natural Language Reasoning

Uploaded by

Copyright:

Available Formats

Natural Language Reasoning, A Survey

, Vol. 1, No. 1, Article . Publication date: May 2023.

2 In this survey, we refer to transformer-based pre-trained language models.

, Vol. 1, No. 1, Article . Publication date: May 2023.

2 WHAT IS NATURAL LANGUAGE REASONING

, Vol. 1, No. 1, Article . Publication date: May 2023.

Natural Language Inference

Fig. 1. Architecture of this survey.

, Vol. 1, No. 1, Article . Publication date: May 2023.

Fig. 2. Timeline of important works.

What is Reasoning What isn’t Reasoning

, Vol. 1, No. 1, Article . Publication date: May 2023.

Task Why not reasoning

, Vol. 1, No. 1, Article . Publication date: May 2023.

, Vol. 1, No. 1, Article . Publication date: May 2023.

2.2 Categories of Inference

Fact1: Aristotle is a human

, Vol. 1, No. 1, Article . Publication date: May 2023.

Deductive Inference Defeasible Inference

2.3 Potentials, Challenges, and Requirements of NLR

, Vol. 1, No. 1, Article . Publication date: May 2023.

3 WHY PLMS FOR NATURAL LANGUAGE REASONING

, Vol. 1, No. 1, Article . Publication date: May 2023.

vanilla medium-size PLMs