Aristo
Building the next generation of systems that can systematically reason, explain, and continually improve over time
- Systematic reasoning and explanation
- Teachable reasoning systems
- Continual learning with memory-based architectures
- Knowledge and belief
- Universal mathematical reasoning
Recent Updates
Towards Teachable Reasoning Systems
April 27, 2022This paper describes our work towards Teachable Reasoning Systems. First, EntailmentWriter searches for a chain of reasoning from facts it believes…
Memory-assisted prompt editing to improve GPT-3 after deployment
April 20, 2022Large LMs such as GPT-3 are powerful, but can commit mistakes that are obvious to humans. Memory-assisted prompt editing allows users to give…
DREAM: Improving Situational QA by First Elaborating the Situation
March 1, 2022When people answer questions about a specific situation, e.g., "I cheated on my mid-term exam last week. Was that wrong?", cognitive science suggests…
Explaining Answers with Entailment Trees
November 1, 2021EntailmentBank is a unique dataset of multi-step entailment trees. Each tree shows how known facts combine to entail the answer to a question. From…
BeliefBank: Adding Memory to a Pre-Trained Language Model for a Systematic Notion of Belief
November 1, 2021Although pretrained language models (PTLMs) contain significant amounts of world knowledge, they can still produce inconsistent answers to questions…
Research Areas
Teachable Reasoning Systems
By interacting with and giving feedback on a system’s reasoning, a user can teach the system so it continually improves over time – without model retraining.
Neuro-Symbolic Reasoning and Explanation
Solving problems by generating consistent, faithful chains of reasoning using neural components.
Modular Models
By learning to chain together existing models, complex problems can be solved, beyond the capabilities of the individual components.
Universal Mathematical Reasoners
Creating models with built-in mathematical reasoning skills, that can be rapidly fine-tuned for a wide variety of mathematical tasks.
Macaw is a high-performance question-answering (QA) model capable of outperforming other popular current language models, all while being an order of magnitude smaller. This demo allows you to explore Macaw's answers and compare them to those of the popular GPT-3 language model on a benchmark set of questions.
Try the demo

Macaw is a high-performance question-answering (QA) model capable of outperforming other popular current language models, all while being an order of magnitude smaller. This demo allows you to explore Macaw's answers and compare them to those of the popular GPT-3 language model on a benchmark set of questions.
Try the demo
Like RuleTaker, ProofWriter determines whether statements are True or False based on rules given in natural language - but also generates the proof of its answers.
Try the demo
Like RuleTaker, ProofWriter determines whether statements are True or False based on rules given in natural language - but also generates the proof of its answers.
Try the demoRecent Papers
Self-Refine: Iterative Refinement with Self-Feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, Peter ClarkNeurips • 2023 Like humans, large language models (LLMs) do not always generate the best output on their first try. Motivated by how humans refine their written text, we introduce Self-Refine, an approach for improving initial outputs from LLMs through iterative feedback…A Logic for Expressing Log-Precision Transformers
William Merrill, Ashish SabharwalNeurIPS • 2023 One way to interpret the reasoning power of transformer-based language models is to describe the types of logical rules they can resolve over some input text. Recently, Chiang et al. (2023) showed that finite-precision transformers can be equivalently…Exploiting Generalization in Offline Reinforcement Learning via Unseen State Augmentations
Nirbhay Modhe, Qiaozi Gao, A. Kalyan, Dhruv Batra, G. Thattai, G. SukhatmearXiv.org • 2023 Offline reinforcement learning (RL) methods strike a balance between exploration and exploitation by conservative value estimation -- penalizing values of unseen states and actions. Model-free methods penalize values at all unseen actions, while model-based…DISCO: Distilling Phrasal Counterfactuals with Large Language Models
Zeming Chen, Qiyue Gao, Kyle Richardson, Antoine Bosselut, Ashish SabharwalACL • 2023 Recent methods demonstrate that data augmentation using counterfactual knowledge can teach models the causal structure of a task, leading to robust and generalizable models. However, such counterfactual data often has a limited scale and diversity if…Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, Ashish SabharwalACL • 2023 Prompting-based large language models (LLMs) are surprisingly powerful at generating natural language reasoning steps or Chains-of-Thoughts (CoT) for multi-step question answering (QA). They struggle, however, when the necessary knowledge is either…
Recent Datasets
ParRoT (Parts and Relations of Things)
11,720 “X relation Y?” True/False questions on parts of everyday things and relational information about these parts
This is the dataset in "Do language models have coherent mental models of everyday things?", ACL 2023.
Belief and Reasoning Dataset
BaRDA: A Belief and REasoning Dataset that Separates Factual Accuracy and Reasoning Ability
BaRDa is a new belief and reasoning dataset for evaluating the factual correctness ("truth") and reasoning accuracy ("rationality", or "honesty") of new language models. It was created in collaboration with, and with the support of, the Open Philanthropy organization.
Lila
A math reasoning benchmark of over 140K natural language questions annotated with Python programs
A comprehensive benchmark for mathematical reasoning with over 140K natural language questions annotated with Python programs and natural language instructions. The data set comes with multiple splits: Lila-IID (train, dev, test), Lila-OOD (train, dev, test), and Lila-Robust.
Entailer
Data for "Entailer: Answering Questions with Faithful and Truthful Chains of Reasoning", EMNLP 2022
Data for "Entailer: Answering Questions with Faithful and Truthful Chains of Reasoning", EMNLP 2022
Recent Press
Persona-driven ChatGPT yields toxic, racist output
April 19, 2023
Changing ChatGPTs Persona Might Make It Malicious
April 17, 2023
This AI Paper Shows How ChatGPT’s Toxicity Can Increase Up To Six-Fold When Assigned A Persona
April 14, 2023
New study reveals ChatGPT's inherent toxicity when assigned different personas
April 13, 2023
'They’re All So Dirty and Smelly:' Study Unlocks ChatGPT's Inner Racist
April 13, 2023
Researchers discover a way to make ChatGPT consistently toxic
April 12, 2023
ChatGPT can turn toxic just by changing its assigned persona, researchers say
April 12, 2023
Researchers From Allen Institute for AI Introduce TeachMe: A Framework To Understand And Correct AI Models
January 17, 2023
Team
Chris Callison-BurchResearch
Peter ClarkResearch
Ben BoginYoung Investigator
Bhavana DalviResearch
Yuling GuPredoctoral Young Investigator
Shashank GuptaResearch
Ashwin KalyanResearch
Tushar KhotResearch
Bodhisattwa Prasad MajumderResearch
Kyle RichardsonResearch
Ashish SabharwalResearch
Oyvind TafjordResearch
Niket TandonResearch
Sarah WiegreffeYoung Investigator