Papers

Learn more about AI2's Lasting Impact Award
Viewing 31-40 of 216 papers
  • Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance

    Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng, Tushar KhotICML 2023, the Challenges in Deployable Generative AI workshop2023 As large language models (LLMs) are continuously being developed, their evaluation becomes increasingly important yet challenging. This work proposes Chain-of-Thought Hub, an open-source evaluation suite on the multi-step reasoning capabilities of large…
  • The Tail Wagging the Dog: Dataset Construction Biases of Social Bias Benchmarks

    Nikil Selvam, Sunipa Dev, Daniel Khashabi, Tushar Khot, Kai-Wei ChangACL2023 How reliably can we trust the scores obtained from social bias benchmarks as faithful indicators of problematic social biases in a given language model? In this work, we study this question by contrasting social biases with non-social biases stemming from…
  • Aligning Language Models to User Opinions

    EunJeong Hwang, Bodhisattwa Prasad Majumder, Niket TandonarXiv2023 An important aspect of developing LLMs that interact with humans is to align models' behavior to their users. It is possible to prompt an LLM into behaving as a certain persona, especially a user group or ideological persona the model captured during its…
  • Anthropomorphization of AI: Opportunities and Risks

    A. Deshpande, Tanmay Rajpurohit, Karthik Narasimhan, A. KalyanarXiv.org2023 Anthropomorphization is the tendency to attribute human-like traits to non-human entities. It is prevalent in many social contexts -- children anthropomorphize toys, adults do so with brands, and it is a literary device. It is also a versatile tool in science…
  • CSTS: Conditional Semantic Textual Similarity

    A. Deshpande, Carlos E. Jimenez, Howard Chen, Vishvak Murahari, Victoria Graf, Tanmay Rajpurohit, A. Kalyan, Danqi Chen, Karthik NarasimhanarXiv.org2023 Semantic textual similarity (STS) has been a cornerstone task in NLP that measures the degree of similarity between a pair of sentences, with applications in information retrieval, question answering, and embedding methods. However, it is an inherently…
  • OpenPI2.0: An Improved Dataset for Entity Tracking in Texts

    Li Zhang, Hai Xu, Abhinav Kommula, Niket Tandon, Chris Callison-BurcharXiv2023 Representing texts as information about entities has long been deemed effective in event reasoning. We propose OpenPI2.0, an improved dataset for tracking entity states in procedural texts. OpenPI2.0 features not only canonicalized entities that facilitate…
  • Improving Language Models via Plug-and-Play Retrieval Feedback

    Wenhao Yu, Zhihan Zhang, Zhenwen Liang, Meng Jiang, Ashish SabharwalarXiv2023 Large language models (LLMs) exhibit remarkable performance across various NLP tasks. However, they often generate incorrect or hallucinated information, which hinders their practical applicability in real-world scenarios. Human feedback has been shown to…
  • Improving Language Model Negotiation with Self-Play and In-Context Learning from AI Feedback

    Yao Fu, Hao Peng, Tushar Khot, Mirella LapataarXiv.org2023 We study whether multiple large language models (LLMs) can autonomously improve each other in a negotiation game by playing, reflecting, and criticizing. We are interested in this question because if LLMs were able to improve each other, it would imply the…
  • Can AI language models replace human participants?

    Danica Dillion, Niket Tandon, Yuling Gu, Kurt GrayTrends in Cognitive Sciences2023 Recent work suggests that language models such as GPT can make human-like judgments across a number of domains. We explore whether and when language models might replace human participants in psychological science. We review nascent research, provide a…
  • Complexity-Based Prompting for Multi-Step Reasoning

    Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, Tushar KhotICLR2023 We study the task of prompting large-scale language models to perform multi-step reasoning. Existing work shows that when prompted with a chain of thoughts (CoT), sequences of short sentences describing intermediate reasoning steps towards a final answer…