Teaching/Talks/Activities

Some miscellaneous notes and musings: Number Theory Meets Computability Theory (see also blog post); other lecture notes: Notes on Language Models, Attention and Transformers, Negation as Failure, Mixing Logic and Deep Learning: The Logic as Loss Function Approach, Introduction to Probability. Formal Techniques for Neural-symbolic Modeling taught at ESSLLI 2023, Language Model Programming taught at ESSLLI 2024

Recent Talks from me and my extended group: Brief (10 minute) introduction to Natural Language Understanding (NLU) and Language Modeling (intended for a non-technical audience); Overview of my work on diagnostic testing of neural models; Pushing the Limits of Rule Reasoning in Transformers (AAAI 2022), Breakpoint Transformers (EMNLP 2022); Learning to Decompose (EMNLP 2022) Decomposed Prompting (ICLR 2023); Language Model Programming: Themes and Prospects (overview of my recent work, given at the University of Tuebingen); Declarative Characterizations of Direct Preference Alignment Algorithms, Language Modeling by Language Models (recent work on automated scientific discovery, presented at the AI for scientific discovery workshop at NAACL, preprint forthcoming)

Recent News

Released the Open-Cot leaderboard on Huggingface that aims to track model improvements due to chain-of-thought prompting.
3 papers accepted to ACL 2024 on OLMO, DOLMA (our work on open-source large language models; both received best paper awards) and TimeArena (agent modeling with time constraints).
Taught a class this summer at ESSLLI with Gijs Wijnholds on Language Model Programming (our attempt to look at current NLP through the lens of conventional programming and programming theory)
2 papers accepted at EMNLP 2024: SUPER (LLM experiment agents; received outstanding paper award), Event causality via Synthetic Control (novel causal analysis techniquee for detecting event causality)
Papers at NeurIPS 2024 Paloma (LLM perplexity benchmarking), Declarative Characterizations of direct preference alignment algorithms (draft of recent work on formal characterizations of DPO, presented at M3L workshop), AucArena (work on environment for interactive agent modeling; presented at OWA workshop), SelfGoal (work on memory architectures for interactive agents; presented at OWA workshop)
Paper at NAACL 2025 SELFGOAL: Your Language Agents Already Know How to Achieve High-level Goals (LLM agent architectures)
Two papers accepted at ICML 2025 Understanding the Logic of Direct Preference Alignment through Logic (formal characterization of direct preference alignment algorithms, brief overview), ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning (hard reasoning problems for LLMs)
New talk on automated science discovery, Language Modeling by Language Models. Presented at the NAACL 2025 AISD workshop. Pre-print here
I taught another version of our Language Model Programming course at ESSLLI 2025, with some new lectures on probabilistic programming for prompting and loss function decompilation (both recent research topics of mine).
I gave a keynote at the NALOMA 2025 workshop entitlted Understanding the Logic of Generative AI through Logic
We released the astabench leaderboard, technical paper, a large initiative to comprehensively evaluate LLM agents across different science tasks.
Paper accepted at EMNLP 2025 TinyScientist: An Interactive, Extensible, and Controllable Framework for Building Research Agents (library for LLM-driven automated research.)
Paper accepted at NeurIPS 2025 Our work on Language Modeling by Language Models (research agents for autonomous machine learning research) was accepted as a spolight paper.
I gave an invited keynote at the International Workshop on Symbolic-Neural Learning (SNL2025) in Osaka Japan. Slides are here.

Selected Publications

Note: For the most up-to-date versions of my papers, please refer to the arxiv versions (unless stated otherwise).

Junyan Cheng, Peter Clark, Kyle Richardson (2025) Language Modeling by Language Models (NeurIPS 2025) [slides] [poster, NAACL AISD workshop] [Discovery Console] Spotlight paper

Haofei Yu, Keyang Xuan, Fenghai Li, Kunlun Zhu, Zijie Lei, Jiaxun Zhang, Ziheng Qi, Kyle Richardson, Jiaxuan You (2025) TinyScientist: An Interactive, Extensible, and Controllable Framework for Building Research Agents (EMNLP 2025) [code] paper [demo]

Jonathan Bragg et al. (2025) AstaBench: Rigorous Benchmarking of AI Agents with a Holistic Scientific Research Suite [dataset]

Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, Yejin Choi (2025) ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning (ICML 2025) [dataset]

Kyle Richardson, Vivek Srikumar, Ashish Sabharwal. (2025) Understanding the Logic of Direct Preference Alignment through Logic (ICML 2025) [M3L draft] [slides] [ICML poster] [video] [code] [ICML lightning slides]

Ruihan Yang, Jiangjie Chen, Yikai Zhang, Siyu Yuan, Aili Chen, Kyle Richardson, Yanghua Xiao, Deqing Yang (2025) SelfGoal: Your Language Agents Already Know How to Achieve High-level Goals (NAACL 2025) [project page]

Haoyu Wang, Fengze Liu, Jiayao Zhang, Dan Roth, Kyle Richardson. Event Causality Identification with Synthetic Control (EMNLP 2024) [code]

Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sabharwal, Tushar Khot (2024) SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories (EMNLP 2024) Outstanding paper award

Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, Ananya Harsh Jha, Oyvind Tafjord, Dustin Schwenk, Evan Pete Walsh, Yanai Elazar, Kyle Lo, Dirk Groeneveld, Iz Beltagy, Hannaneh Hajishirzi, Noah A. Smith, Kyle Richardson, Jesse Dodge. (2024) PALOMA: A Benchmark for Evaluating Language Model Fit (NeurIPS 2024) [code] [data]

Jiangjie Chen, Siyu Yuan, Rong Ye, Bodhisattwa Prasad Majumder, Kyle Richardson. (2024) Put Your Money Where Your Mouth Is: Evaluating Strategic Planning and Execution of LLM Agents in an Auction Arena (appeared at NeurIPS OWA workshop) [arxiv] [project page] [code]

Yikai Zhang, Siyu Yuan, Caiyu Hu, Kyle Richardson, Yanghua Xiao, Jiangjie Chen. (2024) TimeArena: Shaping Efficient Multitasking Language Agents in a Time-Aware Simulation (ACL 2024) [project page]

Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, et al… Nathan Lambert, Kyle Richardson, Luke Zettlemoyer, Jesse Dodge, Kyle Lo, Luca Soldaini, Noah A. Smith, Hannaneh Hajishirzi (2024) OLMo: Accelerating the Science of Language Models (ACL 2024) [code] [data] [model] Best Theme Paper Award (ACL)

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichander, Kyle Richardson, et al. (2024) Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research (ACL 2024) [code] [data] Best Resource Paper Award (ACL)

Dirk Groenveld, Anas Awadalla, Iz Beltagy, Akshita Bhagia, Ian Magnusson, Hao Peng, Oyvind Tafjord, Pete Walsh, Kyle Richardson, Jesse Dodge. (2023) Catwalk: A Unified Language Model Evaluation Framework for Many Datasets (technical report) [toolkit code] eval code]

Kyle Richardson, Ian Magnusson, Oyvind Tafjord, Akshita Bhagia, Iz Beltagy, Arman Cohan, Pradeep Dasigi, Jesse Dodge, Dirk Groeneveld, Yuling Gu, Ananya Harsh Jha, Tushar Khot, Nishant Subramani. (2023) Robust Tooling and New Resources for Large Language Model Evaluation via Catwalk. (extended abstract GEM 2023) (details forthcoming)

Nora Kassner, Oyvind Tafjord, Ashish Sabharwal, Kyle Richardson, Hinrich Schütze and Peter Clark. (2023) Language Models with Rationality (EMNLP 2023) [arxiv] [project page]

Zeming Chen, Qiyue Gao, Antoine Bosselut, Ashish Sabharwal, Kyle Richardson (2023) DISCO: Distilling Counterfactuals with Large Language Models. (ACL 2023) [arxiv] [code]

Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, Ashish Sabharwal (2023) Decomposed Prompting: A Modular Approach for Solving Complex Tasks (ICLR 2023) [arxiv] [code] [poster] [slides]

Gregor Betz, Kyle Richardson. (2023) Probabilistic coherence, logical consistency, and Bayesian learning: Neural language models as epistemic agents (PLOS One journal) [publisher] [data/resources]

Kyle Richardson, Ronen Tamari, Oren Sultan, Dafna Shahaf, Reut Tsarfaty and Ashish Sabharwal. (2022) Breakpoint Transformers for Modeling and Tracking Intermediate Beliefs. (EMNLP 2022) [arxiv] [code] [slides]

Ben Zhou, Kyle Richardson, Xiaodong Yu and Dan Roth. (2022) Learning to Decompose: Hypothetical Question Decomposition Based on Comparable Texts (EMNLP 2022) [arxiv] [data/code]

Matthew Finlayson, Kyle Richardson , Ashish Sabharwal, Peter Clark (2022) What Makes Instruction Learning Hard? An Investigation and a New Challenge in a Synthetic Environment (EMNLP 2022) [arxiv] [code/data]

Gregor Betz, Kyle Richardson. (2022) Judgement Aggregation, Discursive Dilemma and Reflective Equilibrium: Neural Language Models as Self- Improving Doxastic Agents. Frontiers in Artificial Intelligence. [publisher]

Aarohi Srivastava et al (+441 authors) (2022) Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models [arxiv] [resources]

Tushar Khot, Kyle Richardson , Daniel Khashabi, Ashish Sabharwal (2022) Learning to Solve Complex Tasks by Talking to Agents (Findings of ACL) [arxiv] [code/data] [slides] [poster]

Kyle Richardson , Ashish Sabharwal (2022) Pushing the Limits of Rule Reasoning in Transformers through Natural Language Satisfiability (AAAI2022) [arxiv] [code/data][slides] [poster]

Daniel Khashabi, Shane Lyu, Sewon Min, Lianhui Qin, Kyle Richardson , Sameer Singh, Sean Welleck, Hannaneh Hajishirzi, Tushar Khot, Ashish Sabharwal, Yejin Choi (2022) PROMPT WAYWARDNESS: The Curious Case of Discretized Interpretation of Continuous Prompts (Proceedings of NAACL) [arxiv] [slides]

Ronen Tamari, Kyle Richardson , Aviad Sar-Shalom, Noam Kahlon, Nelson F. Liu, Reut Tsarfaty and Dafna Shahaf (2022) Dyna-bAbI: unlocking bAbI’s potential with dynamic synthetic benchmarking (*SEM2022) [arxiv] [code/data]

Gregor Betz, Kyle Richardson. (2022) DeepA2: A Modular Framework for Deep Argument Analysis with Pretrained Neural Text2Text Language Models (*SEM2022) [arxiv] [demo] [dataset] [model]

Hai Hu, He Zhou, Zuoyu Tian, Yiwen Zhang, Yina Patterson, Yanting Li, Yixin Nie, Kyle Richardson. (2021) Investigating Transfer Learning in Multi-lingual Pre-trained Language Models through Chinese Natural Language Inference Findings of ACL [code/data] [arxiv] [acl anthology]

Gregor Betz, Christian Voigt, Kyle Richardson. (2021) Thinking Aloud: Dynamic Context Generation Improves Zero-Shot Reasoning Performance of GPT-2 work in progress [arxiv]

Ben Zhou, Kyle Richardson, Qiang Ning, Tushar Khot, Ashish Sabharwal, Dan Roth. (2021) Temporal Reasoning on Implicit Events from Distant Supervision Proceedings of the 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2021) [arxiv] [code] [data] [leaderboard] [slides]

Tushar Khot, Daniel Khashabi, Kyle Richardson, Peter Clark, Ashish Sabharwal (2021) Text Modular Networks: Learning to Decompose Tasks in the Language of Existing Models Proceedings of the 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2021) [arxiv] [code/data] [demo] [slides] [poster]

Gregor Betz, Christian Voigt, Kyle Richardson. (2021) Critical Thinking for Language Models Proceedings of International Conference on Computational Semantics (IWCS 2021) [arxiv] [data] [models] [blog] [proceedings] [video]

Sumithra Bhakthavatsalam, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, Peter Clark (2021) Think you have Solved Direct-Answer Question Answering? Try ARC-DA, the Direct-Answer AI2 Reasoning Challenge technical note [arxiv] [data]

Liang Xu, Hai Hu, Xuanwei Zhang, Lu Li, Chenjie Cao, Yudong Li, Yechen Xu, Kai Sun, Dian Yu, Cong Yu, Yin Tian, Qianqian Dong, Weitang Liu, Bo Shi, Yiming Cui, Junyi Li, Jun Zeng, Rongzhao Wang, Weijian Xie, Yanting Li, Yina Patterson, Zuoyu Tian, Yiwen Zhang, He Zhou, Shaoweihua Liu, Zhe Zhao, Qipeng Zhao, Cong Yue, Xinrui Zhang, Zhengliang Yang, Kyle Richardson, and Zhenzhong Lan. (2020) CLUE: A Chinese Language Understanding Evaluation Benchmark. in Proceedings of International Conference on Computational Linguistics (COLING) [arxiv] [website/leaderboard] [code/data] [proceedings]

Niket Tandon, Keisuke Sakaguchi, Bhavana Dalvi, Dheeraj Rajagopal, Peter Clark, Michal Guerquin, Kyle Richardson and Eduard Hovy. (2020) A Dataset for Tracking Entities in Open Domain Procedural Text in Proceedings of International Conference on Empirical Methods in Natural Language Processing (EMNLP) [proceedings] [arxiv] [dataset] [code]

Hai Hu, Kyle Richardson, Liang Xu, Lu Li, Sandra Kubler, Lawrence S. Moss. (2020) OCNLI: Original Chinese Natural Language Inference Findings of EMNLP [arxiv] [code/data] [leaderboard] [acl_anthonology]

Sumithra Bhakthavatsalam, Kyle Richardson, Niket Tandon, Peter Clark (2020) Do Dogs have Whiskers? A New Knowledge Base of hasPart Relations technical note [arxiv] [data]

Atticus Geiger, Kyle Richardson, Christopher Potts (2020) Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation in Workshop on Analzying and Interpreting Neural Networks for NLP (BlackBoxNLP) [arxiv] [proceedings] [data]

Kyle Richardson, Ashish Sabharwal (2020). What Does My QA Model Know? Devising Controlled Probes using Expert Knowledge. in Transactions of the Association for Computational Linguistics (TACL) [arxiv] [journal] [code/data][slides (EMNLP2020)]

Peter Clark, Oyvind Tafjord,Kyle Richardson (2020). Transformers as Soft Reasoners over Language. Proceedings of International Joint Conference on Artificial Intelligence (IJCAI) [arxiv] [proceedings] [demo][data] [data generator code]

Hai Hu, Qi Chen, Kyle Richardson, Atreyee Mukherjee, Lawrence S. Moss,Sandra Kuebler (2020). MonaLog: a Lightweight System for Natural Language Inference Based on Monotonicity. Proceedings of Society for Computation in Linguistics (SCIL 2020) [arxiv] [proceedings] [data]

Kyle Richardson, Hai Hu, Lawrence S. Moss, Ashish Sabharwal (2020). Probing Natural Language Inference Models through Semantic Fragments. Proceedings of Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI) [arxiv][aaai][code/data][slides]

Peter Clark,Oren Etzioni, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, Niket Tandon, Sumithra Bhakthavatsalam, Dirk Groeneveld,Michal Guerquin, Michael Schmitz (2020). From ‘F’ to ‘A’ on the N.Y. Regents Science Exams: An Overview of the Aristo Project AI Magazine[arxiv][New York Times, GeekWire]

Kyle Richardson (2018) New Resources and Ideas for Semantic Parser Induction. PhD Thesis, Institute for Natural Language Processing (IMS), Faculty of Computer Science, Electrical Engineering and Information Technology. University of Stuttgart, Germany [opus][slides][code/data][handout]

Kyle Richardson (2018) A Language for Function Signature Representations. Brief technical note. [arxiv][data]

Kyle Richardson, Jonathan Berant and Jonas Kuhn (2018). Polyglot Semantic Parsing in APIs. Proceedings of 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) [arxiv][data][notes][code][slides][video]

Kyle Richardson, Sina Zarrieß and Jonas Kuhn (2017). The Code2Text Challenge: Text Generation in Source Code Libraries (2017) Proceedings of International Natural Language Generation Conference (INLG) [arxiv][paper][inlg_slides][resources].

Kyle Richardson, Jonas Kuhn (2017). Function Assistant: A Tool for NL Querying of APIs. (2017) Proceedings of Empirical Methods in Natural Language Processing (EMNLP) [arxiv][paper][demo][resources][code][poster]

Kyle Richardson, Jonas Kuhn (2017). Learning Semantic Correspondences in Technical Documentation. Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) [arvix][paper][notes][data][acl_poster][stuttgart slides][code].

Kyle Richardson, Jonas Kuhn. (2016) Learning to Make Inferences in a Semantic Parsing Task. Transactions of the Association for Computational Linguistics (TACL) [paper][data][acl_slides][video] [extended version (from thesis)] [based partly on cky/kbest implemention from here].

Cleo Condoravdi, Kyle Richardson, Vishal Sikka, Asuman Suenbuel, and Richard Waldinger (2015) Natural Language Access to Data: It Takes Common Sense!. in Twelfth International Symposium on Logical Formalizations of Commonsense Reasoning (Commonsense-15). AAAI Spring Symposium. [demo][link]

Cleo Condoravdi, Kyle Richardson, Vishal Sikka, Asuman Suenbuel, and Richard Waldinger (2014) Deduction for Natural Language Access to Data. in University of Coimbra CS Technical Reports, CISUC/TR 2014-02. Presented at Joint Workshop on Natural Language and Computer Science (NLCS) and Natural Language Services for Reasoners (NLSR).

Kyle Richardson and Jonas Kuhn (2014) UnixMan Corpus: A Resource for Language Learning in the Unix Domain. in Proceedings of Language Resources and Evaluation (LREC). [link] [data]

Sina Zarriess and Kyle Richardson. (2013) An Automatic Method for Building a Data-to-Text Generator. in Proceedings of 14th European Workshop on Natural Language Generation (ENLG) [link]

Richard Waldinger, Danny Bobrow, Cleo Condoravdi, Amar Das, Kyle Richardson. (2011) Accessing Structured Health Information through English Queries and Automatic Deduction. in Proceedings of AAAI Spring Symposium on Health Communications.

Kyle Richardson

Senior Research Scientist

Allen Institute for Artificial Intelligence

Biography

Teaching/Talks/Activities

Recent Posts

Number Theory Meets Computability Theory

Why Infinity is Strange

Understanding Lisp: Part 1

What is Kolmogorov Complexity?

Selected Publications