A comprehensive benchmark for mathematical reasoning with over 140K natural language questions annotated with Python programs and natural language instructions. The data set comes with multiple splits: Lila-IID (train, dev, test), Lila-OOD (train, dev, test), and Lila-Robust.
Swaroop Mishra,
Matthew Finlayson,
Pan Lu,
Leonard Tang,
Sean Welleck,
Chitta Baral,
Tanmay Rajpurohit,
Oyvind Tafjord,
Ashish Sabharwal,
Peter Clark,
Ashwin Kalyan