A Large-Scale Benchmark for Few-Shot Program Induction and Synthesis

Abstract

A landmark challenge for AI is to learn fexible, powerful representations from small numbers of examples. On an important class of tasks, hypotheses in the form of programs provide extreme generalization capabilities from surprisingly few examples. However, whereas large real image benchmarks have spurred progress in metalearning for deep networks, there is no comparably big, real program-synthesis dataset. This is because, while images are relatively easy to label from internet meta-data or annotated by nonexperts, generating meaningful input-output tests for program induction has proven hard to scale. In this work, we propose a new way of leveraging a collection of programs with associated unit tests to create a much larger collection of testprogram pairs. We do so by extracting subprograms of each program and using the inputs of the overall program to get tests for each subprogram. This allows us to create PROGRES, a large-scale few-shot program-induction benchmark of real programs and propose new challenges in this domain. We analyze the effect of multiple design choices on transformer-based program induction and synthesis algorithms, pointing to shortcomings of current methods and suggesting multiple avenues for future work.

Publication
In International Conference on Machine Learning