Skip to content

A Living Review of Symbolic Regression

Symbolic regression (SR) is a rapidly growing subfield of machine learning (ML) aiming to learn the analytical form of models that underlie data by searching the space of mathematical expressions. A growing interest in SR is taking place because it is naturally interpretable, i.e., it learns a transparent relationship between the input and the output, allowing for reasoning, in contrast to blackbox models learned by neural networks (NNs) where the input-output relationship is opaque and untractable.

This document provides a categorized list of state-of-the-art methods, datasets, and applications of SR as part of the review entitled Interpretable Scientific Discovery with Symbolic Regression: A Review, which provides a technical and unified review of several research works on SR. This document's goal to list all research works so this list will continue to evolve.

The purpose of this note is to make a sufficient and complete database for materials on SR (research papers, open-source codes, datasets, useful online resources, educational materials, etc.) for those developing and applying these approaches to different research areas.

download

Reviews and Benchmarks
Summary of SR Methods
Category
Brief description
Underlying methods
learned model
Regression-based This category presumes a fixed model structure. The linear approach defines the model as a linear combination of non-linear functions and reduces SR to a system of linear equations, whereas the non-linear method relies on a multi-layer perceptron (MLP). In both cases, parameters are learned. Linear SR (sparse regression)
Non-linear SR
System of linear equations
Multi-Layer Perceptron (MLP)
Expression tree-based This category treats mathematical equations as unary-binary trees whose internal nodes are mathematical operators (algebraic operators, analytical functions) and terminal nodes are constants and state variables. Genetic programming (GP)
Reinforcement learning (RL)
Transformer neural network (TNN)
Tree structure
Policy
Seq2seq models
Physics-inspired This category takes into account the units of measurements of physical variables (so-called dimensional analysis) to constraint the search space Deep learning
polynomial fit
brute force search
Neural network parameters
polynomial coefficients
Mathematics-inspired This method uses the Meijer functions General mathematical function parameters of the Meijer functions
Summary of SR Datasets

Data sets (\(\mathcal{D}\)) are categorized into two main groups:

Synthetic data for which the analytical form of the underlying model is known and used to generate data points.
Example: \(f(x) = 2x^2 + \cos(x) \rightarrow \mathcal{D}=(x_i,f(x_i))_{i=1}^{n}\) for \(x \in [0,1]\)

Real-world data for which underlying models are unknown.


Type Category Underlying model Benchmarks Number of equations
Synthetic data Physics Physics equations Revised AIFeynman 120
Physics equations AIFeynman 120
Ordinary differential equations Strogatz 10
Mathematics monomials, polynomials,
trigonometric, exponential,
logarithm, power law, etc.
Koza3
Keijer15
Vladislavleva8
Nguyen12
Korns15
R3
Jin6
Livermore22
Real world data economy, climate, commerce, etc. NA Penn Machine Learning Benchmarks 419