A Living Review of Symbolic Regression¶
Symbolic regression (SR) is a rapidly growing subfield of machine learning (ML) aiming to learn the analytical form of models that underlie data by searching the space of mathematical expressions. A growing interest in SR is taking place because it is naturally interpretable, i.e., it learns a transparent relationship between the input and the output, allowing for reasoning, in contrast to blackbox models learned by neural networks (NNs) where the input-output relationship is opaque and untractable.
This document provides a categorized list of state-of-the-art methods, datasets, and applications of SR as part of the review entitled Interpretable Scientific Discovery with Symbolic Regression: A Review, which provides a technical and unified review of several research works on SR. This document's goal to list all research works so this list will continue to evolve.
The purpose of this note is to make a sufficient and complete database for materials on SR (research papers, open-source codes, datasets, useful online resources, educational materials, etc.) for those developing and applying these approaches to different research areas.
Reviews and Benchmarks
Summary of SR Methods
Category |
Brief description |
Underlying methods |
learned model |
---|---|---|---|
Regression-based | This category presumes a fixed model structure. The linear approach defines the model as a linear combination of non-linear functions and reduces SR to a system of linear equations, whereas the non-linear method relies on a multi-layer perceptron (MLP). In both cases, parameters are learned. | Linear SR (sparse regression) Non-linear SR |
System of linear equations Multi-Layer Perceptron (MLP) |
Expression tree-based | This category treats mathematical equations as unary-binary trees whose internal nodes are mathematical operators (algebraic operators, analytical functions) and terminal nodes are constants and state variables. | Genetic programming (GP) Reinforcement learning (RL) Transformer neural network (TNN) |
Tree structure Policy Seq2seq models |
Physics-inspired | This category takes into account the units of measurements of physical variables (so-called dimensional analysis) to constraint the search space | Deep learning polynomial fit brute force search |
Neural network parameters polynomial coefficients |
Mathematics-inspired | This method uses the Meijer functions | General mathematical function | parameters of the Meijer functions |
Summary of SR Datasets
Data sets (\(\mathcal{D}\)) are categorized into two main groups:
Synthetic data for which the analytical form of the underlying model is known and used to generate data points.
Example: \(f(x) = 2x^2 + \cos(x) \rightarrow \mathcal{D}=(x_i,f(x_i))_{i=1}^{n}\) for \(x \in [0,1]\)
Real-world data for which underlying models are unknown.
Type | Category | Underlying model | Benchmarks | Number of equations |
---|---|---|---|---|
Synthetic data | Physics | Physics equations | Revised AIFeynman | 120 |
Physics equations | AIFeynman | 120 | ||
Ordinary differential equations | Strogatz | 10 | ||
Mathematics | monomials, polynomials, trigonometric, exponential, logarithm, power law, etc. |
Koza | 3 | |
Keijer | 15 | |||
Vladislavleva | 8 | |||
Nguyen | 12 | |||
Korns | 15 | |||
R | 3 | |||
Jin | 6 | |||
Livermore | 22 | |||
Real world data | economy, climate, commerce, etc. | NA | Penn Machine Learning Benchmarks | 419 |