A Living Review of Symbolic Regression¶

Symbolic regression (SR) is a rapidly growing subfield of machine learning (ML) aiming to learn the analytical form of models that underlie data by searching the space of mathematical expressions. A growing interest in SR is taking place because it is naturally interpretable, i.e., it learns a transparent relationship between the input and the output, allowing for reasoning, in contrast to blackbox models learned by neural networks (NNs) where the input-output relationship is opaque and untractable.

This document provides a categorized list of state-of-the-art methods, datasets, and applications of SR as part of the review entitled Interpretable Scientific Discovery with Symbolic Regression: A Review, which provides a technical and unified review of several research works on SR. This document's goal to list all research works so this list will continue to evolve.

The purpose of this note is to make a sufficient and complete database for materials on SR (research papers, open-source codes, datasets, useful online resources, educational materials, etc.) for those developing and applying these approaches to different research areas.

Reviews and Benchmarks

Reviews¶

Summary of SR Methods

Summary of SR Methods¶

Category	Brief description	Underlying methods	learned model
Regression-based	This category presumes a fixed model structure. The linear approach defines the model as a linear combination of non-linear functions and reduces SR to a system of linear equations, whereas the non-linear method relies on a multi-layer perceptron (MLP). In both cases, parameters are learned.	Linear SR (sparse regression) Non-linear SR	System of linear equations Multi-Layer Perceptron (MLP)
Expression tree-based	This category treats mathematical equations as unary-binary trees whose internal nodes are mathematical operators (algebraic operators, analytical functions) and terminal nodes are constants and state variables.	Genetic programming (GP) Reinforcement learning (RL) Transformer neural network (TNN)	Tree structure Policy Seq2seq models
Physics-inspired	This category takes into account the units of measurements of physical variables (so-called dimensional analysis) to constraint the search space	Deep learning polynomial fit brute force search	Neural network parameters polynomial coefficients
Mathematics-inspired	This method uses the Meijer functions	General mathematical function	parameters of the Meijer functions

Summary of SR Datasets

Summary of SR Datasets¶

Data sets (\(\mathcal{D}\)) are categorized into two main groups:

Synthetic data for which the analytical form of the underlying model is known and used to generate data points.
Example: \(f(x) = 2x^2 + \cos(x) \rightarrow \mathcal{D}=(x_i,f(x_i))_{i=1}^{n}\) for \(x \in [0,1]\)

Real-world data for which underlying models are unknown.

Type	Category	Underlying model	Benchmarks	Number of equations
Synthetic data	Physics	Physics equations	Revised AIFeynman	120
	Physics	Physics equations	AIFeynman	120
	Ordinary differential equations	Strogatz	10
	Mathematics	monomials, polynomials, trigonometric, exponential, logarithm, power law, etc.	Koza	3
			Keijer	15
			Vladislavleva	8
			Nguyen	12
			Korns	15
			R	3
			Jin	6
Livermore			22
Real world data	economy, climate, commerce, etc.	NA	Penn Machine Learning Benchmarks	419