Can neural networks approximate any function given enough hidden neurons?

Machine Learning Problem Overview

I understand neural networks with any number of hidden layers can approximate nonlinear functions, however, can it approximate:

f(x) = x^2

I can't think of how it could. It seems like a very obvious limitation of neural networks that can potentially limit what it can do. For example, because of this limitation, neural networks probably can't properly approximate many functions used in statistics like Exponential Moving Average, or even variance.

Speaking of moving average, can recurrent neural networks properly approximate that? I understand how a feedforward neural network or even a single linear neuron can output a moving average using the sliding window technique, but how would recurrent neural networks do it without X amount of hidden layers (X being the moving average size)?

Also, let us assume we don't know the original function f, which happens to get the average of the last 500 inputs, and then output a 1 if it's higher than 3, and 0 if it's not. But for a second, pretend we don't know that, it's a black box.

How would a recurrent neural network approximate that? We would first need to know how many timesteps it should have, which we don't. Perhaps a LSTM network could, but even then, what if it's not a simple moving average, it's an exponential moving average? I don't think even LSTM can do it.

Even worse still, what if f(x,x1) that we are trying to learn is simply

f(x,x1) = x * x1

That seems very simple and straightforward. Can a neural network learn it? I don't see how.

Am I missing something huge here or are machine learning algorithms extremely limited? Are there other learning techniques besides neural networks that can actually do any of this?

Machine Learning Solutions

Solution 1 - Machine Learning

The key point to understand is compact:

Neural networks (as any other approximation structure like, polynomials, splines, or Radial Basis Functions) can approximate any continuous function only within a compact set.

In other words the theory states that, given:

A continuous function f(x),
A finite range for the input x, [a,b], and
A desired approximation accuracy ε>0,

then there exists a neural network that approximates f(x) with an approximation error less than ε, everywhere within [a,b].

Regarding your example of f(x) = x², yes you can approximate it with a neural network within any finite range: [-1,1], [0, 1000], etc. To visualise this, imagine that you approximate f(x) within [-1,1] with a Step Function. Can you do it on paper? Note that if you make the steps narrow enough you can achieve any desired accuracy. The way neural networks approximate f(x) is not much different than this.

But again, there is no neural network (or any other approximation structure) with a finite number of parameters that can approximate f(x) = x² for all x in [-∞, +∞].

Solution 2 - Machine Learning

The question is very legitimate and unfortunately many of the answers show how little practitioners seem to know about the theory of neural networks. The only rigorous theorem that exists about the ability of neural networks to approximate different kinds of functions is the Universal Approximation Theorem.

The UAT states that any continuous function on a compact domain can be approximated by a neural network with only one hidden layer provided the activation functions used are BOUNDED, continuous and monotonically increasing. Now, a finite sum of bounded functions is bounded by definition.

A polynomial is not bounded so the best we can do is provide a neural network approximation of that polynomial over a compact subset of R^n. Outside of this compact subset, the approximation will fail miserably as the polynomial will grow without bound. In other words, the neural network will work well on the training set but will not generalize!

The question is neither off-topic nor does it represent the OP's opinion.

Solution 3 - Machine Learning

I am not sure why there is such a visceral reaction, I think it is a legitimate question that is hard to find by googling it, even though I think it is widely appreciated and repeated outloud. I think in this case you are looking for the actually citations showing that a neural net can approximate any function. This recent paper explains it nicely, in my opinion. They also cite the original paper by Barron from 1993 that proved a less general result. The conclusion: a two-layer neural network can represent any bounded degree polynomial, under certain (seemingly non-restrictive) conditions.

Just in case the link does not work, it is called "Learning Polynomials with Neural Networks" by Andoni et al., 2014.

Solution 4 - Machine Learning

> I understand neural networks with any number of hidden layers can approximate nonlinear functions, however, can it approximate: > > f(x) = x^2

The only way I can make sense of that question is that you're talking about extrapolation. So e.g. given training samples in the range -1 < x < +1 can a neural network learn the right values for x > 100? Is that what you mean?

If you had prior knowledge, that the functions you're trying to approximate are likely to be low-order polynomials (or any other set of functions), then you could surely build a neural network that can represent these functions, and extrapolate x^2 everywhere.

If you don't have prior knowledge, things are a bit more difficult: There are infinitely many smooth functions that fit x^2 in the range -1..+1 perfectly, and there's no good reason why we would expect x^2 to give better predictions than any other function. In other words: If we had no prior knowledge about the function we're trying to learn, why would we want to learn x -> x^2? In the realm of artificial training sets, x^2 might be a likely function, but in the real world, it probably isn't.

To give an example: Let's say the temperature on Monday (t=0) is 0°, on Tuesday it's 1°, on Wednesday it's 4°. We have no reason to believe temperatures behave like low-order polynomials, so we wouldn't want to infer from that data that the temperature next Monday will probably be around 49°.

> Also, let us assume we don't know the original function f, which happens to get the average of the last 500 inputs, and then output a 1 if it's higher than 3, and 0 if it's not. But for a second, pretend we don't know that, it's a black box. > > How would a recurrent neural network approximate that?

I think that's two questions: First, can a neural network represent that function? I.e. is there a set of weights that would give exactly that behavior? It obviously depends on the network architecture, but I think we can come up with architectures that can represent (or at least closely approximate) this kind of function.

Question two: Can it learn this function, given enough training samples? Well, if your learning algorithm doesn't get stuck in a local minimum, sure: If you have enough training samples, any set of weights that doesn't approximate your function gives a training error greater that 0, while a set of weights that fit the function you're trying to learn has a training error=0. So if you find a global optimum, the network must fit the function.

Solution 5 - Machine Learning

A network can learn x|->x * x if it has a neuron that calculates x * x. Or more generally, a node that calculates x**p and learns p. These aren't commonly used, but the statement that "no neural network can learn..." is too strong.

A network with ReLUs and a linear output layer can learn x|->2*x, even on an unbounded range of x values. The error will be unbounded, but the proportional error will be bounded. Any function learnt by such a network is piecewise linear, and in particular asymptotically linear.

However, there is a risk with ReLUs: once a ReLU is off for all training examples it ceases learning. With a large domain, it will turn on for some possible test examples, and give an erroneous result. So ReLUs are only a good choice if test cases are likely to be within the convex hull of the training set. This is easier to guarantee if the dimensionality is low. One work around is to prefer LeakyReLU.

One other issue: how many neurons do you need to achieve the approximation you want? Each ReLU or LeakyReLU implements a single change of gradient. So the number needed depends on the maximum absolute value of the second differential of the objective function, divided by the maximum error to be tolerated.

Solution 6 - Machine Learning

There are theoretical limitations of Neural Networks. No neural network can ever learn the function f(x) = x*x Nor can it learn an infinite number of other functions, unless you assume the impractical:

1- an infinite number of training examples 2- an infinite number of units 3- an infinite amount of time to converge

NNs are good in learning low-level pattern recognition problems (signals that in the end have some statistical pattern that can be represented by some "continuous" function!), but that's it! No more!

Here's a hint:
Try to build a NN that takes n+1 data inputs (x0, x1, x2, ... xn) and it will return true (or 1) if (2 * x0) is in the rest of the sequence. And, good luck. Infinite functions especially those that are recursive cannot be learned. They just are!

Content Type	Original Author	Original Content on Stackoverflow
Question	Essam Al-Mansouri	View Question on Stackoverflow
Solution 1 - Machine Learning	Panagiotis Panagi	View Answer on Stackoverflow
Solution 2 - Machine Learning	Tarek Nassar	View Answer on Stackoverflow
Solution 3 - Machine Learning	Martha White	View Answer on Stackoverflow
Solution 4 - Machine Learning	Niki	View Answer on Stackoverflow
Solution 5 - Machine Learning	chrishmorris	View Answer on Stackoverflow
Solution 6 - Machine Learning	Walid Saba	View Answer on Stackoverflow