This post originated from an RSS feed registered with Java Buzz
by Tim Vernum.
Original Post: Polynomial Trendlines
Feed Title: A word used to describe something
Feed URL: http://blog.adjective.org/feed/category/Development/rss2
Feed Description: Tim's random musings on software development, java, jetty and other such frivolities.
I recently had to generate a polynomial trendline for some data I had.
Microsoft Excel can do it (as can some other applications) but I needed to do
it "by hand" (actually I did it in Java...)
Partly for my own benefit, and partly for the benefit of the
googleDuck Duck Go, I'm
going to write down how I did it.
So, let's start at the beginning.
Why a trendline
I had plotted a line graph of the data I had. What I needed was a curve that
approximated the line. There are 2 purposes for that
To demonstrate what the overall pattern of the data is - is the line going
up or down? Does it cycle or is it moving in 1 consistent direction?
To make (loose) predictions about what will happen in the future. If the
line is going down, when does it hit zero? If it's going up, at what rate is it
moving?
What I needed
What we need then, is a formula f(x) that produces a y
value. So that, for any given x (horizontal) value, we can generate a
y value. Where the x value is in the range of the data we
already have, we want the f(x) to produce a y value that is
close to the data we have (i.e. we minimises the error).
A polynomial trendline
For a polynomial trendline, we want to find a formula in the form:
a{0} + a{1}x + a{2}x^2 + ... + a{n-1}x^n-1 + a{n}x^n = y (Where
a{n} is the nth coefficient and x^n is x
raised to the power n)
So, a polynomial of order 2 would be: a + (b * x) + (c * x * x) =
y
We need to find out the "best" values for a, b, and
c, so that we plug in a value for x and determine an
approximation for y.
Finding the coefficients
For a polynomial of order n, we have n+1 coefficients to
work out.
Because we have n+1 unknowns, we need to use n+1 different
points from our data in order to find a formula.
So, to find the formula for an order 2 polynomial, we need 3 points from our
data to help us find a, b, and c. Let's do a simple
example by hand, and then we can look at how to solve the general case.
Let's assume our three points are (0,0), (1,2),
(2,1). We use those values for x and y to
produce 3 simultaneous equations:
a + b * 0 + c * 0 = 0
a + b * 1 + c * 1 = 2
a + b * 2 + c * 4 = 1
Simplifying, we get
a = 0
a + b + c = 2
a + 2b + 4c = 1
Which means
a = 0
b = 2 - c
b = (1 - 4c)/2
Thus
2 - c = (1-4c)/2
4 - 2c = 1 - 4c
2c = -3
c = -3/2
And
b = 2 - -3/2
b = 7/2
So
a=0
b=7/2
c=-3/2
And our final formula is (7x)/2 - (3x^2)/2 = y
So, we can predict that if x = 3, then y = (7*3)/2 -
(3*3*3)/2 = 21/2 - 27/2 = -3
The general case
The general case for an order n polynomial, with n+1
unknowns, and n+1 sample points, is a series of n+1
simultaneous equations with n+1 coefficients.
We can solve that using Cramer's Rule.
In Java, I used JAMA
to implement Cramer's rule, which is actually quite easy.
For a set of data points (x{0}, y{0}) , (x{1}, y{1})
, (x{2}, y{2}) , ... , (x{n}, y{n}), the coefficient
matrix is
And for any given x, we can find y by calculating
a{0}x^0 + a{1}x^1 + a{2}x^2 + ... + a{n}x^n
Which data points to use
We can pick any set of data points, and that will give us an
equation, but we want the best fit. So how do we calculate that?
In my case, I simply picked random points.
I repeated that 25000 times, and then chose the equation that had the lowest
error (using the sum of squares).
What order polynomial
In theory the higher the order the better the fit - that is, the
error will be smaller. (This is true, because if a lower order equation would
fit better, then we will simply find that a{n} is 0 when
we solve our simultaneous equations).
But we're not simply looking for a better fit - we're also looking for
something that visually demonstrates the trend of the data and makes helpful
predictions about future data. That means that a lower order equation (which
actually has a higher error) is more useful to us.
In my case, after plotting curves up to order 5, I determined that the
equation for order 2 better suited my needs - so I could probably have saved
myself some trouble with Cramer's rule and implemented the simpler case.