8000 Hw03 · jcchurch/PythonAndR Wiki · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
jcchurch edited this page Jan 5, 2012 · 4 revisions
8000

Homework 3

Kernel Density Estimation of Time Series Data


Save this data set to a file under the filename 'goals.dat'. I found this dataset on this site. It is soccer goals scored by England against Scotland at Hampden Park, Glasgow from 1872 – 1987. The first column is the number of goals and the second column is the year. There are gaps in the year (when they didn't play) and there are years where a score wasn't recorded (these represent actual gaps in data).

0,1872
1,1874
0,1876
2,1878
4,1880
1,1882
0,1884
1,1886
5,1888
1,1890
4,1892
2,1894
1,1896
3,1898
1,1900
1,1904
1,1906
1,1908
0,1910
1,1912
1,1914
NA,1916
NA,1918
0,1921
2,1923
0,1925
2,1927
0,1929
0,1931
1,1933
0,1935
1,1937
2,1939
NA,1941
NA,1943
NA,1945
2,1948
1,1950
2,1952
4,1954
1,1956
4,1958
1,1960
0,1962
0,1964
4,1966
1,1968
0,1970
1,1972
0,1974
1,1976
1,1978
2,1980
1,1982
1,1984
0,1985
0,1987

Write a python script to compute the Kernel Density Estimation of this dataset using a Gaussian kernel with a bandwidth of 1, 3, and 10. Which bandwidth parameter do you fell produces the best results? Be sure to skip over records in the file that have "NA" for scores in that year.

Your program should generate 1 plot with two lines in the same plot.

  1. A plot of the goals. The x axis represents the year, the y axis represents England's the score.
  2. The kernel density estimation line of the data with your preferred bandwidth parameter.

Use the method "plt.ylabel" to label the y-axis as England's score and "plt.xlabel" as the year.

This assignment is due on January 5th at noon.

The Instructor's Solution

import sys
import math
import matplotlib.pyplot as plt

SQRT_2PI = math.sqrt(2.0 * math.pi)

def gaussian(x):
    return math.exp(-0.5*x*x)/SQRT_2PI

def function_kde(x, y, h, binpoints):
    bins = [0] * len(binpoints)
    for i in range(len(binpoints)):
        bins[i] = y * gaussian( (binpoints[i] - x) / float(h) ) / float(h)

    return bins

if __name__=='__main__':

    x = []
    y = []

    for line in file('goals.dat'):
        line = line.strip()
        [goals, year] = line.split(",")
        if goals != "NA":
            x.append( float(year) )
            y.append( float(goals) )

    n = len(x)
    low = min(x)
    high = max(x)
    nbins = n * 1 

    binpoints = [0] * nbins
    masterbin = [0] * nbins

    delta = (high - low) / (nbins - 1)

    b = low
    i = 0
    while b <= high:
        binpoints[i] = b
        b += delta
        i += 1

    for i in range(n):
        bins = function_kde(x[i], y[i], 3, binpoints)

        for j in range(nbins):
            masterbin[j] += bins[j]

    # Plot the master bin
    plt.plot(binpoints, masterbin)
    plt.plot(x, y)
    plt.show()
Clone this wiki locally
0