8000 GitHub - Omairss/ds5500-hw3: Repository for Homework 3
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Omairss/ds5500-hw3

Repository files navigation

Problem 1

Following is the district-level fiscal data from 2015-16.

LEAID CENSUSID FIPST CONUM CSA CBSA NAME STNAME STABBR SCHLEV ... FL_61V FL_66V FL_W01 FL_W31 FL_W61 FL_V95 FL_V02 FL_K14 FL_CE1 FL_CE2
0 100002 N 1 01073 142 13820 Alabama Youth Services Alabama AL N ... M M M M M M M M M M
1 100005 01504840100000 1 01095 290 10700 Albertville City Alabama AL 03 ... M M R R R M M M M M
2 100006 01504800100000 1 01095 290 10700 Marshall County Alabama AL 03 ... M M R R R M M M M M
3 100007 01503740100000 1 01073 142 13820 Hoover City Alabama AL 03 ... M M R R R M M M M M
4 100008 01504530100000 1 01089 290 26620 Madison City Alabama AL 03 ... M M R R R M M M M M
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
18703 5680250 51502040100000 56 56039 N 27220 Region V BOCES Wyoming WY 07 ... N N N N N N N N N N
18704 5680251 N 56 56021 N 16940 Wyoming Department of Family Services Wyoming WY N ... N N N N N N N N N N
18705 5680252 N 56 56005 N 23940 Youth Emergency Services Inc. - Administration... Wyoming WY N ... N N N N N N N N N N
18706 5680253 N 56 56025 N 16220 Wyoming Behavioral Institute - Administration ... Wyoming WY N ... N N N N N N N N N N
18707 5680254 N 56 56013 N 40180 Saint Stephen's Indian School Admin Office Wyoming WY 03 ... N N N N N N N N N N

18708 rows × 260 columns

It seems like the federal funding (revenue) follows a power log distribution, with a few schools recieving most of the federal fundings.

png

Revenue
STNAME
California 7709274795
Texas 6194316950
New York 3374793578
Florida 3147328988
Illinois 2334944853
Pennsylvania 2037314939
Ohio 1837962930
Georgia 1815241986
Michigan 1731033937
North Carolina 1587975932
Arizona 1302009922
New Jersey 1249740990
Louisiana 1115618894
Washington 1098331976
Tennessee 1096181996
Virginia 1058145818
Indiana 1015475961
Missouri 959977972
Kentucky 880295974
South Carolina 860866990
Maryland 823598998
Massachusetts 804594987
Alabama 803906916
Wisconsin 782646932
Colorado 721718867
Oklahoma 703224894
Mississippi 690723968
Minnesota 685054951
Arkansas 606945973
Oregon 582559990
New Mexico 516288985
Connecticut 484185984
Iowa 464851996
Kansas 453921959
Utah 419641976
Nevada 405789000
West Virginia 360283000
Nebraska 346825973
Alaska 307320000
Hawaii 261131000
Idaho 248545991
District of Columbia 226201985
Montana 220339869
South Dakota 196643958
Rhode Island 188203996
Maine 186522987
New Hampshire 169165780
North Dakota 155452961
Delaware 144706995
Wyoming 123011974
Vermont 111890912
<matplotlib.axes._subplots.AxesSubplot at 0x7f5915655588>

png

png

Problem 2

LEAID CENSUSID FIPST CONUM CSA CBSA NAME STNAME STABBR SCHLEV ... FL_61V FL_66V FL_W01 FL_W31 FL_W61 FL_V95 FL_V02 FL_K14 FL_CE1 FL_CE2
0 100002 N 1 01073 142 13820 Alabama Youth Services Alabama AL N ... M M M M M M M M M M
1 100005 01504840100000 1 01095 290 10700 Albertville City Alabama AL 03 ... M M R R R M M M M M
2 100006 01504800100000 1 01095 290 10700 Marshall County Alabama AL 03 ... M M R R R M M M M M
3 100007 01503740100000 1 01073 142 13820 Hoover City Alabama AL 03 ... M M R R R M M M M M
4 100008 01504530100000 1 01089 290 26620 Madison City Alabama AL 03 ... M M R R R M M M M M
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
18703 5680250 51502040100000 56 56039 N 27220 Region V BOCES Wyoming WY 07 ... N N N N N N N N N N
18704 5680251 N 56 56021 N 16940 Wyoming Department of Family Services Wyoming WY N ... N N N N N N N N N N
18705 5680252 N 56 56005 N 23940 Youth Emergency Services Inc. - Administration... Wyoming WY N ... N N N N N N N N N N
18706 5680253 N 56 56025 N 16220 Wyoming Behavioral Institute - Administration ... Wyoming WY N ... N N N N N N N N N N
18707 5680254 N 56 56013 N 40180 Saint Stephen's Indian School Admin Office Wyoming WY 03 ... N N N N N N N N N N

18708 rows × 260 columns

png

There is a clean linear relationships between revenue and expenditures, as expected.

png

The following are the unique values from the table. Most non-numeric values are of the form x-y GEx or LEx.

In order to convert this to a numeric value the following steps were performed.

  1. Range values: Mean of the range was taken
  2. GEx: Mean of x and 100 was taken
  3. LEx: Mean of x and 0 was taken (Essentially x/2)
  4. Null/Missing: Replaced by a positive normal distribution with the mean and sigma of the original distribution.

Finally, KL Divergence of the Original distribution vs the imputed distribution was compared to KL Divergence of the Original distribution and random normal distribution.

array(['42', '39', '66', '72', '37', '43', '69', '21', '62', '47', '34',
       '60', '23', '54', '48', '46', '50', '11', '19', '35', '51', '13',
       '26', '32', '22', '27', '41', '49', '44', '31', '38', '67', '53',
       '33', '20', '36', '14', '63', '40', '56', '45', '52', '10', '17',
       '29', '28', '87', '59', '58', '78', '10-14', '30', '25-29',
       '30-34', '60-64', '30-39', '61', '45-49', '68', '35-39', '11-19',
       'LT50', 'PS', '65-69', '80-89', '85-89', '55-59', 'LE10', '55',
       '2', '15-19', '40-44', '50-54', 'LE20', '16', 'LE5', '20-29',
       '21-39', '70-74', '15', 'GE90', '86', '76', '40-49', '24', '75-79',
       '74', '25', '50-59', '70-79', '6-9', '85', '7', '40-59', '60-79',
       '80-84', '20-24', '71', '65', '8', '64', '82', 'GE50', '80', '70',
       '77', '88', '83', '93', '81', '57', '75', '84', '92', '73', 'GE95',
       '18', '12', '6', '9', '3', '90', 'LE1', '4', 'GE80', '5', '60-69',
       '79', '89', '98', '91', '95', '90-94', '94', 'GE99', '96', '97'],
      dtype=object)
def unblur(x):
    
    if '-' in x:
        return (int(x.split('-')[0]) + int(x.split('-')[1])) / 2
    
    if 'GE' in x:
         return ((int(x.replace('GE', '')) + 100)/2)
    
    if 'LE' in x:
         return ((int(x.replace('LE', '')) + 0)/2)
        
fiscal_math_df['ALL_MTH00PCTPROF_1516'].apply(unblur).unique()
array([ nan, 12. , 27. , 32. , 62. , 34.5, 47. , 37. , 15. , 67. , 84.5,
       87. , 57. ,  5. , 17. , 42. , 52. , 10. ,  2.5, 24.5, 30. , 72. ,
       95. , 44.5, 77. , 54.5, 74.5,  7.5, 49.5, 69.5, 82. , 22. , 75. ,
       97.5, 64.5, 92. , 90. ,  0.5])
fiscal_math_df['Unblurred_Scores'] = fiscal_math_df['ALL_MTH00PCTPROF_1516'].apply(unblur)
fiscal_math_df['Unblurred_Scores'].plot.hist(bins = 100)
<matplotlib.axes._subplots.AxesSubplot at 0x7f5927be9a58>

png

The distribution looks like a normal distribution that flattens out at 0. This makes sense since a lower score than 0 is not possible.

As such, we use a normal distribution to estimate the distribution of the original scores. We use KL Divergence to figure out goodness of fit

fiscal_math_df['Unblurred_Scores'] = fiscal_math_df['ALL_MTH00PCTPROF_1516'].apply(unblur)
unblurred_mean = fiscal_math_df['Unblurred_Scores'].mean()
unblurred_sigma = fiscal_math_df['Unblurred_Scores'].std()

fiscal_math_df['Unblurred_Scores'] = fiscal_math_df['Unblurred_Scores']\
    .fillna(pd.Series([abs(np.random.normal(unblurred_mean,unblurred_sigma)) for i in range(len(fiscal_math_df))]))
fiscal_math_df['Unblurred_Scores'].plot.hist(bins = 100)
<matplotlib.axes._subplots.AxesSubplot at 0x7f5927b8e518>

png

def kl_divergence(p, q):
    return np.sum(np.where(p != 0, p * np.log(p / q), 0))
kl_divergence(fiscal_math_df['Unblurred_Scores'].sample(4955),
              list(fiscal_math_df[fiscal_math_df['ALL_MTH00PCTPROF_1516'].str.isnumeric()]\
                   ['ALL_MTH00PCTPROF_1516'].astype(int)))
45565.525218373106
kl_divergence(np.random.uniform(0,100,4955),
              list(fiscal_math_df[fiscal_math_df['ALL_MTH00PCTPROF_1516'].str.isnumeric()]\
                   ['ALL_MTH00PCTPROF_1516'].astype(int)))
97121.56023652945

KL divergence as we can see is about twice as good as a random distribution when compared to the original unmodified distribution and should be a decent method of imputation.

Problem 4

The amount of money to be cut is

fiscal_df['TFEDREV'].sum() * .15
8340410870.7
to_cut = fiscal_df['TFEDREV'].sum() * .15
tocut_df = fiscal_df[['LEAID','TOTALEXP','V33']]
tocut_df = tocut_df[(tocut_df['TOTALEXP'] > 0) & (tocut_df['V33'] > 0)]
tocut_df['percapita_spending'] = tocut_df['TOTALEXP']/tocut_df['V33']
for i, v in tocut_df.sort_values('percapita_spending', ascending = False).iterrows():
    
    if to_cut - v['TOTALEXP'] * .20 > 0:
        
        tocut_df.loc[i, 'to_cut'] = v['TOTALEXP'] * .20
        to_cut -= (v['TOTALEXP'] * .20)
    else:
        tocut_df.loc[i, 'to_cut'] = 0
tocut_df['to_cut'].plot.hist(bins = 100)
<matplotlib.axes._subplots.AxesSubplot at 0x7f58fb3e4240>

Problem 5

In order to effectively cut the budget in a way that's least damaging to the schools alraedy struggling, the method I used is as follows:

  1. Take the sum total of the amount to be cut.
  2. Sort the districts by the funding recieved in decreasing order.
  3. Remove 20% from the top n, till the budget cut is satisfied.

This is a methiod that saves 15,000 districts from budget cuts while only marginally increasing the cuts for the best performing schools and is in my opinion the best way to cut the budget with minimal damage. png

About

Repository for Homework 3

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0