GitHub - Omairss/ds5500-hw3: Repository for Homework 3

Problem 1

Following is the district-level fiscal data from 2015-16.

	LEAID	CENSUSID	FIPST	CONUM	CSA	CBSA	NAME	STNAME	STABBR	SCHLEV	...	FL_61V	FL_66V	FL_W01	FL_W31	FL_W61	FL_V95	FL_V02	FL_K14	FL_CE1	FL_CE2
0	100002	N	1	01073	142	13820	Alabama Youth Services	Alabama	AL	N	...	M	M	M	M	M	M	M	M	M	M
1	100005	01504840100000	1	01095	290	10700	Albertville City	Alabama	AL	03	...	M	M	R	R	R	M	M	M	M	M
2	100006	01504800100000	1	01095	290	10700	Marshall County	Alabama	AL	03	...	M	M	R	R	R	M	M	M	M	M
3	100007	01503740100000	1	01073	142	13820	Hoover City	Alabama	AL	03	...	M	M	R	R	R	M	M	M	M	M
4	100008	01504530100000	1	01089	290	26620	Madison City	Alabama	AL	03	...	M	M	R	R	R	M	M	M	M	M
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
18703	5680250	51502040100000	56	56039	N	27220	Region V BOCES	Wyoming	WY	07	...	N	N	N	N	N	N	N	N	N	N
18704	5680251	N	56	56021	N	16940	Wyoming Department of Family Services	Wyoming	WY	N	...	N	N	N	N	N	N	N	N	N	N
18705	5680252	N	56	56005	N	23940	Youth Emergency Services Inc. - Administration...	Wyoming	WY	N	...	N	N	N	N	N	N	N	N	N	N
18706	5680253	N	56	56025	N	16220	Wyoming Behavioral Institute - Administration ...	Wyoming	WY	N	...	N	N	N	N	N	N	N	N	N	N
18707	5680254	N	56	56013	N	40180	Saint Stephen's Indian School Admin Office	Wyoming	WY	03	...	N	N	N	N	N	N	N	N	N	N

18708 rows × 260 columns

It seems like the federal funding (revenue) follows a power log distribution, with a few schools recieving most of the federal fundings.

	Revenue
STNAME
California	7709274795
Texas	6194316950
New York	3374793578
Florida	3147328988
Illinois	2334944853
Pennsylvania	2037314939
Ohio	1837962930
Georgia	1815241986
Michigan	1731033937
North Carolina	1587975932
Arizona	1302009922
New Jersey	1249740990
Louisiana	1115618894
Washington	1098331976
Tennessee	1096181996
Virginia	1058145818
Indiana	1015475961
Missouri	959977972
Kentucky	880295974
South Carolina	860866990
Maryland	823598998
Massachusetts	804594987
Alabama	803906916
Wisconsin	782646932
Colorado	721718867
Oklahoma	703224894
Mississippi	690723968
Minnesota	685054951
Arkansas	606945973
Oregon	582559990
New Mexico	516288985
Connecticut	484185984
Iowa	464851996
Kansas	453921959
Utah	419641976
Nevada	405789000
West Virginia	360283000
Nebraska	346825973
Alaska	307320000
Hawaii	261131000
Idaho	248545991
District of Columbia	226201985
Montana	220339869
South Dakota	196643958
Rhode Island	188203996
Maine	186522987
New Hampshire	169165780
North Dakota	155452961
Delaware	144706995
Wyoming	123011974
Vermont	111890912

<matplotlib.axes._subplots.AxesSubplot at 0x7f5915655588>

Problem 2

	LEAID	CENSUSID	FIPST	CONUM	CSA	CBSA	NAME	STNAME	STABBR	SCHLEV	...	FL_61V	FL_66V	FL_W01	FL_W31	FL_W61	FL_V95	FL_V02	FL_K14	FL_CE1	FL_CE2
0	100002	N	1	01073	142	13820	Alabama Youth Services	Alabama	AL	N	...	M	M	M	M	M	M	M	M	M	M
1	100005	01504840100000	1	01095	290	10700	Albertville City	Alabama	AL	03	...	M	M	R	R	R	M	M	M	M	M
2	100006	01504800100000	1	01095	290	10700	Marshall County	Alabama	AL	03	...	M	M	R	R	R	M	M	M	M	M
3	100007	01503740100000	1	01073	142	13820	Hoover City	Alabama	AL	03	...	M	M	R	R	R	M	M	M	M	M
4	100008	01504530100000	1	01089	290	26620	Madison City	Alabama	AL	03	...	M	M	R	R	R	M	M	M	M	M
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
18703	5680250	51502040100000	56	56039	N	27220	Region V BOCES	Wyoming	WY	07	...	N	N	N	N	N	N	N	N	N	N
18704	5680251	N	56	56021	N	16940	Wyoming Department of Family Services	Wyoming	WY	N	...	N	N	N	N	N	N	N	N	N	N
18705	5680252	N	56	56005	N	23940	Youth Emergency Services Inc. - Administration...	Wyoming	WY	N	...	N	N	N	N	N	N	N	N	N	N
18706	5680253	N	56	56025	N	16220	Wyoming Behavioral Institute - Administration ...	Wyoming	WY	N	...	N	N	N	N	N	N	N	N	N	N
18707	5680254	N	56	56013	N	40180	Saint Stephen's Indian School Admin Office	Wyoming	WY	03	...	N	N	N	N	N	N	N	N	N	N

18708 rows × 260 columns

There is a clean linear relationships between revenue and expenditures, as expected.

The following are the unique values from the table. Most non-numeric values are of the form x-y GEx or LEx.

In order to convert this to a numeric value the following steps were performed.

Range values: Mean of the range was taken
GEx: Mean of x and 100 was taken
LEx: Mean of x and 0 was taken (Essentially x/2)
Null/Missing: Replaced by a positive normal distribution with the mean and sigma of the original distribution.

Finally, KL Divergence of the Original distribution vs the imputed distribution was compared to KL Divergence of the Original distribution and random normal distribution.

array(['42', '39', '66', '72', '37', '43', '69', '21', '62', '47', '34',
       '60', '23', '54', '48', '46', '50', '11', '19', '35', '51', '13',
       '26', '32', '22', '27', '41', '49', '44', '31', '38', '67', '53',
       '33', '20', '36', '14', '63', '40', '56', '45', '52', '10', '17',
       '29', '28', '87', '59', '58', '78', '10-14', '30', '25-29',
       '30-34', '60-64', '30-39', '61', '45-49', '68', '35-39', '11-19',
       'LT50', 'PS', '65-69', '80-89', '85-89', '55-59', 'LE10', '55',
       '2', '15-19', '40-44', '50-54', 'LE20', '16', 'LE5', '20-29',
       '21-39', '70-74', '15', 'GE90', '86', '76', '40-49', '24', '75-79',
       '74', '25', '50-59', '70-79', '6-9', '85', '7', '40-59', '60-79',
       '80-84', '20-24', '71', '65', '8', '64', '82', 'GE50', '80', '70',
       '77', '88', '83', '93', '81', '57', '75', '84', '92', '73', 'GE95',
       '18', '12', '6', '9', '3', '90', 'LE1', '4', 'GE80', '5', '60-69',
       '79', '89', '98', '91', '95', '90-94', '94', 'GE99', '96', '97'],
      dtype=object)

def unblur(x):
    
    if '-' in x:
        return (int(x.split('-')[0]) + int(x.split('-')[1])) / 2
    
    if 'GE' in x:
         return ((int(x.replace('GE', '')) + 100)/2)
    
    if 'LE' in x:
         return ((int(x.replace('LE', '')) + 0)/2)
        
fiscal_math_df['ALL_MTH00PCTPROF_1516'].apply(unblur).unique()

array([ nan, 12. , 27. , 32. , 62. , 34.5, 47. , 37. , 15. , 67. , 84.5,
       87. , 57. ,  5. , 17. , 42. , 52. , 10. ,  2.5, 24.5, 30. , 72. ,
       95. , 44.5, 77. , 54.5, 74.5,  7.5, 49.5, 69.5, 82. , 22. , 75. ,
       97.5, 64.5, 92. , 90. ,  0.5])

fiscal_math_df['Unblurred_Scores'] = fiscal_math_df['ALL_MTH00PCTPROF_1516'].apply(unblur)
fiscal_math_df['Unblurred_Scores'].plot.hist(bins = 100)

<matplotlib.axes._subplots.AxesSubplot at 0x7f5927be9a58>

The distribution looks like a normal distribution that flattens out at 0. This makes sense since a lower score than 0 is not possible.

As such, we use a normal distribution to estimate the distribution of the original scores. We use KL Divergence to figure out goodness of fit

fiscal_math_df['Unblurred_Scores'] = fiscal_math_df['ALL_MTH00PCTPROF_1516'].apply(unblur)
unblurred_mean = fiscal_math_df['Unblurred_Scores'].mean()
unblurred_sigma = fiscal_math_df['Unblurred_Scores'].std()

fiscal_math_df['Unblurred_Scores'] = fiscal_math_df['Unblurred_Scores']\
    .fillna(pd.Series([abs(np.random.normal(unblurred_mean,unblurred_sigma)) for i in range(len(fiscal_math_df))]))

fiscal_math_df['Unblurred_Scores'].plot.hist(bins = 100)

<matplotlib.axes._subplots.AxesSubplot at 0x7f5927b8e518>

def kl_divergence(p, q):
    return np.sum(np.where(p != 0, p * np.log(p / q), 0))

kl_divergence(fiscal_math_df['Unblurred_Scores'].sample(4955),
              list(fiscal_math_df[fiscal_math_df['ALL_MTH00PCTPROF_1516'].str.isnumeric()]\
                   ['ALL_MTH00PCTPROF_1516'].astype(int)))

45565.525218373106

kl_divergence(np.random.uniform(0,100,4955),
              list(fiscal_math_df[fiscal_math_df['ALL_MTH00PCTPROF_1516'].str.isnumeric()]\
                   ['ALL_MTH00PCTPROF_1516'].astype(int)))

97121.56023652945

KL divergence as we can see is about twice as good as a random distribution when compared to the original unmodified distribution and should be a decent method of imputation.

Problem 4

The amount of money to be cut is

fiscal_df['TFEDREV'].sum() * .15

8340410870.7

to_cut = fiscal_df['TFEDREV'].sum() * .15
tocut_df = fiscal_df[['LEAID','TOTALEXP','V33']]
tocut_df = tocut_df[(tocut_df['TOTALEXP'] > 0) & (tocut_df['V33'] > 0)]
tocut_df['percapita_spending'] = tocut_df['TOTALEXP']/tocut_df['V33']

for i, v in tocut_df.sort_values('percapita_spending', ascending = False).iterrows():
    
    if to_cut - v['TOTALEXP'] * .20 > 0:
        
        tocut_df.loc[i, 'to_cut'] = v['TOTALEXP'] * .20
        to_cut -= (v['TOTALEXP'] * .20)
    else:
        tocut_df.loc[i, 'to_cut'] = 0

tocut_df['to_cut'].plot.hist(bins = 100)

<matplotlib.axes._subplots.AxesSubplot at 0x7f58fb3e4240>

Problem 5

In order to effectively cut the budget in a way that's least damaging to the schools alraedy struggling, the method I used is as follows:

Take the sum total of the amount to be cut.
Sort the districts by the funding recieved in decreasing order.
Remove 20% from the top n, till the budget cut is satisfied.

This is a methiod that saves 15,000 districts from budget cuts while only marginally increasing the cuts for the best performing schools and is in my opinion the best way to cut the budget with minimal damage.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
.gitignore		.gitignore
README.ipynb		README.ipynb
README.md		README.md
output_13_1.png		output_13_1.png
output_15_1.png		output_15_1.png
output_19_1.png		output_19_1.png
output_22_1.png		output_22_1.png
output_32_1.png		output_32_1.png
output_5_1.png		output_5_1.png
output_7_1.png		output_7_1.png
output_9_1.png		output_9_1.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Problem 1

It seems like the federal funding (revenue) follows a power log distribution, with a few schools recieving most of the federal fundings.

Problem 2

There is a clean linear relationships between revenue and expenditures, as expected.

The distribution looks like a normal distribution that flattens out at 0. This makes sense since a lower score than 0 is not possible.

As such, we use a normal distribution to estimate the distribution of the original scores. We use KL Divergence to figure out goodness of fit

KL divergence as we can see is about twice as good as a random distribution when compared to the original unmodified distribution and should be a decent method of imputation.

Problem 4

The amount of money to be cut is

Problem 5

About

Uh oh!

Releases

Packages

Omairss/ds5500-hw3

Folders and files

Latest commit

History

Repository files navigation

Problem 1

It seems like the federal funding (revenue) follows a power log distribution, with a few schools recieving most of the federal fundings.

Problem 2

There is a clean linear relationships between revenue and expenditures, as expected.

The distribution looks like a normal distribution that flattens out at 0. This makes sense since a lower score than 0 is not possible.

As such, we use a normal distribution to estimate the distribution of the original scores. We use KL Divergence to figure out goodness of fit

KL divergence as we can see is about twice as good as a random distribution when compared to the original unmodified distribution and should be a decent method of imputation.

Problem 4

The amount of money to be cut is

Problem 5

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages