Thursday, May 11, 2017

Assignment 6

Part I


Part I is dedicated to determining the nature of a statement made by a news organization, and also to use the information provided to determine the crime rate given the percent of free lunches for an area. 

A local news organization went on the record saying that as the number of kids that get free lunches increases, so does the crime in an area. The data the news organization used in this claim was used to see how much water this statement holds. A regression analysis is run to determine the significance of the relationship between these two variables (Figure 1, and Figure 2). The results of these tests are below. The results from Figure 2 were then used to determine what the corresponding crime rate per 100,000 people would be given a 23.5% free lunch for an area.



Figure 1: R squared regression results.




Figure 2: Regression results for free lunch and crime percentage.


Y = a + bx

Y= Crime rate
a = Constant
b = PercentFreeLunch B Coefficient
x = Percent Free Lunch For an area

Crime Rate = 21.819 + 1.685(23.5)

    =61.4165 per 100,000 people

Due to the results of the regression, the news outlet can make a claim that there as the number of children with free lunches increases, so does the crime, however, this is a very shakey claim. The significance value of the results is p=.05, and the beta is .416, which means that there is barely a positive linear relationship between the two variables. On top of this, the r squared results is .173, which means that the Crime Rate only explains 17.3% of the number of free lunches distributed. This low of a number indicates that the model is very poor at explaining the variance of percent free lunches. These factors put together make it a very poor choice to indicate that as the percent of free lunches increases, so does the crime rate.


Part II


Key words for Part II:

Regression Analysis: A statistical tool used to investigate the relationship between two variables
Dependent Variable: What is explained by the other variables
Independent Variables: The variables which are used to analyze the dependent variable
Linear Relationship: Significant relationship exists between independent and dependent variables
R-Squared: Or coefficient of determination, is a number which represents the variance of the dependent from the independents included in the model
Residual: How far the observed value is from the theoretical value


Introduction


Part II is dedicated to analyze the factors which influence 911 calls throughout Portland, Oregon. The City of Portland is interested in assessing their efficiency in responding to 911 calls. To accomplish this, they want to see what factors can help to explain where the most calls come from. A company is also interested in building a new hospital, and this report will help to use the factors listed below to determine potential locations for building an ER.

The factors are:
911 Calls (Dependent Variable)
Jobs
Renters
Low Education (People with no high school degree)
Alcohol Sales (AlcoholX)
Unemployed
Foreign Born population
Median Income
College Grads

Methods


To determine results, the first step is to use regression analysis on a variety of variables to try to find significant linear relationships between the independent variables and 911 calls.  The three variables chosen were jobs, renters, and low education. This was due to a quick stepwise regression analysis to determine if there were three significant variables that were the result of that. The stepwise process is described later on in this report. Figures 1, 2, and 3 describe the results of this below.

The next step is to make two maps, one being a cholopleth map of the number of calls per census tract, and the other being a standardized residual map of the renters per census tract in Portland. The renters was the variable chosen because it has the highest r-squared value out of the three variables analyzed. The residual map is created in ArcMap using the Ordinary Least Squares (OLS) tool to create a shapefile of the renter residuals.

The third step of the report is to use a multiple regression report. This process uses all of the independent variables listed above. Collinearity Diagnostics are ran along with the multiple regression report to determine if there is multicollinearity present among the independent variables. After this is completed, a stepwise approach to help determine which variable is the most important. A final map is created using the three variables from the stepwise results in ArcMap with the OLS process.

Results


Figure 1 below displays the results of running a linear regression report using jobs as the independent variable against 911 calls as the dependent. Jobs displays a positive linear relationship with 911 calls. The p value is less than .05, and the beta is a .583 which is the basis of the claim. The r-squared is .34 which means that 34% of the 911 calls can be explained by jobs per census tract.


Figure 1: Regression results from jobs against 911 calls.



Figure 2 below displays the results of running a linear regression report using low education as the independent variable against 911 calls as the dependent. Low education displays a positive linear relationship with 911 calls. The p value is less than .05, and the beta is a .753 which is the basis of the claim. The r-squared is .567 which means that 56.7% of the 911 calls can be explained by jobs per census tract. This is higher than the amount of 911 calls explained by jobs per census tract.



Figure 2: Regression results from running low education against 911 calls.



Figure 3 below displays the results of running a linear regression report using renters per census tract as the independent variable against 911 calls as the dependent. Renters displays a positive linear relationship with 911 calls. The p value is less than .05, and the beta is a .785 which is the basis of the claim. The r-squared is .616 which means that 61.6% of the 911 calls can be explained by renters per census tract. This is higher than the amount of 911 calls explained by jobs per census tract, and low education. Renters per census tract displayed the highest r-squared value out of the three values ran in regression models against the dependent of 911 calls. This result helped to create a residual map of renters in Figure 5 below.



Figure 3: Regression results from running renters against 911 calls.



Figure 4 displays results of creating a chloropleth map out of 911 calls within Portland. This process creates a ranked order of census tracts comparing direct amounts of 911 calls throughout the city. The results help one see the tracts within the north central portion of the city receive the largest amount of 911 calls per tract. Using just this map, the small tract displaying 19-56 911 calls between 4 of the highest proportion could be the new location for the hospital.



Figure 4: Chloropleth map results of 911 calls throughout the census tracts of Portland, OR.



Figure 5 below displays a standardized view of the residuals of census tracts based on renters within the city. The darker the red displays the higher the proportion of renters making 911 calls, the darker the blue displays the lower proportion of renters making 911 calls.


Figure 5: Standard deviation map of residuals for renters in in Portland, OR against 911 calls.




Figure 6 below displays the results of running a multiple regression report with all of the variables along with a collinearity report to determine if there are any independent variables which are related and thus change the results of the multiple regression report. Using all of the variables, only jobs and low education display a linear relationship with 911 calls, and they both have positive linear relationship. Although this is true, the r-squared is .783 which means that 78.3% of the 911 calls are explained by the independent variables within the model.

Viewing the collinearity diagnostics, the condition index is less than 30, and therefore there is no multicollinearity. Although this is true, the low amount of independent variables that display a linear relationship with 911 calls suggests that the variables do tend to pull the trend line around without having too much significance on 911 calls.





Figure 6: Multiple linear regression and collinearity results using all of the independents against the dependent.




Figure 7 displays the stepwise regression report using all of the independent variables. The three variables that were included in the report are renters, low education, and jobs. The three all display a positive linear relationship with 911 calls. The r-squared of the model with all three included is .771, which means that 77.1% of the 911 calls can be explained by these three independents together. This is only 12% smaller than the model with all of the variables included, and all three of the independents in the stepwise results all have a positive linear relationship. A spatial representation of this report is below in Figure 8.


Figure 7: Stepwise regression results from all of the variables.




Figure 8 below displays a standardized view of the residuals of census tracts based on the three stepwise results within the city. The darker the red displays the higher the proportion of 911 calls being explained by the variables and the darker the blue displays the lower proportion of the three variables making 911 calls. This result helps the hospital group pick a spot. The spot suggested earlier in the report based off of Figure 4 is still the spot that would be a good choice based off of the three strongest independents too.


Figure 8: Standard deviation map of the residuals of the stepwise regression results.


Conclusions

The City of Portland now knows that renters, jobs, and low education within its census tracts are the three strongest results of the models presented for explaining 911 calls. They can use this information to find how to best respond to 911 calls, and to help create community programs which can address at least jobs and low education within the city. Although multicollinearity was not present, the results suggest that the combination of all of the independent variables is too much, and can sway the r-squared in a positive direction. Assignment 6 shows that SPSS and ArcMap can create highly accurate results that give users real information to create dynamic change within cities across the country. 

Tuesday, May 2, 2017

Assignment 5

Goals and Background


This assignment is designed to assist in the understanding of correlation through the use of various software. Students begin with the task of running correlations through IBM SPSS. With the output created from this process, they are then expected to interpret the correlations and determine significance of relationships. The next task is set up to have students download U.S. Census data, use the GEOIDs to join it within ArcMap, and then to run the joined data through Geoda to view Moran's I values and create LISA cluster maps.


Part 1


Part 1's goal is to use census data for Milwaukee, WI to find correlation values between various datasets. Correlation is a measure of how strong of a relationship exists between two variables. The results vary from -1 to 1, and the closer to either end of the spectrum indicates stronger relationships. Positive relationships are the result of  both datasets increasing together, and negative relationships are the opposite. The correlation results do not imply causation, however. 

The provided codes are the datasets in Figure 1 below:


White  = White Pop. for the Census Tracts in Milwaukee County
Black = Black Pop
Hispanic = Hispanic Pop
MedInc = Median Household Income
Manu = Number of Manufacturing Employees
Retail = Number of Retail Employees
Finance = Number of Finance Employees


Figure 1: Bivariate correlation matrix results from provided excel datasheet. 


One of the most apparent results of the matrix is from the White population having jobs in each industry, and the tendency for other races to not have as many jobs. Aside from the the correlation values comparing the white data against other races, the Pearson Correlation value is very strongly positive for the white data on all of the employee data. These results point towards white populations having more jobs in manufacturing, retail, and finance compared to black populations and Hispanic populations. This claim is supported by the results from the Pearson Correlation comparing the race's median income as well. 58.5% of the white population's income is explained by the correlation results on all of the data, while the black and Hispanic correlations are both negative values. This suggests that the white population's median income trends upwards with the trend line, and the other race's incomes are far lower than white populations. In other words there is a negative relationship between the black and Hispanic workers and median income.

Part 2


Introduction


The following work is dedicated to a project for the Texas Election Commission (TEC). The goal is to analyze the patterns of elections to determine if there is spatial auto correlation of voting patterns, and voter turnout in The State of Texas. Spatial autocorrelation is the study of how alike things arecompared to what is around it. When models are run, a value is returned that displays a strength of similarity the data set exhibits on a spatial scale. Unlike correlation, a positive or negative value does not give direction. The study is focused on the voting results for the Presidential Elections of 1980 and 2016. The data obtained is focused on the percent Democrat votes, and the voter turnout for each year. Ultimately, the results of this study will be provided to the Governor to see if election patterns have changed over the course of the 36 years between the two Presidential Elections.

Methodology


The provided data table from the TEC contains multiple code as follows:

  • VTP80 = Voter Turnout 1980
  • VTP16 = Voter Turnout 2016
  • PRES80D = % Democratic Vote 1980
  • PRES16D = % Democratic Vote 2016

The Hispanic Population of 2015 was necessary to obtain for the completion of the study. It was gathered from the 2015 US Census Bureau ACS 5 Year Estimates. The Texas County shapefile was also obtained from the US Census Bureau.

Once the data downloaded from the Census Bureau, it was cleaned up by deleting unnecessary columns and fields. After this was completed, it was brought into ArcMap along with the Texas Voting Data from the TEC, and the Texas County shapefile. All three of these data sets were joined on the GEOID provided through the US Census Bureau. This joined dataset was then exported into a new shapefile so that it could be used later.


The new shapefile was then brought into Geoda for further analyzing. A new project was created, and within this a spatial weight was developed to assist in determining spatial autocorrelation later. Once this was done, Moran's I plots were created using the weight against the variables. This helped to see if spatial autocorrelation was a factor in the elections. Similareto correlation coefficients, the Morans I returns a value between -1 and 1, with stronger values on the extreme ends of the spectrum. The higher the value, the higher the spatial autocorrelation, and the lower the value, the less the spatial autocorrelation. After these plots were created, LISA Cluster Maps were created for each data set for further analysis in spatial autocorrelation. These all helped in determining the strength of spatial autocorrelation between counties. 

Results



Percent Hispanic


The first data set ran through the Geoda Moran's I and LISA Cluster maps was the Percent Hispanic residents. The Moran's I exhibits very strong spatial autocorrelation results, with a score of .78 (Figure 2). This can be seen in Figure 3 in the LISA Cluster Map as well. The southern border of Texas exhibits High, High values almost all the way across, multiple counties in. This means that these counties all have areas of high Hispanic Populations surrounded by other areas of Hispanic Populations. This can also be seen on the opposite end of the spectrum, with the North East side of the state exhibiting Low, Low values. This means that these counties all have very low Hispanic populations, and are surrounded by other counties with very low populations.


Figure 2: Moran's I scatterplot of Hispanic populations in Texas counties.

Figure 3: LISA Cluster Map of Hispanic populations spread across the counties in The State of Texas.



Voter Turnout for 1980 Presidential Election


The next set of data to analyze was the voter turnout for the 1980 Presidential Election. The Moran's I is a medium strength value of .47 of spatial autocorrelation (Figure 2). This means that there is a strength that exists in the spatial autocorrelation of the data set, but it is not particularly strong. When looking at the LISA Cluster Map, this is visible in just a few areas around the state. The southern and eastern blue portions exhibit Low, Low values, which means that this is an area which exhibits low voter turnouts in the counties highlighted as well as the surrounding counties (Figure 3). The Northern part of the state, highlighted in red, exhibits High, High values, which means that these areas all have very high voter turnout compared to other areas in the state.



Figure 4: Moran's I scatterplot of voter turnout for the 1980 Presidential Election in Texas counties.





Figure 3: LISA Cluster Map of voter turnout for the 1980 Presidential Election spread across the counties in The State of Texas.



Voter Turnout For the 2016 Presidential Election


The Moran's I for the 2016 Presidential Election voter turnout is lower than the spatial autocorrelation for voter turnout in 1980. With a Moran's I value of .29, the spatial autocrrelation is pretty low for this data set (Figure 4) . This can mean that voters and non-voters are more evenly spread out than the election in 1980. This election, only the southern tip and a North Western portion of the state exhibits Low, Low values, and there is only a small area within the center of the state which has High, High values (Figure 5) . This finding would support the claim that there is more evenly spread out voters to non-voters as much less of the state is covered by extremes on the LISA map.



Figure 4: Moran's I scatterplot of voter turnout for the 2016 Presidential Election in Texas counties.





Figure 5: LISA Cluster Map of voter turnout for the 2016 Presidential Election spread across the counties in The State of Texas.



Percent Democrat Vote in 1980


The Moran's I value for democratic vote in the 1980 was relatively strong, with a score of .58 (Figure 6). This means that there is a fair amount of spatial autocorrelation between counties in Texas that voted Democrat. Comparing the LISA Cluster Map (Figure 7) to the voter turnout LISA map for 1980 (Figure 3), one can see similarities. The South Eastern High, High section of Figure 7, is the same area where there was Low, Low voter turnout in 1980. The same can be said for the Low, Low section of Figure 7, as it is almost the same area of the High, High turnout in 1980 (Figure 3). 



Figure 6: Moran's I scatterplot of percent democrat vote in 1980 within Texas counties.




Figure 7: LISA Cluster Map of percent democratic vote for the 1980 Presidential Election spread across the counties in The State of Texas.



Percent Democrat Vote in 2016


The Moran's I value of percent Democrat becomes higher in 2016 compared to 1980, with a score of .69 (Figure 8). This indicates a high amount of spatial autocorrelation for the party lines of the Presidential Election of 2016. Looking at the LISA map, one can see that there is a larger area of High, High Democrat voters along the southern edge of the state (Figure 9). Looking back at Figure 3, one can see that this area also has High, High Hispanic populations. One could assume that there was a larger Hispanic voter turnout this year than there was in 1980. 




Figure 8: Moran's I scatterplot of percent democrat vote in 2016 within Texas counties.




Figure 7: LISA Cluster Map of percent democratic vote for the 2016 Presidential Election spread across the counties in The State of Texas.



Conclusion


Part 1 allows one to see that SPSS is a very robust statistics software that allows user to gather fast, accurate results. Part 2 helps to see that using data from a variety of sources, one can aggregate it into a single file and find very compelling results within Geoda. There was not very strong evidence for voter turnout difference between 1980 and 2016, however there is evidence that there could have been a stronger Hispanic voter turnout in the South West portion of the state in 2016.