Multiple Linear Regression

Introduction

The aim of this project was to analyse a dataset using various kinds of software. For this project I made my own data set using excel. The advantage to using my own data allowed me to create my own model so I could compare the model created using multiple regression to the actual model I created.

The fictional situation was as follows: 1013 professional basketball players were selected and put through an obstacle course. At various points through the course the players would attempt to score a point by shooting the ball through the hoop. There are 100 different points throughout the course where the player would attempt to make a shot. The number of successful shots was recorded. The data collected included a measure a number of different attributes of the players (some related to sport, some not). The model looks to predict how a player will perform in the obstacle if given the relevant attributes.

The ten attributers measured were as follows:

  1. Height (in Feet)
  2. Weight (in Stone)
  3. Salary (in million per annum)
  4. Number of Years playing professionally
  5. Number of professional games
  6. Number of Hours training a week
  7. Number of Cars owned
  8. Number of Children
  9. Amount of money spent on sport supplements per year (in thousands)
  10. Calories consumed per day

Data Analysis in R

The dataset was read into R. The pairs function was used to visualise the data. This gave the following output:

Rplot

Figure 1 – Pairs

There are a number of different relationships to note. There looks to be a strong correlation between height and weight. This makes sense as taller people will generally weigh more than shorter. There is another strong correlation between number of games played and number of years playing. Again this would be expected. The variable we are looking shows a strong correlation with salary. We will expect this factor to have a strong influence in the final model. In order to actually measure the correlation, the “cor” function was used to produce a matrix correlation table.

Score Height Weight Salary Years Games Hours.per.week  Cars Children Supplements Calories

Score           1.00   0.02   0.01   0.77  0.20  0.20           0.07 -0.02    -0.01        0.13     0.04

Height          0.02   1.00   0.91  -0.01  0.03  0.03           0.00 -0.03     0.00        0.00    -0.02

Weight          0.01   0.91   1.00  -0.01  0.03  0.03          -0.02 -0.01     0.00        0.01    -0.01

Salary          0.77  -0.01  -0.01   1.00  0.00  0.01          -0.03  0.00    -0.02        0.05     0.03

Years           0.20   0.03   0.03   0.00  1.00  0.99          -0.03 -0.02     0.02        0.01    -0.01

Games           0.20   0.03   0.03   0.01  0.99  1.00          -0.03 -0.02     0.02        0.01    -0.01

Hours.per.week  0.07   0.00  -0.02  -0.03 -0.03 -0.03           1.00 -0.02     0.01        0.03    -0.09

Cars           -0.02  -0.03  -0.01   0.00 -0.02 -0.02          -0.02  1.00    -0.04        0.01    -0.04

Children       -0.01   0.00   0.00  -0.02  0.02  0.02           0.01 -0.04     1.00       -0.04     0.03

Supplements     0.13   0.00   0.01   0.05  0.01  0.01           0.03  0.01    -0.04        1.00    -0.05

Calories        0.04  -0.02  -0.01   0.03 -0.01 -0.01          -0.09 -0.04     0.03       -0.05    1.00

As expected the height and weight have a correlation measure of 0.91 while the games and year correlation was measured as 0.99. This is known as multicollinearity and can disrupt the model making it inaccurate. To ensure the most accurate result both weight and years will be removed from the model.

The following output was produced:

Call:lm(formula = Score ~ Height + Salary + Games + Hours.per.week +     Cars + Children + Supplements + Calories, data = fs) Residuals:

Min       1Q   Median       3Q      Max -28.1619  -6.1410  -0.1649   6.2595  27.6160

Coefficients:   Estimate Std. Error t value Pr(>|t|)

(Intercept)    -1.601e+01  9.175e+00  -1.745   0.0813 .

Height          1.587e+00  1.375e+00   1.154   0.2486

Salary          5.718e+00  1.398e-01  40.903  < 2e-16 ***

Games           4.855e-02  4.545e-03  10.682  < 2e-16 ***

Hours.per.week  2.994e-01  5.511e-02   5.433 6.94e-08 ***

Cars           -8.140e-02  1.968e-01  -0.414   0.6792

Children       -3.091e-04  1.481e-01  -0.002   0.9983

Supplements     7.798e-02  1.568e-02   4.972 7.79e-07 ***

Calories        1.262e-03  5.916e-04   2.133   0.0332 *

—Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.067 on 1004 degrees of freedomMultiple

R-squared:  0.6501,   Adjusted R-squared:  0.6473

F-statistic: 233.2 on 8 and 1004 DF,  p-value: < 2.2e-16

 

This shows a multiple R-squared of 0.6501. In multiple linear regression the more reliable value is adjusted R-squared with a value of 0.6473. This means our model can be used to explain 65% of the variation. The most important factors are Salary, Games and Supplements with a p value significantly less than the cut-off point of .0.5. Meaning the chances of these factors having no impact on the results is negligible. Calories also had a low p value of 0.033 meaning the chances of this attribute having no effect are 3.3%. Height showed a p value of 0.08 which is above the cut-off off point and may be worth further investigation.

Testing Normality of the Residuals

To test the “goodness of fit” of the model, the residuals (distance the predicted value is from the actual value) can be measured. The residuals are expected to be normally distributed about the predicted values. There are a number of ways to test or visualise this.

Residuals

Figure 2 – First Normality Test

For Figure 2 the residuals are standardised by dividing each residual by the standard deviation of all the residuals. You expect to see 95% of the residuals to be within 2 standard deviations. Figure 2 looks reasonable and shows homoscedasticity.

QQnorm

Figure 3- QQplot

Figure 3 is known as a quantile-quantile or qqplot. To produce this plot all residual are sorted in order from lowest to highest. The blackline represents the position each of the residuals would be in if they were perfectly normally distributed. That is out of 1000 residuals, the lowest residual is expected to be below 3 standard deviations, the median should be at the mean and the highest would be above 3 standard deviations. So the closer the residuals are to this line, the closer the data set is to normal distribution. Figure 3 shows a very good fit and thus we are satisfied the residuals are normally distributed.

Rapid Miner

This model can also be built using RapidMiner. RapidMiner is a software package that allows for data analytics using a GUI. It offers an easy to use interface for people who might not be as comfortable coding.

Rapidminer Design

Figure 4 – RapidMiner Design

Figure 4 shows how the data was fed into an operator that created the linear model.

Rapidminer Results

Figure 5 – RapidMiner Results

Figure 5 and Figure 6 shows the results found by RapidMiner. These results match the result found in R with the only difference being in rounding errors.

RM_Rsq

Figure 6 – Squared Correlation

Excel

The data analysis add-in for excel can also carry out a regression analysis. Excel gave the following output.

ExcelOP

Figure 7 – Excel Output

 

Further Investigation

When creating the data set I used a function to create random variables that were normally distributed with a selected mean and standard deviation. I also added an error term to account for other factors not listed. Weight and games played were linear functions of height and years respectively, with normally distributed errors added.

To generate a score I assigned a coefficient to each attribute. Each coefficient was multiplied by the corresponding attribute value and they were all added together, including the error value. This score was then standardised and set to values between 1 and 100. (y={x-min}/{max-min})

 

I made an estimation of how much variation could be attributed to each attribute in my model. Multiplying each coefficient by the corresponding mean of that attribute gave an “importance” value. For example the mean height, (6.2ft) is multiplied by the selected coefficient for height (200). So on average Height will contribute 6.2 by 200 (1240) to the score in the model. The importance value is 1240 for height. The importance value of an attribute can then be found as a percentage of the sum of all importance values. The following table gives the importance percentage I found for each attribute:

 

Attribute Height Salary Games Hours Cars Children Supplements Calories Weight Years
Importance % 22% 44% 14% 13% 0% 0% 4% 2% 0% 0%

 

I expected to see height to be given more significance in the model and supplements and calories to be given less. Height was given a mean of 6.2 and a standard deviation of 0.2 while supplements had a mean of 30 and a standard deviation of 20. This meant supplements had a much larger range and variation. So the change attributed to the money spent on supplements is easier to detect. Height varied a lot less and the changes attributed to it are lost among the other factors including the random error that was added.

To further investigate I increased my sample size from 1,013 to 102,823 samples. This will help overcome the problem of noise or random error. This sample set gave the following output:

Call:lm(formula = Score ~ Height + Salary + Games + Hours.per.week +     Cars + Children + Supplements + Calories, data = fs2)

Residuals:    Min      1Q  Median      3Q     Max -32.753  -4.773  -0.033   4.766  32.060  Coefficients:                 Estimate Std. Error t value Pr(>|t|)

(Intercept)    -5.777e+00  7.212e-01  -8.010 1.16e-15 ***

Height          1.800e+00  1.100e-01  16.363  < 2e-16 ***

Salary          4.712e+00  1.110e-02 424.671  < 2e-16 ***

Games           3.776e-02  3.631e-04 103.990  < 2e-16 ***

Hours.per.week  2.737e-01  4.417e-03  61.965  < 2e-16 ***

Cars            2.339e-02  1.470e-02   1.591    0.112

Children       -1.672e-02  1.132e-02  -1.477    0.140

Supplements     6.499e-02  1.171e-03  55.522  < 2e-16 ***

Calories        2.606e-04  4.417e-05   5.900 3.65e-09 ***

—Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.081 on 102813 degrees of freedom

Multiple R-squared:  0.6587,   Adjusted R-squared:  0.6586

F-statistic: 2.48e+04 on 8 and 102813 DF,  p-value: < 2.2e-16

 

It can be seen the p-value of all relevant attributes have all dropped to well below the cut-off point of 0.05. The t-value of height was 1.15 in the 1,000 sample set but has now increased to 16.363 with 100,000 samples. In the 1,000 sample the t-value was lower than calories but has overtaken it in the 100,000 sample. I would expect it to eventually overtake supplements with a large enough sample.

This illustrates the impact a good data when trying to build the most accurate model. The less your data varies the more important it is to get a larger dataset.

Movie Performance

Movie Performance

The aim of this project was to investigate how a movie performs over the first three weeks after release. How a movie will perform will include many factors including what actors are starring, how popular a franchise is, competition at time of release etc. The two main factors investigated in this project are critical reception and movie budget. The critical reception is being measured using the score the movie received on RottenTomatoes.com. This website takes looks at the rating a film receives from a number of sources. It then gives the percentage of positive scores. So a film with a score of 73% means that 73% of the critics scored the film positively. Another factor investigated was the budget of the film. Ideally the marketing budget would have been investigated but this information was not available. For this analysis the assumption is that the production budget is proportional to the marketing budget.

The sample investigated was the highest grossing movies of 2015 (http://www.the-numbers.com/market/2015/top-grossing-movies). It should be noted that the majority of data was the-numbers.com and where the-numbers.com did not have the required data, boxofficemojo.com was used. The data from these two sites were generally were consistent; however there were some differences especially in the production budget.

Analysis

Budget and Performance

This analysis was carried out using R studio after the data was collected and stored in an excel file. A linear model was created for all relationships and the summary of the linear model was observed. The first relationship investigated was production budget and total gross.

Call:lm(formula = Total.Gross.. ~ PB..m, data = movies) Residuals:       Min         1Q     Median         3Q        Max -210670972  -33115380   -5740931   15731709  653073398  Coefficients:            Estimate Std. Error t value Pr(>|t|)    (Intercept) 24737681   15233454   1.624    0.108    PB..m        1286692     164301   7.831 6.12e-12 ***—Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 103700000 on 97 degrees of freedomMultiple R-squared:  0.3874,   Adjusted R-squared:  0.381 F-statistic: 61.33 on 1 and 97 DF,  p-value: 6.115e-12

 

This shows a reasonably strong relationship between the budget and success of a movie. A low p-value rejects the null hypothesis and an R-squared value of 0.39 shows a correlation between success and budget.

The opening weekend is generally the most profitable for a big budget movie and a strong indicator for how movie will finish. It can be used as a measure of how successful a marketing campaign for a movie was. A linear model of this relationship was created with and the following summary report was found:

Residuals:

Min        1Q    Median        3Q       Max

-92762124 -17709634  -4254184  10566517 263492439

 

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept)  5966236    6649620   0.897    0.372

PB..m         606987      71720   8.463 2.75e-13 ***

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 45260000 on 97 degrees of freedom

Multiple R-squared:  0.4248,   Adjusted R-squared:  0.4188

F-statistic: 71.63 on 1 and 97 DF,  p-value: 2.754e-13

 

The R-squared value shows 0.42; this is a slight increase from 0.39 but may not be large enough increase to call significant. This indicates a movie with a strong marketing campaign will get customers excited to see a film who will want to see it as soon as the movie is released. However, this model does have a number of flaws. This model assumes that the production budget is proportional to the marketing budget, which may not be the case. Secondly, an expensive budget may not always be used as effectively as cheaper ones.

budwk1

Figure 1 – Opening week (USD) vs Budget (USD in millions)

 

Value for Money

A big budget movie should expect to see large tickets sales but this does not necessary mean the best return on investment. For the following analysis the data used in “Budget and Performance” i.e. Gross profit was divided by the production value to give Total Gross Factor. This will indicate the best value for money when judging the performance of a movie.

When comparing the total gross factor to the production budget the following summary was observed.

Residuals:    Min      1Q  Median      3Q     Max -1.9247 -1.0234 -0.5483  0.0784 16.8228  Coefficients:             Estimate Std. Error t value Pr(>|t|)    (Intercept)  2.098272   0.379341   5.531 2.69e-07 ***PB..m       -0.011042   0.004091  -2.699  0.00821 ** —Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.582 on 97 degrees of freedomMultiple R-squared:  0.06984,  Adjusted R-squared:  0.06025 F-statistic: 7.283 on 1 and 97 DF,  p-value: 0.008211

 

This showed a higher p-value so the null hypothesis could be rejected, but not as definitively as before. The R-squared value is 0.07 which is not significant. This concludes that spending more money on a movie will not increase the chances of getting a better return on your money.

Critical reception

This portion of the analysis looks at how critical receptions will influence the performance of a movie. This uses the same logic as before but the independent variable is now the score found on rotten tomatoes.

The relationship between week 1 and the score on rotten tomatoes was as follows:

Residuals:      Min        1Q    Median        3Q       Max -66987362 -28636601 -12231937  10615737 326215017  Coefficients:            Estimate Std. Error t value Pr(>|t|)  (Intercept) 17979800   13907457   1.293   0.1991  RT..          507187     220459   2.301   0.0236 *—Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 58110000 on 97 degrees of freedomMultiple R-squared:  0.05174,  Adjusted R-squared:  0.04197 F-statistic: 5.293 on 1 and 97 DF,  p-value: 0.02356

 

Again this shows a higher p-value than before but it is still below the cut-off point of 0.05. The null hypothesis could be rejected but not with much confidence. The R-squared 0.05 shows there is no significant correlation. The same analysis was carried out for weeks 2 and 3 which showed similar figures, concluding that critical reception made no difference to box office performance in the first 3 weeks.

Drop off

Another important measure when projecting the success of a movie is the drop off rate. Big budget movies expect large numbers during the opening weekend. These are the customers who were sold by the marketing but if there is poor reviews a movie may suffer a large drop off by week 2. While the previously concluded that critical reception does not affect performance it may still contributed to the decline for weeks 2 and 3. For this analysis, the gross for the week in question was found as a percentage of the gross for week 1. This percentage was then compared to the score on rotten tomatoes.

When comparing critical reception to the drop off for week 2 the following summary was found:

Residuals:    Min      1Q  Median      3Q     Max -0.6549 -0.3367 -0.1862  0.0597  3.8212  Coefficients:            Estimate Std. Error t value Pr(>|t|)   (Intercept) 0.242155   0.177543   1.364  0.17575   RT..        0.008711   0.002814   3.095  0.00257 **—Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7418 on 97 degrees of freedomMultiple R-squared:  0.0899,   Adjusted R-squared:  0.08051 F-statistic: 9.581 on 1 and 97 DF,  p-value: 0.00257

 

This shows similar result to the relationship between critical reception and performance. The p-value is close to the cut off and the R-squared is too low to be significant. When comparing week 2 to week 3 and week 1 to week 3, the results were similar. Another analysis carried out was determining whether critical reception has any effect on how the film finishes compared to how the film performed in week 1. Again no significance in correlation was found.

Budget and Reception

Finally the last relationship examined was if the budget would have any relationship with how the movie was received. The summary was as follows:

Residuals:

Min      1Q  Median      3Q     Max

-51.568 -22.971   4.497  18.736  44.763

 

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 52.97059    3.88668   13.63   <2e-16 ***

PB.m         0.06331    0.04192    1.51    0.134

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 

Residual standard error: 26.45 on 97 degrees of freedom

Multiple R-squared:  0.02298,  Adjusted R-squared:  0.0129

F-statistic: 2.281 on 1 and 97 DF,  p-value: 0.1342

 

This was the weakest correlation of all the tests and the only one with a p-value above a cut off of 0.05.

Predicting Opening Week Performance

As stated earlier, the success of a movie will include many factors, including popularity of a franchise, actors, directors and competition at time of release. For this example I have chosen to predict the domestic (U.S.) opening week gross to a film being released in late May of 2016, “X-Men Apocalypse”. It is difficult to find a reliable production budget for a movie that has not yet been released but one site I found stated $234 million (https://ihumanmedia.com/2015/12/13/234-million-budget-xmen-apocalypse-review/ – accessed 16th April 2016).

The linear model gives an equation of 606,986.9 * Production budget in millions + 5,966,236.1 which predicts an opening week of $148,001,170. A 95% confidence interval predicts an opening week between $122,657,183 and $173,345,156.

The other factors (franchise popularity, actors etc.) contributing to the opening numbers of a movie are difficult to quantify but may be possible to capture by looking at previous movies in the franchise. By looking at how these movies performed these others factors may be accounted for.

I have put together a table that looks at the previous movies in the franchise and compared how they actually preformed to the model. When calculating this I also took in account inflation.

X-men_fran

Table 1 – X-Men Franchise – Performance and Model

This from this table it can be found that the average error of the model is +8.7%. This means that on average the model under project by 8.7%. However the first two movies of the franchise were among the first movies to be released that helped the current comic book movies boom. Before this point studios may not have been as confident and budgeted less for these to minimise the risk. Since then, comic book movies have proven to be good investments so studios will feel more confident investing in them.Table 1 – X-Men Franchise – Performance and Model

A reasonably big under performer according to the model is “X-Men: First class”. One reason this underperformed may be due to the competition at the time of release. When comparing the competition faced by “X-men: Days of future past”, it appears that First Class had significantly tougher competition. Looking at day 7 for each film, the three next highest grossing movies (including Pirates of the Caribbean, one of the most popular franchises of all time) that faced First Class, went on to gross over $500 million domestically while the competition for Days of future past only grossed $162 million. For this reason, I took the first 2 movies and “First Class” as outliers and found the average without these. Excluding these outliers, the model under-projects 2.5% for the X-men franchise. Taking this into consideration, the model now predicts a gross of $150,991,210 for week 1. This assumes average competition and that the budget is actually $234 million. However another movie being released on the same day is “Alice through the looking glass”. This is the sequel to a very successful movie that will likely hurt the performance. It is also 3 weeks after the release of “Captain America: Civil War”, a successful franchise in the same genre with a strong fan following, and 1 week after the release of “Angry Birds”, a somewhat wild card at this stage that could take the family audience.

When taking this into consideration, the model could easily over-project. My best guess would be  $146,461,474 for week 1 which is 7% lower than previous number given (model with previous franchise performance taken in consideration). Although I have given my prediction, I do not feel confident in it as it is difficult to assess how the competition will perform and I believe that will be the deciding factor.

Conclusion

The results indicate that the budget of a film has a correlation to how it will perform, especially in the first week. This does not mean that putting more money into a movie will increase the chances of success but may be a measure of how confident studios are in a film. However, critical reception seems to have no impact on the performance of a movie.

However, it should be noted that the sample analysed were the 100 top performers of 2015 (excluding a few outliers like rereleases). This is a relatively small sample taken from only one year. This may be a biased selection as there were nearly 800 movies in the list, so this sample represents the top 12%. The data for all movies from 2015 could not be retrieved in its entirety because the time needed to cleanse and sort the data. If all movies were taken into account a stronger correlation between performance and critical reception may be found. The top performers will generally be large budget movies. Smaller budgeted movies may be expected lower in the list and may behave differently. The correlation for lower budget films may depend more on critical reception as many small independent movies can become popular through word of mouth. As far as the higher performing movies, critical reception does not seem to affect the overall performance of a movie nor does a big budget influence the critical response.

If looking to improve the model, accuracy may be improved by measuring the popularity of a movie franchise (by measuring previous performance) and measuring the competition (looking at budget and franchise popularity). These factors could be included in a multiple linear regression model.

Try R

Try R

R Studio

In a previous post I used a fusion table to visualise the population distribution across Ireland. I also went on to compare this to the distribution of wealth across Ireland using the average income per person. The two fusion maps indicated there may be a relationship worth investigating. To further investigate this I used R studio to create a linear model.

Data

The data for the population was taken from http://www.cso.ie/en/statistics/population/populationofeachprovincecountyandcity2011/ and the data for the average income was taken from http://www.cso.ie/px/pxeirestat/Statire/SelectVarVal/Define.asp?maintable=CIA01. I used pivot tables to allow me to isolate just the counties and average income per person in one excel sheet. On another excel sheet I had a list of counties and total population. These excel sheets were saved as csv files and were edited to remove empty columns (excel saved a series of commas after the population and income columns).

Inputting Data to R Studio

The next step was to let R studio read the data files. The working directory was set to the location where I had saved my csv files.

> setwd(“C:/Users/Sat/Documents/R project”)

 

I then gave my data dataframes as follows:

> inc <- read.csv(“justinc.csv”) for the average income per person

> pop <- read.csv(“justpop.csv”) for the total population

 

Pop and Inc were then merged together

> rel <- merge(x = pop, y = inc)

 

and the data was then plotted.

> plot(rel$Pop, rel$Income)

popvsinc

Figure 1 – Plot of Population and Income

The test for correlation was run

> cor.test(rel$Pop, rel$Income)

Pearson’s product-moment correlation data:

rel$Pop and rel$Income

t = 4.835, df = 24, p-value = 6.318e-05alternative hypothesis: true correlation is not equal to 095 percent confidence interval:

0.4328738 0.8566950

sample estimates:

cor 0.7024438

 

This shows a p-value of 6.318e-5 which is significantly lower than the cut-off point of 0.05. The Pearson product-moment correlation coefficient is approximately 0.7 which indicates a reasonably strong positive correlation between the two sets of data.  These both indicate that there is a correlation between the income of a person and the population of the counties they are located in.

To further investigate the correlation the linear model is investigated. The linear model is created by setting the income as a linear function to the population.

>popinc.lm = lm(Income ~ Pop, data=rel)

>coeffs = coefficients(popinc.lm); coeffs

(Intercept)          Pop 2.124751e+04 6.038183e-03

with line

Figure 2 – Abline (Model Inserted)

A summary of the model command was run

>summary(popinc.lm)

Call:

lm(formula = Income ~ Pop, data = rel)

Residuals:

Min      1Q  Median      3Q     Max

-3386.5  -953.5  -124.2   460.3  4203.3

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 2.125e+04  3.707e+02  57.320  < 2e-16 ***Pop         6.038e-03  1.249e-03   4.835 6.32e-05 ***

—Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1520 on 24 degrees of freedom

Multiple R-squared:  0.4934,   Adjusted R-squared:  0.4723

F-statistic: 23.38 on 1 and 24 DF,  p-value: 6.318e-05

 

While the p-value is quite low and shows there is a correlation the R-squared value is 0.4934. This can be interpreted as county population having a somewhat moderate influence on the average income.  The county population accounts for approximately 50% of the change in income but there is still a large amount of change due to “errors” or other unaccounted for factors. This would be expected as there would be a large number of factors involved in determining the average income. Further investigation could involve adding the quality of education in the counties as a variable to the linear model.  There are list available which rank colleges, these lists could be used to deter a score for each college. Then by adding the college score as the second variable, the impact on the adjusted R squared value could be measured. It is important to compare the R squared adjusted scores as the number of degrees of freedom will decrease when adding more variables. If only the R squared score was considered, it would appear as if each variable was significant as the R squared value can only increase as degrees of freedom decrease. R squared adjusted will take into account the degrees of freedom and give a better idea if your model is actually improving.

The conclusion is that having a larger population will correlate with a larger average income per person but that there will still be a significant amount of variance due to other factors. One possible reason as explained in my previous post is the synergy of large numbers of people working together and everyone benefitting from each other’s works. This would also explain how cities came to be built.

Fusion Tables

Fusion Tables

The aim of this project was to create a fusion table using the Irish County 2011 Population data which can be found at http://www.cso.ie/en/statistics/population/populationofeachprovincecountyandcity2011/ . This data was to be displayed on a heat map of Ireland using a KML file found at http://www.independent.ie/editorial/test/map_lead.kml .

Cleansing the data

The Irish County 2011 Population table contains data on the number of males, females and total population in each county of the Republic of Ireland. The table was copied and pasted into an excel file. However the data at this stage was not ideal, as it included not only the data for each county but also the data for each province. This can lead to problems as Google can interpret this data incorrectly (e.g. when analysing the row for Munster, it would interpret Munster as a location in Germany). To avoid this, these rows were deleted from the excel file. Another problem with the data was how some counties were split into regions. For example, Tipperary was split into North Tipperary and South Tipperary regions. The data from these rows were combined and renamed to Tipperary and the values for this row were found by adding together the values from the separate regions. Other data cleansing included changing misspelled county names and incorrect data (e.g. Laois was misspelled, the total for males and females in Kerry did not match the total given and the total male population in Waterford was given as 54,464 when the rest of the data indicated it was actually 56,464.) After I cleansed all the data, I saved the excel file as “cleansed population data”.

After the data was cleansed it could then be used by the fusion table software. This was done by signing into my Google account and visiting https://support.google.com/fusiontables/answer/2571232 . Clicking ‘create table’ led me to a page where I could upload the excel file. After opening the fusion table it could be seen that some counties were geocoded incorrectly (e.g. Longford was located beside London U.K.). However these errors were corrected by the KML file.

KML File and Merging

The next step was to create a new table using the KML file found at http://www.independent.ie/editorial/test/map_lead.kml .  This file uses code to store the shape and location of each county that can be read by the fusion table software. The steps used to create this table were the same as above, only this time, selecting a KML file instead of an excel file.

I confirmed the KML file was correct by viewing the map of geometry. The next step was to select File > Merge and select the file I wished to merge with. As I was already in the KML file I chose to merge with the “cleansed population data” I created earlier. I matched the county name for both files and selected the data I was interested in (total population and KML). After inspecting the map I noticed not all the data was merged correctly. There were multiple copies of most counties and the population data for Cavan and Carlow were not copied correctly. By doing a search for Carlow and Cavan in the Rows tab, I could see that google had given this a county name of Clare despite the KML code calling it Carlow or Cavan. Google did not allow editing of tables that had already been merged. To correct this, I deleted the merged table and selected the KML only table. I searched for Carlow in the rows tab and found Carlow was the in KML code but was named Cavan in the county name. After correcting these details the files merged without any problem.

Heat Map

After the files were merged the heat map was created by selected Feature Map > change feature styles > polygons > fill colour > buckets and divided into 8 buckets. The colours and distribution used can be seen in the image below. The legend was added by going to change feature styles > automatic legend and checking show polygon fill legend.

Figure 1 – Heat map population

Analysis

It can be seen from this map that there are three key areas Dublin, Cork and Galway. The population generally drops as you move away from these areas. To investigate how this affects the distribution of wealth within Ireland, a heat map was generated of average income per person. This heat map was produced using the same steps as the population heat map, using the data from http://www.cso.ie/px/pxeirestat/Statire/SelectVarVal/Define.asp?maintable=CIA01.

Figure 2 – Heat map Income

This map shows a somewhat similar pattern as the population distribution especially around Dublin. This may be because higher earners in large corporations that would be located in the capital will bring up the average salary. Another reason may be the population benefiting from the synergy of a lot of people working together. This is the very reason towns and cities exist.