## Introduction

The aim of this project was to analyse a dataset using various kinds of software. For this project I made my own data set using excel. The advantage to using my own data allowed me to create my own model so I could compare the model created using multiple regression to the actual model I created.

The fictional situation was as follows: 1013 professional basketball players were selected and put through an obstacle course. At various points through the course the players would attempt to score a point by shooting the ball through the hoop. There are 100 different points throughout the course where the player would attempt to make a shot. The number of successful shots was recorded. The data collected included a measure a number of different attributes of the players (some related to sport, some not). The model looks to predict how a player will perform in the obstacle if given the relevant attributes.

The ten attributers measured were as follows:

- Height (in Feet)
- Weight (in Stone)
- Salary (in million per annum)
- Number of Years playing professionally
- Number of professional games
- Number of Hours training a week
- Number of Cars owned
- Number of Children
- Amount of money spent on sport supplements per year (in thousands)
- Calories consumed per day

## Data Analysis in R

The dataset was read into R. The pairs function was used to visualise the data. This gave the following output:

Figure 1 – Pairs

There are a number of different relationships to note. There looks to be a strong correlation between height and weight. This makes sense as taller people will generally weigh more than shorter. There is another strong correlation between number of games played and number of years playing. Again this would be expected. The variable we are looking shows a strong correlation with salary. We will expect this factor to have a strong influence in the final model. In order to actually measure the correlation, the “cor” function was used to produce a matrix correlation table.

Score Height Weight Salary Years Games Hours.per.week Cars Children Supplements Calories

Score 1.00 0.02 0.01 0.77 0.20 0.20 0.07 -0.02 -0.01 0.13 0.04

Height 0.02 1.00 **0.91** -0.01 0.03 0.03 0.00 -0.03 0.00 0.00 -0.02

Weight 0.01 **0.91** 1.00 -0.01 0.03 0.03 -0.02 -0.01 0.00 0.01 -0.01

Salary 0.77 -0.01 -0.01 1.00 0.00 0.01 -0.03 0.00 -0.02 0.05 0.03

Years 0.20 0.03 0.03 0.00 1.00 **0.99** -0.03 -0.02 0.02 0.01 -0.01

Games 0.20 0.03 0.03 0.01 **0.99** 1.00 -0.03 -0.02 0.02 0.01 -0.01

Hours.per.week 0.07 0.00 -0.02 -0.03 -0.03 -0.03 1.00 -0.02 0.01 0.03 -0.09

Cars -0.02 -0.03 -0.01 0.00 -0.02 -0.02 -0.02 1.00 -0.04 0.01 -0.04

Children -0.01 0.00 0.00 -0.02 0.02 0.02 0.01 -0.04 1.00 -0.04 0.03

Supplements 0.13 0.00 0.01 0.05 0.01 0.01 0.03 0.01 -0.04 1.00 -0.05

Calories 0.04 -0.02 -0.01 0.03 -0.01 -0.01 -0.09 -0.04 0.03 -0.05 1.00

As expected the height and weight have a correlation measure of 0.91 while the games and year correlation was measured as 0.99. This is known as multicollinearity and can disrupt the model making it inaccurate. To ensure the most accurate result both weight and years will be removed from the model.

The following output was produced:

Call:lm(formula = Score ~ Height + Salary + Games + Hours.per.week + Cars + Children + Supplements + Calories, data = fs) Residuals:

Min 1Q Median 3Q Max -28.1619 -6.1410 -0.1649 6.2595 27.6160

Coefficients: Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.601e+01 9.175e+00 -1.745 0.0813 .

Height 1.587e+00 1.375e+00 1.154 0.2486

Salary 5.718e+00 1.398e-01 40.903 < 2e-16 ***

Games 4.855e-02 4.545e-03 10.682 < 2e-16 ***

Hours.per.week 2.994e-01 5.511e-02 5.433 6.94e-08 ***

Cars -8.140e-02 1.968e-01 -0.414 0.6792

Children -3.091e-04 1.481e-01 -0.002 0.9983

Supplements 7.798e-02 1.568e-02 4.972 7.79e-07 ***

Calories 1.262e-03 5.916e-04 2.133 0.0332 *

—Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.067 on 1004 degrees of freedomMultiple

R-squared: 0.6501, Adjusted R-squared: 0.6473

F-statistic: 233.2 on 8 and 1004 DF, p-value: < 2.2e-16

This shows a multiple R-squared of 0.6501. In multiple linear regression the more reliable value is adjusted R-squared with a value of 0.6473. This means our model can be used to explain 65% of the variation. The most important factors are Salary, Games and Supplements with a p value significantly less than the cut-off point of .0.5. Meaning the chances of these factors having no impact on the results is negligible. Calories also had a low p value of 0.033 meaning the chances of this attribute having no effect are 3.3%. Height showed a p value of 0.08 which is above the cut-off off point and may be worth further investigation.

## Testing Normality of the Residuals

To test the “goodness of fit” of the model, the residuals (distance the predicted value is from the actual value) can be measured. The residuals are expected to be normally distributed about the predicted values. There are a number of ways to test or visualise this.

Figure 2 – First Normality Test

For Figure 2 the residuals are standardised by dividing each residual by the standard deviation of all the residuals. You expect to see 95% of the residuals to be within 2 standard deviations. Figure 2 looks reasonable and shows homoscedasticity.

Figure 3- QQplot

Figure 3 is known as a quantile-quantile or qqplot. To produce this plot all residual are sorted in order from lowest to highest. The blackline represents the position each of the residuals would be in if they were perfectly normally distributed. That is out of 1000 residuals, the lowest residual is expected to be below 3 standard deviations, the median should be at the mean and the highest would be above 3 standard deviations. So the closer the residuals are to this line, the closer the data set is to normal distribution. Figure 3 shows a very good fit and thus we are satisfied the residuals are normally distributed.

## Rapid Miner

This model can also be built using RapidMiner. RapidMiner is a software package that allows for data analytics using a GUI. It offers an easy to use interface for people who might not be as comfortable coding.

Figure 4 – RapidMiner Design

Figure 4 shows how the data was fed into an operator that created the linear model.

Figure 5 – RapidMiner Results

Figure 5 and Figure 6 shows the results found by RapidMiner. These results match the result found in R with the only difference being in rounding errors.

Figure 6 – Squared Correlation

## Excel

The data analysis add-in for excel can also carry out a regression analysis. Excel gave the following output.

Figure 7 – Excel Output

## Further Investigation

When creating the data set I used a function to create random variables that were normally distributed with a selected mean and standard deviation. I also added an error term to account for other factors not listed. Weight and games played were linear functions of height and years respectively, with normally distributed errors added.

To generate a score I assigned a coefficient to each attribute. Each coefficient was multiplied by the corresponding attribute value and they were all added together, including the error value. This score was then standardised and set to values between 1 and 100. *(y={x-min}/{max-min})*

I made an estimation of how much variation could be attributed to each attribute in my model. Multiplying each coefficient by the corresponding mean of that attribute gave an “importance” value. For example the mean height, (6.2ft) is multiplied by the selected coefficient for height (200). So on average Height will contribute 6.2 by 200 (1240) to the score in the model. The importance value is 1240 for height. The importance value of an attribute can then be found as a percentage of the sum of all importance values. The following table gives the importance percentage I found for each attribute:

Attribute |
Height | Salary | Games | Hours | Cars | Children | Supplements | Calories | Weight | Years |

Importance % |
22% | 44% | 14% | 13% | 0% | 0% | 4% | 2% | 0% | 0% |

I expected to see height to be given more significance in the model and supplements and calories to be given less. Height was given a mean of 6.2 and a standard deviation of 0.2 while supplements had a mean of 30 and a standard deviation of 20. This meant supplements had a much larger range and variation. So the change attributed to the money spent on supplements is easier to detect. Height varied a lot less and the changes attributed to it are lost among the other factors including the random error that was added.

To further investigate I increased my sample size from 1,013 to 102,823 samples. This will help overcome the problem of noise or random error. This sample set gave the following output:

Call:lm(formula = Score ~ Height + Salary + Games + Hours.per.week + Cars + Children + Supplements + Calories, data = fs2)

Residuals: Min 1Q Median 3Q Max -32.753 -4.773 -0.033 4.766 32.060 Coefficients: Estimate Std. Error t value Pr(>|t|)

(Intercept) -5.777e+00 7.212e-01 -8.010 1.16e-15 ***

Height 1.800e+00 1.100e-01 16.363 < 2e-16 ***

Salary 4.712e+00 1.110e-02 424.671 < 2e-16 ***

Games 3.776e-02 3.631e-04 103.990 < 2e-16 ***

Hours.per.week 2.737e-01 4.417e-03 61.965 < 2e-16 ***

Cars 2.339e-02 1.470e-02 1.591 0.112

Children -1.672e-02 1.132e-02 -1.477 0.140

Supplements 6.499e-02 1.171e-03 55.522 < 2e-16 ***

Calories 2.606e-04 4.417e-05 5.900 3.65e-09 ***

—Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.081 on 102813 degrees of freedom

Multiple R-squared: 0.6587, Adjusted R-squared: 0.6586

F-statistic: 2.48e+04 on 8 and 102813 DF, p-value: < 2.2e-16

It can be seen the p-value of all relevant attributes have all dropped to well below the cut-off point of 0.05. The t-value of height was 1.15 in the 1,000 sample set but has now increased to 16.363 with 100,000 samples. In the 1,000 sample the t-value was lower than calories but has overtaken it in the 100,000 sample. I would expect it to eventually overtake supplements with a large enough sample.

This illustrates the impact a good data when trying to build the most accurate model. The less your data varies the more important it is to get a larger dataset.