21.1 Activity

1. Conduct a Exploratory Data Analysis

  • Classify the variables in the dataset. Using descriptive statistics, what is data telling us?
  • Make a histogram and box plot for \(Y\). Are there any outliers? What is the IQR?
  • Create a new variable considering age:

\[ X_{11} = \begin{cases} 1 & \text{if } 21 \leq \text{ age } \leq 30,\\ 2 & \text{if } 31 \leq \text{ age } \leq 40, \\ 3 & \text{if age} > 40 \end{cases} \]

  • Make a comparison of systolic blood pressure among Age ranges (hint: box-plot)
  • Consider two groups of people: (1) Weight \(\leq 65\) and (2) Weight \(> 65\). Do you observe significant differences between the systolic blood pressure between them? (hint: test on the mean).
  • Considering the two groups of Weight defined before, are the forearm skin fold statistically different?

2. Perform a detailed analysis of the following multiple regression model

\[Y_i = \beta_0 + \beta_1 X_{1i} + \beta_1 X_{2i} + \ldots + \beta_9X_{9i} + \epsilon_i\]

where

  • \(Y\) = systolic blood pressure
  • \(X_1\) = age
  • \(X_2\) = years in urban area
  • \(X_3\) = \(\dfrac{X_2}{X_1}\) = fraction of life in urban area
  • \(X_4\) = weight (kg)
  • \(X_5\) = height (mm)
  • \(X_6\) = chin skinfold
  • \(X_7\) = forearm skinfold
  • \(X_8\) = calf skinfold
  • \(X_9\) = resting pulse rate

3. Considering the results, answer these questions

  • What is the interpretation of the intercept?
  • What are the interpretation of the coefficients of \(X_4\) and \(X_9\)?
  • Interpret the \(R^2\)
  • Can we assume that residuals are normally distributed?
  • Can we assume that all the parameters are equal to zero? (hint: test for all variables)
  • Can we assume that \(\beta_1\) and \(\beta_8\) are statistically significant? (hint: tests for individual variables)
  • Do you recommend the full model to estimate \(Y\)? Why?
  • Before fitting new models, which model do you recommend to analyze afterwards?

4. Working with regressors

  • Make de XY (matrix) scatterplot with Height, Chin, Forearm, Calf, and Pulse
  • Can we state that multicollinearity is not a problem in this case? Why?
  • Do you recommend to transform any of the variables? How?
  • Do you recommend to remove any of the variables from the analysis? Why?

5. Final model

  • Based on the previous results, fit and present a model that you recommend to estimate \(Y\) (hint: all parameters are significant). Why do you select this model?
  • Diagnosis the model (\(R^2\), \(F-\)test, \(t-\)test, residuals)
  • Simulate three cases (3 new people). Use this information to estimate their \(Y\)

6. Conclusions

  • Present you final comments and conclusions on the overall Project.