16.8 Dummy Variables

16.8.1 A Dummy variable

An indicator variable, or dummy variable, is an input variable that represents qualitative data, such as gender, race, etc. Typically, dummy variables are sometimes referred to as binary variables because they usually take just two values, \(1\) or \(0\), with \(1\) generally representing the presence of a characteristic and \(0\) representing the absence.

For example, if we had a dummy variable called male, \(1\) would indicate that the individual is male and \(0\) would indicate that the individual is female (or non-male to be more precise). Using a \(1\) and \(0\) to represent the presence or absence of a particular characteristic is arbitrary, but as we will see, it is quite convenient. Generally, we can define a dummy variable \(D\) as

\[ \begin{aligned} D = \begin{cases} 1 & \text{if characteristic is present} \\ 0 & \text{if characteristic is not present} \end{cases} \end{aligned} \]

16.8.2 In the Model

Perhaps the most common use of dummy variables in a linear regression model is to modify the intercept parameter. Adding an indicator variable \(D\) to the linear model, along with a new parameter \(\beta_2\) gives us

\[ y_i = \beta_{0} + \beta_{1}x + \beta_2 D + \epsilon_i \] Suppose we had a regression model where the variable \(D = \text{DEGREE}\) represented whether or not an individual obtained a college degree, with \(1\) indicating that he or she did, and \(0\) otherwise. In this model, our dependent variable, \(y\), is income, along with another independent variable, \(x= \text{EXPERIENCE}\), indicating how much work experience in years that individual has. Such a model can be represented as follows:

\[ \text{INCOME}_i = \beta_{0} + \beta_{1}\text{EXPERIENCE} + \beta_2 \text{DEGREE} + \epsilon_i \]

that is, for those with a college degree, the intercept of the model is \((\hat \beta_{0} + \hat \beta_2)\), while the intercept of the model for those without a college degree is simply \(\hat \beta_{0}\).

\[ \begin{align} \hat{\text{INCOME}} = \begin{cases} (\hat \beta_{0} + \hat \beta_2) + \hat \beta_{1}\text{EXPERIENCE}, & \text{If DEGREE}=1 \\ \hat \beta_{0} + \hat \beta_{1}\text{EXPERIENCE}, & \text{If DEGREE} = 0 \end{cases} \end{align} \]

16.8.3 An Example

We have information about the 2008-09 nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S. The data were collected as part of the on-going effort of the college’s administration to monitor salary differences between male and female faculty members. The available data set has 397 observations on the following 6 variables:

Salaries for Professors: Variables
label description
rank a factor with levels AssocProf, AsstProf and Prof
discipline a factor with levels A (“theoretical” departments) or B (“applied” departments)
yrs.since.phd years since PhD
yrs.service years of service
sex a factor with levels Female and Male
salary nine-month salary, in dollars
Source: carData package in R
Salaries for Professors: Data
rank discipline yrs.since.phd yrs.service sex salary
Prof B 27 14 Male 147349
AsstProf B 2 0 Male 78000
AssocProf B 48 53 Male 90000
Prof A 33 31 Male 111350
Prof A 39 36 Male 117515
Prof A 49 43 Male 72300
Prof A 16 11 Male 88175
Prof A 23 23 Male 108200
Prof A 25 15 Male 95329
Prof A 30 26 Male 136660
Note: First 10 observations

16.8.3.1 Categorical variables with two levels

Suppose that, we wish to investigate differences in salaries between males and females.

Based on the gender variable, we can create a new dummy variable, \(D\), that takes the value:

\[ \begin{align} D = \begin{cases} 1 & \text{if a person is male} \\ 0 & \text{if a person is female} \end{cases} \end{align} \] and use this variable as a predictor in the regression model $ Y_i=_0 + _1D + _i $, leading to:

\[ \begin{align} \begin{cases} \beta_{0} + \beta_1, & \text{if person is male: } D=1 \\ \beta_{0} , & \text{if person is female: } D=0 \end{cases} \end{align} \] The coefficients can be interpreted as follow:

\[ \begin{align} \begin{cases} \beta_{0} + \beta_1 & \text{s the mean salary among males}\\ \beta_{0} & \text{is the mean salary among females}\\ \beta_1 & \text{is the mean difference in salary between males and females} \end{cases} \end{align} \]


  salary 000
Regressor \(\hat \beta\) std. Error \(t_{stat}\) \(p\)-value
(Intercept) 101.00 4.81 21.00 <0.001
sex [Male] 14.09 5.06 2.78 0.0057
Observations 397
R2 / R2 adjusted 0.019 / 0.017


From the output above (where salary is divided by 1000) the mean salary for female is estimated to be \(101.002\), whereas males are estimated a total of \(101.002 + 14.088 = 115.090\). The \(p\)-value for the dummy variable \(\text{sexMale}\) is significant, suggesting that there is a statistical evidence of a difference in mean salary between the genders.

The decision to code males as \(1\) and females as \(0\) (baseline) is arbitrary, and has no effect on the regression computation, but does alter the interpretation of the coefficients.

Let’s see the output when \(D=1\) if the person is female:


  salary 000
Regressor \(\hat \beta\) std. Error \(t_{stat}\) \(p\)-value
(Intercept) 115.09 1.59 72.50 <0.001
sex [Female] -14.09 5.06 -2.78 0.0057
Observations 397
R2 / R2 adjusted 0.019 / 0.017


Now the estimates for \(\beta_0\) and \(\beta_1\) are \(115.090\) and \(-14.088\), respectively, leading once again to a prediction of mean salary of \(115.090\) for males and a estimation of \(115.090 - 14.088 = 101.002\) for females.


Observations 397
Dependent variable salary000
Type OLS linear regression
F(2,394) 26.821
0.120
Adj. R² 0.115
Est. 2.5% 97.5% t val. p
(Intercept) 101.429 96.451 106.407 40.060 0.000
sex_Female -9.072 -18.630 0.486 -1.866 0.063
yrs.service 0.748 0.529 0.967 6.711 0.000
Standard errors: OLS

16.8.3.2 Categorical variables with more than two levels

WORK IN PROGRESS

References: