16.8 Dummy Variables

16.8.1 A Dummy variable

An indicator variable, or dummy variable, is an input variable that represents qualitative data, such as gender, race, etc. Typically, dummy variables are sometimes referred to as binary variables because they usually take just two values, 1 or 0, with 1 generally representing the presence of a characteristic and 0 representing the absence.

For example, if we had a dummy variable called male, 1 would indicate that the individual is male and 0 would indicate that the individual is female (or non-male to be more precise). Using a 1 and 0 to represent the presence or absence of a particular characteristic is arbitrary, but as we will see, it is quite convenient. Generally, we can define a dummy variable D as

D={1if characteristic is present0if characteristic is not present

16.8.2 In the Model

Perhaps the most common use of dummy variables in a linear regression model is to modify the intercept parameter. Adding an indicator variable D to the linear model, along with a new parameter β2 gives us

yi=β0+β1x+β2D+ϵi Suppose we had a regression model where the variable D=DEGREE represented whether or not an individual obtained a college degree, with 1 indicating that he or she did, and 0 otherwise. In this model, our dependent variable, y, is income, along with another independent variable, x=EXPERIENCE, indicating how much work experience in years that individual has. Such a model can be represented as follows:

INCOMEi=β0+β1EXPERIENCE+β2DEGREE+ϵi

that is, for those with a college degree, the intercept of the model is (ˆβ0+ˆβ2), while the intercept of the model for those without a college degree is simply ˆβ0.

^INCOME={(ˆβ0+ˆβ2)+ˆβ1EXPERIENCE,If DEGREE=1ˆβ0+ˆβ1EXPERIENCE,If DEGREE=0

16.8.3 An Example

We have information about the 2008-09 nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S. The data were collected as part of the on-going effort of the college’s administration to monitor salary differences between male and female faculty members. The available data set has 397 observations on the following 6 variables:

Salaries for Professors: Variables
label description
rank a factor with levels AssocProf, AsstProf and Prof
discipline a factor with levels A (“theoretical” departments) or B (“applied” departments)
yrs.since.phd years since PhD
yrs.service years of service
sex a factor with levels Female and Male
salary nine-month salary, in dollars
Source: carData package in R
Salaries for Professors: Data
rank discipline yrs.since.phd yrs.service sex salary
Prof B 27 14 Male 147349
AsstProf B 2 0 Male 78000
AssocProf B 48 53 Male 90000
Prof A 33 31 Male 111350
Prof A 39 36 Male 117515
Prof A 49 43 Male 72300
Prof A 16 11 Male 88175
Prof A 23 23 Male 108200
Prof A 25 15 Male 95329
Prof A 30 26 Male 136660
Note: First 10 observations

16.8.3.1 Categorical variables with two levels

Suppose that, we wish to investigate differences in salaries between males and females.

Based on the gender variable, we can create a new dummy variable, D, that takes the value:

D={1if a person is male0if a person is female and use this variable as a predictor in the regression model $ Y_i=_0 + _1D + _i $, leading to:

{β0+β1,if person is male: D=1β0,if person is female: D=0 The coefficients can be interpreted as follow:

{β0+β1s the mean salary among malesβ0is the mean salary among femalesβ1is the mean difference in salary between males and females


  salary 000
Regressor ˆβ std. Error tstat p-value
(Intercept) 101.00 4.81 21.00 <0.001
sex [Male] 14.09 5.06 2.78 0.0057
Observations 397
R2 / R2 adjusted 0.019 / 0.017


From the output above (where salary is divided by 1000) the mean salary for female is estimated to be 101.002, whereas males are estimated a total of 101.002+14.088=115.090. The p-value for the dummy variable sexMale is significant, suggesting that there is a statistical evidence of a difference in mean salary between the genders.

The decision to code males as 1 and females as 0 (baseline) is arbitrary, and has no effect on the regression computation, but does alter the interpretation of the coefficients.

Let’s see the output when D=1 if the person is female:


  salary 000
Regressor ˆβ std. Error tstat p-value
(Intercept) 115.09 1.59 72.50 <0.001
sex [Female] -14.09 5.06 -2.78 0.0057
Observations 397
R2 / R2 adjusted 0.019 / 0.017


Now the estimates for β0 and β1 are 115.090 and 14.088, respectively, leading once again to a prediction of mean salary of 115.090 for males and a estimation of 115.09014.088=101.002 for females.


Observations 397
Dependent variable salary000
Type OLS linear regression
F(2,394) 26.821
0.120
Adj. R² 0.115
Est. 2.5% 97.5% t val. p
(Intercept) 101.429 96.451 106.407 40.060 0.000
sex_Female -9.072 -18.630 0.486 -1.866 0.063
yrs.service 0.748 0.529 0.967 6.711 0.000
Standard errors: OLS
Created with Highcharts 8.1.2Years of ServiceSalary (dollars)x = 18y = 114Years of Service vs SalaryIs FemaleIs Male051015202530354045505560501001502002500

16.8.3.2 Categorical variables with more than two levels

WORK IN PROGRESS

References: