16.8 Dummy Variables | Econometrics I

16.8.1 A Dummy variable

An indicator variable, or dummy variable, is an input variable that represents qualitative data, such as gender, race, etc. Typically, dummy variables are sometimes referred to as binary variables because they usually take just two values, $1$ or $0$, with $1$ generally representing the presence of a characteristic and $0$ representing the absence.

For example, if we had a dummy variable called male, $1$ would indicate that the individual is male and $0$ would indicate that the individual is female (or non-male to be more precise). Using a $1$ and $0$ to represent the presence or absence of a particular characteristic is arbitrary, but as we will see, it is quite convenient. Generally, we can define a dummy variable $D$ as

\[ \begin{aligned} D = \begin{cases} 1 & \text{if characteristic is present} \\ 0 & \text{if characteristic is not present} \end{cases} \end{aligned} \]

16.8.2 In the Model

Perhaps the most common use of dummy variables in a linear regression model is to modify the intercept parameter. Adding an indicator variable $D$ to the linear model, along with a new parameter $\beta_2$ gives us

\[ y_i = \beta_{0} + \beta_{1}x + \beta_2 D + \epsilon_i \] Suppose we had a regression model where the variable $D = \text{DEGREE}$ represented whether or not an individual obtained a college degree, with $1$ indicating that he or she did, and $0$ otherwise. In this model, our dependent variable, $y$, is income, along with another independent variable, $x= \text{EXPERIENCE}$, indicating how much work experience in years that individual has. Such a model can be represented as follows:

\[ \text{INCOME}_i = \beta_{0} + \beta_{1}\text{EXPERIENCE} + \beta_2 \text{DEGREE} + \epsilon_i \]

that is, for those with a college degree, the intercept of the model is $(\hat \beta_{0} + \hat \beta_2)$, while the intercept of the model for those without a college degree is simply $\hat \beta_{0}$.

\[ \begin{align} \hat{\text{INCOME}} = \begin{cases} (\hat \beta_{0} + \hat \beta_2) + \hat \beta_{1}\text{EXPERIENCE}, & \text{If DEGREE}=1 \\ \hat \beta_{0} + \hat \beta_{1}\text{EXPERIENCE}, & \text{If DEGREE} = 0 \end{cases} \end{align} \]

16.8.3 An Example

We have information about the 2008-09 nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S. The data were collected as part of the on-going effort of the college’s administration to monitor salary differences between male and female faculty members. The available data set has 397 observations on the following 6 variables:

Salaries for Professors: Variables
label	description
rank	a factor with levels AssocProf, AsstProf and Prof
discipline	a factor with levels A (“theoretical” departments) or B (“applied” departments)
yrs.since.phd	years since PhD
yrs.service	years of service
sex	a factor with levels Female and Male
salary	nine-month salary, in dollars
Source: carData package in R

Salaries for Professors: Data
rank	discipline	yrs.since.phd	yrs.service	sex	salary
Prof	B	27	14	Male	147349
AsstProf	B	2	0	Male	78000
AssocProf	B	48	53	Male	90000
Prof	A	33	31	Male	111350
Prof	A	39	36	Male	117515
Prof	A	49	43	Male	72300
Prof	A	16	11	Male	88175
Prof	A	23	23	Male	108200
Prof	A	25	15	Male	95329
Prof	A	30	26	Male	136660
Note: First 10 observations

16.8.3.1 Categorical variables with two levels

Suppose that, we wish to investigate differences in salaries between males and females.

Based on the gender variable, we can create a new dummy variable, $D$, that takes the value:

\[ \begin{align} D = \begin{cases} 1 & \text{if a person is male} \\ 0 & \text{if a person is female} \end{cases} \end{align} \] and use this variable as a predictor in the regression model $ Y_i=_0 + _1D + _i $, leading to:

\[ \begin{align} \begin{cases} \beta_{0} + \beta_1, & \text{if person is male: } D=1 \\ \beta_{0} , & \text{if person is female: } D=0 \end{cases} \end{align} \] The coefficients can be interpreted as follow:

\[ \begin{align} \begin{cases} \beta_{0} + \beta_1 & \text{s the mean salary among males}\\ \beta_{0} & \text{is the mean salary among females}\\ \beta_1 & \text{is the mean difference in salary between males and females} \end{cases} \end{align} \]

	salary 000
Regressor	$\hat \beta$	std. Error	$t_{stat}$	$p$-value
(Intercept)	101.00	4.81	21.00	<0.001
sex [Male]	14.09	5.06	2.78	0.0057
Observations	397
R² / R² adjusted	0.019 / 0.017

From the output above (where salary is divided by 1000) the mean salary for female is estimated to be $101.002$, whereas males are estimated a total of $101.002 + 14.088 = 115.090$. The $p$-value for the dummy variable $\text{sexMale}$ is significant, suggesting that there is a statistical evidence of a difference in mean salary between the genders.

The decision to code males as $1$ and females as $0$ (baseline) is arbitrary, and has no effect on the regression computation, but does alter the interpretation of the coefficients.

Let’s see the output when $D=1$ if the person is female:

	salary 000
Regressor	$\hat \beta$	std. Error	$t_{stat}$	$p$-value
(Intercept)	115.09	1.59	72.50	<0.001
sex [Female]	-14.09	5.06	-2.78	0.0057
Observations	397
R² / R² adjusted	0.019 / 0.017

Now the estimates for $\beta_0$ and $\beta_1$ are $115.090$ and $-14.088$, respectively, leading once again to a prediction of mean salary of $115.090$ for males and a estimation of $115.090 - 14.088 = 101.002$ for females.

Observations	397
Dependent variable	salary000
Type	OLS linear regression

F(2,394)	26.821
R²	0.120
Adj. R²	0.115

	Est.	2.5%	97.5%	t val.	p
(Intercept)	101.429	96.451	106.407	40.060	0.000
sex_Female	-9.072	-18.630	0.486	-1.866	0.063
yrs.service	0.748	0.529	0.967	6.711	0.000
Standard errors: OLS

16.8.3.2 Categorical variables with more than two levels

WORK IN PROGRESS

References:

	salary 000
Regressor	\(\hat \beta\)	std. Error	\(t_{stat}\)	\(p\)-value
(Intercept)	101.00	4.81	21.00	<0.001
sex [Male]	14.09	5.06	2.78	0.0057
Observations	397
R² / R² adjusted	0.019 / 0.017