What is a Dummy variable?
A Dummy variable or Indicator Variable is an artificial variable created to represent an attribute with two or more distinct categories/levels. Basically the binary variables created from a categorical variable with having multiple levels are termed as dummy variables.
Why is it used?
Regression analysis treats all independent (X) variables in the analysis as numerical. Numerical variables are interval or ratio scale variables whose values are directly comparable, e.g. ‘10 is twice as much as 5’, or ‘3 minus 1 equals 2’. Often, however, you might want to include an attribute or nominal scale variable such as ‘Product Brand’ or ‘Type of Defect’ in your study. Say you have three types of defects, numbered ‘1’, ‘2’ and ‘3’. In this case, ‘3 minus 1’ doesn’t mean anything. You can’t subtracting defect 1 from defect 3. The numbers here are used to indicate or identify the levels of ‘Defect Type’ and do not have intrinsic meaning of their own. Dummy variables are created in this situation to ‘trick’ the regression algorithm into correctly analysing attribute variables.
Things to keep in mind about dummy variables:
Dummy variables assign the numbers ‘0’ and ‘1’ to indicate membership in any mutually exclusive and exhaustive category.
- The number of dummy variables necessary to represent a single attribute variable is equal to the number of levels (categories) in that variable minus one.
- For a given attribute variable, none of the dummy variables constructed can be redundant. That is, one dummy variable cannot be a constant multiple or a simple linear relation of another.
- The interaction of two attribute variables (e.g. Gender and Marital Status) is represented by a third dummy variable which is simply the product of the two individual dummy variables.
- The decision as to which level is not coded is often arbitrary. The level which is not coded is the category to which all other categories will be compared. As such, often the biggest group will be the not- coded category.