When do we use linear models and when do we use tree based classification models? This is common question often been asked in data science job interview. Here are some points to remember:
We can use any algorithm. It is purely depends on the type of business problem we are solving and who is end user of the model and how he is going to consume the model’s output. Let’s look at some key factors which will help you to decide which model to use:
- If the relationship between dependent & independent variable is well approximated by a linear model, linear regression will outperform tree based model. No doubt in this aspect. If the realationship is not linear then tree model is better to choose as lot of complicated transformation might be required on the independent variables to make the relationship linear.
- If there is a higher degree of non-linearity between dependent & independent variables, a tree model will perform better than Linear Regression Model. How do you check the linearity? Simply create the bivariate plot of dependent variable and independent variables and study the plots to determine what kind of relationship is between Y and the chosen X variable.
- Decision tree models do not require too much data cleaning (missing value and outlier effect). Hence easy and fast to develop and easy to explain to our customers as well.
- If your business problem demands the possible cause or path to reach to the target variable then tree is easy to explain whereas finding the nature of relationship of the predictor variables with the target variable Linear regression is a better choice.
- Decision tree models are even easier to interpret from a layman point of view.