Immanuel Kant, a key influential philosopher combined rationalism and empiricism perspective to progress the way humankind understood the world. Rationalism emphasized on the human mind to achieve objective truth, empiricists focused on experiments to prove a hypothesis. Kant is known for bringing both together from a philosophical standpoint. Similarly, econometrics cannot be studied in isolation, it is an integral tool to understand any discipline with substantial evidence. We have indeed come a long way in making the tool more dynamic and applying in multidisciplinary subjects. This series is set on a few important concepts of econometrics that remain fundamental to any decision making process. A sound econometric approach simply reiterates the belief in science and logic. While economics is the heart, I believe econometrics is the skeleton and explicates a framework in understanding the ideas and theories better.

Application of regression has become integral to modern day data analytics and process of decision making. In order to know when to use regression analysis, we must first understand what it exactly does. Here is a simple answer that pops up, when you Google its use:

*Regression analysis is used when you want to predict a continuous dependent variable from a number of independent variables. If the dependent variable is dichotomous, then logistic regression should be used. *Don’t worry if any of these terms sound unfamiliar. This article will help you understand its meaning, application and interpretation.

## Data science is ideologically neutral

Progressive society holds ideological debates on various things – a data driven framework would indeed help us manifest those ideas into meaningful policy decisions. To illustrate, there is an ongoing debate about gender pay gaps, a recent study shows that women get paid 34 % less than men. But this number is an absolute value comparison and definitely needs more inquiries to comprehend the factors that actually affect the difference in pay. Although the categories broadly defined are gender, we are aware that a labour market price setting is dependent on various other things. Concluding the difference in pay is only based on gender using the above mentioned information will be oversimplification of the problem and econometricians would definitely not recommend it.

As a neutral party we want to understand whether gender plays a role in determining an individual’s pay. If it does, to what extent gender affects the compensation is the next plausible interest. The first step is defining the scope of our model and it is practically possible that there could be numerous factors affecting a phenomena. To perform a robust analysis, we need to restrict our equation with the most prominent factors that would influence the compensation.

In this example, data is collected pertaining to education, work experience and gender of the said individuals. These are essentially the *independent variables. We *are interested in finding how much is the salary received by individuals-* dependent *on these independent variables. Remember that there has to be no correlation between education, work experience and gender. There are multiple statistical tests that can be used to check if there are correlations between the variables. In the following graph, there is an apparent correlation between salary and consumption of coffee. An individual’s coffee consumption can possibly move in the same direction as that of the salary. However, that does not mean an increase in salary has led to an increase in coffee consumption. Confusing between correlation and causation could result in a misleading picture that we want to avoid in our analysis. Correlation effect does not mean there exists a causational effect.

In a similar vein, independent variables in the model must have any correlational characteristics. Education and work experience in our data set should ideally not exhibit any statistical relation (move in the same direction) because this would create a noise while estimating the causational effect. As a researcher, we are interested to know the causational relationship between independent variables (education, work experience and gender) and dependent variable (compensation).

We can use the data only when the conditions are satisfied, a model is constructed to find out the extent to which education, work experience and gender affects the compensation received by the individuals. The sample size also plays a crucial role in determining our results, imagine if you collect data only from a particular region – our results would be limited to that region. This would mean that there is a risk of higher standard error if the results obtained are generalized. In effect, the scope of the analysis would have limitations in making them universally applicable.

Now, we make a regression model with the data available. How does an equation help us get a deeper understanding of the issue at hand? It’s actually the coefficients that help us understand the impacts better.

Dependent variable gets its value from the independent ones. Therefore, to put this mathematically,

Compensation Package = | Base Value + Education + Experience + Gender |

What regression does, is put values that show the impact of these variables on the dependent variable (compensation package). So our result would look like:

Compensation package (Y) = | β_{0}_{ }+ β (Education)+ _{1}β (Experience) + _{2}β (Gender)_{3}+ U (Standard error)_{i} |

Where,** β **values talks about the impact a variable has on Y i.e. your compensation package.

It is important that we interpret the β values correctly, so as to understand our data and the bigger picture better. Here’s the general way of reading the results:

For a unit change in the independent variables, Y changes by β units. So, in our example, we can say, given that education and experience are fixed, gender has compensation changes by β_{3 }units.

To verify and take a stance, we can put the same values for education and experience, to understand if people with the same education and experience are paid differently. The only differing input value here, is the gender. Since we cannot input words like “Male”, “Female” and “Other” in our model, one can assign mathematical values for them.

Female | Male | Other |

0 | 1 | 2 |

If the output, Y (compensation package) differs for the same education and experience, then we can say that gender has a role to play in determining an individual’s compensation package. Further, it’s important to note the sign of the coefficient. A + sign denotes a positive relation, i.e. as the variable changes by a unit, the Y value moves in the positive direction and vice versa for the negative. The standard error refers to how wrong the regression model is on average using the units of response. The estimates usually try to minimize the error term to get a better fit model.

Moving on to the second part of the Google result above that reads, “*If the dependent variable is dichotomous, then logistic regression should be used.” *Well, to understand this imagine that you’re a bank, and want to analyse which applicant is likely to default in loan repayments. There already is a credit score that exists right? So what are we talking about now? Credit score is one of the many parameters that are taken into account to determine an applicant’s creditworthiness. Analysts need a model that simply highlights how likely an individual is to default. In this model, we get 2 outcomes, 0 and 1. Zero indicating no default, while one indicating a default. This is the *dichotomous *dependent variable that the result talks about.

*Logistic regression *model is used when the data collected is assigned with binary values such as yes or no. The data would provide us details on what is the likelihood of an individual defaulting given the parameters. Further, based on the results a bank could take policy decisions on setting the credit score flooring. Regression model types are chosen based on what the researcher wants to attain and interpret accordingly. Logistic regression is used to classify items, and hence often known as a ‘classifier model’.

There are various other models that can be adopted based on the type of data and nature of the study. Data science has evolved into other disciplines and significantly progressed in recent times. The conceptual understanding of regression analysis remains to be the foundation for proceeding with applications of Big data, Machine Learning and programming of Artificial Intelligence. In our upcoming series of articles, types of data, methods of regression analysis, different econometric models and their applications would be discussed.

Awesome Manjari & Simran