1. 程式人生 > >Feature Selection: A/B Test With Tableau

Feature Selection: A/B Test With Tableau

Feature Selection: A/B Test With Tableau

During a data science project it is important to prepare the data before analyzing them or create a model that generalizes them, this is, in fact, the phase of the projects that demand more time (usually between 60% and 80%).[1].

When the objective is to build a robust model with high accuracy making predictions, it is very important to select significant variables. When the training dataset has the most significant variables, the generalization that the model makes of them will have the highest quality.

At the same time, the more insignificant features the training dataset has, the more noise is being introduced to the model, decreasing its final performance (in computing this phenomenon is known as the “garbage in, garbage out” law).

So, how to differentiate between dataset variables? there are different feature selection techniques and their use depends on the project scope and the expertise of the data scientist in charge.

In this post using the visualization tool Tableau and the popular dataset about Titanic applying a kind of A/B test we select the characteristics of the sailors that have more influence on the survival rate.

The Idea Behind A/B Test Feature Selection

The main idea of the method I will explain is to compare the influence of the dataset variables with the global survival rate ( target variable in the prediction).

Globar Survival Rate vs Sex

Thus, for example, we see that the reference line is a 41% chance of surviving the accident. The attribute ‘Sex’ increases this reference line in favor of the ladies, a fact that makes a lot of sense since the protocol for this type of accidents is “women and children first”. From the test, we obtain that only by gender women increase their chances of survival by 25% compared to men.

The ‘Sex’ attribute is, therefore, a significant variable for the modeling process and should be taken into account. The selection of the variables performs the same analysis with the rest of the features contained in the dataset. If a feature shows patterns that can increase or decrease that 41% reference line, then said feature is statistically significant.

Scientific literature says that a feature is relevant to a target if exist an observation that, by only changing the feature, the final target value is different.[2].