Correlation and Significance

How to preliminary choose the values you'd to use for training? In this tutorial we show how to interpret the traffic light analysis in Arkangel AI

Choosing the supporting variables prior to training is one of the most important things in your algorithm.

The traffic light bars show the degree to which a feature is correlated with the target. The classification is capable of detecting non-linear relationships with the target, but as they are univariate, they are unable to detect interaction effects between features. This calculation uses Phi-k correlation and significance. They are both calculated using an algorithm that measures the information content of the variable; this calculation is done independently for each feature in the dataset.

Tip: As you iterate with different configurations you might want to remove features that are unrelated to the target.

What is correlation?

Correlation is a statistical measure that expresses the extent to which two variables are related, describing how much they can change in relation to one another and identifying a pattern but does not help identify a relationship between cause and effect.

Phi-k correlation coefficient

The Phi-k correlation coefficient works consistently between categorical, ordinal, and interval variables. It is obtained by a derivation from Pearson’s chi-squared contingency test. The values for levels of correlation are bound in the range [0 - 1]:

Values close to 0.0% represent no association
Values close to 100% represent the complete association.

What is significance?

Significance helps quantify and understand whether a relationship between a group of variables is caused by something different than chance. Significance provides evidence that a correlation exists between the variables.

Phi-k significance

Phi-k significance is obtained by utilizing a hybrid approach, where a the G-test is used and the result is expressed as a one sided Z, which is then transformed into a P-value score.

This value is interpreted as a hypothesis test score, where the null hypothesis is that the correlation between variables has no statistical significance between one another, while the alternative hypothesis is that there is statistical significance between variables.

A p-value lower than 0.025 is generally considered statistically significant, the lower the p-value the greater the statistical significance the variable has

For values close to 1 a statistical significance is impossible to define
Values close to 0.025 or lower are statistically significant

PreviousHandling dates NextHandling of Outliers

Last updated 1 year ago

Was this helpful?