TY - GEN
T1 - Analysis and Application of Normalization Methods with Supervised Feature Weighting to Improve K-means Accuracy
AU - Niño-Adan, Iratxe
AU - Landa-Torres, Itziar
AU - Portillo, Eva
AU - Manjarres, Diana
N1 - Publisher Copyright:
© 2020, Springer Nature Switzerland AG.
PY - 2020
Y1 - 2020
N2 - Normalization methods are widely employed for transforming the variables or features of a given dataset. In this paper three classical feature normalization methods, Standardization (St), Min-Max (MM) and Median Absolute Deviation (MAD), are studied in different synthetic datasets from UCI repository. An exhaustive analysis of the transformed features’ ranges and their influence on the Euclidean distance is performed, concluding that knowledge about the group structure gathered by each feature is needed to select the best normalization method for a given dataset. In order to effectively collect the features’ importance and adjust their contribution, this paper proposes a two-stage methodology for normalization and supervised feature weighting based on a Pearson correlation coefficient and on a Random Forest Feature Importance estimation method. Simulations on five different datasets reveal that our two-stage proposed methodology, in terms of accuracy, outperforms or at least maintains the K-means performance obtained if only normalization is applied.
AB - Normalization methods are widely employed for transforming the variables or features of a given dataset. In this paper three classical feature normalization methods, Standardization (St), Min-Max (MM) and Median Absolute Deviation (MAD), are studied in different synthetic datasets from UCI repository. An exhaustive analysis of the transformed features’ ranges and their influence on the Euclidean distance is performed, concluding that knowledge about the group structure gathered by each feature is needed to select the best normalization method for a given dataset. In order to effectively collect the features’ importance and adjust their contribution, this paper proposes a two-stage methodology for normalization and supervised feature weighting based on a Pearson correlation coefficient and on a Random Forest Feature Importance estimation method. Simulations on five different datasets reveal that our two-stage proposed methodology, in terms of accuracy, outperforms or at least maintains the K-means performance obtained if only normalization is applied.
KW - K-means
KW - Normalization
KW - Pearson correlation
KW - Random Forest
KW - Standardization
KW - Weighted Euclidean Distance
UR - http://www.scopus.com/inward/record.url?scp=85065923886&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-20055-8_2
DO - 10.1007/978-3-030-20055-8_2
M3 - Conference contribution
AN - SCOPUS:85065923886
SN - 9783030200541
T3 - Advances in Intelligent Systems and Computing
SP - 14
EP - 24
BT - 14th International Conference on Soft Computing Models in Industrial and Environmental Applications SOCO 2019, Proceedings
A2 - Martínez Álvarez, Francisco
A2 - Troncoso Lora, Alicia
A2 - Quintián, Héctor
A2 - Sáez Muñoz, José António
A2 - Corchado, Emilio
PB - Springer Verlag
T2 - 14th International Conference on Soft Computing Models in Industrial and Environmental Applications, SOCO 2019
Y2 - 13 May 2019 through 15 May 2019
ER -