TY - JOUR
T1 - Duplicate product record detection engine for e-commerce platforms
AU - Albayrak, Osman Semih
AU - Aytekin, Tevfik
AU - Kalaycı, Tolga Ahmet
N1 - Publisher Copyright:
© 2022 Elsevier Ltd
PY - 2022/5/1
Y1 - 2022/5/1
N2 - Having a clean product catalog and keeping it complying with the standards of the industry is one of the primary concerns of e-commerce companies. Integrating product data from multiple providers confronts the companies with a challenging issue: duplicate product records. Since it is possible to describe a product with a variety of different words, images and attributes, detecting duplicate product records is a difficult task to overcome. In this work, a novel duplicate record detection engine is proposed for an e-commerce company, Hepsiburada. The engine is developed based on a real-world dataset. In order to build a training set we use text similarity and domain-specific distance metrics for generating candidate duplicate product pairs which are then labeled by human experts. We performed extensive feature engineering and state-of-the-art classification models to determine whether any two products are duplicated or not. The experimental results show that our engine is able to detect duplicate product records with high precision and outperforms the accuracy of non-adaptive methodologies.
AB - Having a clean product catalog and keeping it complying with the standards of the industry is one of the primary concerns of e-commerce companies. Integrating product data from multiple providers confronts the companies with a challenging issue: duplicate product records. Since it is possible to describe a product with a variety of different words, images and attributes, detecting duplicate product records is a difficult task to overcome. In this work, a novel duplicate record detection engine is proposed for an e-commerce company, Hepsiburada. The engine is developed based on a real-world dataset. In order to build a training set we use text similarity and domain-specific distance metrics for generating candidate duplicate product pairs which are then labeled by human experts. We performed extensive feature engineering and state-of-the-art classification models to determine whether any two products are duplicated or not. The experimental results show that our engine is able to detect duplicate product records with high precision and outperforms the accuracy of non-adaptive methodologies.
KW - Classification
KW - Duplicate record detection
KW - Feature engineering
KW - Text similarity
UR - http://www.scopus.com/inward/record.url?scp=85122988788&partnerID=8YFLogxK
U2 - 10.1016/j.eswa.2021.116420
DO - 10.1016/j.eswa.2021.116420
M3 - Article
AN - SCOPUS:85122988788
SN - 0957-4174
VL - 193
JO - Expert Systems with Applications
JF - Expert Systems with Applications
M1 - 116420
ER -