TY - JOUR
T1 - A novel approach for standardizing clinical laboratory categorical test results using machine learning and string distance similarity
AU - Ahmmed, Syed
AU - Mondal, M. Rubaiyat Hossain
AU - Mia, Md Raihan
AU - Adibuzzaman, Mohammad
AU - Hoque, Abu Sayed Md Latiful
AU - Ahamed, Sheikh Iqbal
N1 - Publisher Copyright:
© 2023 The Author(s)
PY - 2023/11
Y1 - 2023/11
N2 - Standardizing clinical laboratory test results is critical for conducting clinical data science research and analysis. However, standardized data processing tools and guidelines are inadequate. In this paper, a novel approach for standardizing categorical test results based on supervised machine learning and the Jaro-Winkler similarity algorithm is proposed. A supervised machine learning model is used in this approach for scalable categorization of the test results into predefined groups or clusters, while Jaro-Winkler similarity is used to map text terms into standard clinical terms within these corresponding groups. The proposed method is applied to 75062 test results from two private hospitals in Bangladesh. The Support Vector Classification algorithm with a linear kernel has a classification accuracy of 98%, which is better than the Random Forest algorithm when categorizing test results. The experiment results show that Jaro-Winkler similarity achieves a remarkable 99.93% success rate in the test result standardization for the majority of groups with manual validation. The proposed method outperforms previous studies that concentrated on standardizing test results using rule-based classifiers on a smaller number of groups and distance similarities such as Cosine similarity or Levenshtein distance. Furthermore, when applied to the publicly available MIMIC-III dataset, our approach also performs excellently. All these findings show that the proposed standardization technique can be very beneficial for clinical big data research, particularly for national clinical research data hubs in low- and middle-income countries.
AB - Standardizing clinical laboratory test results is critical for conducting clinical data science research and analysis. However, standardized data processing tools and guidelines are inadequate. In this paper, a novel approach for standardizing categorical test results based on supervised machine learning and the Jaro-Winkler similarity algorithm is proposed. A supervised machine learning model is used in this approach for scalable categorization of the test results into predefined groups or clusters, while Jaro-Winkler similarity is used to map text terms into standard clinical terms within these corresponding groups. The proposed method is applied to 75062 test results from two private hospitals in Bangladesh. The Support Vector Classification algorithm with a linear kernel has a classification accuracy of 98%, which is better than the Random Forest algorithm when categorizing test results. The experiment results show that Jaro-Winkler similarity achieves a remarkable 99.93% success rate in the test result standardization for the majority of groups with manual validation. The proposed method outperforms previous studies that concentrated on standardizing test results using rule-based classifiers on a smaller number of groups and distance similarities such as Cosine similarity or Levenshtein distance. Furthermore, when applied to the publicly available MIMIC-III dataset, our approach also performs excellently. All these findings show that the proposed standardization technique can be very beneficial for clinical big data research, particularly for national clinical research data hubs in low- and middle-income countries.
KW - Data quality
KW - Data science
KW - Electronic health records
KW - LOINC
KW - Machine learning
KW - SNOMED CT
KW - Standardization
KW - String distance similarity
UR - http://www.scopus.com/inward/record.url?scp=85178236818&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85178236818&partnerID=8YFLogxK
U2 - 10.1016/j.heliyon.2023.e21523
DO - 10.1016/j.heliyon.2023.e21523
M3 - Article
AN - SCOPUS:85178236818
SN - 2405-8440
VL - 9
JO - Heliyon
JF - Heliyon
IS - 11
M1 - e21523
ER -