Automatic identification of variables in epidemiological datasets using logic regression

Matthias W. Lorenz; Negin Ashtiani Abdi; Frank Scheckenbach; Anja Pflug; Alpaslan Bülbül; Alberico L. Catapano; Stefan Agewall; Marat Ezhov; Michiel L. Bots; Stefan Kiechl; Andreas Orth; Giuseppe D. Norata; Jean Philippe Empana; Hung Ju Lin; Stela McLachlan; Lena Bokemark; Kimmo Ronkainen; Mauro Amato; Ulf Schminke; Sathanur R. Srinivasan; Lars Lind; Akihiko Kato; Chrystosomos Dimitriadis; Tadeusz Przewlocki; Shuhei Okazaki; C. D.A. Stehouwer; Tatjana Lazarevic; Peter Willeit; David N. Yanez; Helmuth Steinmetz; Dirk Sander; Holger Poppert; Moise Desvarieux; M. Arfan Ikram; Sebastjan Bevc; Daniel Staub; Cesare R. Sirtori; Bernhard Iglseder; Gunnar Engström; Giovanni Tripepi; Oscar Beloqui; Moo Sik Lee; Alfonsa Friera; Wuxiang Xie; Liliana Grigore; Matthieu Plichart; Ta Chen Su; Christine Robertson; Caroline Schmidt; Tomi Pekka Tuomainen; Fabrizio Veglia; Henry Völzke; Giel Nijpels; Aleksandar Jovanovic; Johann Willeit; Ralph L. Sacco; Oscar H. Franco; Radovan Hojs; Heiko Uthoff; Bo Hedblad; Hyun Woong Park; Carmen Suarez; Dong Zhao; Pierre Ducimetiere; Kuo Liong Chien; Jackie F. Price; Göran Bergström; Jussi Kauhanen; Elena Tremoli; Marcus Dörr; Gerald Berenson; Aikaterini Papagianni; Anna Kablak-Ziembicka; Kazuo Kitagawa; Jaqueline M. Dekker; Radojica Stolic; Joseph F. Polak; Matthias Sitzer; Horst Bickel; Tatjana Rundek; Albert Hofman; Robert Ekart; Beat Frauchiger; Samuela Castelnuovo; Maria Rosvall; Carmine Zoccali; Manuel F. Landecho; Jang Ho Bae; Rafael Gabriel; Jing Liu; Damiano Baldassarre; Maryam Kavousi

doi:10.1186/s12911-017-0429-1

Automatic identification of variables in epidemiological datasets using logic regression

Matthias W. Lorenz, Negin Ashtiani Abdi, Frank Scheckenbach, Anja Pflug, Alpaslan Bülbül, Alberico L. Catapano, Stefan Agewall, Marat Ezhov, Michiel L. Bots, Stefan Kiechl, Andreas Orth, Giuseppe D. Norata, Jean Philippe Empana, Hung Ju Lin, Stela McLachlan, Lena Bokemark, Kimmo Ronkainen, Mauro Amato, Ulf Schminke, Sathanur R. SrinivasanLars Lind, Akihiko Kato, Chrystosomos Dimitriadis, Tadeusz Przewlocki, Shuhei Okazaki, C. D.A. Stehouwer, Tatjana Lazarevic, Peter Willeit, David N. Yanez, Helmuth Steinmetz, Dirk Sander, Holger Poppert, Moise Desvarieux, M. Arfan Ikram, Sebastjan Bevc, Daniel Staub, Cesare R. Sirtori, Bernhard Iglseder, Gunnar Engström, Giovanni Tripepi, Oscar Beloqui, Moo Sik Lee, Alfonsa Friera, Wuxiang Xie, Liliana Grigore, Matthieu Plichart, Ta Chen Su, Christine Robertson, Caroline Schmidt, Tomi Pekka Tuomainen, Fabrizio Veglia, Henry Völzke, Giel Nijpels, Aleksandar Jovanovic, Johann Willeit, Ralph L. Sacco, Oscar H. Franco, Radovan Hojs, Heiko Uthoff, Bo Hedblad, Hyun Woong Park, Carmen Suarez, Dong Zhao, Pierre Ducimetiere, Kuo Liong Chien, Jackie F. Price, Göran Bergström, Jussi Kauhanen, Elena Tremoli, Marcus Dörr, Gerald Berenson, Aikaterini Papagianni, Anna Kablak-Ziembicka, Kazuo Kitagawa, Jaqueline M. Dekker, Radojica Stolic, Joseph F. Polak, Matthias Sitzer, Horst Bickel, Tatjana Rundek, Albert Hofman, Robert Ekart, Beat Frauchiger, Samuela Castelnuovo, Maria Rosvall, Carmine Zoccali, Manuel F. Landecho, Jang Ho Bae, Rafael Gabriel, Jing Liu, Damiano Baldassarre, Maryam Kavousi

School Of Public Health

Research output: Contribution to journal › Article › peer-review

1 Scopus citations

Abstract

Background: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. Methods: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. Results: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. Conclusions: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.

Original language	English (US)
Article number	40
Journal	BMC Medical Informatics and Decision Making
Volume	17
Issue number	1
DOIs	https://doi.org/10.1186/s12911-017-0429-1
State	Published - Apr 13 2017

Keywords

Data management
Epidemiology
Logic regression
Meta-analysis

ASJC Scopus subject areas

Health Policy
Health Informatics
Computer Science Applications

Access to Document

10.1186/s12911-017-0429-1

Cite this

Lorenz, M. W., Abdi, N. A., Scheckenbach, F., Pflug, A., Bülbül, A., Catapano, A. L., Agewall, S., Ezhov, M., Bots, M. L., Kiechl, S., Orth, A., Norata, G. D., Empana, J. P., Lin, H. J., McLachlan, S., Bokemark, L., Ronkainen, K., Amato, M., Schminke, U., ... Kavousi, M. (2017). Automatic identification of variables in epidemiological datasets using logic regression. BMC Medical Informatics and Decision Making, 17(1), Article 40. https://doi.org/10.1186/s12911-017-0429-1

Lorenz, MW, Abdi, NA, Scheckenbach, F, Pflug, A, Bülbül, A, Catapano, AL, Agewall, S, Ezhov, M, Bots, ML, Kiechl, S, Orth, A, Norata, GD, Empana, JP, Lin, HJ, McLachlan, S, Bokemark, L, Ronkainen, K, Amato, M, Schminke, U, Srinivasan, SR, Lind, L, Kato, A, Dimitriadis, C, Przewlocki, T, Okazaki, S, Stehouwer, CDA, Lazarevic, T, Willeit, P, Yanez, DN, Steinmetz, H, Sander, D, Poppert, H, Desvarieux, M, Ikram, MA, Bevc, S, Staub, D, Sirtori, CR, Iglseder, B, Engström, G, Tripepi, G, Beloqui, O, Lee, MS, Friera, A, Xie, W, Grigore, L, Plichart, M, Su, TC, Robertson, C, Schmidt, C, Tuomainen, TP, Veglia, F, Völzke, H, Nijpels, G, Jovanovic, A, Willeit, J, Sacco, RL, Franco, OH, Hojs, R, Uthoff, H, Hedblad, B, Park, HW, Suarez, C, Zhao, D, Ducimetiere, P, Chien, KL, Price, JF, Bergström, G, Kauhanen, J, Tremoli, E, Dörr, M, Berenson, G, Papagianni, A, Kablak-Ziembicka, A, Kitagawa, K, Dekker, JM, Stolic, R, Polak, JF, Sitzer, M, Bickel, H, Rundek, T, Hofman, A, Ekart, R, Frauchiger, B, Castelnuovo, S, Rosvall, M, Zoccali, C, Landecho, MF, Bae, JH, Gabriel, R, Liu, J, Baldassarre, D & Kavousi, M 2017, 'Automatic identification of variables in epidemiological datasets using logic regression', BMC Medical Informatics and Decision Making, vol. 17, no. 1, 40. https://doi.org/10.1186/s12911-017-0429-1

@article{fc8ff671251e412c8f2d2b5da4806817,

title = "Automatic identification of variables in epidemiological datasets using logic regression",

abstract = "Background: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. Methods: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. Results: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. Conclusions: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.",

keywords = "Data management, Epidemiology, Logic regression, Meta-analysis",

author = "Lorenz, {Matthias W.} and Abdi, {Negin Ashtiani} and Frank Scheckenbach and Anja Pflug and Alpaslan B{\"u}lb{\"u}l and Catapano, {Alberico L.} and Stefan Agewall and Marat Ezhov and Bots, {Michiel L.} and Stefan Kiechl and Andreas Orth and Norata, {Giuseppe D.} and Empana, {Jean Philippe} and Lin, {Hung Ju} and Stela McLachlan and Lena Bokemark and Kimmo Ronkainen and Mauro Amato and Ulf Schminke and Srinivasan, {Sathanur R.} and Lars Lind and Akihiko Kato and Chrystosomos Dimitriadis and Tadeusz Przewlocki and Shuhei Okazaki and Stehouwer, {C. D.A.} and Tatjana Lazarevic and Peter Willeit and Yanez, {David N.} and Helmuth Steinmetz and Dirk Sander and Holger Poppert and Moise Desvarieux and Ikram, {M. Arfan} and Sebastjan Bevc and Daniel Staub and Sirtori, {Cesare R.} and Bernhard Iglseder and Gunnar Engstr{\"o}m and Giovanni Tripepi and Oscar Beloqui and Lee, {Moo Sik} and Alfonsa Friera and Wuxiang Xie and Liliana Grigore and Matthieu Plichart and Su, {Ta Chen} and Christine Robertson and Caroline Schmidt and Tuomainen, {Tomi Pekka} and Fabrizio Veglia and Henry V{\"o}lzke and Giel Nijpels and Aleksandar Jovanovic and Johann Willeit and Sacco, {Ralph L.} and Franco, {Oscar H.} and Radovan Hojs and Heiko Uthoff and Bo Hedblad and Park, {Hyun Woong} and Carmen Suarez and Dong Zhao and Pierre Ducimetiere and Chien, {Kuo Liong} and Price, {Jackie F.} and G{\"o}ran Bergstr{\"o}m and Jussi Kauhanen and Elena Tremoli and Marcus D{\"o}rr and Gerald Berenson and Aikaterini Papagianni and Anna Kablak-Ziembicka and Kazuo Kitagawa and Dekker, {Jaqueline M.} and Radojica Stolic and Polak, {Joseph F.} and Matthias Sitzer and Horst Bickel and Tatjana Rundek and Albert Hofman and Robert Ekart and Beat Frauchiger and Samuela Castelnuovo and Maria Rosvall and Carmine Zoccali and Landecho, {Manuel F.} and Bae, {Jang Ho} and Rafael Gabriel and Jing Liu and Damiano Baldassarre and Maryam Kavousi",

note = "Publisher Copyright: {\textcopyright} 2017 The Author(s).",

year = "2017",

month = apr,

day = "13",

doi = "10.1186/s12911-017-0429-1",

language = "English (US)",

volume = "17",

journal = "BMC Medical Informatics and Decision Making",

issn = "1472-6947",

publisher = "BioMed Central",

number = "1",

}

TY - JOUR

T1 - Automatic identification of variables in epidemiological datasets using logic regression

AU - Lorenz, Matthias W.

AU - Abdi, Negin Ashtiani

AU - Scheckenbach, Frank

AU - Pflug, Anja

AU - Bülbül, Alpaslan

AU - Catapano, Alberico L.

AU - Agewall, Stefan

AU - Ezhov, Marat

AU - Bots, Michiel L.

AU - Kiechl, Stefan

AU - Orth, Andreas

AU - Norata, Giuseppe D.

AU - Empana, Jean Philippe

AU - Lin, Hung Ju

AU - McLachlan, Stela

AU - Bokemark, Lena

AU - Ronkainen, Kimmo

AU - Amato, Mauro

AU - Schminke, Ulf

AU - Srinivasan, Sathanur R.

AU - Lind, Lars

AU - Kato, Akihiko

AU - Dimitriadis, Chrystosomos

AU - Przewlocki, Tadeusz

AU - Okazaki, Shuhei

AU - Stehouwer, C. D.A.

AU - Lazarevic, Tatjana

AU - Willeit, Peter

AU - Yanez, David N.

AU - Steinmetz, Helmuth

AU - Sander, Dirk

AU - Poppert, Holger

AU - Desvarieux, Moise

AU - Ikram, M. Arfan

AU - Bevc, Sebastjan

AU - Staub, Daniel

AU - Sirtori, Cesare R.

AU - Iglseder, Bernhard

AU - Engström, Gunnar

AU - Tripepi, Giovanni

AU - Beloqui, Oscar

AU - Lee, Moo Sik

AU - Friera, Alfonsa

AU - Xie, Wuxiang

AU - Grigore, Liliana

AU - Plichart, Matthieu

AU - Su, Ta Chen

AU - Robertson, Christine

AU - Schmidt, Caroline

AU - Tuomainen, Tomi Pekka

AU - Veglia, Fabrizio

AU - Völzke, Henry

AU - Nijpels, Giel

AU - Jovanovic, Aleksandar

AU - Willeit, Johann

AU - Sacco, Ralph L.

AU - Franco, Oscar H.

AU - Hojs, Radovan

AU - Uthoff, Heiko

AU - Hedblad, Bo

AU - Park, Hyun Woong

AU - Suarez, Carmen

AU - Zhao, Dong

AU - Ducimetiere, Pierre

AU - Chien, Kuo Liong

AU - Price, Jackie F.

AU - Bergström, Göran

AU - Kauhanen, Jussi

AU - Tremoli, Elena

AU - Dörr, Marcus

AU - Berenson, Gerald

AU - Papagianni, Aikaterini

AU - Kablak-Ziembicka, Anna

AU - Kitagawa, Kazuo

AU - Dekker, Jaqueline M.

AU - Stolic, Radojica

AU - Polak, Joseph F.

AU - Sitzer, Matthias

AU - Bickel, Horst

AU - Rundek, Tatjana

AU - Hofman, Albert

AU - Ekart, Robert

AU - Frauchiger, Beat

AU - Castelnuovo, Samuela

AU - Rosvall, Maria

AU - Zoccali, Carmine

AU - Landecho, Manuel F.

AU - Bae, Jang Ho

AU - Gabriel, Rafael

AU - Liu, Jing

AU - Baldassarre, Damiano

AU - Kavousi, Maryam

PY - 2017/4/13

Y1 - 2017/4/13

N2 - Background: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. Methods: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. Results: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. Conclusions: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.

AB - Background: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. Methods: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. Results: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. Conclusions: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.

KW - Data management

KW - Epidemiology

KW - Logic regression

KW - Meta-analysis

UR - http://www.scopus.com/inward/record.url?scp=85018523489&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85018523489&partnerID=8YFLogxK

U2 - 10.1186/s12911-017-0429-1

DO - 10.1186/s12911-017-0429-1

M3 - Article

C2 - 28407816

AN - SCOPUS:85018523489

SN - 1472-6947

VL - 17

JO - BMC Medical Informatics and Decision Making

JF - BMC Medical Informatics and Decision Making

IS - 1

M1 - 40

ER -

Automatic identification of variables in epidemiological datasets using logic regression

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this