Automatic identification of variables in epidemiological datasets using logic regression

Matthias W. Lorenz, Negin Ashtiani Abdi, Frank Scheckenbach, Anja Pflug, Alpaslan Bülbül, Alberico L. Catapano, Stefan Agewall, Marat Ezhov, Michiel L. Bots, Stefan Kiechl, Andreas Orth, Giuseppe D. Norata, Jean Philippe Empana, Hung Ju Lin, Stela McLachlan, Lena Bokemark, Kimmo Ronkainen, Mauro Amato, Ulf Schminke, Sathanur R. SrinivasanLars Lind, Akihiko Kato, Chrystosomos Dimitriadis, Tadeusz Przewlocki, Shuhei Okazaki, C. D.A. Stehouwer, Tatjana Lazarevic, Peter Willeit, David N. Yanez, Helmuth Steinmetz, Dirk Sander, Holger Poppert, Moise Desvarieux, M. Arfan Ikram, Sebastjan Bevc, Daniel Staub, Cesare R. Sirtori, Bernhard Iglseder, Gunnar Engström, Giovanni Tripepi, Oscar Beloqui, Moo Sik Lee, Alfonsa Friera, Wuxiang Xie, Liliana Grigore, Matthieu Plichart, Ta Chen Su, Christine Robertson, Caroline Schmidt, Tomi Pekka Tuomainen, Fabrizio Veglia, Henry Völzke, Giel Nijpels, Aleksandar Jovanovic, Johann Willeit, Ralph L. Sacco, Oscar H. Franco, Radovan Hojs, Heiko Uthoff, Bo Hedblad, Hyun Woong Park, Carmen Suarez, Dong Zhao, Pierre Ducimetiere, Kuo Liong Chien, Jackie F. Price, Göran Bergström, Jussi Kauhanen, Elena Tremoli, Marcus Dörr, Gerald Berenson, Aikaterini Papagianni, Anna Kablak-Ziembicka, Kazuo Kitagawa, Jaqueline M. Dekker, Radojica Stolic, Joseph F. Polak, Matthias Sitzer, Horst Bickel, Tatjana Rundek, Albert Hofman, Robert Ekart, Beat Frauchiger, Samuela Castelnuovo, Maria Rosvall, Carmine Zoccali, Manuel F. Landecho, Jang Ho Bae, Rafael Gabriel, Jing Liu, Damiano Baldassarre, Maryam Kavousi

Research output: Contribution to journalArticlepeer-review

1 Scopus citations


Background: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. Methods: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. Results: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. Conclusions: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.

Original languageEnglish (US)
Article number40
JournalBMC Medical Informatics and Decision Making
Issue number1
StatePublished - Apr 13 2017


  • Data management
  • Epidemiology
  • Logic regression
  • Meta-analysis

ASJC Scopus subject areas

  • Health Policy
  • Health Informatics


Dive into the research topics of 'Automatic identification of variables in epidemiological datasets using logic regression'. Together they form a unique fingerprint.

Cite this