TY - JOUR
T1 - Measuring the reproducibility and quality of Hi-C data
AU - Yardımcı, Galip Gürkan
AU - Ozadam, Hakan
AU - Sauria, Michael E.G.
AU - Ursu, Oana
AU - Yan, Koon Kiu
AU - Yang, Tao
AU - Chakraborty, Abhijit
AU - Kaul, Arya
AU - Lajoie, Bryan R.
AU - Song, Fan
AU - Zhan, Ye
AU - Ay, Ferhat
AU - Gerstein, Mark
AU - Kundaje, Anshul
AU - Li, Qunhua
AU - Taylor, James
AU - Yue, Feng
AU - Dekker, Job
AU - Noble, William S.
N1 - Funding Information:
G.G.Y and W.S.N are supported by awards NIH U41HG007000, U24HG009446. H.O. is supported by DK107980. B.R.J is supported by awards HG004592, HG003143 and J.D. is supported by awards HG004592, HG003143, DK107980. M.E.G.S. and J.T. is supported by awards NIH R24 DK106766, U41 HG006620. O.U is supported by Howard Hughes Medical Institute International Student Research Fellowship and a Gabilan Stanford Graduate Fellowship award and A.K. is supported by awards NIH DP2OD022870, U24HG009397, R01ES025009-02S1. T.Y. is supported by NIH T32 GM102057 (CBIOS training program to The Pennsylvania State University), a Huck Graduate Research Innovation Grant and Q.L. is supported by award NIH R01GM109453.
Publisher Copyright:
© 2019 The Author(s).
PY - 2019/3/19
Y1 - 2019/3/19
N2 - Background: Hi-C is currently the most widely used assay to investigate the 3D organization of the genome and to study its role in gene regulation, DNA replication, and disease. However, Hi-C experiments are costly to perform and involve multiple complex experimental steps; thus, accurate methods for measuring the quality and reproducibility of Hi-C data are essential to determine whether the output should be used further in a study. Results: Using real and simulated data, we profile the performance of several recently proposed methods for assessing reproducibility of population Hi-C data, including HiCRep, GenomeDISCO, HiC-Spector, and QuASAR-Rep. By explicitly controlling noise and sparsity through simulations, we demonstrate the deficiencies of performing simple correlation analysis on pairs of matrices, and we show that methods developed specifically for Hi-C data produce better measures of reproducibility. We also show how to use established measures, such as the ratio of intra- to interchromosomal interactions, and novel ones, such as QuASAR-QC, to identify low-quality experiments. Conclusions: In this work, we assess reproducibility and quality measures by varying sequencing depth, resolution and noise levels in Hi-C data from 13 cell lines, with two biological replicates each, as well as 176 simulated matrices. Through this extensive validation and benchmarking of Hi-C data, we describe best practices for reproducibility and quality assessment of Hi-C experiments. We make all software publicly available at http://github.com/kundajelab/3DChromatin_ReplicateQC to facilitate adoption in the community.
AB - Background: Hi-C is currently the most widely used assay to investigate the 3D organization of the genome and to study its role in gene regulation, DNA replication, and disease. However, Hi-C experiments are costly to perform and involve multiple complex experimental steps; thus, accurate methods for measuring the quality and reproducibility of Hi-C data are essential to determine whether the output should be used further in a study. Results: Using real and simulated data, we profile the performance of several recently proposed methods for assessing reproducibility of population Hi-C data, including HiCRep, GenomeDISCO, HiC-Spector, and QuASAR-Rep. By explicitly controlling noise and sparsity through simulations, we demonstrate the deficiencies of performing simple correlation analysis on pairs of matrices, and we show that methods developed specifically for Hi-C data produce better measures of reproducibility. We also show how to use established measures, such as the ratio of intra- to interchromosomal interactions, and novel ones, such as QuASAR-QC, to identify low-quality experiments. Conclusions: In this work, we assess reproducibility and quality measures by varying sequencing depth, resolution and noise levels in Hi-C data from 13 cell lines, with two biological replicates each, as well as 176 simulated matrices. Through this extensive validation and benchmarking of Hi-C data, we describe best practices for reproducibility and quality assessment of Hi-C experiments. We make all software publicly available at http://github.com/kundajelab/3DChromatin_ReplicateQC to facilitate adoption in the community.
UR - http://www.scopus.com/inward/record.url?scp=85063156719&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85063156719&partnerID=8YFLogxK
U2 - 10.1186/s13059-019-1658-7
DO - 10.1186/s13059-019-1658-7
M3 - Article
C2 - 30890172
AN - SCOPUS:85063156719
SN - 1474-7596
VL - 20
JO - Genome Biology
JF - Genome Biology
IS - 1
M1 - 57
ER -