NSF III: Small: Collaborative Research: Conflicts to Harmony: Integrating Massive Data by Trustworthiness Estimation and Truth Discovery
National Science Foundation Award Number: NSF IIS 1320617 (08-01-2013—07-31-2016)
E-mail: hanj at cs.uiuc.edu
List of Supported Students and Staff
§ Shi Zhi, Ph.D. student, Department of Computer Science, University of Illinois at Urbana-Champaign
§ Jingbo Shang, Ph.D. student, Department of Computer Science, University of Illinois at Urbana-Champaign (collaborative)
§ Jingjing Wang, Ph.D. student, Department of Computer Science, University of Illinois at Urbana-Champaign (collaborative)
§ Wenqi He, Ph.D. student, Department of Computer Science, University of Illinois at Urbana-Champaign (collaborative)
§ Quan Yuan, Postdoc Research Fellow, Department of Computer Science, University of Illinois at Urbana-Champaign (collaborative)
Big data leads to big challenges, not only in the volume of data but also in its dynamics and variety. Multiple descriptions about the same sets of objects or events from different sources will unavoidably lead to data or information inconsistency. Then, among conflicting pieces of data or information, which one is more trustworthy, or represents the true fact? Facing the daunting scale of data, it is unrealistic to expect human to “label” or tell which data source is more reliable or which piece of information is correct. Our position is to detect truths without supervision, by integrating source reliability estimation and truth finding. Although there are recent studies following a general principle: Truth is obtained by a weighted voting among multiple sources where more reliable sources have higher weights, and sources that tell the truths more often will be regarded more reliable, there are many unsolved issues. We propose to integrate previous studies on this issue and develop a unified approach, by the integration of probabilistic and optimization models with multidimensional trustworthiness factors. This will lead to a set of efficient and effective methods, technologies and software systems for truth inference from multiple conflicting sources of heterogeneous, disparate, correlated, gigantic, scattered, and streaming data.
Intellectual Merit: This proposal addresses a few important research questions for the task of jointly conducting trustworthiness estimation and truth discovery: (i) How to model the cases when multiple values can be true simultaneously for one entity, (ii) how to characterize heterogeneous data types in the process, (iii) how to detect truths effectively in streaming, distributed and large-scale data sets, and (iv) how to derive truths and trustworthiness when there exist data dependency? To address these research questions, we propose to develop an integrated truth discovery framework with the following integral components: (1) A generative model which incorporates two-sided trustworthiness for effective detection of multiple truths; (2) probabilistic and optimization frameworks which allow any loss function for any data type in truth and trustworthiness inference; (3) models that integrate time factors as well as efficient incremental and parallel computation approaches for streaming, distributed and large-scale data; (4) an integrated model that takes care of data and source dependencies in trustworthiness estimation and truth discovery; (5) methods to conduct integrative analysis of conflicting data sources that integrates trustworthiness analysis; and (6) a systematic study on the connections between various models and techniques leading to a unified TruthMine framework. The success of this project will solve several difficult research problems and greatly improve the state-of-the-art in conflict resolution and data fusion.
Broader Impacts: Identifying trustworthy information sources and truths from massive, diverse, conflicting, complex and noisy data is critical for data integration, information understanding and decision making. It is a cornerstone for big data management and data analytics. Our proposed TruthMine framework, resulted from our previous long-term collaborative work in this and other related domains, will make tangible contributions to this critical research issue. The developed new theory, methodologies, algorithms, and software prototype will advance the state-of-the-art and benefit many applications where critical decisions have to be made based on the correct information extracted from diverse sources. Application fields may include healthcare, business intelligence, cyber-security, military and intelligence decision making, bioinformatics, crowd-sourcing, cyber physical systems, social media, and well-beyond. Moreover, the proposed work will be integrated with training students and new generation researchers, especially female and minority students. The new research results will be integrated into course materials and projects in data mining, for graduate, undergraduate, and K-12 outreach activities. Tutorials, workshops, and research publications will be made available for broad access. Datasets and software prototypes of this project will be made publicly available.
The research results are to be published in various research and application forums and be integrated into the educational programs at UIUC. The progress of the project and the research results are also disseminated via the project Web site (http://www.cs.uiuc.edu/homes/hanj/projs/conflicts.htm).
Selected Publications and Products:
Journal and Refereed Conference Publications
1. Meng Jiang, Christos Faloutsos, Jiawei Han, "CatchTartan: Representing and Summarizing Dynamic Multicontextual Behaviors", in Proc. of 2016 ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD'16), San Francisco, CA, Aug. 2016
2. Xiang Ren, Wenqi He, Meng Qu, Clare R. Voss, Heng Ji, Jiawei Han, "Label Noise Reduction in Entity Typing by Heterogeneous Partial-Label Embedding", in Proc. of 2016 ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD'16), San Francisco, CA, Aug. 2016
3. Mengting Wan, Xiangyu Chen, Lance Kaplan, Jiawei Han, Jing Gao, Bo Zhao, "An Uncertainty-Aware Model to Summarize Trustworthy Quantitative Information", in Proc. of 2016 ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD'16), San Francisco, CA, Aug. 2016
4. Jing Gao, Qi Li, Bo Zhao, Wei Fan, and Jiawei Han, "Mining Reliable Information from Passively and Actively Crowdsourced Data" (conf. tutorial), in Proc. of 2016 ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD'16), San Francisco, CA, Aug. 2016
5. Chao Zhang, Guangyu Zhou, Quan Yuan, Honglei Zhuang, Yu Zheng, Lance Kaplan, Shaowen Wang, Jiawei Han, "GeoBurst: Real-time Local Event Detection in Geo-tagged Tweet Stream", in Proc. of 2016 ACM SIGIR Conf. on Research & Development in Information Retrieval (SIGIR'16), Pisa, Italy, July 2016
Education: Parts of the new research results are used in Data Mining courses (CS412, CS512) for both undergraduate and graduate students being taught in the Department of Computer Science, the University of Illinois at Urbana-Champaign. Moreover, the research results have been and will continuously be published timely in international conferences and journals and be distributed world-wide for education and research. The new progress will also be integrated into the new edition of our data mining textbook and other research collections.
Collaborations: For this project we have established collaborations with Boeing, ARL, NASA, IBM T.J. Watson Research Center, Yahoo! Labs, Microsoft Research, Google Research, and NCSA (National Center of Supercomputer Applications). Through such collaborations we expect to have access to real datasets and applications and produce more research results.
Current and Future Activities
The following are some of the highlights of our ongoing work. Please refer to the section: Publications and Products section for related references
· This project is based on the previous research on data mining, information network analysis, spatiotemporal data analysis, and data cube and multidimensional analysis.
There have been many research papers published on these themes. Several textbooks on data mining, information retrieval and information network analysis provide good overviews of the principles and algorithms, including (Han, Kamber and Pei, 2011) and (Sun and Han 2012).
· Xin Luna Dong, Alon Halevy, and Cong Yu, “Data integration with uncertainty”, (2010), VLDB Journal 2010.
· X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1):562–573, 2009.
· J. Gao, F. Liang, W. Fan, Y. Sun, and J. Han. A graph-based consensus maximization approach for combining multiple supervised and unsupervised models. IEEE Transactions on Knowledge and Data Engineering, 25(1):15–28, 2013.
· M. Gupta and J. Han. Heterogeneous network-based trust analysis: A survey. SIGKDD Explorations, 13(1):54–71, 2011.
· A. D. Sarma, X. L. Dong, and A. Halevy. Data integration with dependent sources. In Proc. of the International Conference on Extending Database Technology (EDBT’11), pages 401–412, 2011
· D. Wang, L. Kaplan, H. Le, and T. Abdelzaher, “On truth discovery in social sensing: A maximum likelihood estimation approach”, in Proc. of the ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN’12), pages 233–244, 2012.
· V. Vydiswaran, C. Zhai, and D. Roth, “Content-driven trust propagation framework”, in Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’11), pages 974–982, 2011.
· P. Yu, J. Han, and C. Faloutsos, editors. Link Mining: Models, Algorithms, and Applications. Springer, 2010
Potential Related Projects
Any project related to truth discovery, information fusion, information and social network analysis, spatiotemporal data mining, and knowledge discovery.
Project Web site URL: http://www.cs.uiuc.edu/~hanj/projs/conflicts.htm
Online software: Online software related to this project can be downloaded at www.illimine.cs.uiuc.edu
Online resources: Research publications related to this project can be downloaded at Selected Publications