NSF III: Small: Collaborative Research: Conflicts to Harmony: Integrating Massive Data by Trustworthiness Estimation
and Truth Discovery
National Science Foundation Award Number: NSF
IIS 1320617 (08-01-2013—07-31-2017)
E-mail: hanj at cs.uiuc.edu
URL: http://www.cs.uiuc.edu/~hanj
Jing Gao, PI
Department of Computer Science and Engineering
University at Buffalo, Buffalo,
NY 14260, U.S.A.
Office: (716)645-1586
URL: https://www.cse.buffalo.edu/~jing/
List of Supported Students and Staff at UIUC
§ Shi Zhi, Ph.D. student, Department of Computer Science,
University of Illinois at Urbana-Champaign
§ Jingbo Shang, Ph.D. student, Department of Computer
Science, University of Illinois at Urbana-Champaign (collaborative)
§ Jingjing Wang, Ph.D. student, Department of Computer
Science, University of Illinois at Urbana-Champaign (collaborative)
§ Wenqi He, Ph.D. student, Department of Computer Science,
University of Illinois at Urbana-Champaign (collaborative)
§ Quan Yuan, Postdoc Research Fellow, Department of
Computer Science, University of Illinois at Urbana-Champaign (collaborative)
Acknowledgement
·
This material is based upon work supported by the National Science
Foundation under Grant No. NSF IIS 1320617 (08/01/2013—07/31/2017) “NSF III:
Small: Collaborative Research: Conflicts to Harmony: Integrating Massive Data by
Trustworthiness Estimation and Truth Discovery”. Any opinions, findings, and conclusions or recommendations expressed in
this material are those of the author(s) and do not necessarily reflect the
views of the National Science Foundation.
Collaborative project website:
·
PI: Professor Jing Gao (https://www.cse.buffalo.edu/~jing)
·
The collaborative project website: https://www.cse.buffalo.edu//~jing/truth.htm.
Project Summary
Big
data leads to big challenges, not only in the volume of data but also in its dynamics and variety. Multiple descriptions
about the same sets of objects or events from different sources will
unavoidably lead to data or information inconsistency. Then, among conflicting
pieces of data or information, which one is more trustworthy, or
represents the true fact? Facing
the daunting scale of data, it is unrealistic to expect human to “label” or
tell which data source is more reliable or which piece of information is
correct. Our position is to detect truths without supervision, by integrating source
reliability estimation and truth finding. Although there are recent studies
following a general principle: Truth
is obtained by a weighted voting among multiple sources where more
reliable sources have higher weights, and sources that tell the truths more
often will be
regarded
more reliable,
there are many unsolved issues. We propose to integrate previous studies on
this issue and develop a unified approach, by the integration of probabilistic
and optimization models with multidimensional trustworthiness factors. This
will lead to a set of efficient and effective methods, technologies and
software systems for truth inference from multiple conflicting sources of
heterogeneous, disparate, correlated, gigantic, scattered, and streaming data.
Intellectual
Merit: This proposal addresses a few important
research questions for the task of jointly conducting trustworthiness
estimation and truth discovery: (i) How to model the
cases when multiple
values can
be true simultaneously for one entity, (ii) how to characterize heterogeneous data
types in
the process, (iii) how to detect truths effectively in streaming, distributed
and large-scale data
sets, and (iv) how to derive truths and trustworthiness when there exist data dependency? To address these research questions, we
propose to develop an integrated truth discovery framework with the following
integral components: (1) A generative model which incorporates two-sided
trustworthiness for effective detection of multiple truths; (2) probabilistic
and optimization frameworks which allow any loss function for any data type in
truth and trustworthiness inference; (3) models that integrate time factors as
well as efficient incremental
and parallel computation approaches for streaming,
distributed and large-scale data; (4) an integrated model that takes care of data and source
dependencies in
trustworthiness estimation and truth discovery; (5) methods to conduct integrative analysis of conflicting data sources
that integrates trustworthiness analysis; and (6) a systematic study on the
connections between various models and techniques leading to a unified TruthMine
framework. The success of this
project will solve several difficult research problems and greatly improve the
state-of-the-art in conflict resolution and data fusion.
Broader
Impacts: Identifying trustworthy information sources
and truths from massive, diverse, conflicting, complex and noisy data is
critical for data integration, information understanding and decision
making. It is a cornerstone for big data
management and data analytics. Our proposed TruthMine framework, resulted from
our previous long-term collaborative work in this and other related domains,
will make tangible contributions to this critical research issue. The developed
new theory, methodologies, algorithms, and software prototype will advance the
state-of-the-art and benefit many applications where critical decisions have to
be made based on the correct information extracted from diverse sources.
Application fields may include healthcare, business intelligence,
cyber-security, military and intelligence decision making, bioinformatics,
crowd-sourcing, cyber physical systems, social media, and well-beyond.
Moreover, the proposed work will be integrated with training students and new
generation researchers, especially female and minority students. The new research results will be integrated
into course materials and projects in data mining, for graduate, undergraduate,
and K-12 outreach activities. Tutorials, workshops, and research publications
will be made available for broad access. Datasets and software prototypes of
this project will be made publicly available.
The research results are to be published in various
research and application forums and be integrated into the educational programs
at UIUC. The progress of the project and
the research results are also disseminated via the project Web site
(http://www.cs.uiuc.edu/homes/hanj/projs/conflicts.htm).
Selected
Publications and Products:
Books (authored)
·
Jialu Liu, Jingbo
Shang, and Jiawei Han, Phrase Mining from Massive Text and Its
Applications,
Synthesis Lectures on Data Mining and Knowledge Discovery, Morgan &
Claypool Publishers, 2017.
Journal and Refereed Conference
Publications
13.
Meng
Jiang, Christos Faloutsos, Jiawei Han, "CatchTartan: Representing
and Summarizing Dynamic Multicontextual Behaviors", in
Proc. of 2016 ACM SIGKDD Conf. on Knowledge Discovery and Data Mining
(KDD'16), San Francisco, CA, Aug. 2016
14.
Xiang Ren, Wenqi He,
Meng Qu, Clare R. Voss, Heng
Ji, Jiawei Han, "Label Noise
Reduction in Entity Typing by Heterogeneous Partial-Label Embedding", in
Proc. of 2016 ACM SIGKDD Conf. on Knowledge Discovery and Data Mining
(KDD'16), San Francisco, CA, Aug. 2016
15.
Mengting
Wan, Xiangyu Chen, Lance Kaplan, Jiawei Han, Jing
Gao, Bo Zhao, "An
Uncertainty-Aware Model to Summarize Trustworthy Quantitative Information", in
Proc. of 2016 ACM SIGKDD Conf. on Knowledge Discovery and Data Mining
(KDD'16), San Francisco, CA, Aug. 2016
16.
Jing Gao, Qi Li, Bo Zhao, Wei Fan, and Jiawei Han, "Mining Reliable Information from
Passively and Actively Crowdsourced Data" (conf.
tutorial), in Proc. of 2016 ACM SIGKDD Conf. on Knowledge Discovery and
Data Mining (KDD'16), San Francisco, CA, Aug. 2016
17.
Chao Zhang, Guangyu Zhou, Quan Yuan, Honglei Zhuang, Yu
Zheng, Lance Kaplan, Shaowen Wang, Jiawei Han, "GeoBurst: Real-time Local Event
Detection in Geo-tagged Tweet Stream", in
Proc. of 2016 ACM SIGIR Conf. on Research & Development in Information
Retrieval (SIGIR'16), Pisa, Italy, July 2016
Project Impact
·
Education: Parts of the new research
results are used in Data Mining courses (CS412, CS512) for both undergraduate
and graduate students being taught in the Department of Computer
Science, the University of Illinois at Urbana-Champaign.
Moreover, the research results have been and will continuously be published
timely in international conferences and journals and be distributed world-wide
for education and research. The new progress will also be integrated into
the new edition of our data mining textbook and other research collections.
·
Collaborations: For this project we have established collaborations with Boeing, ARL,
NASA, IBM T.J. Watson Research Center, Yahoo! Labs, Microsoft Research, Google Research,
and NCSA (National Center of Supercomputer Applications). Through such
collaborations we expect to have access to real datasets and applications and
produce more research results.
Current and Future Activities
·
The following are some of the highlights of our
ongoing work. Please refer to the
section: Publications and Products section for related references
Area Background
· This project is based on the previous research on data mining, information network analysis, spatiotemporal data analysis, and data cube and multidimensional analysis.
·
There have been many research papers published on
these themes. Several textbooks on data mining, information
retrieval and information network analysis provide good overviews of the
principles and algorithms, including (Han, Kamber and
Pei, 2011) and (Sun and Han 2012).
Area
References
·
Xin Luna
Dong, Alon
Halevy, and Cong Yu, “Data integration
with uncertainty”, (2010), VLDB Journal 2010.
·
X. L.
Dong, L. Berti-Equille, and D. Srivastava. Truth
discovery and copying detection in a dynamic world. PVLDB, 2(1):562–573, 2009.
·
J. Gao, F. Liang, W. Fan, Y. Sun, and J. Han. A graph-based consensus
maximization approach for combining multiple supervised and unsupervised
models. IEEE Transactions on
Knowledge and Data
Engineering, 25(1):15–28, 2013.
·
M.
Gupta and J. Han. Heterogeneous network-based trust analysis: A survey. SIGKDD Explorations, 13(1):54–71, 2011.
·
A. D.
Sarma, X. L. Dong, and A. Halevy. Data integration
with dependent sources. In Proc.
of the
International
Conference on Extending Database Technology (EDBT’11), pages 401–412, 2011
·
D.
Wang, L. Kaplan, H. Le, and T. Abdelzaher, “On truth
discovery in social sensing: A maximum likelihood estimation approach”, in Proc. of the ACM/IEEE
International Conference on Information Processing in Sensor Networks
(IPSN’12), pages 233–244, 2012.
·
V. Vydiswaran, C. Zhai, and D. Roth,
“Content-driven trust propagation framework”, in Proc. of the ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (KDD’11), pages
974–982, 2011.
·
P.
Yu, J. Han, and C. Faloutsos, editors. Link Mining: Models, Algorithms,
and Applications. Springer, 2010
Potential Related Projects
·
Any project related to
truth discovery, information fusion, information and social network analysis, spatiotemporal
data mining, and knowledge discovery.
Project Web site URL:
http://www.cs.uiuc.edu/~hanj/projs/conflicts.htm
Online software (code):
downloadable at www.illimine.cs.uiuc.edu,
Github: first author
Online resources: Research publications related to this project can be downloaded at Selected Publications
Education Materials: CS412+CS512 (http://hanj.cs.illinois.edu) and Coursera: UIUC, Data Mining Specialization
Date of
last update: Nov. 2017