NSF III: Small: Collaborative Research: Conflicts to Harmony: Integrating Massive Data by Trustworthiness Estimation and Truth Discovery

National Science Foundation Award Number: NSF IIS 1320617 (08-01-2013—07-31-2016)

 

Contact Information

 

Jiawei Han, co-PI
Department of Computer Science
University of Illinois, Urbana-Champaign
201 N. Goodwin Ave., Urbana, Illinois 61801 U.S.A.
Office: (217) 333-6903

Fax: (217) 265-6494

E-mail: hanj at cs.uiuc.edu

URL: http://www.cs.uiuc.edu/~hanj

 

List of Supported Students and Staff

 

§  Shi Zhi, Ph.D. student, Department of Computer Science, University of Illinois at Urbana-Champaign

§  Jingbo Shang, Ph.D. student, Department of Computer Science, University of Illinois at Urbana-Champaign (collaborative)

§  Jingjing Wang, Ph.D. student, Department of Computer Science, University of Illinois at Urbana-Champaign (collaborative)

§  Wenqi He, Ph.D. student, Department of Computer Science, University of Illinois at Urbana-Champaign (collaborative)

§  Quan Yuan, Postdoc Research Fellow, Department of Computer Science, University of Illinois at Urbana-Champaign (collaborative)

Project Award Information

Project Summary

Big data leads to big challenges, not only in the volume of data but also in its dynamics and variety. Multiple descriptions about the same sets of objects or events from different sources will unavoidably lead to data or information inconsistency. Then, among conflicting pieces of data or information, which one is more trustworthy, or represents the true fact? Facing the daunting scale of data, it is unrealistic to expect human to “label” or tell which data source is more reliable or which piece of information is correct. Our position is to detect truths without supervision, by integrating source reliability estimation and truth finding. Although there are recent studies following a general principle: Truth is obtained by a weighted voting among multiple sources where more reliable sources have higher weights, and sources that tell the truths more often will be regarded more reliable, there are many unsolved issues. We propose to integrate previous studies on this issue and develop a unified approach, by the integration of probabilistic and optimization models with multidimensional trustworthiness factors. This will lead to a set of efficient and effective methods, technologies and software systems for truth inference from multiple conflicting sources of heterogeneous, disparate, correlated, gigantic, scattered, and streaming data.

 

Intellectual Merit:  This proposal addresses a few important research questions for the task of jointly conducting trustworthiness estimation and truth discovery: (i) How to model the cases when multiple values can be true simultaneously for one entity, (ii) how to characterize heterogeneous data types in the process, (iii) how to detect truths effectively in streaming, distributed and large-scale data sets, and (iv) how to derive truths and trustworthiness when there exist data dependency?  To address these research questions, we propose to develop an integrated truth discovery framework with the following integral components: (1) A generative model which incorporates two-sided trustworthiness for effective detection of multiple truths; (2) probabilistic and optimization frameworks which allow any loss function for any data type in truth and trustworthiness inference; (3) models that integrate time factors as well as efficient incremental and parallel computation approaches for streaming, distributed and large-scale data; (4) an integrated model that takes care of data and source dependencies in trustworthiness estimation and truth discovery; (5) methods to conduct integrative analysis of conflicting data sources that integrates trustworthiness analysis; and (6) a systematic study on the connections between various models and techniques leading to a unified TruthMine framework. The success of this project will solve several difficult research problems and greatly improve the state-of-the-art in conflict resolution and data fusion.

 

Broader Impacts:  Identifying trustworthy information sources and truths from massive, diverse, conflicting, complex and noisy data is critical for data integration, information understanding and decision making.  It is a cornerstone for big data management and data analytics. Our proposed TruthMine framework, resulted from our previous long-term collaborative work in this and other related domains, will make tangible contributions to this critical research issue. The developed new theory, methodologies, algorithms, and software prototype will advance the state-of-the-art and benefit many applications where critical decisions have to be made based on the correct information extracted from diverse sources. Application fields may include healthcare, business intelligence, cyber-security, military and intelligence decision making, bioinformatics, crowd-sourcing, cyber physical systems, social media, and well-beyond. Moreover, the proposed work will be integrated with training students and new generation researchers, especially female and minority students.   The new research results will be integrated into course materials and projects in data mining, for graduate, undergraduate, and K-12 outreach activities. Tutorials, workshops, and research publications will be made available for broad access. Datasets and software prototypes of this project will be made publicly available.

 

The research results are to be published in various research and application forums and be integrated into the educational programs at UIUC.  The progress of the project and the research results are also disseminated via the project Web site (http://www.cs.uiuc.edu/homes/hanj/projs/conflicts.htm).

Selected Publications and Products:

Books (authored)

 

  1. Yizhou Sun and Jiawei Han, Mining Heterogeneous Information Networks: Principles and Methodologies, Morgan & Claypool Publishers, 2012.
  2. Jiawei Han, Micheline Kamber, and Jian Pei, Data Mining: Concepts and Techniques, 3rd ed., Morgan Kaufmann, 2011.

 

Journal and Refereed Conference Publications

 

1.    Meng Jiang, Christos Faloutsos, Jiawei Han, "CatchTartan: Representing and Summarizing Dynamic Multicontextual Behaviors", in Proc. of 2016 ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD'16), San Francisco, CA, Aug. 2016

2.    Xiang Ren,  Wenqi He,  Meng Qu, Clare R. Voss, Heng Ji, Jiawei Han, "Label Noise Reduction in Entity Typing by Heterogeneous Partial-Label Embedding", in Proc. of 2016 ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD'16), San Francisco, CA, Aug. 2016

3.    Mengting Wan, Xiangyu Chen, Lance Kaplan, Jiawei Han, Jing Gao, Bo Zhao, "An Uncertainty-Aware Model to Summarize Trustworthy Quantitative Information", in Proc. of 2016 ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD'16), San Francisco, CA, Aug. 2016

4.    Jing Gao, Qi Li, Bo Zhao, Wei Fan, and Jiawei Han, "Mining Reliable Information from Passively and Actively Crowdsourced Data" (conf. tutorial), in Proc. of 2016 ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD'16), San Francisco, CA, Aug. 2016

5.    Chao Zhang, Guangyu Zhou, Quan Yuan, Honglei Zhuang, Yu Zheng, Lance Kaplan, Shaowen Wang, Jiawei Han, "GeoBurst: Real-time Local Event Detection in Geo-tagged Tweet Stream",  in Proc. of 2016 ACM SIGIR Conf. on Research & Development in Information Retrieval (SIGIR'16), Pisa, Italy, July 2016

  1. Xiang Ren, Ahmed El-Kishky, Chi Wang, and Jiawei Han, "Automatic Entity Recognition and Typing in Massive Text Corpora", (Conference tutorial), 2016 Int. World-Wide Web Conf. (WWW'16), Montreal, Canada, April 2016
  2. Jialu Liu, Xiang Ren, Jingbo Shang, Taylor Cassidy, Clare Voss and Jiawei Han, "Representing Documents via Latent Keyphrase Inference", in Proc. of 2016 Int. World-Wide Web Conf. (WWW'16), Montreal, Canada, April 2016
  3. Min Li, Jingjing Wang, Wenzhu Tong, Hongkun Yu, Xiuli Ma, Yucheng Chen, Haoyan Cai, Jiawei Han, "EKNOT: Event Knowledge from News and Opinions in Twitter", Proc. of AAAI Conf. on Artificial Intelligence (AAAI'16) (system demo), Phoenix, AZ, Feb. 2016
  4. Jingjing Wang, Wenzhu Tong, Hongkun Yu, Min Li, Xiuli Ma, Haoyan Cai, Tim Hanratty, and Jiawei Han, "Mining Multi-Aspect Reflection of News Events in Twitter: Discovery, Linking and Presentation",  in Proc. of 2015 IEEE Int. Conf. on Data Mining (ICDM'15), Atlantic City, NJ, Nov. 2015
  5. Chenguang Wang, Yangqiu Song, Haoran Li, Ming Zhang and Jiawei Han, "KnowSim: A Document Similarity Measure on Structured Heterogeneous Information Networks", in Proc. of 2015 IEEE Int. Conf. on Data Mining (ICDM'15), Atlantic City, NJ, Nov. 2015,
  6. Ahmed El-Kishky, Yanglei Song, Chi Wang, Clare R. Voss, and Jiawei Han, "Scalable Topical Phrase Mining from Text Corpora", PVLDB 8(3): 305 - 316, 2015. Also, in Proc. 2015 Int. Conf. on Very Large Data Bases (VLDB'15), Kohala Coast, Hawaii, Sept. 2015.
  7. Qi Li, Yaliang Li, Jing Gao, Lu Su, Bo Zhao, Murat Demirbas, Wei Fan, and Jiawei Han, "A Confidence-Aware Approach for Truth Discovery on Long-Tail Data",  PVLDB 8(4): 425-436, 2015  Also, in Proc. 2015 Int. Conf. on Very Large Data Bases (VLDB'15), Kohala Coast, Hawaii, Sept. 2015.
  8. Jing Gao, Qi Li, Bo Zhao, Wei Fan, and Jiawei Han, "Truth Discovery and Crowdsourcing Aggregation: A Unified Perspective", PVLDB 8(12): 2048-2059 (2015).  Also, in Proc. 2015 Int. Conf. on Very Large Data Bases (VLDB'15) (conference tutorial), Kohala Coast, Hawaii, Sept. 2015.
  9. Shi Zhi, Jiawei Han, and Quanquan Gu, "Robust Classification of Information Networks by Consistent Graph Learning",  in Proc. of 2015 European Conf. on Machine Learning and Principles and Practices of Knowledge Discovery in Databases (ECMLPKDD'15), Porto, Portugal, Sept. 2015.
  10. Chao Zhang, Shan Jiang, Yucheng Chen, Yidan Sun, and Jiawei Han,"Fast Inbound Top-K Query for Random Walk with Restart", in Proc. of 2015 European Conf. on Machine Learning and Principles and Practices of Knowledge Discovery in Databases (ECMLPKDD'15), Porto, Portugal, Sept. 2015.  (Received the Best Student Paper Runner-Up award at ECML/PKDD 2015)
  11. Chao Zhang, Yu Zheng, Xiuli Ma,  Jiawei Han, "Assembler: Efficient Discovery of Spatial Coevolving Patterns in Massive Geosensory Data", in Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, Aug. 2015
  12. Xiang Ren, Ahmed El-Kishky, Chi Wang, Fangbo Tao, Clare R. Voss, Heng Ji, Jiawei Han, "ClusType: Effective Entity Recognition and Typing by Relation Phrase-Based Clustering", in Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, Aug. 2015
  13. Fenglong Ma, Yaliang Li, Qi Li, Minghui Qui, Jing Gao, Shi Zhi, Lu Su, Bo Zhao, Heng Ji, and Jiawei Han, "FaitCrowd: Fine Grained Truth Discovery for Crowdsourced Data Aggregation", in Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, Aug. 2015
  14. Chi Wang, Xueqing Liu, Yanglei Song, Jiawei Han, "Towards Interactive Construction of Topical Hierarchy: A Recursive Tensor Decomposition Approach", in Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, Aug. 2015
  15. Yaliang Li, Qi Li, Jing Gao, Lu Su, Bo Zhao, Wei Fan, and Jiawei Han, "On the Discovery of Evolving Truth", in Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, Aug. 2015
  16. Shi Zhi, Bo Zhao, Wenzhu Tong, Jing Gao, Dian Yu, Heng Ji, and Jiawei Han, "Modeling Truth Existence in Truth Discovery", in Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, Aug. 2015
  17. Younghoon Kim, Jiawei Han, Cangzhou Yuan, "TOPTRAC: Topical Trajectory Pattern Mining", in Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, Aug. 2015
  18. Chenguang Wang, Yangqiu Song , Ahmed El-Kishky, Dan Roth, Ming Zhang, Jiawei Han, "Incorporating World Knowledge to Document Clustering via Heterogeneous Information Networks", in Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, Aug. 2015
  19. Honglei Zhuang, Aditya Parameswaran, Dan Roth, Jiawei Han, "Debiasing Crowdsourced Batches", in Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, Aug. 2015
  20. Xiang Ren, Ahmed El-Kishky, Chi Wang, and Jiawei Han, "Automatic Entity Recognition and Typing from Massive Text Corpora: A Phrase and Network Mining Approach" (conference tutorial), 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, Aug. 2015
  21. Jialu Liu, Jingbo Shang, Chi Wang, Xiang Ren, Jiawei Han, "Mining Quality Phrases from Massive Text Corpora",  in Proc. of 2015 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'15),  Melbourne, Australia, May 2015 (won Grand Prize in Yelp Dataset Challenge, 2015)
  22. Fangbo Tao, Bo Zhao, Ariel Fuxman, Yang Li, Jiawei Han, "Leveraging Pattern Semantics for Constructing Entity Taxonomies in Enterprises", in Proc. of 2015 Int. Conf. on World-Wide Web (WWW'15), Florence, Italy, May 2015
  23. Huan Gui, Ya Xu, Anmol Bhasin, Jiawei Han, "Network A/B Testing: From Sampling to Estimation", in Proc. of 2015 Int. Conf. on World-Wide Web (WWW'15), Florence, Italy, May 2015
  24. Jialu Liu, Chi Wang, Jing Gao, Quanquan Gu, Charu Aggarwal, Lance Kaplan, and Jiawei Han, " GIN: A Clustering Model for Capturing Dual Heterogeneity in Networked Data", in Proc. of 2015 SIAM Int. Conf. on Data Mining (SDM'15), Vancouver, Canada, Apr. 2015 (selected as one of the best papers in the conference and invited to journal Statistical Analysis and Data Mining (SADM) special issue "Best of SDM 2015")
  25. Mengting Wan, Yunbo Ouyang, Lance Kaplan, Jiawei Han, " Graph Regularized Meta-path Based Transductive Regression in Heterogeneous Information Network", in Proc. of 2015 SIAM Int. Conf. on Data Mining (SDM'15), Vancouver, Canada, Apr. 2015
  26. Manish Gupta, Jing Gao, Charu C. Aggarwal, and Jiawei Han, "Outlier Detection for Temporal Data: A Survey",  IEEE Trans. on Knowledge and Data Engineering, 26(9):2250-2267, 2014.
  27. Wei Shen, Jiawei Han, and Jianyong Wang, "A Probabilistic Model for Linking Named Entities in Web Text with Heterogeneous Information Networks", Proc. of 2014 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'14), Snowbird, UT, June 2014
  28. Qi Li, Yaliang Li, Jing Gao, Bo Zhao, Wei Fan, and Jiawei Han, "Resolving Conflicts in Heterogeneous Data by Truth Discovery and Source Reliability Estimation", Proc. of 2014 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'14), Snowbird, UT, June 2014

 

Project Impact

·         Education:  Parts of the new research results are used in Data Mining courses (CS412, CS512) for both undergraduate and graduate students being taught in the Department of Computer Science, the University of Illinois at Urbana-Champaign.    Moreover, the research results have been and will continuously be published timely in international conferences and journals and be distributed world-wide for education and research.  The new progress will also be integrated into the new edition of our data mining textbook and other research collections.

·         Collaborations: For this project we have established collaborations with Boeing, ARL, NASA, IBM T.J. Watson Research Center, Yahoo! Labs, Microsoft Research, Google Research, and NCSA (National Center of Supercomputer Applications).  Through such collaborations we expect to have access to real datasets and applications and produce more research results.

 

Current and Future Activities

·         The following are some of the highlights of our ongoing work.  Please refer to the section: Publications and Products section for related references

Area Background

·         This project is based on the previous research on data mining, information network analysis, spatiotemporal data analysis, and data cube and multidimensional analysis.   

·         There have been many research papers published on these themes.   Several textbooks on data mining, information retrieval and information network analysis provide good overviews of the principles and algorithms, including (Han, Kamber and Pei, 2011) and (Sun and Han 2012).

 

Area References

·         Xin Luna Dong, Alon Halevy, and Cong Yu, “Data integration with uncertainty”, (2010), VLDB Journal 2010.

·         X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1):562–573, 2009.

·         J. Gao, F. Liang, W. Fan, Y. Sun, and J. Han. A graph-based consensus maximization approach for combining multiple supervised and unsupervised models. IEEE Transactions on Knowledge and Data Engineering, 25(1):15–28, 2013.

·         M. Gupta and J. Han. Heterogeneous network-based trust analysis: A survey. SIGKDD Explorations, 13(1):54–71, 2011.

·         A. D. Sarma, X. L. Dong, and A. Halevy. Data integration with dependent sources. In Proc. of the International Conference on Extending Database Technology (EDBT’11), pages 401–412, 2011

·         D. Wang, L. Kaplan, H. Le, and T. Abdelzaher, “On truth discovery in social sensing: A maximum likelihood estimation approach”, in Proc. of the ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN’12), pages 233–244, 2012.

·         V. Vydiswaran, C. Zhai, and D. Roth, “Content-driven trust propagation framework”, in Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’11), pages 974–982, 2011.

·         P. Yu, J. Han, and C. Faloutsos, editors. Link Mining: Models, Algorithms, and Applications. Springer, 2010

Potential Related Projects

·         Any project related to truth discovery, information fusion, information and social network analysis, spatiotemporal data mining, and knowledge discovery.

Project Web site URL:  http://www.cs.uiuc.edu/~hanj/projs/conflicts.htm

Online software:  Online software related to this project can be downloaded at www.illimine.cs.uiuc.edu

Online resources:  Research publications related to this project can be downloaded at Selected Publications