NSF III: Small: Collaborative Research: Conflicts to Harmony: Integrating Massive Data by Trustworthiness Estimation and Truth Discovery

National Science Foundation Award Number: NSF IIS 1320617 (08-01-2013—07-31-2017)

 

Contact Information

 

Jiawei Han, co-PI
Department of Computer Science
University of Illinois, Urbana-Champaign
201 N. Goodwin Ave., Urbana, Illinois 61801 U.S.A.
Office: (217) 333-6903

Fax: (217) 265-6494

E-mail: hanj at cs.uiuc.edu

URL: http://www.cs.uiuc.edu/~hanj

 

Jing Gao,  PI
Department of Computer Science and Engineering
University at Buffalo,
Buffalo, NY 14260, U.S.A.
Office:
(716)645-1586

E-mail: jing at buffalo.edu

URL: https://www.cse.buffalo.edu/~jing/

 

List of Supported Students and Staff at UIUC

 

§  Shi Zhi, Ph.D. student, Department of Computer Science, University of Illinois at Urbana-Champaign

§  Jingbo Shang, Ph.D. student, Department of Computer Science, University of Illinois at Urbana-Champaign (collaborative)

§  Jingjing Wang, Ph.D. student, Department of Computer Science, University of Illinois at Urbana-Champaign (collaborative)

§  Wenqi He, Ph.D. student, Department of Computer Science, University of Illinois at Urbana-Champaign (collaborative)

§  Quan Yuan, Postdoc Research Fellow, Department of Computer Science, University of Illinois at Urbana-Champaign (collaborative)

Project Award Information

Acknowledgement

·         This material is based upon work supported by the National Science Foundation under Grant No. NSF IIS 1320617 (08/01/2013—07/31/2017) NSF III: Small: Collaborative Research: Conflicts to Harmony: Integrating Massive Data by Trustworthiness Estimation and Truth Discovery”.  Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Collaborative project website:

·         PI: Professor Jing Gao (https://www.cse.buffalo.edu/~jing)

·         The collaborative project website: https://www.cse.buffalo.edu//~jing/truth.htm. 

Project Summary

Big data leads to big challenges, not only in the volume of data but also in its dynamics and variety. Multiple descriptions about the same sets of objects or events from different sources will unavoidably lead to data or information inconsistency. Then, among conflicting pieces of data or information, which one is more trustworthy, or represents the true fact? Facing the daunting scale of data, it is unrealistic to expect human to “label” or tell which data source is more reliable or which piece of information is correct. Our position is to detect truths without supervision, by integrating source reliability estimation and truth finding. Although there are recent studies following a general principle: Truth is obtained by a weighted voting among multiple sources where more reliable sources have higher weights, and sources that tell the truths more often will be regarded more reliable, there are many unsolved issues. We propose to integrate previous studies on this issue and develop a unified approach, by the integration of probabilistic and optimization models with multidimensional trustworthiness factors. This will lead to a set of efficient and effective methods, technologies and software systems for truth inference from multiple conflicting sources of heterogeneous, disparate, correlated, gigantic, scattered, and streaming data.

 

Intellectual Merit:  This proposal addresses a few important research questions for the task of jointly conducting trustworthiness estimation and truth discovery: (i) How to model the cases when multiple values can be true simultaneously for one entity, (ii) how to characterize heterogeneous data types in the process, (iii) how to detect truths effectively in streaming, distributed and large-scale data sets, and (iv) how to derive truths and trustworthiness when there exist data dependency?  To address these research questions, we propose to develop an integrated truth discovery framework with the following integral components: (1) A generative model which incorporates two-sided trustworthiness for effective detection of multiple truths; (2) probabilistic and optimization frameworks which allow any loss function for any data type in truth and trustworthiness inference; (3) models that integrate time factors as well as efficient incremental and parallel computation approaches for streaming, distributed and large-scale data; (4) an integrated model that takes care of data and source dependencies in trustworthiness estimation and truth discovery; (5) methods to conduct integrative analysis of conflicting data sources that integrates trustworthiness analysis; and (6) a systematic study on the connections between various models and techniques leading to a unified TruthMine framework. The success of this project will solve several difficult research problems and greatly improve the state-of-the-art in conflict resolution and data fusion.

 

Broader Impacts:  Identifying trustworthy information sources and truths from massive, diverse, conflicting, complex and noisy data is critical for data integration, information understanding and decision making.  It is a cornerstone for big data management and data analytics. Our proposed TruthMine framework, resulted from our previous long-term collaborative work in this and other related domains, will make tangible contributions to this critical research issue. The developed new theory, methodologies, algorithms, and software prototype will advance the state-of-the-art and benefit many applications where critical decisions have to be made based on the correct information extracted from diverse sources. Application fields may include healthcare, business intelligence, cyber-security, military and intelligence decision making, bioinformatics, crowd-sourcing, cyber physical systems, social media, and well-beyond. Moreover, the proposed work will be integrated with training students and new generation researchers, especially female and minority students.   The new research results will be integrated into course materials and projects in data mining, for graduate, undergraduate, and K-12 outreach activities. Tutorials, workshops, and research publications will be made available for broad access. Datasets and software prototypes of this project will be made publicly available.

 

The research results are to be published in various research and application forums and be integrated into the educational programs at UIUC.  The progress of the project and the research results are also disseminated via the project Web site (http://www.cs.uiuc.edu/homes/hanj/projs/conflicts.htm).

Selected Publications and Products:

Books (authored)

 

·         Jialu Liu, Jingbo Shang, and Jiawei Han, Phrase Mining from Massive Text and Its Applications, Synthesis Lectures on Data Mining and Knowledge Discovery, Morgan & Claypool Publishers, 2017.

 

Journal and Refereed Conference Publications

 

  1. Jingbo Shang, Meng Jiang, Wenzhu Tong, Jinfeng Xiao, Jian Peng, Jiawei Han. "DPPred: An Effective Prediction Framework with Concise Discriminative Patterns", accepted by IEEE Transactions on Knowledge and Data Engineering, Sept. 2017
  2. Chao Zhang, Liyuan Liu, Dongming Lei, Quan Yuan, Honglei Zhuang, Tim Hanratty and Jiawei Han, "TrioVecEvent: Embedding-Based Online Local Event Detection in Geo-Tagged Tweet Streams", in Proc. of 2017 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'17), Halifax, Nova Scotia, Canada, Aug. 2017
  3. Chao Zhang, Keyang Zhang, Quan Yuan, Fangbo Tao, Luming Zhang, Tim Hanratty, Jiawei Han, "ReAct: Online Multimodal Embedding for Recency-Aware Spatiotemporal Activity Modeling", In Proc. of 2017 ACM SIGIR Conf. on Research & Development in Information Retrieval (SIGIR'17), Tokyo, Japan, Aug. 2017 
  4. Jingbo Shang, Xiang Ren, Meng Jiang, and Jiawei Han, "Mining Entity-Relation-Attribute Structures from Massive Text Data" (conference tutorial), Proc. of 2017 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'17), Halifax, Nova Scotia, Canada, August 2017.
  5. Chao Zhang, Dongming Lei, Quan Yuan, Honglei Zhuang, Lance Kaplan, Shaowen Wang, Jiawei Han, "GeoBurst+: Effective and Real-Time Local Event Detection in Geo-Tagged Tweet Streams", accepted by ACM Transactions on Intelligent Systems and Technology (ACM TIST), 2017
  6. Xiang Ren, Meng Jiang, Jingbo Shang and Jiawei Han, "Building Structured Databases of Factual Knowledge from Massive Text Corpora" (conference tutorial), Proc. of 2017 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'17), June 2017.
  7. Xiang Ren, Zeqiu Wu, Wenqi He, Meng Qu, Clare Voss, Heng Ji, Tarek Abdelzaher and Jiawei Han, "CoType: Joint Extraction of Typed Entities and Relations with Knowledge Bases", in Proc. of 2017 World-Wide Web Conf. (WWW'17), Perth, Australia, Apr. 2017.
  8. Chao Zhang, Keyang Zhang, Quan Yuan, Haoruo Peng, Yu Zheng, Tim Hanratty, Shaowen Wang and Jiawei Han, "Regions, Periods, Activities: Uncovering Urban Dynamics via Cross-Modal Representation Learning",  in Proc. of 2017 World-Wide Web Conf. (WWW'17), Perth, Australia, Apr. 2017. 
  9. Xiang Ren, Meng Jiang, Jingbo Shang and Jiawei Han, "Constructing Structured Information Networks from Massive Text Corpora" (conference tutorial), Proc. of 2017 World-Wide Web Conf. (WWW'17), Perth, Australia, Apr. 2017.
  10. Chao Zhang, Quan Yuan, and Jiawei Han, "Bringing Semantics to Spatiotemporal Data Mining: Challenges, Methods, and Applications" (Conference tutorial), Proc of 2017 IEEE Int. Conf on Data Engineering (ICDE'17), San Diego, California,, Apr. 2017.
  11. Xiang Ren, Yuanhua Lv, Kuansan Wang and Jiawei Han, "Comparative Document Analysis for Large Text Corpora", in Proc. of 2017 ACM  Int. Conf. on Web Search and Data Mining (WSDM'17), Cambridge UK, Feb. 2017
  12. Quan Yuan, Wei Zhang, Chao Zhang, Xinhe Geng, Gao Cong and Jiawei Han, "Periodic Region Detection for Mobility Modeling of Social Media Users", in Proc. of 2017 ACM  Int. Conf. on Web Search and Data Mining (WSDM'17), Cambridge UK, Feb. 2017.

13.  Meng Jiang, Christos Faloutsos, Jiawei Han, "CatchTartan: Representing and Summarizing Dynamic Multicontextual Behaviors", in Proc. of 2016 ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD'16), San Francisco, CA, Aug. 2016

14.  Xiang Ren,  Wenqi He,  Meng Qu, Clare R. Voss, Heng Ji, Jiawei Han, "Label Noise Reduction in Entity Typing by Heterogeneous Partial-Label Embedding", in Proc. of 2016 ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD'16), San Francisco, CA, Aug. 2016

15.  Mengting Wan, Xiangyu Chen, Lance Kaplan, Jiawei Han, Jing Gao, Bo Zhao, "An Uncertainty-Aware Model to Summarize Trustworthy Quantitative Information", in Proc. of 2016 ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD'16), San Francisco, CA, Aug. 2016

16.  Jing Gao, Qi Li, Bo Zhao, Wei Fan, and Jiawei Han, "Mining Reliable Information from Passively and Actively Crowdsourced Data" (conf. tutorial), in Proc. of 2016 ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD'16), San Francisco, CA, Aug. 2016

17.  Chao Zhang, Guangyu Zhou, Quan Yuan, Honglei Zhuang, Yu Zheng, Lance Kaplan, Shaowen Wang, Jiawei Han, "GeoBurst: Real-time Local Event Detection in Geo-tagged Tweet Stream",  in Proc. of 2016 ACM SIGIR Conf. on Research & Development in Information Retrieval (SIGIR'16), Pisa, Italy, July 2016

  1. Xiang Ren, Ahmed El-Kishky, Chi Wang, and Jiawei Han, "Automatic Entity Recognition and Typing in Massive Text Corpora", (Conference tutorial), 2016 Int. World-Wide Web Conf. (WWW'16), Montreal, Canada, April 2016
  2. Jialu Liu, Xiang Ren, Jingbo Shang, Taylor Cassidy, Clare Voss and Jiawei Han, "Representing Documents via Latent Keyphrase Inference", in Proc. of 2016 Int. World-Wide Web Conf. (WWW'16), Montreal, Canada, April 2016
  3. Min Li, Jingjing Wang, Wenzhu Tong, Hongkun Yu, Xiuli Ma, Yucheng Chen, Haoyan Cai, Jiawei Han, "EKNOT: Event Knowledge from News and Opinions in Twitter", Proc. of AAAI Conf. on Artificial Intelligence (AAAI'16) (system demo), Phoenix, AZ, Feb. 2016
  4. Jingjing Wang, Wenzhu Tong, Hongkun Yu, Min Li, Xiuli Ma, Haoyan Cai, Tim Hanratty, and Jiawei Han, "Mining Multi-Aspect Reflection of News Events in Twitter: Discovery, Linking and Presentation",  in Proc. of 2015 IEEE Int. Conf. on Data Mining (ICDM'15), Atlantic City, NJ, Nov. 2015
  5. Chenguang Wang, Yangqiu Song, Haoran Li, Ming Zhang and Jiawei Han, "KnowSim: A Document Similarity Measure on Structured Heterogeneous Information Networks", in Proc. of 2015 IEEE Int. Conf. on Data Mining (ICDM'15), Atlantic City, NJ, Nov. 2015,
  6. Ahmed El-Kishky, Yanglei Song, Chi Wang, Clare R. Voss, and Jiawei Han, "Scalable Topical Phrase Mining from Text Corpora", PVLDB 8(3): 305 - 316, 2015. Also, in Proc. 2015 Int. Conf. on Very Large Data Bases (VLDB'15), Kohala Coast, Hawaii, Sept. 2015.
  7. Qi Li, Yaliang Li, Jing Gao, Lu Su, Bo Zhao, Murat Demirbas, Wei Fan, and Jiawei Han, "A Confidence-Aware Approach for Truth Discovery on Long-Tail Data",  PVLDB 8(4): 425-436, 2015  Also, in Proc. 2015 Int. Conf. on Very Large Data Bases (VLDB'15), Kohala Coast, Hawaii, Sept. 2015.
  8. Jing Gao, Qi Li, Bo Zhao, Wei Fan, and Jiawei Han, "Truth Discovery and Crowdsourcing Aggregation: A Unified Perspective", PVLDB 8(12): 2048-2059 (2015).  Also, in Proc. 2015 Int. Conf. on Very Large Data Bases (VLDB'15) (conference tutorial), Kohala Coast, Hawaii, Sept. 2015.
  9. Shi Zhi, Jiawei Han, and Quanquan Gu, "Robust Classification of Information Networks by Consistent Graph Learning",  in Proc. of 2015 European Conf. on Machine Learning and Principles and Practices of Knowledge Discovery in Databases (ECMLPKDD'15), Porto, Portugal, Sept. 2015.
  10. Chao Zhang, Shan Jiang, Yucheng Chen, Yidan Sun, and Jiawei Han,"Fast Inbound Top-K Query for Random Walk with Restart", in Proc. of 2015 European Conf. on Machine Learning and Principles and Practices of Knowledge Discovery in Databases (ECMLPKDD'15), Porto, Portugal, Sept. 2015.  (Received the Best Student Paper Runner-Up award at ECML/PKDD 2015)
  11. Chao Zhang, Yu Zheng, Xiuli Ma,  Jiawei Han, "Assembler: Efficient Discovery of Spatial Coevolving Patterns in Massive Geosensory Data", in Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, Aug. 2015
  12. Xiang Ren, Ahmed El-Kishky, Chi Wang, Fangbo Tao, Clare R. Voss, Heng Ji, Jiawei Han, "ClusType: Effective Entity Recognition and Typing by Relation Phrase-Based Clustering", in Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, Aug. 2015
  13. Fenglong Ma, Yaliang Li, Qi Li, Minghui Qui, Jing Gao, Shi Zhi, Lu Su, Bo Zhao, Heng Ji, and Jiawei Han, "FaitCrowd: Fine Grained Truth Discovery for Crowdsourced Data Aggregation", in Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, Aug. 2015
  14. Chi Wang, Xueqing Liu, Yanglei Song, Jiawei Han, "Towards Interactive Construction of Topical Hierarchy: A Recursive Tensor Decomposition Approach", in Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, Aug. 2015
  15. Yaliang Li, Qi Li, Jing Gao, Lu Su, Bo Zhao, Wei Fan, and Jiawei Han, "On the Discovery of Evolving Truth", in Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, Aug. 2015
  16. Shi Zhi, Bo Zhao, Wenzhu Tong, Jing Gao, Dian Yu, Heng Ji, and Jiawei Han, "Modeling Truth Existence in Truth Discovery", in Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, Aug. 2015
  17. Younghoon Kim, Jiawei Han, Cangzhou Yuan, "TOPTRAC: Topical Trajectory Pattern Mining", in Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, Aug. 2015
  18. Chenguang Wang, Yangqiu Song , Ahmed El-Kishky, Dan Roth, Ming Zhang, Jiawei Han, "Incorporating World Knowledge to Document Clustering via Heterogeneous Information Networks", in Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, Aug. 2015
  19. Honglei Zhuang, Aditya Parameswaran, Dan Roth, Jiawei Han, "Debiasing Crowdsourced Batches", in Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, Aug. 2015
  20. Xiang Ren, Ahmed El-Kishky, Chi Wang, and Jiawei Han, "Automatic Entity Recognition and Typing from Massive Text Corpora: A Phrase and Network Mining Approach" (conference tutorial), 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, Aug. 2015
  21. Jialu Liu, Jingbo Shang, Chi Wang, Xiang Ren, Jiawei Han, "Mining Quality Phrases from Massive Text Corpora",  in Proc. of 2015 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'15),  Melbourne, Australia, May 2015 (won Grand Prize in Yelp Dataset Challenge, 2015)
  22. Fangbo Tao, Bo Zhao, Ariel Fuxman, Yang Li, Jiawei Han, "Leveraging Pattern Semantics for Constructing Entity Taxonomies in Enterprises", in Proc. of 2015 Int. Conf. on World-Wide Web (WWW'15), Florence, Italy, May 2015
  23. Huan Gui, Ya Xu, Anmol Bhasin, Jiawei Han, "Network A/B Testing: From Sampling to Estimation", in Proc. of 2015 Int. Conf. on World-Wide Web (WWW'15), Florence, Italy, May 2015
  24. Jialu Liu, Chi Wang, Jing Gao, Quanquan Gu, Charu Aggarwal, Lance Kaplan, and Jiawei Han, " GIN: A Clustering Model for Capturing Dual Heterogeneity in Networked Data", in Proc. of 2015 SIAM Int. Conf. on Data Mining (SDM'15), Vancouver, Canada, Apr. 2015 (selected as one of the best papers in the conference and invited to journal Statistical Analysis and Data Mining (SADM) special issue "Best of SDM 2015")
  25. Mengting Wan, Yunbo Ouyang, Lance Kaplan, Jiawei Han, " Graph Regularized Meta-path Based Transductive Regression in Heterogeneous Information Network", in Proc. of 2015 SIAM Int. Conf. on Data Mining (SDM'15), Vancouver, Canada, Apr. 2015

Project Impact

·         Education:  Parts of the new research results are used in Data Mining courses (CS412, CS512) for both undergraduate and graduate students being taught in the Department of Computer Science, the University of Illinois at Urbana-Champaign.    Moreover, the research results have been and will continuously be published timely in international conferences and journals and be distributed world-wide for education and research.  The new progress will also be integrated into the new edition of our data mining textbook and other research collections.

·         Collaborations: For this project we have established collaborations with Boeing, ARL, NASA, IBM T.J. Watson Research Center, Yahoo! Labs, Microsoft Research, Google Research, and NCSA (National Center of Supercomputer Applications).  Through such collaborations we expect to have access to real datasets and applications and produce more research results.

 

Current and Future Activities

·         The following are some of the highlights of our ongoing work.  Please refer to the section: Publications and Products section for related references

Area Background

·         This project is based on the previous research on data mining, information network analysis, spatiotemporal data analysis, and data cube and multidimensional analysis.   

·         There have been many research papers published on these themes.   Several textbooks on data mining, information retrieval and information network analysis provide good overviews of the principles and algorithms, including (Han, Kamber and Pei, 2011) and (Sun and Han 2012).

 

Area References

·         Xin Luna Dong, Alon Halevy, and Cong Yu, “Data integration with uncertainty”, (2010), VLDB Journal 2010.

·         X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1):562–573, 2009.

·         J. Gao, F. Liang, W. Fan, Y. Sun, and J. Han. A graph-based consensus maximization approach for combining multiple supervised and unsupervised models. IEEE Transactions on Knowledge and Data Engineering, 25(1):15–28, 2013.

·         M. Gupta and J. Han. Heterogeneous network-based trust analysis: A survey. SIGKDD Explorations, 13(1):54–71, 2011.

·         A. D. Sarma, X. L. Dong, and A. Halevy. Data integration with dependent sources. In Proc. of the International Conference on Extending Database Technology (EDBT’11), pages 401–412, 2011

·         D. Wang, L. Kaplan, H. Le, and T. Abdelzaher, “On truth discovery in social sensing: A maximum likelihood estimation approach”, in Proc. of the ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN’12), pages 233–244, 2012.

·         V. Vydiswaran, C. Zhai, and D. Roth, “Content-driven trust propagation framework”, in Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’11), pages 974–982, 2011.

·         P. Yu, J. Han, and C. Faloutsos, editors. Link Mining: Models, Algorithms, and Applications. Springer, 2010

Potential Related Projects

·         Any project related to truth discovery, information fusion, information and social network analysis, spatiotemporal data mining, and knowledge discovery.

Project Web site URL:  http://www.cs.uiuc.edu/~hanj/projs/conflicts.htm

Online software (code):  downloadable at www.illimine.cs.uiuc.edu, Github: first author

Online resources:  Research publications related to this project can be downloaded at Selected Publications 

Education Materials: CS412+CS512 (http://hanj.cs.illinois.edu) and Coursera: UIUC, Data Mining Specialization

Date of last update: Nov. 2017