NSF III: Small: Multi-Dimensional Structuring, Summarizing and Mining of Social Media Data

National Science Foundation Award Number: NSF IIS 16-18481 (08-01-201607-31-2019)

 

Contact Information

 

         Jiawei Han, PI
Department of Computer Science
University of Illinois, Urbana-Champaign
201 N. Goodwin Ave., Urbana, Illinois 61801 U.S.A.
Office: (217) 333-6903

Fax: (217) 265-6494

E-mail: hanj at cs.uiuc.edu

URL: http://www.cs.uiuc.edu/~hanj

 

List of Supported Students and Staff

 

         Meng Jiang, Postdoc Research Fellow, Department of Computer Science, University of Illinois at Urbana-Champaign

         Quan Yuan, Postdoc Research Fellow, Department of Computer Science, University of Illinois at Urbana-Champaign (collaborative)

         Ahmed Elkishky, Ph.D. student, Department of Computer Science, University of Illinois at Urbana-Champaign (collaborative)

         Xiang Ren, Ph.D. student, Department of Computer Science, University of Illinois at Urbana-Champaign (collaborative)

         Jiaming Shen, Ph.D. student, Department of Computer Science, University of Illinois at Urbana-Champaign

         Chao Zhang, Ph.D. student, Department of Computer Science, University of Illinois at Urbana-Champaign

         Honglei Zhuang, Ph.D. student, Department of Computer Science, University of Illinois at Urbana-Champaign

Project Award Information

         Award Number: NSF IIS 16-18481

         Duration: 08/01/201607/31/2019

         Title: NSF III: Small: Multi-Dimensional Structuring, Summarizing and Mining of Social Media Data

         Keywords: Big data; data mining; social media analysis; data integration; text mining; text summarization and OLAP; information trustworthiness analysis; information network analysis; efficiency and scalability; applications

Project Summary

         Various kinds of social media have impacted billions of users on their ways of obtaining and sharing information across the globe. This creates great opportunities but also poses tremendous challenges on understanding, summarizing, and mining of such data due to its huge volume as well as dynamic and unstructured nature of its text contents. In response to such challenges, this project focuses on text-based social media, proposes a multi-dimensional data structuring approach, which mines unstructured social media data to uncover its hidden multi-dimensional structures. The project investigates principle, methodologies and algorithms for social media structuring, summarizing and mining, and develops effective and scalable technology for multi-dimensional social media data analysis. The principles and methodologies developed in this study can be extended to scalable and multi-dimensional analysis of other kinds of massive unstructured data as well.

 

         To conduct effective multi-dimensional social media structuring, this project develops a distant supervision-based methodology with minimal effort of human curation and labeling. It takes data in Wikipedia, Freebase, or other knowledge-bases as references, integrates social media data with the corresponding news or other relevant documents, conducts phrase mining, entity and event discovery and typing, and uncover critical aspects, attributes, and values associated with such entities and events from social media. By organizing social media data in a structured way, massive social media can be summarizing effectively in a context-aware semantic OLAP (online analytical processing) framework and can be analyzed systematically under a general multi-dimensional social media querying and mining framework for many tasks, such as modeling behavioral patterns and uncovering bursty events and detecting social frauds or anomalies.

 

Intellectual Merit:

         We propose a multi-dimensional data structuring approach, which mines unstructured social media data to uncover its hidden multi-dimensional structures. Multi-dimensional structuring will involve integrating social media data with news, wikipedia, Freebase, and other knowledge-base data, conducting phrase mining, entity/event discovery and typing, and uncovering aspects associated with such entities and events. Organizing massive social media data in a conceptually structured way will facilitate understand and summarize social media information effectively, support context-aware semantic OLAP, facilitate multi-dimensional mining of social media data, such as finding bursty events and detecting anomalies in social media.

         To systematically develop this approach, we organize the proposal into three themes: (1) multidimensional structuring of social media data, (2) context-aware summarization in multi-dimensional space, and (3) a general framework for multidimensional social media mining. We will systematically develop principle, methodologies and algorithms along the three lines of the proposed research and generate effective and scalable technology for multi-dimensional social media data structuring, summarization and mining.

         Built on our existing work, this project has the following intellectual merit. (1) Developing new principles, methods, and technologies for structuring, summarizing, and mining of massive, time-evolving social media data: New technologies will be developed for entity extraction/typing, aspect discovery, context-aware semantic OLAP, and multidimensional event discovery and anomaly mining, and thus advance the state-of-the-art; (2) Enriching the principles and technologies of data mining: Structuring and mining massive, dynamic and unstructured data, such as social media data, is a major challenge in data mining.

 

Broader Impacts:

         With tremendous amounts of social media data being generated in all aspects of our society, this project will have the following broad impacts: (1) Benefits our social-media permeated society: Social media penetrates every aspect of our life. The project, enhancing our analysis power on social media, will benefit our society in many ways; (2) Benefits data mining and information technology: New technologies and tools will be generated for mining massive unstructured data and will be transferred to ARL, etc., as we did before; (3) Benefits education and training: The project will train a good number of researchers, especially female and minority students, educating a great number of undergraduates and graduates via our research publications, tutorials, massive online courses, workshops, and demo-systems.

         This project focuses on text-based social media, not on in-depth analysis of image, audio, and video data. Also, we will use publicly accessible social media data (e.g., publicly released tweets) with no links to users' personal information.

         The research results are to be published in various research and application forums and be integrated into the educational programs at UIUC. The progress of the project and the research results are also disseminated via the project Web site (http://www.cs.uiuc.edu/homes/hanj/projs/social_media.htm).

Selected Publications and Products:

Books (authored)

 

1.      Jiawei Han, Micheline Kamber, and Jian Pei, Data Mining: Concepts and Techniques, 3rd ed., Morgan Kaufmann, 2011.

2.      Yizhou Sun and Jiawei Han, Mining Heterogeneous Information Networks: Principles and Methodologies, Morgan & Claypool Publishers, 2012.

3.      Chi Wang and Jiawei Han, Mining Latent Entity Structures, Morgan & Claypool Publishers, 2015

 

Journal and Refereed Conference Publications

 

1.      Xiang Ren,  Wenqi He,  Meng Qu, Clare R. Voss, Heng Ji, Jiawei Han, "Label Noise Reduction in Entity Typing by Heterogeneous Partial-Label Embedding", in Proc. of 2016 ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD'16), San Francisco, CA, Aug. 2016

2.      Meng Jiang, Christos Faloutsos, Jiawei Han, "CatchTartan: Representing and Summarizing Dynamic Multicontextual Behaviors

", in Proc. of 2016 ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD'16), San Francisco, CA, Aug. 2016

3.      Mengting Wan, Xiangyu Chen, Lance Kaplan, Jiawei Han, Jing Gao, Bo Zhao, "An Uncertainty-Aware Model to Summarize Trustworthy Quantitative Information", in Proc. of 2016 ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD'16), San Francisco, CA, Aug. 2016

4.      Chao Zhang, Guangyu Zhou, Quan Yuan, Honglei Zhuang, Yu Zheng, Lance Kaplan, Shaowen Wang, Jiawei Han, "GeoBurst: Real-time Local Event Detection in Geo-tagged Tweet Stream",  in Proc. of 2016 ACM SIGIR Conf. on Research & Development in Information Retrieval (SIGIR'16), Pisa, Italy, July 2016

5.      Jialu Liu, Xiang Ren, Jingbo Shang, Taylor Cassidy, Clare Voss and Jiawei Han, "Representing Documents via Latent Keyphrase Inference", in Proc. of 2016 Int. World-Wide Web Conf. (WWW'16), Montreal, Canada, April 2016

6.      Min Li, Jingjing Wang, Wenzhu Tong, Hongkun Yu, Xiuli Ma, Yucheng Chen, Haoyan Cai, Jiawei Han, "EKNOT: Event Knowledge from News and Opinions in Twitter", Proc. of AAAI Conf. on Artificial Intelligence (AAAI'16) (system demo), Phoenix, AZ, Feb. 2016

7.      Jingjing Wang, Wenzhu Tong, Hongkun Yu, Min Li, Xiuli Ma, Haoyan Cai, Tim Hanratty, and Jiawei Han, "Mining Multi-Aspect Reflection of News Events in Twitter: Discovery, Linking and Presentation",  in Proc. of 2015 IEEE Int. Conf. on Data Mining (ICDM'15), Atlantic City, NJ, Nov. 2015

8.      Chenguang Wang, Yangqiu Song, Haoran Li, Ming Zhang and Jiawei Han, "KnowSim: A Document Similarity Measure on Structured Heterogeneous Information Networks", in Proc. of 2015 IEEE Int. Conf. on Data Mining (ICDM'15), Atlantic City, NJ, Nov. 2015,

9.      Ahmed El-KishkyYanglei Song, Chi Wang, Clare R. Voss, and Jiawei Han, "Scalable Topical Phrase Mining from Text Corpora", PVLDB 8(3): 305 - 316, 2015. Also, in Proc. 2015 Int. Conf. on Very Large Data Bases (VLDB'15), Kohala Coast, Hawaii, Sept. 2015.

10.  Qi Li, Yaliang Li, Jing Gao, Lu Su, Bo Zhao, Murat Demirbas, Wei Fan, and Jiawei Han, "A Confidence-Aware Approach for Truth Discovery on Long-Tail Data",  PVLDB 8(4): 425-436, 2015  Also, in Proc. 2015 Int. Conf. on Very Large Data Bases (VLDB'15), Kohala Coast, Hawaii, Sept. 2015.

11.  Jing Gao, Qi Li, Bo Zhao, Wei Fan, and Jiawei Han, "Truth Discovery and Crowdsourcing Aggregation: A Unified Perspective", PVLDB 8(12): 2048-2059 (2015).  Also, in Proc. 2015 Int. Conf. on Very Large Data Bases (VLDB'15) (conference tutorial),Kohala Coast, Hawaii, Sept. 2015.

12.  Shi Zhi, Jiawei Han, and Quanquan Gu, "Robust Classification of Information Networks by Consistent Graph Learning",  in Proc. of 2015 European Conf. on Machine Learning and Principles and Practices of Knowledge Discovery in Databases (ECMLPKDD'15), Porto, Portugal, Sept. 2015.

13.  Chao Zhang, Yu Zheng, Xiuli Ma,  Jiawei Han, "Assembler: Efficient Discovery of Spatial Coevolving Patterns in Massive Geosensory Data", in Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, Aug. 2015

14.  Xiang Ren, Ahmed El-Kishky, Chi Wang, Fangbo Tao, Clare R. Voss, Heng Ji, Jiawei Han, "ClusType: Effective Entity Recognition and Typing by Relation Phrase-Based Clustering", in Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, Aug. 2015

15.  Fenglong Ma, Yaliang Li, Qi Li, Minghui Qui, Jing Gao, Shi Zhi, Lu Su, Bo Zhao, Heng Ji, and Jiawei Han, "FaitCrowd: Fine Grained Truth Discovery for Crowdsourced Data Aggregation", in Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, Aug. 2015

16.  Chi Wang, Xueqing Liu, Yanglei Song, Jiawei Han, "Towards Interactive Construction of Topical Hierarchy: A Recursive Tensor Decomposition Approach", in Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, Aug. 2015

17.  Yaliang Li, Qi Li, Jing Gao, Lu Su, Bo Zhao, Wei Fan, and Jiawei Han, "On the Discovery of Evolving Truth", in Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, Aug. 2015

18.  Shi Zhi, Bo Zhao, Wenzhu Tong, Jing Gao, Dian Yu, Heng Ji, and Jiawei Han, "Modeling Truth Existence in Truth Discovery", in Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, Aug. 2015

19.  Younghoon Kim, Jiawei Han, Cangzhou Yuan, "TOPTRAC: Topical Trajectory Pattern Mining", in Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, Aug. 2015

20.  Chenguang Wang, Yangqiu Song , Ahmed El-Kishky, Dan Roth, Ming Zhang, Jiawei Han, "Incorporating World Knowledge to Document Clustering via Heterogeneous Information Networks", in Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, Aug. 2015

21.  Honglei Zhuang, Aditya Parameswaran, Dan Roth, Jiawei Han, "Debiasing Crowdsourced Batches", in Proc. of 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, Aug. 2015

22.  Xiang Ren, Ahmed El-Kishky, Chi Wang, and Jiawei Han, "Automatic Entity Recognition and Typing from Massive Text Corpora: A Phrase and Network Mining Approach" (conference tutorial), 2015 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'15), Sydney, Australia, Aug. 2015

23.  Jialu Liu, Jingbo Shang, Chi Wang, Xiang Ren, Jiawei Han, "Mining Quality Phrases from Massive Text Corpora",  in Proc. of 2015 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'15),  Melbourne, Australia, May 2015 (won Grand Prize in Yelp Dataset Challenge, 2015)

24.  Fangbo Tao, Bo Zhao, Ariel Fuxman, Yang Li, Jiawei Han, "Leveraging Pattern Semantics for Constructing Entity Taxonomies in Enterprises", in Proc. of 2015 Int. Conf. on World-Wide Web (WWW'15), Florence, Italy, May 2015

25.  Jialu Liu, Chi Wang, Jing Gao, Quanquan GuCharu Aggarwal, Lance Kaplan, and Jiawei Han, " GIN: A Clustering Model for Capturing Dual Heterogeneity in Networked Data", in Proc. of 2015 SIAM Int. Conf. on Data Mining (SDM'15), Vancouver, Canada, Apr. 2015 (selected as one of the best papers in the conference and invited to journal Statistical Analysis and Data Mining (SADM) special issue "Best of SDM 2015")

26.  Manish Gupta, Jing Gao, Charu C. Aggarwal, and Jiawei Han, "Outlier Detection for Temporal Data: A Survey",  IEEE Trans. on Knowledge and Data Engineering, 26(9):2250-2267, 2014.

27.  Wei Shen, Jiawei Han, and Jianyong Wang, "A Probabilistic Model for Linking Named Entities in Web Text with Heterogeneous Information Networks", Proc. of 2014 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'14), Snowbird, UT, June 2014

28.  Qi Li, Yaliang Li, Jing Gao, Bo Zhao, Wei Fan, and Jiawei Han, "Resolving Conflicts in Heterogeneous Data by Truth Discovery and Source Reliability Estimation", Proc. of 2014 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'14), Snowbird, UT, June 2014

 

Project Impact

 

         Education: Parts of the new research results are used in Data Mining courses (CS412, CS512) for both undergraduate and graduate students being taught in the Department of Computer Science, the University of Illinois at Urbana-Champaign.    Moreover, the research results have been and will continuously be published timely in international conferences and journals and be distributed world-wide for education and research.  The new progress will also be integrated into the new edition of our data mining textbook and other research collections.

         Collaborations: For this project we have established collaborations with Boeing, ARL, NASA, IBM T.J. Watson Research Center, Yahoo! Labs, Microsoft Research, Google Research, and NCSA (National Center of Supercomputer Applications).  Through such collaborations we expect to have access to real datasets and applications and produce more research results.

 

Current and Future Activities

         The following are some of the highlights of our ongoing work. Please refer to the section: Publications and Products section for related references

Area Background

         This project is based on the previous research on data mining, information network analysis, spatiotemporal data analysis, and data cube and multidimensional analysis.   

         There have been many research papers published on these themes.   Several textbooks on data mining, information retrieval and information network analysis provide good overviews of the principles and algorithms, including (Han, Kamber and Pei, 2011) and (Sun and Han 2012).

 

Area References

         P. Yu, J. Han, and C. Faloutsos, editors. Link Mining: Models, Algorithms, and Applications. Springer, 2010

         X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1):562573, 2009.

         Xiaoxin Yin, Jiawei Han and Philip S. Yu, "Truth Discovery with Multiple Conflicting Information Providers on the Web", IEEE Transactions on Knowledge and Data Engineering, 20(6):796-808, 2008.

         Bo Zhao, Benjamin I. P. Rubinstein, Jim Gemmell, and Jiawei Han, "A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration", PVLDB 5(6): 550-561 (2012) (Also, Proc. 2012 Int. Conf. on Very Large Data Bases (VLDB'12), Istanbul, Turkey, Aug. 2012)

Potential Related Projects

         Any project related to social media analysis, information fusion, information and social network analysis, spatiotemporal data mining, and knowledge discovery.

Project Web site URL:  http://www.cs.uiuc.edu/~hanj/projs/social_media.htm

Online software:  Online software related to this project can be downloaded at www.illimine.cs.uiuc.edu

Online resources:  Research publications related to this project can be downloaded at Selected Publications