NSF III: Medium: Collaborative Research: Mining and Leveraging Knowledge Hypercubes for Complex Applications: NSF-IIS 19-56151

(10/01/2020-09/30/2023)

 

 

Contact Information

 

Jiawei Han,  Co-PI, Michael Aiken Chair Professor 
Department of Computer Science
University of Illinois, Urbana-Champaign
201 N. Goodwin Ave., Urbana, Illinois 61801 U.S.A.
Office: (217) 333-6903, Fax: (217) 265-6494

E-mail: hanj at illinois.edu, URL: http://hanj.cs.illinois.edu

 

List of Supported Students and Staff

 

§  Liyuan Liu, Ph.D. student, Department of Computer Science, University of Illinois at Urbana-Champaign 

§  Xiaotao Gu, Ph.D. student, Department of Computer Science, University of Illinois at Urbana-Champaign 

§  Yunyi Zhang, Ph.D. student, Department of Computer Science, University of Illinois at Urbana-Champaign 

Project Award Information

 

·         Award Number: NSF IIS NSF-IIS 19-56151

·         Duration: 10/01/2020-09/30/2023

·         Title: NSF III: Medium: Collaborative Research: Mining and Leveraging Knowledge Hypercubes for Complex Applications

·         Keywords:  text/data mining; knowledge bases; unsupervised and weakly supervised learning; multi-dimensional information extraction and analysis; knowledge discovery; efficiency and scalability

Project Summary

Recent years have witnessed the proliferation of various machine-readable knowledge repositories, such as general knowledge bases and domain-specific ontologies.  Although existing knowledge repositories have shown their power at simple search and question answering, their usage in complex problem solving is very limited. In many domains, knowledge varies with respect to contexts, and a flat structure that is commonly adopted by existing knowledge repositories cannot capture the complicated knowledge associated with different contexts.  To make knowledge resources more findable, accessible, interoperable, and reusable (FAIR), this project proposes to conceptualize a new structure, Knowledge Hypercube (K-CUBE), for organizing and retrieving knowledge that could support complex applications in various domains.  A knowledge hybercube organizes knowledge with respect to selected important dimensions or aspects, and thus it allows people to easily access knowledge in any context, encapsulate distinctive entities and relationships, and conduct cross-dimensional comparison and inference.  The major objective of this proposal is to form a paradigm of mining knowledge hybercubes from massive collection of text documents and leveraging such hybercubes for complex exploration and prediction tasks.   The progress of the project and the research results are also disseminated via the project Web site (http://hanj.cs.illinois.edu/projs/hypercube.htm).

 

Intellectual Merit:

 

The proposed research bridges the gap between the empirical success of network embedding, and existing statistical learning and optimization theories. The core of this proposed research is the integration of modern network mining techniques with sophisticated statistical learning and optimization tools, which lays a foundation to design a new generation of network embedding algorithms with strong theoretical guarantees, and to derive new theories for various setups of network embedding. Extensive empirical evaluations ensure the proposed algorithms' applicability in various application domains. The proposed research is expected to advance the frontier of network embedding, and enable it to be good at taming modern massive networks in the wild.

 

Broader Impacts:

 

The successful completion of this project will lead to a new advanced way to store, retrieve, share and exploit knowledge for complex applications. It will have immediate impact on the process of knowledge distillation, organization and exploitation and will broadly impact the field of data science which centers around finding and using knowledge.  The proposed research will provide an important source to advance knowledge-based machine learning approaches. Furthermore, the proposed research to mine and leverage knowledge can potentially benefit a wide range of domains which have gigantic literature and unsolved complex tasks by building a bridge between complex tasks and text collections, such as drug repurposing and fake news detection.  A repository of the developed software and constructed knowledge hypercubes for the proposed domains will be constructed and the results of this project will be disseminated to both within the computer science area and in many other disciplines.  This project has the potential to promote the adoption of knowledge hypercubes by industry, making knowledge resources more findable, accessible, interoperable, and reusable (FAIR).  Moreover, the proposed research work will be integrated tightly with education as we plan to leverage knowledge hypercubes for educational tasks such as knowledge tracing.  We will also encourage the participation of undergraduate and minority students in data mining research at all three institutions.

 

The research results are to be published in various research and application forums and be integrated into the educational programs at UIUC.  The progress of the project and the research results are also disseminated via the project Web site (http://www.cs.uiuc.edu/homes/hanj/projs/hypercube.htm).

Publications and Products: (Note: major publications closely related to this project are in bold font)

Note:  Please search and download all the papers in PDF, if available, at our group’s publication website by following the link: Selected research publications.

Books

·         Xiang Ren and Jiawei Han, Mining Structures of Factual Knowledge from Text: An Effort-Light ApproachMorgan & Claypool Publishers, 2018 (Series: Synthesis Lectures on Data Mining and Knowledge Discovery)

·         Chao Zhang and Jiawei Han, Multidimensional Mining of Massive Text Data, Morgan & Claypool Publishers, 2019 (Series: Synthesis Lectures on Data Mining and Knowledge Discovery)

 

 

Journal articles

·         Yu Meng, Jiaxin Huang, Guangyuan Wang, Zihan Wang, Chao Zhang, and Jiawei Han, ”Unsupervised Word Embedding Learning by Incorporating Local and Global Contexts”, Frontier in Big Data, 3:9, 2020

·         Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R Voss, Jiawei Han, "Automated Phrase Mining from Massive Text Corpora", IEEE Transactions on Knowledge and Data Engineering, 30(10):1825-1837 (2018)

·         Jingbo Shang, Meng Jiang, Wenzhu Tong, Jinfeng Xiao, Jian Peng, Jiawei Han. "DPPred: An Effective Prediction Framework with Concise Discriminative Patterns", IEEE Transactions on Knowledge and Data Engineering, 30(7): 1226-1239 (2018)

 

Refereed Conference Publications

 

1.        Xiaotao Gu, Zihan Wang, Zhenyu Bi, Yu Meng, Liyuan Liu, Jiawei Han, Jingbo Shang, "UCPhrase: Unsupervised Context-aware Quality Phrase Tagging", in Proc. of 2021 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'21), Aug. 2021 

2.        Yu Meng, Jiaxin Huang, Yu Zhang, Jiawei Han, "On the Power of Pre-Trained Text Representations: Models and Applications in Text Mining" (Conference Tutorial), in Proc. of 2021 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'21), Aug. 2021

3.       Sha Li, Heng Ji and Jiawei Han, "Document-Level Event Argument Extraction by Conditional Generation", in Proc. 2021 Annual Conf. of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT'21), June 2021

4.       Jiaming Shen, Wenda Qiu, Yu Meng, Jingbo Shang, Xiang Ren and Jiawei Han, "TaxoClass: Hierarchical Multi-Label Text Classification Using Only Class Names", in Proc. 2021 Annual Conf. of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT'21), June 2021

5.       Xinyang Zhang, Chenwei Zhang, Xin Luna Dong, Jingbo Shang and Jiawei Han, “Minimally-Supervised Structure-Rich Text Categorization via Learning on Text-Rich Networks”, in Proc. The Web Conf. 2021 (WWW’21), April 2021

6.       Yu Zhang, Zhihong Shen, Yuxiao Dong, Kuansan Wang and Jiawei Han, “MATCH: Metadata-Aware Text Classification in a Large Hierarchy”, in Proc. The Web Conf. 2021 (WWW’21), April 2021

7.       Qi Zhu, Fang Guo, Jingjing Tian, Yuning Mao, Jiawei Han, "SUMDocS: Surrounding-aware Unsupervised Multiple Document Summarization", in Proc. 2021 SIAM Int. Conf. on Data Mining (SDM'21), April 2021

8.       Yu Zhang, Xiusi Chen, Yu Meng and Jiawei Han, "Hierarchical Metadata-Aware Document Categorization under Weak Supervision", in Proc. 2021 ACM Int. Conf. on Web Search and Data Mining (WSDM'21), Feb. 2021

9.       Di Jin, Xiangchen Song, Zhizhi Yu, Ziyang Liu, Heling Zhang, Zhaomeng Cheng and Jiawei Han, "BiTe-GCN: A New GCN Architecture via Bidirectional Convolution of Topology and Features on Text-Rich Networks",  in Proc. 2021 ACM Int. Conf. on Web Search and Data Mining (WSDM'21), Feb. 2021

10.     Carl Yang, Yuxin Xiao, Yu Zhang, Yizhou Sun and Jiawei Han, "Heterogeneous Network Representation Learning: A Unified Framework with Survey and Benchmark",  IEEE Transactions on Knowledge and Data Engineering, 2021

11.      Xuan Wang, Xiangchen Song, Bangzheng Li, Kang Zhou, Qi Li, and Jiawei Han, "Fine-Grained Named Entity Recognition with Distant Supervision in COVID-19 Literature", in Proc. 2020 IEEE Int. Conf. on Bioinformatics and Biomedicine (IEEE BIBM 2020), Dec. 2020

12.     Xuan Wang, Yu Zhang, Aabhas Chauhan, Qi Li, and Jiawei Han, "Textual Evidence Mining via Spherical Heterogeneous Information Network Embedding", in Proc. 2020 IEEE Int. Conf. on Big Data (IEEE BigData'20), Dec. 2020 

13.     XuanWang, Yingjun Guan, Yu Zhang, Qi Li, and Jiawei Han, "Pattern-enhanced Named Entity Recognition with Distant Supervision", in Proc. 2020 IEEE Int. Conf. on Big Data (IEEE BigData'20), Dec. 2020  

14.     Carl Yang, Liyuan Liu, Mengxiong Liu, Zongyi Wang, Chao Zhang, and Jiawei Han, "Graph Clustering with Embedding Propagation", in Proc. 2020 IEEE Int. Conf. on Big Data (IEEE BigData'20), Dec. 2020  

15.     Jiaxin Huang, Yu Meng, Fang Guo, Heng Ji and Jiawei Han, "Aspect-Based Sentiment Analysis by Aspect-Sentiment Joint Embedding", in Proc. 2020 Conf. on Empirical Methods in Natural Language Processing (EMNLP'20), Nov. 2020

16.     Yuning Mao, Yanru Qu, Yiqing Xie, Xiang Ren and Jiawei Han, "Multi-document Summarization with Maximal Marginal Relevance-guided Reinforcement Learning", in Proc. 2020 Conf. on Empirical Methods in Natural Language Processing (EMNLP'20), Nov. 2020

17.     Yu Meng, Yunyi Zhang, Jiaxin Huang, Chenyan Xiong, Heng Ji, Chao Zhang and Jiawei Han, "Text Classification Using Label Names Only: A Language Model Self-Training Approach", in Proc. 2020 Conf. on Empirical Methods in Natural Language Processing (EMNLP'20), Nov. 2020

18.     Jiaming Shen, Wenda Qiu, Jingbo Shang, Michelle Vanni, Xiang Ren and Jiawei Han, "SynSetExpan: An Iterative Framework for Joint Entity Set Expansion and Synonym Discovery", in Proc. 2020 Conf. on Empirical Methods in Natural Language Processing (EMNLP'20), Nov. 2020

19.     Edouard Fouche, Yu Meng, Fang Guo, Honglei Zhuang, Klemens Boehm, and Jiawei Han, "Mining Text Outliers in Document Directories", in Proc. 2020 IEEE Int. Conf. on Data Mining (ICDM'20), Nov. 2020

20.    Carl Yang, Jieyu Zhang, and Jiawei Han, "Co-Embedding Network Nodes and Hierarchical Labels with Taxonomy Based Generative Adversarial Networks", in Proc. 2020 IEEE Int. Conf. on Data Mining (ICDM'20), Nov. 2020 (Best Paper Award)

21.     Yu Meng, Jiaxin Huang, Jiawei Han, “Embedding-Driven Multi-Dimensional Topic Mining and Text Analysis”, (Conference tutorial), 2020 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD’20), San Diego, CA, August 2020

22.    Jiaxin Huang, Yiqing Xie, Yu Meng, Yunyi Zhang and Jiawei Han, “CoRel: Seed-Guided Topical Taxonomy Construction by Concept Learning and Relation Transferring”, in Proc. of 2020 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD’20), San Diego, CA, August 2020

23.    Yuning Mao, Tong Zhao, Andrey Kan, Chenwei Zhang, Xin Luna Dong, Christos Faloutsos and Jiawei Han, “Octet: Online Catalog Taxonomy Enrichment with Self-Supervision”, in Proc. of 2020 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD’20), San Diego, CA, August 2020 

24.    Yu Meng, Yunyi Zhang, Jiaxin Huang, Yu Zhang, Chao Zhang and Jiawei Han, “Hierarchical Topic Mining via Joint Spherical Tree and Text Embedding”, in Proc. of 2020 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD’20), San Diego, CA, August 2020 

25.    Chanyoung Park, Carl Yang, Qi Zhu, Donghyun Kim, Hwanjo Yu and Jiawei Han, “Unsupervised Differentiable Multi-aspect Network Embedding”, in Proc. of 2020 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD’20), San Diego, CA, August 2020 

26.    Carl Yang, Aditya Pal, Andrew Zhai, Nikil Pancha, Jiawei Han, Chuck Rosenburg and Jure Leskovec, “MultiSage: Empowering GCN with Contextualized Multi-Embeddings on Web-Scale Multipartite Networks”, in Proc. of 2020 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD’20), San Diego, CA, August 2020

27.    Yiqing Xie, Sha Li, Carl Yang, Raymond Chi-Wing Wong, Jiawei Han, “When Do GNNs Work: Understanding and Improving Neighborhood Aggregation”, in Proc. of 2020 Int. Joint Conf. on Artificial Intelligence and Pacific Rim Int. Conf. on Artificial Intelligence (IJCAI-PRICAI’20), Yokohoma, Japan, July 2020

28.    Yu Zhang, Yu Meng, Jiaxin Huang, Frank F. Xu, Xuan Wang and Jiawei Han, “Minimally Supervised Categorization of Text with Metadata”, in Proc. 2020 ACM SIGIR Int. Conf. on Research and development in Information Retrieval (SIGIR’20), Xi’an, China, July 2020 

29.    Honglei Zhuang, Fang Guo, Chao Zhang, Liyuan Liu and Jiawei Han, “Joint Aspect-Sentiment Analysis with Minimal User Guidance”, in Proc. 2020 ACM SIGIR Int. Conf. on Research and development in Information Retrieval (SIGIR’20), Xi’an, China, July 2020

30.    Carl Yang, Jieyu Zhang, Haonan Wang, Bangzheng Li, Jiawei Han, "Neural Concept Map Generation for Effective Document Classification with Interpretable Structured Summarization" (short paper), in Proc. 2020 ACM SIGIR Int. Conf. on Research and development in Information Retrieval (SIGIR'20), Xi'an, China, July 2020

31.     Yuning Mao, Liyuan Liu, Qi Zhu, Xiang Ren and Jiawei Han, “Facet-Aware Evaluation for Extractive Summarization”, in Proc. 2020 Annual Conf. of the Association for Computational Linguistics (ACL’20), Seattle, WA, July 202

32.    Yunyi Zhang, Jiaming Shen, Jingbo Shang and Jiawei Han, “Empower Entity Set Expansion via Language Model Probing”, in Proc. 2020 Annual Conf. of the Association for Computational Linguistics (ACL’20), Seattle, WA, July 2020 

33.    Xuan Wang, Yingjun Guan, Weili Liu, Aabhas Chauhan, Enyi Jiang, Qi Li, David Liem, Dibakar Sigdel, John Caufield, Peipei Ping and Jiawei Han, “EVIDENCEMINER: Textual Evidence Discovery for Life Sciences”, in Proc. 2020 Annual Conf. of the Association for Computational Linguistics (ACL’20) (System demo), Seattle, WA, July 2020

34.    Xiaotao Gu, Yuning Mao, Jiawei Han, Jialu Liu, You Wu, Cong Yu, Daniel Finnie, Hongkun Yu, Jiaqi Zhai and Nicholas Zukoski, ”Generating Representative Headlines for News Stories”, in Proc. 2020 Int. World Wide Web Conf. (WWW’20), Taipei, Taiwan, Apr. 2020

35.    Jiaxin Huang, Yiqing Xie, Yu Meng, Jiaming Shen, Yunyi Zhang and Jiawei Han, ”Guiding Corpus-based Set Expansion by Auxiliary Sets Generation and Co-Expansion”, in Proc. 2020 Int. World Wide Web Conf. (WWW’20), Taipei, Taiwan, Apr. 2020 

36.    Yu Meng, Jiaxin Huang, Guangyuan Wang, Zihan Wang, Chao Zhang, Yu Zhang and Jiawei Han, ”Discriminative Topic Mining via Category-Name Guided Text Embedding”, in Proc. 2020 Int. World Wide Web Conf. (WWW’20), Taipei, Taiwan, Apr. 2020

37.    Jingbo Shang, Xinyang Zhang, Liyuan Liu, Sha Li and Jiawei Han, ”NetTaxo: Automated Topic Taxonomy Construction from Large-Scale Text-Rich Network”, in Proc. 2020 Int. World Wide Web Conf. (WWW’20), Taipei, Taiwan, Apr. 2020 

38.    Jiaming Shen, Zhihong Shen, Chenyan Xiong, Chi Wang, Kuansan Wang and Jiawei Han ”TaxoExpan: Self-supervised Taxonomy Expansion with Position-Enhanced Graph Neural Network”, in Proc. 2020 Int. World Wide Web Conf. (WWW’20), Taipei, Taiwan, Apr. 2020 

39.    Qi Zhu, Hao Wei, Bunyamin Sisman, Da Zheng, Christos Faloutsos, Xin Luna Dong and Jiawei Han, ”Collective Multi-type Entity Alignment Between Knowledge Graphs”, in Proc. 2020 Int. World Wide Web Conf. (WWW’20), Taipei, Taiwan, Apr. 2020

40.   Liu, Liyuan, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. "On the variance of the adaptive learning rate and beyond," In Proc. 2020 Int. Conf. on Learning Representations (ICLR), Addis Ababa, Ethiopia, Apr. 2020.

41.     Chanyoung Park, Donghyun Kim, Hwanjo Yu, Jiawei Han, “Unsupervised Attributed Multiplex Network Embedding”, in Proc. 2020 AAAI Int. Conf. on Artificial Intelligence (AAAI’20), New York, NY, Feb. 2020

42.    Aravind Sankar, Xinyang Zhang, Adit Krishnan and Jiawei Han, "A Deep Generative Approach to Integrate Social Homophily and Temporal Influence in Diffusion Prediction", in Proc. 2020 ACM Int. Conf. on Web Search and Data Mining (WSDM'20), Houston, TX, Feb. 2020

43.    Carl Yang, Jieyu Zhang, Haonan Wang, Sha Li, Myunghwan Kim, Matthew Walker, Yiou Xiao and Jiawei Han, "Relation Learning on Social Networks with Multi-Modal Graph Edge Variational Autoencoders", in Proc. 2020 ACM Int. Conf. on Web Search and Data Mining (WSDM'20), Houston, TX, Feb. 2020

44.   Yu Meng, Jiaxin Huang, Guangyuan Wang, Zihan Wang, Chao Zhang, and Jiawei Han, ”Unsupervised Word Embedding Learning by Incorporating Local and Global Contexts”, Frontier in Big Data, 3:9, 2020

 

Ph.D. Dissertations

 

·         Xiang Ren, Ph.D., January 2018, thesis title: “Mining Entity and Relation Structures from Text: An Effort-Light Approach", Ph.D. Thesis won 2018 ACM SIGKDD Doctoral Dissertation Award

·         Chao Zhang, Ph.D., Nov. 2018, thesis title: “Multi-dimensional Mining of Unstructured Data with Limited Supervision", Ph.D. Thesis won 2019 ACM SIGKDD Doctoral Dissertation Award Runner-Up

·         Jingbo Shang, Ph.D., Nov. 2019, thesis title: “Constructing and Mining Structured Heterogeneous Information Networks from Massive Text Corpora”, Ph.D. Thesis won 2020 ACM SIGKDD Doctoral Dissertation Award Runner-Up

 

 

Project Impact

 

§  Education: Parts of the new research results are used in Data Mining courses (CS412, CS512, CS412 MCD-DS online Coursera courses) for both undergraduate and graduate students being taught in the Department of Computer Science, the University of Illinois at Urbana-Champaign.   The research results have been and will continuously be published timely in international conferences and journals and be distributed world-wide for education and research.  Most of the software developed in this project have been made opensource published at Github. The new progress will also be integrated into the new edition of our data mining textbook and other research collections.

§  Collaborations: For this project we have established collaborations with ARL, Google Research, Amazon, Adobe, IAI, Microsoft Research, UCLA Medical School, LinkedIn, Facebook, and other industry and research centers.  Through such collaborations we expect to explore many real applications and produce bigger Research Impacts.

 

 

Current and Future Activities

The following are some of the highlights of our ongoing work.  Please refer to the section: Publications and Products section for related references.

1.        Study effective and scalable methods for embedding at mining text and heterogeneous information networks

2.        Study effective and scalable methods for embedding and text mining at construction of heterogeneous knowledge cubes from unstructured data

3.       Study effective and scalable methods for exploration of multidimensional text-and knowledge-hypercubes to support new applications

 

Area Background

 

This project is based on the previous research on data mining, text mining, embedding in networks, and data cube and multidimensional analysis.    There have been many research papers published on these themes.   Several textbooks on data mining, text mining, information retrieval and information network analysis provide good overviews of the principles and algorithms.

 

Area References

·         Jiawei Han, Jian Pei, and Hanghang Tong, Data Mining: Concepts and Techniques, 4th edition, Morgan Kaufmann, 2021

·         C. Aggarwal, Machine Learning for Text, Springer 2017

·         Xiang Ren and Jiawei Han, Mining Structures of Factual Knowledge from Text: An Effort-Light Approach, Morgan & Claypool Publishers, 2018 

·         Jialu Liu, Jingbo Shang and Jiawei Han,  Phrase Mining from Massive Text and Its Applications, Morgan & Claypool, 2017

·         Yizhou Sun and Jiawei Han, Mining Heterogeneous Information Networks: Principles and Methodologies, Morgan & Claypool, 2012

 

 

Potential Related Projects

·         Information Network Academic Research CenterNetwork Science-Collaborative Technology Alliance

·         NIH BD2K: KnowEng (Knowledge Engine for Genomics) Center: Construction and Mining of Biological Networks

·         Multi-Dimensional Structuring, Summarizing and Mining of Social Media Data (NSF/IIS)

·         StructNet: Constructing and Mining Structure-Rich Information Networks for Scientific Research (NSF/IIS)

 

Project Web site URL:  http://hanj.cs.illinois.edu/projs/hypercube.htm

Online software:  Online software can be downloaded at http://illimine.cs.uiuc.edu, and at github first-author name

Online resources:  Research publications related to this project can be downloaded at Selected Publications