NSF/SGER: CS-BibCube:
OLAPing and Mining of Computer Science Literature
National Science Foundation Award Number: NSF IIS 08-42769 (September1, 2008―Feb. 28, 2010)
Contact Information
Jiawei Han, PI
Department of Computer Science
University of Illinois, Urbana-Champaign
1304 West Springfield Ave. , Urbana, Illinois 61801 U.S.A.
Office: (217) 333-6903, Fax: (217) 265-6494
E-mail: hanj at
cs.uiuc.edu, URL: http://www.cs.uiuc.edu/~hanj
List
of Supported Students and Staff
§
Zhijun Yin, Ph.D. student, Department of Computer Science, University of Illinois at Urbana-Champaign
§
Yintao Yu, Ph.D. student, Department of Computer Science, University of Illinois at Urbana-Champaign
§
Bo
Zhao, Ph.D. student, Department of Computer Science, University of Illinois
at Urbana-Champaign
Project
Award Information
- Award Number: NSF IIS 08-42769
- Duration: September1,
2008―Feb. 28, 2010
- Title: NSF/SGER:
CS-BibCube: OLAPing and
Mining of Computer Science Literature
- Keywords:
text databases, text mining, information network analysis, online
analytical processing, scalable OLAP and mining algorithms, data mining
applications
Project
Summary
This research project is to
investigate issues in the design and development of CS-BibCube,
a multidimensional text data cube, constructed based on multidimensional
categorical dimensions (e.g., author list, venue, and date) and unstructured
text attributes (e.g., title, abstract, and contents), to facilitate
multidimensional online analytical processing (OLAP) and mining of computer
science literature. Data cube has become
an essential engine in data warehouse industry and has been extended to handle relatively
structured non-relational data, including spatiotemporal data, sequences,
graphs, data streams, etc. However, it
is still challenging to handle unstructured text data. This project is to explore the possibilities
and alternatives on the design, multidimensional modeling, implementation,
performance improvement, and deployment of text-cubing and text-OLAP. The work will integrate multiple disciplinary
approaches derived from data cube and OLAP, information retrieval, text mining,
and machine learning, and further study is expected to be expanded to other
multidimensional text databases with broad applications in business, industry,
government agencies, scientific research, and education. The research results are to be published in
research forums on information retrieval, data mining, and database systems,
and be integrated into the educational program at UIUC. The progress of the
project and the research results will be disseminated via the project Web site
(http://www.cs.uiuc.edu/~hanj/projs/csbibcube.htm)..
Publications and Products:
Journal articles (including
accepted)
- Chen Chen, Xifeng Yan, Feida Zhu, Jiawei Han,
Philip S. Yu, “Graph OLAP: A Multi-Dimensional Framework for Graph
Data Analysis", Knowledge and Information Systems (KAIS) (Special Issue
of Selected Papers from ICDM'08), 2009.
- Jing Gao,
Bolin Ding, Wei Fan, Jiawei Han, and Philip S.
Yu, “Classifying Data Streams with Skewed Class Distribution and
Concept Drifts", IEEE Internet Computing (Special Issue on Data
Stream Management), 12(6):37-49, 2008.
- Chao Liu, Xiangyu
Zhang, and Jiawei Han, “A Systematic Study
of Failure Proximity", IEEE Transactions on Software Engineering,
34(6):826-843, 2008.
Book and Book Chapters
- H. J. Miller and J. Han (eds.), Geographic Data
Mining and Knowledge Discovery, 2nd ed., Springer Verlag, 2009.
- Hillol Kargupta,
Jiawei Han, Philip S. Yu, and Rajeev Motwani (eds.), Next Generation of Data Mining,
(Chapman & Hall/CRC Data Mining and Knowledge Discovery Series), 2009
(605 + xxiv pages).
- Jiawei
Han, Y. Dora Cai, Yixin
Chen, Guozhu Dong, Jian
Pei, Benjamin W. Wah, and Jianyong
Wang, “Multi-Dimensional Analysis of Data Streams Using Stream
Cubes”, in C. C. Aggarwal (ed.), Data
Streams: Models and Algorithms, Kluwer Academic
Publishers, pp. 103-126, 2006.
- Jiawei Han, “Data Mining",
in M. Tamer Ozsu and Ling Liu (eds.),
Encyclopedia of Database Systems, Springer, 2009
- Hong Cheng and Jiawei
Han, “Frequent Itemsets and Association
Rules", in M. Tamer Ozsu and Ling Liu
(eds.), Encyclopedia of Database Systems, Springer, 2009
- Hong Cheng and Jiawei
Han, “Pattern-Growth Methods", in M. Tamer Ozsu
and Ling Liu (eds.), Encyclopedia of Database Systems, Springer, 2009
- Jiawei Han and Bolin Ding,
“Stream Mining", in M. Tamer Ozsu
and Ling Liu (eds.), Encyclopedia of Database Systems, Springer, 2009
- Ronnie Alves, Joel Ribeiro, Orlando
Belo, and Jiawei Han, “Ranking Gradients
in Multi-Dimensional Spaces", in T. M. Nguyen (ed.), Complex Data
Warehousing and Knowledge Discovery for Advanced Retrieval Development: Innovative
Methods and Applications, IGI Global, 2009.
- Jiawei Han and Jing
Gao, “Research
Challenges for Data Mining in Science and Engineering", in H. Kargupta, et al., (eds.), Next Generation of Data
Mining, Chapman & Hall/CRC, 2009, pp. 3-28.
- Feida Zhu, Xifeng
Yan, Jiawei Han and
Philip S. Yu, \Mining Frequent Approximate Sequential Patterns", in
H. Kargupta, et al., (eds.), Next Generation of
Data Mining, Chapman & Hall/CRC, 2009, pp. 69-90.
- Jiawei Han and Xiaolei
Li, “Classification and Clustering for Homeland Security", in
John G. Voeller (ed.), Wiley Handbook of Science
and Technology for Homeland Security, John Wiley & Sons, 2009.
- Jiawei Han, “OLAP,
Spatial", in Shashi Shekhar
and Hui Xiong (eds.),
Encyclopedia of GIS, Springer, 2008
Refereed Conference Publications
1.
Mohammad M. Masud, Jing Gao, Latifur
Khan, Jiawei Han, and Bhavani
Thuraisingham, “Integrating
Novel Class Detection with Classification for Concept-Drifting Data Streams",
Proc. 2009 European Conf. on Machine Learning and Principles and Practice of
Knowledge Discovery in Databases (ECMLPKDD'09), Bled,
Slovenia, Sept. 2009.
- Min-Soo Kim and Jiawei Han,
"A
Particle-and-Density Based Evolutionary Clustering Method for Dynamic
Networks", Proc. 2009 Int. Conf. on Very Large Data Bases
(VLDB'09), Lyon, France, Aug. 2009.
- Tianyi Wu, Dong Xin, Qiaozhu Mei, and Jiawei Han, "Promotion
Analysis in Multi-Dimensional Space", Proc. 2009 Int. Conf.
on Very Large Data Bases (VLDB'09), Lyon, France, Aug. 2009.
- Chen Chen, Cindy Lin, Matt Fredrikson,
Mihai Christodorescu, Xifeng Yan, and Jiawei Han, "Mining Graph
Patterns Efficiently via Randomized Summaries", Proc. 2009 Int.
Conf. on Very Large Data Bases (VLDB'09), Lyon, France, Aug. 2009.
- Yintao Yu, Cindy X.
Lin, Yizhou Sun, Chen Chen,
Jiawei Han, Binbin
Liao, Tianyi Wu, ChengXiang
Zhai, Duo Zhang, and Bo Zhao, “iNextCube: Information Network-Enhanced Text Cube",
Proc. 2009 Int. Conf. on Very Large Data Bases (VLDB'09) (system demo),
Lyon, France, Aug.
2009.
- David Lo, Hong Cheng, Jiawei Han, SiauCheng Khoo, and Chengnian Sun, “Classification
of Software Behaviors for Failure Detection: A Discriminative Pattern
Mining Approach", Proc. 2009 ACM SIGKDD Int. Conf. on Knowledge
Discovery and Data Mining (KDD'09), Paris,
France,
June 2009.
- Yizhou Sun, Yintao Yu, and Jiawei Han, “Ranking-Based
Clustering of Heterogeneous Information Networks with Star Network Schema",
Proc. 2009 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining
(KDD'09), Paris, France, June 2009.
- Zhijun Yin, Rui Li, Qiaozhu Mei, and Jiawei Han,
“Exploring
Social Tagging Graph for Web Object Classification", Proc. 2009
ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD'09), Paris, France,
June 2009.
- Jing Gao,
Wei Fan, Yizhou Sun, and Jiawei
Han, “Heterogeneous
Source Consensus Learning via Decision Propagation and Negotiation",
Proc. 2009 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining
(KDD'09), Paris, France, June 2009.
- Deng Cai, Xiaofei
He, Xuanhui Wang, Hujun
Bao, Jiawei Han, “Locality
Preserving Nonnegative Matrix Factorization”, Proc. 2009 Int.
Joint Conf. on Arti_cial Intelligence
(IJCAI-09), Pasadena, CA, July 2009.
- Mohammad Maifi
Hasan Khan, Tarek Abdelzaher, Jiawei Han, and Hossein Ahmadi, “Finding
Symbolic Bug Patterns in Sensor Networks", Proc. 2009 IEEE
Int. Conf. on Distributed Computing in Sensor Systems (DCOSS '09),
Marina Del Rey, CA, June 2009.
- Jing Gao, Guofei
Jiang, Haifeng Chen, and Jiawei
Han, “Modeling
Probabilistic Measurement Correlations for Problem Determination in
Large-Scale Distributed Systems”, Proc. 2009 Int. Conf. on
Distributed Computing Systems (ICDCS'09), Montreal, Quebec, Canada, June
2009.
- Mohammad M Masud, Jing
Gao, Latifur Khan, Jiawei Han,
and Bhavani Thuraisingham,
“A
Multi-Partition Multi-Chunk Ensemble Technique to Classify
Concept-Drifting Data Streams”, Proc. 2009
Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD'09), Bangkok, Thailand, Apr. 2009.
- Xin Jin, Sangkyum Kim, Jiawei Han, Liangliang Cao,
and Zhijun Yin, “GAD: General
Activity Detection for Fast Clustering on Large Data", Proc.
2009 SIAM Int. Conf. on
Data Mining (SDM'09), Sparks,
NV, April 2009.
- Marisa Thoma, Hong Cheng, Arthur Gretton, Jiawei Han,
Hans-Peter Kriegel, Alexander J. Smola, Le Song, Philip S. Yu, Xifeng
Yan, and Karsten M. Borgwardt, “Near-Optimal
Supervised Feature Selection among
Frequent Subgraphs", Proc. 2009
SIAM Int. Conf. on Data Mining (SDM'09), Sparks, NV, April 2009.
- Duo Zhang, Chengxiang Zhai
and Jiawei Han, “Topic Cube:
Topic Modeling for OLAP on Multidimensional Text Databases",
Proc. 2009 SIAM Int.
Conf. on Data Mining (SDM'09), Sparks,
NV, April 2009. (One of “Best of SDM’09”)
- Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin,
Hong Cheng, Tianyi Wu, “RankClus: Integrating Clustering with Ranking for
Heterogeneous Information Network Analysis”, Proc. 2009 Int. Conf.
on Extending Data Base Technology (EDBT'09), Saint-Petersburg, Russia,
Mar. 2009.
- Jiawei Han, Xifeng Yan, and Philip S. Yu, “Scalable
OLAP and Mining of Information
Networks”, 2009 Int. Conf. on Extending Data Base Technology
(EDBT'09), Saint-Petersburg, Russia, Mar. 2009.
- Bolin Ding, David Lo, Jiawei Han, and Siau-Cheng Khoo, “Efficient
Mining of Closed Repetitive Gapped Subsequences from a Sequence Database”,
Proc. 2009 Int. Conf. on Data Engineering (ICDE'09), Shanghai, China,
Mar. 2009.
- Xiaolei Li, Zhenhui Li, Jiawei Han, and Jae-Gil Lee, “Temporal
Outlier Detection in Vehicle Traffic Data”, Proc. 2009 Int.
Conf. on Data Engineering (ICDE'09), Shanghai,
China,
Mar. 2009.
- Chen Chen, Xifeng
Yan, Feida Zhu, Jiawei Han, and Philip S. Yu, "Graph OLAP:
Towards Online Analytical Processing on Graphs", Proc. 2008
Int. Conf. on Data Mining (ICDM'08), Pisa,
Italy,
Dec. 2008.
- Deng Cai, Xiaofei He, Xiaoyun Wu, and Jiawei Han, “Non-negative
Matrix Factorization on Manifold”, Proc. 2008 Int. Conf. on Data
Mining (ICDM'08), Pisa,
Italy,
Dec. 2008.
- Cindy Xide Lin, Bolin Ding, Jiawei Han, Feida Zhu, and
Bo Zhao, "Text Cube:
Computing IR Measures for Multidimensional Text Database Analysis",
Proc. 2008 Int. Conf. on Data Mining (ICDM'08), Pisa, Italy,
Dec. 2008.
- Luiz Mendes, Bolin Ding, and Jiawei
Han, "Stream
Sequential Pattern Mining with Precise Error Bounds", Proc.
2008 Int. Conf. on Data Mining (ICDM'08), Pisa, Italy, Dec. 2008.
- Mohammad Masud, Jing
Gao, Latifur Khan, Jiawei Han,
and Bhavani Thuraisingham,
"A
Practical Approach to Classify Evolving Data Streams: Training with
Limited Amount of Labeled Data", Proc. 2008 Int. Conf. on
Data Mining (ICDM'08), Pisa,
Italy,
Dec. 2008.
26. Mohammad Maifi Hasan Khan, Hieu Le, Hossein Ahmadi, Tarek Abdelzaher,
and Jiawei Han, “DustMiner: Troubleshooting Interactive Complexity Bugs in
Sensor Networks”, Proc. 2008 ACM Int. Conf. on Embedded Networked
Sensor Systems (Sensys'08), Raleigh, NC, Nov. 2008.
28. Deng Cai, Qiaozhu Mei, Jiawei Han, and ChengXiang Zhai, “Modeling Hidden
Topics on Document Manifold”, Proc. 2008 ACM Conf. on Information and
Knowledge Management (CIKM'08), Napa Valley, CA, Oct. 2008.
Project
Impact
§
Education: Parts of the new research results are
used in Data Mining courses (CS412, CS512) for both undergraduate and graduate
students being taught in the Department of Computer Science, the University of Illinois at Urbana-Champaign.
Moreover, the research results have been and will continuously be
published timely in international conferences and journals and be distributed
world-wide for education and research. The new progress will also be
integrated into the new edition of our data mining textbook and other research
collections.
§
Collaborations: For this project we have established collaborations with NASA, HP Labs,
IBM T.J. Watson
Research Center,
Yahoo! Research, Microsoft Research, Boeing, and NCSA (National Center of
Supercomputer Applications). Through such collaborations we expect to have
access to real datasets and applications and produce more research results.
Current and Future Activities
The following are some of the highlights of
our ongoing work. Please refer to the
section: Publications and Products section for related references
§
Development of efficient and scalable mechanisms for OLAP and mining
networks: see ICDM’08, EDBT’09, SDM’09, KDD’09 and
VLDB’09 papers.
§
Development of multi-dimensional text database analysis techniques: see
ICDM’08 (text cube), SDM’09 (topic cube), VLDB’09 (iNextCube) demo.
§
Development of efficient methods for data intensive knowledge discovery
and data mining: SDM’09, KDD’09, VLDB’09.
Area
Background
This project is based on the previous research on data
mining, text data analysis, and data cube and
multidimensional analysis. There have been many research
papers published on these themes. Several textbooks on data
mining, information retrieval and
information network analysis provide good overviews of the principles and
algorithms, including (Han and Kamber, 2006, (Hastie,
Tibshirani, and Friedman, 2001) and (Manning, Raghavan and Schutze 2008).
Area
References
- Chen Chen, Xifeng
Yan, Feida Zhu, Jiawei Han, and Philip S. Yu, "Graph OLAP:
Towards Online Analytical Processing on Graphs", Proc. 2008
Int. Conf. on Data Mining (ICDM'08), Pisa,
Italy,
Dec. 2008..
- Cindy Xide Lin, Bolin Ding, Jiawei Han, Feida Zhu, and
Bo Zhao, "Text Cube:
Computing IR Measures for Multidimensional Text Database Analysis",
Proc. 2008 Int. Conf. on Data Mining (ICDM'08), Pisa, Italy,
Dec. 2008.
- J. Han and M. Kamber. Data Mining: Concepts and
Techniques, 2nd ed., Morgan Kaufmann, 2006.
4. T. Hastie, R. Tibshirani,
and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference,
and Prediction, Springer-Verlag 2001.
5. C. D. Manning, P. Raghavan, and H. Schutze,
Introduction to Information Retrieval, Cambridge University
Press, 2008
- Duo Zhang, Chengxiang Zhai
and Jiawei Han, “Topic Cube:
Topic Modeling for OLAP on Multidimensional Text Databases",
Proc. 2009 SIAM Int.
Conf. on Data Mining (SDM'09), Sparks,
NV, April 2009.
Potential Related Projects
This
project is related to most of data mining and text database and OLAP. In
particularly, it is related to P.I.'s NSF IIS
020-9199 (Mining Sequential and Structured Patterns: Scalability, Flexibility,
Extensibility and Applicability), P.I.'s NSF
IIS-03-08215 (Mining Dynamics of Data Streams in Multi-Dimensional Space), and PI’s NASA project NNX08AC35A (Event Cube: An
Organized Approach for Mining and Understanding Anomalous Aviation
Events). We wish to collaborate or
exchange research ideas with most of the research projects related to knowledge
discovery in databases, text information systems, and OLAP analysis, and their
applications.
Project Web site URL: http://www.cs.uiuc.edu/~hanj/projs/csbibcube.htm
Online software: Online software related to this project can be downloaded
at www.illimine.cs.uiuc.edu
Online resources: Research publications related to
this project can be downloaded at Selected Publications