Contents

People involved
About Theoretical and Machine Learning for Data Streams
References
Links

COMP2SYS people involved in Machine Learning for Data Streams are:
  • Yann-Aël Le Borgne
Senior scientists involved in Machine Learning for Data Streams are:
  • Gianluca Bontempi

Machine learning for data streams

Machine learning aims at developing automated algorithms that can transform raw data into useful knowledge.

Useful knowledge is a task-dependent notion, and machine learning methods cover a wide range of techniques and tools whose final purposes are to solve classification, regression, prediction and clustering problems.

Togeteher with the advances in electronics, computer science and information technology, recordable data sources constantly increase, and the both in number and variety. Many scientific communities and large commercial organizations can now commonly collect on the order of gigabytes of data per day. Data rate of this level have significant consequences, both in term of storage

All these data generating sources are referred to as data streams.

A data stream can be defined as a real-time, continuous and ordered sequence of items. This outlines the key constraints associated with data streams : it is impossible to control the order in which the items arrive, nor is it feasible to locally store a stream in its entirety. These characteristics constrain the learning algorithm to integrate data as it becomes available, in an on-line and real-time fashion.

Moreover, in most cases, the underlying structure of the data distribution within the data stream changes over time. Such changes, referred to as concept-drift, must be detected by the learning algorithm, so as to replace the outdated model by a new one. A desirable data stream learning algorithm would detect when changes occur, and identify how many models are necessary to accurately represent the data stream over time.

These constraints invalidate many classical data mining algorithms, which are either based on multiple scans of a training set, or on the assumption that data are independently and identically distributed. Adaptation of these learning algorithms, together with new approaches better are suited to data stream management, has given rise to the data stream research community [5,6,7,8].

In this research, a particular interest will be given to lazy learning methods and distributed learning algorithms.


References

[1] G. Bontempi. Local Learning Techniques for Modeling, Prediction and Control. PhD thesis, IRIDIA -Université Libre de Bruxelles, 1999.

[2] L. Golab and M. Ozsu. Issues in Data Stream Management. In SIGMOD Record, Volume 32, Number 2, June 2003, pp. 5--14.

[3] P. Domingos, G. Hulten. Catching up with the data : Research issues in mining data streams. ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, Santa Barbara, CA, USA, May 2001.

[4] B. Babcock, B, Babu, M. Datar, R. Motwani, J. Widom. Models and issues in data streams systems. Proceedings of the 21st ACM Symposium on Principles of Database Systems, 2002.

[5] P. Domingos, G. Hulten. A general Framework for mining massive data streams. Journal of Computational and Graphical Statistics, 12, 2003.

[6] P. Domingos, G. Hulten. Mining high speed data streams. Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining, pages 97-106, San Francisco, CA, 2001. ACM Press.

[7] M. Gaber, S. Krishnaswanmy, A. Zaslavsky. Ubiquitous data stream mining. Current Research and Future Directions Workshop Proceedings held in conjunction with The Eighth Pacific-Asia Conference on Knowledge Discovery and Data Mining, Sydney, Australia May 26 2004.

[8] M. Garofalakis, J. Gehrke, R. Rastogi. Querying and mining data streams : you only get one look a tutorial. SIGMOD (Special Interest Group on Management of Data) Conference, 2002.

 


Links

  • The Knowledge Discovery Network of Excellence website
  • Last modified: June 27 2014 11:17:34.  e-mail: est@iridia.ulb.ac.be