10 October 2012 7 5K Report

Data Mining of Big Data using tools like SVM, clusters, trees, MCMC, and NN have emerged in place of conventional statistics in order to handle large size and complexity. These are well-suited to *spot patterns in large, static data sets*.

But, the inevitable demand for *rapid analysis of streaming data* reveals the limitations of Data Mining methods, especially where data streams are unstable, chaotic, non-stationary or have concept drift. And that covers many important areas (!), like human behavior (economic, social, commercial, etc.) My focus is mostly computational finance.

Data Mining methods lag in adjusting to changes in the observed behavior. A key problem is the uncertainty whether estimates calculated from past data still are good enough to apply in the often changing future. How frequently and when does the model for prediction or classification need to be updated? How responsive to incoming data should the estimating procedure be to achieve the needed reliability, without getting whipsawed or lagging amidst shifts?

Having a machine learning tool that self-corrects to minimize prediction and classification errors is the challenge. A forgetting factor, as in the dynaTree R package, could be effective if it adjusted automatically. A gain factor, as in Kalman Filtering, can be set pretty well for steady systems (physics), but is sluggish in chaotic settings. GARCH and its relatives provide particularly clumsy structures. Many other approaches exist like Dynamic Model Averaging, adaptive ensembles. Some models must work well, like real-time demand estimators within Google’s Borg, which load-balances its servers.

Have you had success in this area? Can you cite methods, sources, examples or software? I would be glad to discuss this more if you have interest. Thanks.

Tom

Similar questions and discussions