I don't know what this particular Data Science program is teaching, but research topics could be much more varied than just statistics, and data mining. Consider that Dr. Deming's notion that whenever there is data collected for a specific purpose, there is a process involved to accomplish that, and wherever there are processes, there is natural variation. One research question from this premise is : How much variation is acceptable for data to be "good" data? What are the characteristics of "good" data? How are they measured? What are reasonable standards, in different domains of data collected for different purposes, for the quality of data? Other research questions that flow from these include: How do the characteristics of data quality impact decision-making - in theory, and then again, in practice? Also, what should the qualification standards be for a new measurement system before it is deployed for use in production? Here's an example from my work in these areas mentioned above: https://www.researchgate.net/publication/13780659_Comprehensive_reliability_assessment_and_comparison_of_quality_indicators_and_their_components?ev=prf_pub
Article Comprehensive Reliability Assessment and Comparison of Quali...
Just a different name for the old dear Statistics (more toward its descriptive multidimensional territories than toward the inferential probability lands) , but in these days scientists need advertising to live, and advertising asks for new brands....very sad indeed...
The closest domain to data science is data mining and if you look for example:
* Principles and Theory for Data Mining and Machine Learning byBertrand Clarke · Ernest Fokoue ́ · Hao Helen Zhang
*Data Mining Methods and Models by Daniel T. Larose
*Principles of Data Mining by Max Bramer
The focus is more on an algorithmic perspective than a data oriented field which I think is the focus of Data Science. Where it gets fuzzy is in references such as:
Data Mining Concepts and Techniques by Jiawei Han and Micheline Kamber
where they also cover Data Warehouse and OLAP Technology which I would consider deviates from the above references and more into data science. While this reference does not cover all that would be included it starts to blur the lines of what I would consider topics belonging to data science.
And I would agree with Alessandro that this is more related to re-branding more than anything else, since the other topics in data science are covered in other already established disciplines.
I agree with Arturo Geigel, that in the sense described by Michael Brimacombe, Data Science may be taken as another name for Data mining, also known as Knowledge Discovery. The main issue is that there is a huge amount of measurements being taken at any moment all the time. Can we make sense out of it? can we transform it into information, and model back this information, so that we can obtain new knowledge?
Statistical methods have taken us a long way in data mining or data science. Since statistical methods are so closely aligned to measures it has somewhat limited the research. Thus we are expanding the mathematics and looking for applied problems. Let me explain.
Machine learning may be separated into Bayesian Belief models and Rough Set models. If you have studied Bayesian theory, then you should alreay be aware that it has some different fundamental understanding than Fisher's frequentist model. Rough Sets are different again.
The motivation for rough sets is readily understood. I can normally distinguish human gender by a simple observation. However, there exists Androgynous Pat from Saturday Night Live where it is difficult to know with certainty. What set of attributes of observations improves my classification? Rough sets are defined by the concept that things may {belong, not-belong, cannot-distinguish}. The distinction from fuzzy sets is the absence of an appropriate measure.
The mathematical methods of rough sets then are about how to weaken the axioms of "measure" and improve the odds of making correct decisions about belonging.
Since rough set theory is still in its adolescence, it requires reading some pretty academic content. It does show promise over the Bayesian models in some circumstances. Some methods have been shown to have faster convergence than the Bayesian training methods.
Data science is largely fast evolving new discipline to tackle the current age complex and very large data sets. It has foundation in data warehousing, statistics and high performance computing. It is developing new and sophisticated ways, algorithms and techniques to analyse the huge data sets in simplified and user friendly ways.
I don't know what this particular Data Science program is teaching, but research topics could be much more varied than just statistics, and data mining. Consider that Dr. Deming's notion that whenever there is data collected for a specific purpose, there is a process involved to accomplish that, and wherever there are processes, there is natural variation. One research question from this premise is : How much variation is acceptable for data to be "good" data? What are the characteristics of "good" data? How are they measured? What are reasonable standards, in different domains of data collected for different purposes, for the quality of data? Other research questions that flow from these include: How do the characteristics of data quality impact decision-making - in theory, and then again, in practice? Also, what should the qualification standards be for a new measurement system before it is deployed for use in production? Here's an example from my work in these areas mentioned above: https://www.researchgate.net/publication/13780659_Comprehensive_reliability_assessment_and_comparison_of_quality_indicators_and_their_components?ev=prf_pub
Article Comprehensive Reliability Assessment and Comparison of Quali...
I agree with you that there is more to data science than just data mining, the emphasis in data science is "data". But in this regard some of the boundaries on what should be covered in DM vs. DS courses is blurry at best. if you take data quality it could fall on data preprocessing and knowledge extraction in DM. Even the security of data is still blurry (which I think is a distinction between DM and DS) and some of the security issues are surfacing as adversarial data mining.
Both fields are very young and I think we need to give them time to settle down into their respective boundaries, and I think having this discussion and different opinions is very healthy in defining the field.
Robert Ferguson really added new knowledge to my existing 'rough set'! At the moment we are at a tipping point in terms of data generation, data management and its storage, let alone its consumption in terms of descriptive and inferential analysis with the same pace. Conventional statistical software such as SAS, SPSS, Stata and R are all wrecked even before uploading huge sets of data repository. New tools are the must to venturesome such new territories.
I agree Arturo, with the newness of these fields, and with the extent that other disciplines can inform them, like: metrology, for learning about how to characterize a measurement system, its reliability, precision, repeatability and reproducibility, etc.; and signal processing, from engineering, which can help data scientists learn how to identify signal from noise, and find patterns.
I think Data Science provides an opportunity to integrate and apply some of the highly specialised and technical domains mentioned. To me it would be most valuable to consider it more of an umbrella term since 'science' is a broader classification than KD or DM. Innovation is not shaped like an I, but like a T where we see deep specific knowledge, but also the bridging and linking to different fields. Data Science as a new field has the opportunity to be innovative in this way.
In healthcare, linking Informatics research, KD, & DM with classical epidemiological research methods would be highly beneficial because epidemiology has a greater understanding of the data in the field of medicine and healthcare than the computational sciences alone can - a greater emphasis on the data and it's applied use.
To me, a research agenda should include questions from Implementation Research and Applied Science in how can we better combine the mathematical/computational side to the applied decision-making side eg how do we improve the ability of healthcare to use the tools already developed to create a more informed data-driven system, what are the practical risks and benefits eg as per data quality mentioned, and how do we quantify these in practical use? I think this is where we will see the greatest benefit to society.
From the meeting I attended recently for data science, it integrates Applied Math / Stats, Computer Science and Business( the goal ) . It does cover data mining, but maybe more than that, I think, especailly in practical problems. It depends on the goal of projects. Some problems needs more computer science technique, such as some big data issue. Some need more methods and algorithms' innovations in operations research or quantititive analysis, which is definitely in statistics and applied math scope. About the research focus, I think it dependes on the problem goal. I agree with Dr. Deming's comments, like Edwin mentioned.
Thank you for all your answers so far - I'm on holiday at the moment and will respond fuller as recreation permits (!). However, looking at the subtext to my question (above), quite a lot got missed off that would have clarified my question. So here it is in full:
If 'Data Science' is indeed a new, separate discipline, then it must have a research agenda that distinguishes it from other disciplines. What then distinguishes Data Science from its nearest neighbours. What are the fundamental research topics that make it distinctive? Or is it that its methods and objects of study fundamentally different?
Without some clarity here, Data Science may be seen merely as a re-branding.
It may of course just fill a gap that the Venn diagram of other disciplines leaves vacant or has opened up and therefore borrows from many but adds of its own so that the whole is greater than the sum of the parts....and therefore worthwhile pursuing. I think Data Science goes beyond technique and algorithm to processes of a data value pipeline which would include issues of data accuracy/uncertainty, metadata, security, privacy, legal oversight, business models, big data, open data and so on. I also view 'data' very broadly as being numerical, text, spatial, audio, image, video - anything that can be analysed, co-analysed in the production of new understandings and knowledge.
Edwin Huff says "I don't know what this particular Data Science program is teaching"...well, as with any question there is an ulterior motive in asking. I have recently opened a Professional Doctorate programme in Data Science (D.DataSc.), the content of which was crowd-sourced from a previous question I asked the Research Gate community, though the structure has to conform to my university's model for such programmes. Details can be found here:
Could a snappy defintion of Data Science be: "the production of value from data". To extract value requires attention to the whole process chain and relevant technologies from data gathering/harvesting to information consumption. (?)
Data Science IS a rebranding, all the activities mentioned along this debate in terms of quality control of data, visualization, data mining...are required to any average student in Statistics when he/she takes his/her degree. Or at least this was the situation here in Roma until few years ago..I hope, notwithstanding the global degradation of culture taking place in these last years this is still the case...