One of the common problems in data science is gathering data from various sources in a somehow cleaned (semi-structured) format and combining metrics from various sources for making a higher level analysis. Looking at the other people's effort, especially other questions on this site, it appears that many people in this field are doing somewhat repetitive work. For example analyzing tweets, facebook posts, Wikipedia articles etc. is a part of a lot of big data problems.

Some of these data sets are accessible using public APIs provided by the provider site, but usually, some valuable information or metrics are missing from these APIs and everyone has to do the same analyses again and again. For example, although clustering users may depend on different use cases and selection of features, but having a base clustering of Twitter/Facebook users can be useful in many Big Data applications, which is neither provided by the API nor available publicly in independent data sets.

Is there any index or publicly available data set hosting site containing valuable data sets that can be reused in solving other big data problems? I mean something like GitHub (or a group of sites/public datasets or at least a comprehensive listing) for the data science. If not, what are the reasons for not having such a platform for data science? The commercial value of data, need to frequently update data sets, ...? Can we not have an open-source model for sharing data sets devised for data scientists?

More Fazla Rabbi Mashrur's questions See All
Similar questions and discussions