It's not easy to give a good answer without knowing details on the type of problem at hand. With clustering you are referring to a broad set of statistical techniques that try to identify homogeneous subgroups among the available data/observations.
Two popular clustering methods are K-means clustering which seeks to partition data into a pre-specified number k of subgroups by minimizing the within-group variation.
Or hierarchical clustering, which does not need to pre-specify the number of clusters and builds subgroups by iteratively grouping similar data points.
These may be starting points for your work (there are however several other clustering methods)
Thank you for your answer, actually i'm looking for a good hierarchical algorithm that clustering the normal traffic from attack traffic. for the point-assignment algorithm such as k-means, it doesn't deal with different shapes and sizes for the cluster. you know that the attacker send flooding of data in a specific time which lead to a large attack cluster that different from a normal cluster in some time. this is what i look for?? let me have your comments please.
Well, I have never dealt with such a problem before, but once the clusters have been identified (say through bottom-up hierarchical clustering) you could do some post-processing of the results. I would start off with some simple descriptive statistics and plots: for instance, you could plot the number of data points within the clusters in order to have an idea of the size of the clusters. Or you could look at the distance between clusters. This might give you an idea of the characteristics of specific clusters (attack/no attack).
thanks for your answer. I have a large data of attack/ normal packets. this means that there is a big distance (no similarity) between normal and attack data point. so, i can't use Spectral clustering because i can't use similarity matrix to reduce the dimensional of data. in genreral, the attacker send flooding packets in some time and suspend in another time. this means that the shapes of attack clusters are in different sizes and shapes, if we assume that the number of clusters are two ( one for attack and one for normal). I have read a lot of papers to select the good algorithm that deals with different shape and size but not find till this time.
For this you will need to represent your data into as 2 dimensions: 1. A output of a shape classifier 2. The overall size either as perimeter or area of contour.
Overall size is easy to calculate but the shape classification is an non-trivial job. One way to do it as 1 dimension is to assume circle as a '0' shape and then find the distance of each contour from the the circular shape by this algorithm: http://docs.opencv.org/trunk/modules/shape/doc/shape.html .
For multi-dimension shape classification, you can build up a starting bin of pre-assigned basic shapes and represent each shape as distance from the each basic shape. So if the number of basic shapes is N, you have following vector (s1, ..., sN). Add to it the size, which make vector as (Area, s1, ... sN). Then you should apply recursive K-means clustering. If you need more sophistication, the new shape centers of the clusters can then be used to rebuild the basic shape data.
There is no truly satisfying clustering algorithm for every sorts of criteria. However, I found one variable shape related text for you. May check the following :
K-Means for Spherical Clusters with Large Variance in Sizes by
A. M. Fahim, G. Saake, A. M. Salem, F. A. Torkey, and M. A. Ramadan
I agree with Malay, that is no single method that will always give a good performance. However, if you some facility with R I suggest the functions:
gpcm(...) - package(mixture)
pgmmEM(...) - package(pgmmEM)
Mclust(...) - package(mclust) (this contains a subset of the models available in mixture)
Given the dimensionality problems pgmmEM(...) will probably work the best as it is designed to handle high dimensional data (see McNicholas and Murphy (2008) Parsimonious Gaussian Mixture Models and the subsequent McNicholas and Murphy (2010) Clustering Gene Expression Microarray Data)