One of the questions you should ask yourself is whether you want to compare a continuous (or approximately continuous) curve, or a discrete set of datapoints. It seems to me that you would like to do the latter.
In the continuous case, you could consider the area between the two curves or mathematically approximate the functions.
In the discrete case, you could calculate the Hausdorff-distance, Fréchet distance, Kolmogorov-Smirnov statistic, or any kind of Sum-of-squares statistic.
None of these approaches are trivial. Imagine the following situations:
1) At a low value of x, the curves start to diverge. However, for all other values of x, the difference remains equal, so that they run parallel.
2) You repeat the experiment and find completely different curves. The maximum vertical distance is the same across both curves, albeit at other values of x.
3) The curves cross one or more times.
In (1), do you care about the moment at which the divergence occurs? In (2), do you care about the fact that the curves are so different, but possess the same maximum y-difference? In (3), do you consider the intersections to compensate or to add up? The choice of your statistic depends on what you find important.
If you already have the structural definition of the response eg Emax model or a sigmoidal or whatever, than you have two alternatives.
First your dataset is suitable created (it contains two additive 0-1 coded collumn) to be implementd into a simple nonlinear least sqares method. Than you pick the appropriate parameter(s) from your model (eg. ED50, Emax etc) and you describe it as a linear function of your additive collumn (Emax=a+b*x where ix is your 0-1 coded collumn)
Second you use nonlinear mixed effects modeling ie nlme package in R or PROC NLMIXED in SAS
I do not have the definition of the response or a standard model for the dose-volume curve... their shapes are all depend on how the treatment plan being generated. You can imagine they are only two curves which should be quite similar to each other (doesn't matter what exactly the X and Y represent), but I would like to quantify how much they look alike in a statistical way. Of course comparison by selecting some points in each curve can be done but I would like to compare the curve as a whole.
Sorry I am a statistics geek, thanks Lstvan and Jochen!
Winky, if you want to "quantify how much they look alike in a statistical way" then you will need a well-defined statistical model. It is a bit like the wish to express an idea in a language when it is not absolutely clear in *which* language (note that in this allegory the language compares to the statistical model).
To my opinion, it will be most instructive to compare the curves visually.
If you vould like to compare you have to have a model which describes the relationship between X and Y, otherwise what will you compare?? Probably you missunderstand my point. For example you have an Emax model: E = (Emax*Dose)/(ED50*Dose) in this function Emax and ED50 is the parameters of your response-curve. Your goal is to detect difference (or be dare to confirm similarity) in one or both of these parameters. Its pretty easy...Emax is expressed as a+b*x or ED50 is expressed as a+b*x or both. Emax model was just an example you also can chose another model eg three parameter logistic, four parameter logistic and so on...but to compare you need to chose for both treatment the same structural part. So, you cant choose Emax for treatment A and 4 parm logistic for treatment B.
All the above scholarly suggestions are quite valid, however, your problem is not well-defined. Your latter response, "You can imagine they are only two curves which should be quite similar to each other" suggests that dose-volume in the two cases should be similar. In this case, tow methods:one parametric and a non-parametric may be used. In the first case differences in response variable in two curves may be considered and their significance may be tested as in bi-variate regression models. In the second case Chi-Square test may be applied.
I do not think there is a common approach that does not depend on what kind of curves You compare. But there are two approaches to solving similar problems that can be used.
1. The Kolmogorov-Smirnov and omega-squared tests, which are used for comparison of sample homogeneity. The essence of the criteria is to compare the distribution functions, which is a particular case of your problem.
2. Methods for analysis of residues in checking of the models adequacy. They suggest verification of the hypothesis that the model residuals have a normal distribution with zero mean and are independent random variables. In your case, as the residues You can used the differences of the functions for a set of the independent variable values.
In my case, I have several curves and I need to quantify their shapes. So, I am trying with statistical parameters like Kurtosis, RMS, Peak, etc.
My goal is to propose a simple identification method. I know that there are sophisticated methods in the literature, but... for now, it is not my way...
The problem is not trivial at all (it has to do with patent recognition). If your y values are at the same x for each curve, you can still try some non-parametric statistical tests. The problem is when the curves come at non coherent x sets. Of course the easiest way is to replace the curves with some characteristics (like area underneath suggested before, moments of distribution, etc.) and then run multivariate ANOVA on them. If you really want to asses the similarity between the curves, you may want to check theses papers:
In my research I compared two curves (data vs model) and I used the weighted least squares (WLS)
For the WLS method it was defined as the sum of the absolute differences between observed and expected values, divided by the observed values (Lika et al. 2011).
The second part is to analyze the correlation (Matlab= corrcoef).
Regards
Article The “covariation method” for estimating the parameters of th...
Hi. You can consider either the Hausdorff distance (also called the Pompeiu-Hausdorff distance) or the L_p norm applied to the difference between the functions corresponding to the two curves. They are distances in the usual mathematical sense (applicable to metric spaces).
One of the questions you should ask yourself is whether you want to compare a continuous (or approximately continuous) curve, or a discrete set of datapoints. It seems to me that you would like to do the latter.
In the continuous case, you could consider the area between the two curves or mathematically approximate the functions.
In the discrete case, you could calculate the Hausdorff-distance, Fréchet distance, Kolmogorov-Smirnov statistic, or any kind of Sum-of-squares statistic.
None of these approaches are trivial. Imagine the following situations:
1) At a low value of x, the curves start to diverge. However, for all other values of x, the difference remains equal, so that they run parallel.
2) You repeat the experiment and find completely different curves. The maximum vertical distance is the same across both curves, albeit at other values of x.
3) The curves cross one or more times.
In (1), do you care about the moment at which the divergence occurs? In (2), do you care about the fact that the curves are so different, but possess the same maximum y-difference? In (3), do you consider the intersections to compensate or to add up? The choice of your statistic depends on what you find important.
You can calculate the area under each curve and them compare it between groups. But be careful, sometimes AUC is not the best metric. Specially, when handling negative values because a function that oscillates between 1 and -1 will yield the same value that f(x)=0. But they are different. In that case AUC^2 could work.
If the two curves were obtained by regression, I would indicate an statistical approach such as a Monte Carlo analysis. Pick several sets of random points in the x axis and calculate the values of f1(x) and f2(x). Calculating, lets say, the r² for the values of f1 and f2 for each set of random x, you will have a single values of r². Repeating this for a big number of random sets you will be able to say to give a range of r² for a level of significance.