I have the following results when I tried to use simple linear regression between a dependent variable (Timeliness) and an independent variable (Semantic Accuracy) using spss:
I seem not to be able to read your file on my phone, but I suspect that your relationship is curvilinear. A quadratic linear regression is still technically a linear regression, and you might try that, depending upon how your data look on a graph ... which, as I said, I have not seen.
Hello James R Knaub. The .spv vile Mohamed Amine Ferradji uploaded is an SPSS output file. I opened it, changed a couple tables to display more decimals, and exported to PDF. See attached.
Mohamed Amine Ferradji, your Semantic Accuracy variable appears to have only 3 possible values. But the REGRESSION command treats it as a continuous variable.
I cannot see the file because I am in the process of uninstalling my SPSS license from Walden and hopefully switching back to SAS or I might be working in R and Python. However, we don't necessarily need to open file to answer your question. Because remember the linear regression the explanatory variable or independent variable (IV) is intended to move the response variable or dependent variable (DV). Which I only bring up because in my mind just visualizing based on what you have provided, it makes more sense that each shift in unit of timeliness (IV) will result in in the change in semantic accuracy (DV) you intend to observe. Now having said all that, you actually don't want to make a standard practice of using time as an IV in a simple linear regression engine, but instead would opt for a time series regression alternative.
Here is Mohamed Amine Ferradji's scatter-plot with a bit more info added to it. The linear fit is very nearly perfect. But as you can see, there are only 3 possible values of X (semantic accuracy), and only 3 very tight clusters of data points.
If I had to guess, I'd guess that semantic accuracy was measured as the proportion of 3 semantic tasks that were done correctly (with possible values 0, 1/3, 2/3, or 1). I have no idea what Timeliness is or how it was measured.
Bruce, can you assign semantic accuracy to the y-axis and timeliness to the x-axis and rerun? Also, exclude those first n=30 data points (0,0), and can you change the scale of both axis values to increments of 0.10?
Bruce Weaver since I found out that there is a correlation between Timeliness and Semantic Accuracy (I'm studying linked data quality dimensions assessment, trying to evaluate a dimension quality -in this case Timeliness- from another dimension (Semantic Accuracy)), I presumed that regression analysis is the next step in this matter.
-the Semantic accuracy formula I used is: msemTriple = |G ∧ S| / |G|
msemTriple measures the extent to which the triples in the repository G (the original LOD dataset) and in the gold standard S have the same values.
(de is the entity document of the datum in the linked data dataset (in my study case, it's DBpedia) and pe is the correspondent entity document in the data source (so it's Wikipedia)).
Charles Naney, the raw data have not been provided. But judging from the output that was provided, this will give you the raw data (or a close approximation to it, at least):
DATA LIST LIST / x y n.
BEGIN DATA
0 0 30
0.333 0.32 2
0.333 0.33 2
0.333 0.34 8
0.667 0.65 1
0.667 0.68 9
0.667 0.7 1
END DATA.
WEIGHT BY n.
Note that you need to use variable n as a frequency weight. I'm sure whatever software you are using has that capability.
See the problem with using time in linear regression this way is you end up with the regression with autoregressive errors. sorry for the basic chart, I had to use spreadsheet app on my phone because the semester hasn’t quite started yet:)
either my question is unclear or I didn't provide enough informations, what I'm interested in, is to know if using linear regression is suitable in this study case or should I use another type of regression?
Jochen Wilhelm since I found out that there is a correlation between Timeliness and Semantic Accuracy (I'm studying linked data quality dimensions assessment, trying to evaluate a dimension quality -in this case Timeliness- from another dimension (Semantic Accuracy)), I presumed that regression analysis is the next step in this matter.
-the Semantic accuracy formula I used is: msemTriple = |G ∧ S| / |G|
msemTriple measures the extent to which the triples in the repository G (the original LOD dataset) and in the gold standard S have the same values.
Currency((de)) = (1-(lastmodificationTime(de )-lastmodificationTime(pe ))/(currentTime-startTime))*Ratio (the Ratio measures the extent to which the triples in the the LOD dataset (in my case wikidata) and in the gold standard (wikipedia) have the same values.)
Am I looking at the plot correctly in that this should be presented as a 3x3 contingency where all the data are on the diagonal? If so, why are you doing statistics on this? What research question does this address?
Daniel Wright there is a lot of similar values among these variables as you can see, that's why there is 3 clusters on the scatterplot line. My objective is to know which regression type (linear, robust..etc..) is the best suited for this data.
it's an interesting estimation but I now see the intent is just this -- the conversation about the convoluted assocation approximated from a fractal of imperfect nebulous data.