To persist a user rating matrix, any kind of (native) matrix object is the best choice. Using a graph database for storing a user rating matrix is not a straight forward conversion, and it will add a lot of overhead, both conceptual and implementation related.
Depending on your programming language and other infrastructure, you should figure out an efficient way to store your potentially very large user rating matrix, as a first step. Then as a second step, figure out how to serialise that data structure to a file.
In Java, you could use e.g. jBLAS (http://mikiobraun.github.io/jblas/) as a fast matrix library, and then use the java serialisation API to serialise the matrix into a file when your app starts/ends.
The answer to this questions highly depends on the language, matrix dimensions, and required operations. Relational databases, R data.frame, CSV or ARFF file, serialized data object, simple array in memory, graphDB. All possible. Can you share a bit more context? It's like asking which car would be best. :-)
You are right, here is a bit of what we would like to accomplish.
Its a news recommendation system but for a small domain. I don't expect data to scale that much and news stories would be about 100 per week. I approximate about 3000 users in general. The main idea is to save data on user clicks per news story and likes. In addition, news stories will have keywords that describe the object. These keywords come from a fixed dictionary (54 keywords in total). At the moment user interest is associated to the keywords of the news stories the user has liked or clicked on in the past. We expect to have a cube representing the following information: user x item (how relevant is the news story to the user represented by a numerical value), user x keywords (how interesting are certain keywords to users given by a numerical value), item x keywords (boolean that indicates if the news story has been assigned the keyword). We have designed strategies to calculate this information and were wondering which would be the best way to persist this information and update it when the system is in production.
That would be 3000 x 5200 x 54 entries (in one year), right? That is moderate size. Although relational databases might offer best performance and easy human read access, I would not go for them. My impression is that this is a classical graphDB problem (because I would also expect the matrix to be very sparse). Users, Items, and keywords are nodes of the graph and connections between them are edges. The graphDB gives you the ability to easy extend that schema by introducing new nodes (with new labels) and new edges. The option of matrices is always possible but makes little sense if the matrix is very sparse while adding new rows/columns might (!) be a very costly operation, depending on the implementation. I don't know how frequently 3000 users will klick the system, but every user interaction is a write operation to the data structure, right? In that case you may want to use a distributed data structure anyway that has transactional support. I repeat myself, but for me it sounds like a classical graphDB problem: modeling dependencies between objects.
Thanks. Additionally, new keywords and users could be added. Given the variability of the data I also believe that a graph database might be better. I was checking out neo4j which seems promising. Still I have doubts, specially because the organization the product is meant for is very traditional and is not familiar with graph databases. I dont know if its worth the trouble.
There are not many options for open-source databases. Neo4j is by far the most advanced and the one with best performance. The only issue with neo4j is released under AGPL, which is quite restrictive and strong. But if that matches your needs I can recommend Neo4j. I also used Titan once---not bad but neo4j is better. You can also use gremlin etc. with neo4j since neo4j sticks to the standards and implements blueprints: https://github.com/tinkerpop.
Guys, having developed recommender systems that are currently in production, live, I have to recommend the "classical" RDBMS approach. I have used MySQL with great success as the underlying persistence engine of user-item ratings matrices which in the environments that my systems had to work in, where much more complicating, including for example timestamp information (when the user purchased an item), id information (as the same user may purchase the same item more than once), price information and so on... In fact, you should look at other well-known Open-Source systems such as Apache Mahout, or LensKit, or ELF, to see what persistence they support: almost all offer support for relational databases, flat-files (CSV) and very little else... Eventually, you should evaluate your persistence approach against a number of criteria: performance (how fast your persistence will be able to load data from disk, perhaps how fast it will be able to store data to disk), reliability, recovery, backup, support, and so on. Look at the numbers, and base your decision on them.
IMHO, RDBMS have been optimized for the past 50 years in every aspect of their operation to provide superior performance and reliability, something that cannot be said of most other approaches including No-SQL databases that are a "tempting" alternative...
This is all very true, and it is very nice to hear from the experience of others. Nonetheless, there is a reason for the NoSQL movement and I was wondering: have there been successful implementations of recommendation systems based on NoSQL persistence?
NoSQL vs Relational for the new generation of recommender systems?
Andrea, I could recommend TinkerGraph, which is the default implementation in Apache TinkerPop, if of course, RAM is not a concern. It is an in-memory store.
You can refer to some benchmark results in a couple of my studies (these were primarily conducted to compare the performance of SPARQL [RDF query language] and Gremlin [Property graph query language] over different RDF and Graph stores.
1. Article A Stitch in Time Saves Nine -- SPARQL querying of Property G...
2. Article Trying Not to Die Benchmarking – Orchestrating RDF and Graph...
Though I am not sure about how strictly do you need a distributed graph database.
I didn't quite pursue any further the RecSys + Graph approach though it is still an active/interesting topic.
I believe Neo4j would be a good choice. However, the big picture question for me is how to transfer the implementation of matrix-based RecSys solutions directly on the graph? is it worthy? I think that technically representing the data in graph form is really only if graph-based algorithms are going to be leveraged. This opens the opportunity to connect research on Graph analytics and Recsys.