You mean 300 million edges? So on my monitor, this would be about 145 edges per pixel. What do you expect from a visualization? It seems unlikely that you will be able to see anything on a simple drawing of this graph. I guess the interesting question is: What methods can be used to simplify such large networks so that visualization actually gives us some information?
Christian Staudt is correct. You would have to typically 'disect' (slice / dice) and select particular views of interest, perhaps by grouping similar nodes to form a hierarchical cluster that let you drill down to individual nodes.
Veslava, as Albert was asking correctly - the slice/dice view-creation procedure depends on the domain of your graph data and the problem you are trying to solve. Perhaps if you could share few more details about the nature of this data and the problem at hand, it would be more easy to provide helpful pointers in the right direction.
To start with, perhaps you could map these weights 0.0005 and 1 to something meaningful in your domain (say, least strong relation between the nodes to very strong relation. E.g. the number of mails or tweets exchanged between those two nodes, or how close they are in terms of their semantic opinion of any given topic).
Then start back from your goal (e.g. predicting all people who would respond positivity to a discount coupon and generate a sale lead). That would give a range of weights you want to concentrate on (say, 0.6 to 0.9 etc.). You can then filter out the graph based on those criteria.
This is just an outline and you can apply it to your own domain and problem. Or if you could share those details, others should be able to help with right kind of cut/view procedure.
To many variance and variable of node you seek if you put it in monitor i think there will be no pattern that you will see or visualize. With so many node i dont think gephy will handle it you can try either rapid miner or matlab with social media adds on for the easies way or phyton scikit on n R with add on modul for SNA
It is similarity of abstracts measured by tf-idf. Nodes: 34 K, Edges:283 M. After reduction:
271 M by weight >0.03 (max is 1)
214 M ------------- >0.02
-------------------------------------
22 M ---------------> 0.09
821 K ------------->0.019
And now i am considering what to choose? I should upload data to for exmaple Geshi and estimate just by eye? Are another methods to estimate requered precision?
Sorry, it is still not clear for me. Could You please also specify what do You mean by nodes? Are they abstracts or words that You tested on frequency in these abstracts?
In any case probably You could try to make one step back when You had data table like:
abstract_1 abstract_2 abstract_n
word_1 value value value
word_2 value value value
word_n value value value
Data in this view could be used for clustering. You can cluster words based on correlation or euclidian distance and select only the cluster with the highest probability. Thus You will reduce number of words that will be used for abstracts comparison. Likely, You will get significant drop in an amount of edges as well and can visualised obtained network via, for instance, Cytoscape.
If this way is irrelevant for Your study, may be You can use trimmed mean or median to define the threshold (all values higher than trimmed mean or median, for instance).
Another possibility is to build the network as it is in Cytiscape and then apply MCODE plugin to select the most interconnected regions and analyse only this region.
From your description, it looks like your nodes are 'abstracts' / 'documents' and the edges are their TFIDF scores.
If your goal is calculating precision and recall, then perhaps you already have a 'retrieval algorithm' that uses these TFIDF scores. In that case, Your graph view could be defined as two sets: 1. Retrieved abstracts by your algorithm 2. Relevant abstracts as per the TFIDF matrix. Perhaps you could use different colors for each and that gives you visual idea of precision and recall.
As you might already be aware the of the definitions:
Precision: How many retrieved are relevant?
Recall: How many relevant are retrieved?
The overlapping of colors give a good indication of performance of your retrieval algorithm. You might want to choose only top 10 or top 100 results for the inspection to make it visually clear.
For drilling down further: you could filter your documents by 'category' or 'keywords'. For example, consider only the documents that have 'sentiment analysis' as their topic. This will greatly reduce the no. of documents to consider. Latent Dirichlet topic modelling is one way to get this additional level 'drill down' filtering capability for your inputs.
Also, Albert's suggestion of clustering could greatly help boost the TFIDF rankings. To generalize, you could pre-process your abstracts using the standard text-processing technique to drastically reduce the input space. You might already be doing these pre-processing steps. If not, the below could be a good starting point:
1. Removing stop words (such as a, the, has, is etc..)
2. Stemming the words (this reduces the words to their base form)
3. If possible, replacing all synonyms with a singular base form
It is not so new tool, but to my mind one of the best. At leas for my biological networks. However, Cytoscape is not a magic. I think Christian Staudt said very important point and Gopalakrishna Palem and me tried to suggest how to reduce this huge amount of data. Otherwise, to my mind, it will be very difficult to visualise and interpret Your results.
I just strated with Cytoscape and suprised it does not understand NET format, which in my opinion is the standard network format. Could you write what the main advantage of Cytoscape comparable with Gephi?
I did not compare carefully two softwares, but what I like in Cytoscape is that it can work with .txt files. To my mind it is the easiest way to create a network manually or to transform huge networks to .txt via R. I found Cytoscape more user-friendly, and there are some nice plugging there, also it has a good comparability with R which I am using for network analysis. The version go Gephi that I tried had just several network properties (may be now they added more).
Regarding the work in Cytoscape... as far as I know Gephi also can import .txt or .csv files? So, if You can transform Your data to .txt via Gephi You will be able to import it in Cytoscape.
In NET file i have 3 columns (ARC section) with: linking nodes identifiers (first two columns) and wieght of this link (third column). How shoud i process data to cytoscape? Could you kindly advice? It does not understand this format.
Can you transform your NET file to txt or csv? Via gephi or R? I am disconnected now from the computer, but if you will send me the data (a small part) I can try to do it and explain then to you. However it will take time, I am in a trip now. So, would be best if you will transform your data by yourself. Basically for cystoscape you need very similar shape of the data with 3 columns. But instead of weight in the 3 rd column you need just a merged nodes that have connection. Your weight will be attached later as an edge propertie.
Here is the sample of my data in NET format. 1st part is list of wertixes with # and id, the 2nd - is the links list: out node, in node and weight (above 0.19). I have 821 K nodes and 3,4M links. The last number creates the problem: i can not open my file in Gephi because it is to many links. Could you see, how can i convert this format to one acceptable in Cytoscape?
Sorry, as I said I am disconnected from the computer, but as far as I see You can easily do it by yourself from this type of data. Just create the .txt file with the same structure as the file in my attachment. The 1st column is source node, the 3rd column is target node and 2nd column is interaction between them. Do not include your weight, for this Cytoscape will use another .txt file, but Your first goal is to try if ou can load this huge network to the software.
You are welcome to contact me by e-mail or messages to do not contaminate Q&A section.
Again, I am not sure that Cytoscape will create so huge network as well, I never tried this. What about reducing the data? Did You find a solution?
I agree with @Christian_Staudt... why would you want to visualize such a dense network??? You will not learn anything from a hairball diagram.
Yet, inside the hairball will be many interesting clusters/communities/sub-networks. Spend your time extracting those and then almost any visualization tool will be able to map those more reasonable sub-networks.
See this blog post for an example of extracting meaning/pattern from a hairball.
For viewing large networks without the hairball, you might take a look at BioFabric (www.BioFabric.org). The hairball is avoided by drawing nodes as horizontal lines. Quick demo is at: http://www.biofabric.org/gallery/pages/SuperQuickBioFabric.html That said, 821 K nodes and 3.4M links will be too much for it to handle in the current implementation, but it can handle 1M links pretty well.
As an example of a larger network, here is one with 282K nodes and 2.3M links: http://www.biofabric.org/gallery/index.html#Stanford Even at that size, interesting patterns are visible, and you can zoom in for detail, which I feel validates the basic approach. But performance on that particular network was pretty awful.
We've been researching streaming algorithms and visualization methods or tools for online social networks visualization. Feel free to check our research. If you need any help I'll be glad to provide further assistance.