At the end of Part I, I showed some word clouds that were derived out of Google Web History data. The word cloud visualization is useful in identifying important terms. But its simple single dimensional representation is unable to answer many questions that are of important to us. For instance, it doesn’t display relationships between different terms, how topics have evolved over time, etc.
To find answers to some of these questions, I started playing with semantic networks and explored graph based visualization. Below are some results of my quest.
In order to build the above graph, I randomly selected a term (in the above graph the term is “jquery”) and extracted all the terms that are within 2 degree of separation. In the above graph, nodes indicate various search terms and edges indicate relations between these terms. Node size further encodes frequency of the search term and color indicates temporal information. The darker the color gets (yellow, orange and red), the more recently I searched for that term. For instance about two years ago I used to work on PHP but for last year I am using ruby instead of PHP. From the graph you can notice that since PHP node has yellow color whereas ruby node is red color. Edge thickness indicates the frequency of the bi-gram. Below is another example. The graph below is related to the term “clustering”
Note on the process
- Extract Google Web History data
- Clean queries: lowercase transformation, tokenization, stemming, etc
- Use Hadoop/MapReduce to compute uni-grams and bi-grams
- Use R to explore and build graph visualizations