In 1945, Vannevar Bush proposed the idea of Memex (Memory Extender), a machine that can capture all the nuggets of information that ever came across us. Even by today’s standard the idea of Memex sounds profound, but, thanks to Google and Facebook, it is now very close to becoming a reality. In 2005, Google released “Web History”, a feature that allows us go back in time to explore our search and browsing history. Essentially what Google did is simply store all the web searches that we perform on any of Google’s vertical search engine (web, image, blog, news, etc) along with the search results that we click. From top, this data doesn’t look much interesting but there is a wealth of information hidden in it. In many ways it is a gold mine of personal information that can allow us to travel in time.
In this post, I explore Google Web History in order to know more about myself (or to say know what Google can know about me).
To start with, my first challenge was how to download my complete Web History. Google only allows you to download that data in bunch of 1000 records at a time. Given that I had tons of records, manually downloading set of 1000 records would have been a very slow process. Luckily I found this nifty tool through which I was able to download all of my google history data with just one click.The data came in CSV with format with the following 10 fields: Title, keywords, Link, Category, Pubdate, Description, Guid, QueryUid, VideoLength, and ImgThumbnail.
Once the data was on desktop, I fired up R and started on my quest to know myself from Google’s eyes. That data ranged from 3rd July 2005 to 8th May 2012 (Note Web History was launched in April 2005). There are 121,869 records in the dataset. For the purpose of this analysis, I selected data only for complete years i.e starting from the beginning of 2006 to the end of 2011. Below are some statistics related to filtered data
- Date Range: 1st Jan 2006 to 31st Dec 2011
- Number of Records: 109,072
- Average Number of Searches/Clicks per Day: 59.25 (Median = 40, SD = 67)
- Average Number of Searches/Clicks per Month: 1515 (Median = 1383, SD = 1404)
Once I had the data in R, I first started probing my browsing habit, whether it has increased over the years or not. From the plot below it clearly seems that my usage of Google has increased first three years and then stabilized after peaking in 2008. This is interesting as during this time I was frantically trying to finish up my Phd and was doing lot of job hunting.
In particular, from the Montly Distribution Plot (Plot 2), we notice an exceptionally high usage of Google around mid of the year, especially in June 2008 and July 2008. As compared to monthly average of 1515 searches/clicks, there were 7460 and 8625 searches in the month of June and July of 2008, respectively. Statistically, chances of having such high number of searches/clicks is less than 0.01%. Interesting !!!.
Continuing on my quest to know why I had so many search during that time period, I thought of looking at daily distribution of searches for these two months (Plot 3). From Plot 3, one again notices few days with exceptionally high usage of Google’s search engine. In particular, following dates showed an exceptional behavior:
- 19th June 2008,
- 16th July 2008
- 24th July 2008.
I further drilled down to determine what I was searching during these days. For this, I used standard natural language processing techniques (namely text normalization and stemming) and further calculate most frequently occuring search terms. The three figure below show the tag cloud related to each of the three dates.
I find each of the three tag clouds very interesting as they accurately capture what I was doing at that time. As I said earlier, around 2008 I was actively trying to finish my Phd and the first two tag clouds are related to that. In my thesis, I developed a system using “CakePHP” (as can be seen from the first tag cloud) and it was also about “Information Overload” (as can be seen from the second tag cloud). Further more I was considering purchasing Nuance’s “Dragonfly” dictation software to help my quickly write my thesis. I actually ended up buying it towards the end of July. Therefore you see so high volume of searches related to dragonfly (and mispelled drangofly) and my attempt to download a trial version of the software. Such insights in real time is especially useful to Google and others as it can help in targeted advertisement.
The above analysis is just the tip of the iceburg of all the various analysis you can do. See my PhD thesis for all the various modeling that can be done on an individuals interest based on such data.
In the coming few posts, I will expand on this analysis and focus on interest model and personal knowledge evolution. In particular, I am interested to see if we can effectively use Google web history as a tool to analyze our “subjective, situated and evolving” knowledge and know more about ourself. Our knowledge is subjective as it is based on our personal experiences. It is situated in the web resources that we are able to find and its constantly evolving as we try to (un)learn new concepts.
I look forward for comments.