Google Web History: A Gold Mine of Personal Information (Part I)

In 1945, Vannevar Bush proposed the idea of Memex (Memory Extender), a machine that can capture all the nuggets of  information that ever came across us. Even by today’s standard the idea of Memex sounds profound, but, thanks to Google and Facebook, it is now very close to becoming a reality. In 2005, Google released “Web History”, a feature that allows us go back in time to explore our search and browsing history. Essentially what Google did is simply store all the web searches that we perform on any of Google’s vertical search engine (web, image, blog, news, etc) along with the search results that we click. From top, this data doesn’t look much interesting but there is a wealth of information hidden in it. In many ways it is a gold mine of personal information that can allow us to travel in time.

In this post, I explore Google Web History in order to know more about myself (or to say know what Google can know about me).

To start with, my first challenge was how to download my complete Web History. Google only allows you to download that data in bunch of 1000 records at a time. Given that I had tons of records, manually downloading set of 1000 records would have been a very slow process. Luckily I found this nifty tool through which I was able to download all of my google history data with just one click.The data came in CSV with format with the following 10 fields: Title, keywords, Link, Category, Pubdate, Description, Guid, QueryUid, VideoLength, and ImgThumbnail.

Once the data was on desktop, I fired up R and started on my quest to know myself from Google’s eyes. That data ranged from 3rd July 2005 to 8th May 2012 (Note Web History was launched in April 2005). There are 121,869 records in the dataset. For the purpose of this analysis, I selected data only for complete years i.e starting from the beginning of 2006 to the end of   2011.  Below are some statistics related to filtered data

  1. Date Range: 1st Jan 2006  to 31st Dec 2011
  2. Number of Records: 109,072
  3. Average Number of Searches/Clicks per Day: 59.25 (Median = 40, SD = 67)
  4. Average Number of Searches/Clicks per Month: 1515 (Median = 1383, SD = 1404)
Image

Plot 1. Yearly Distribution of searches/clicks

Once I had the data in R, I first started probing my browsing habit, whether it has increased over the years or not. From the plot below it clearly seems that my usage of Google has increased first three years and then stabilized after peaking in 2008. This is interesting as during this time I was frantically trying to finish up my Phd and was doing lot of job hunting.

In particular, from the Montly Distribution Plot (Plot 2), we notice an exceptionally high usage of Google around mid of the year, especially in June 2008 and July 2008.  As compared to monthly average of 1515 searches/clicks, there were 7460 and 8625 searches in the month of June and July of 2008, respectively. Statistically, chances of having such high number of searches/clicks is less than 0.01%. Interesting !!!.

monthly

Plot 2: Monthly distribution of Searches/Clicks

Continuing on my quest to know why I had so many search during that time period, I thought of looking at daily distribution of searches for these two months (Plot 3). From Plot 3, one again notices few days with exceptionally high usage of Google’s search engine. In particular, following dates showed an exceptional behavior:

  • 19th June 2008,
  • 16th July 2008
  • 24th July 2008.

Plot 3. Showing daily distribution of searches/clicks from 1st June 2008 to 1st August 2008. During this period, I made unexceptionally higher number of searches/clicks.

I further drilled down to determine what I was searching during these days. For this, I used standard natural language processing techniques (namely text normalization and stemming) and further calculate most frequently occuring search terms. The three figure below show the tag cloud related to each of the three dates.

I find each of the three tag clouds very interesting as they accurately capture what I was doing at that time. As I said earlier, around 2008 I was actively trying to finish my Phd and the first two tag clouds are related to that. In my thesis, I developed a system using “CakePHP” (as can be seen from the first tag cloud) and it was also about “Information Overload” (as can be seen from the second tag cloud). Further more I was considering purchasing Nuance’s “Dragonfly” dictation software to help my quickly write my thesis. I actually ended up buying it towards the end of July. Therefore you see so high volume of searches related to  dragonfly (and mispelled drangofly) and my attempt to download a trial version of the software. Such insights in real time is especially useful to Google and others as it can help in targeted advertisement.

Plot 4.a: Tag Cloud for 19th June 2008

Plot 4.b: Tag Cloud for 16th July 2008

Plot 4.c: Tag Cloud for 24th July 2008

The above analysis is just the tip of the iceburg of all the various analysis you can do. See my PhD thesis for all the various modeling that can be done on an individuals interest based on such data.

In the coming few posts, I will expand on this analysis and focus on interest model and personal knowledge evolution. In particular, I am interested to see if we can effectively use Google web history as a tool to analyze our “subjective, situated and evolving” knowledge and know more about ourself. Our knowledge is subjective as it is based on our personal experiences. It is situated in the web resources that we are able to find and its constantly evolving as we try to (un)learn new concepts.

I look forward for comments.

About these ads

About Ritesh Agrawal

I am a applied researcher who enjoys anything related to statistics, large data analysis, data mining, machine learning and data visualization.
This entry was posted in Data Mining, Text Mining, Web and tagged , , . Bookmark the permalink.

4 Responses to Google Web History: A Gold Mine of Personal Information (Part I)

  1. Abhishek says:

    Very insightful. Does google provide users with a tool to do this kind of analysis themselves? It will be an interesting product, especially for people to review their own “online history”.

  2. Pingback: Google Web History: A Gold Mine of Personal Information (Part II) | Memento

  3. Mark Soper says:

    Thanks for writing about this experiment, Ritesh. I’m researching how browser history and other personal data can serve us while we’re away from the screen – e.g. as wearable computing evolves. I enjoyed reading this!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s