“Size does matter” – using cursor to sequentially process large database

Recently, I started working on a new project which uses postgres database and contains more than 3 million addresses. One of my first task was to geocode address. To do so, I wanted to sequentially access all the records ordered by address. As I never dealt with such a large database before, my first attempt was a traditional one that I have been using on my smaller projects. I created a B-Tree index on “address” field and used the following logic for gecoding address

//My First Logic
1. Open Database Connection
2. For (int i=0; i<3000; i++)
         2.1. Fetch 1000 records: //SQL Statement – select id, address from mytable order by address limit 1000 offset (i* 1000) 
         2.2. for each record geocode the address and store the information in the database
3. Close database connection

Theoretically, the above approach is fine and was working. Initial database queries (see 2.1 above) for fetching 1000 records sequentially took reasonable amount of time, less than 20 seconds. However, as the “offset” increased, the database performance quickly started deteriorating. As can be seen from the graph below, after 160 iteration (i.e offset = 16000), the database query started taking more than 20 minutes and it kept increasing.

DB Performance small

After doing some analysis, I realized that “offset” requires sequential scan of the index (assuming you are fetching records in order). For each query, database was starting a sequential scan from the start. And, therefore, as the offset increased, the database performance decreased.

An obvious solution is to maintain the location of the last fetch record and use that in the next query. This is where  “cursor” become useful. Cursor helps  in maintaing a pointer at the last fetched record. After realizing this problem, I re-implemented my java program to use cursor and below is the modified code:

//My Second Logic – Using Cursor
1. Open Database Connection
2. Begin a procedure  // Syntax – Begin geocode;
3. Declare a cursor     // Syntax – Declare address cursor for select id, address from mytable order by address;
4.  move cursor to desired location (as I wanted to start from 326000 offset)  // Syntax – move forward 1000 in address;
5.  for (int i=327; i< 3000; i++)
          5.1 Fetch forward 1000 records from the cursor   // syntax – fetch forward 1000 from address;
          5.2. For each record, geocode the address and store information in the database
6. Close cursor, end procedure, and close database connection   // Syntax – close address; end work;

As you can see from the above graph (iteration no 327 and beyond) , using cursor, the performance significantly increased and subsequent database queries took less than 30 seconds to sequentially fetch 1000 records.

Yahoo !!

About Ritesh Agrawal

I am a applied researcher who enjoys anything related to statistics, large data analysis, data mining, machine learning and data visualization.
This entry was posted in Database and tagged , . Bookmark the permalink.

3 Responses to “Size does matter” – using cursor to sequentially process large database

  1. Ian says:

    Well done, good detective work on figuring this out. What I’m not clear on is why postgres was making a sequential scan for the offset. I would have thought it could have done a binary search which would be much quicker.

  2. ragrawal says:

    Ahhh!!!.. I was under the same impression. But later I realized its not a search problem. Index (especial BTree) are good for search and for sequentially ordering records. But the problem was that an index does not maintain number of leaf nodes for each subtree. Therefore to get to the right offset, database has to sequentially scan all the leaf nodes until it reaches the right offset.

  3. ragrawal says:

    Now, I am guessing that there can be another explanation. I will need to check the low-level query tree that a database generate for the query, “select * from my table limit 10 offset 500000”. If the order of low-level query tree is as follows

    1. Get all the records from my table (almost 3 million)
    2. Move the cursor to record 500,000 th
    3. Fetch next 10 records

    then also the performance will decrease as the offset increases.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s