March 19, 2009

Indexing geo-data

Recently we started wondering what the most effective way is to index data based on Longitude and Latitude. Although we're not yet seeing performance problems, we're definitely anticipating them without an effective index. We're using MySQL for anything mission critical, so (some of) this information specifically applies to MySQL.

For many the obvious thing to do might be to add a mysql index on those two numbers:

ALTER TABLE geo ADD INDEX (longitude,latitude)

The problem with how B-TREE indexes work, is that columns within the index will be used in order. Only if an exact match is found for the left-most column (longitude in this case) the latitude column is used. Since we're always selecting on a range of values, in practice this means that the latitude column in the index will in fact always be ignored.

This could be very inefficient if you're zoomed in quite a bit on for example a city on the east-coast (where I'm at). There will be a lot of matches for cities way north or south from here.

Splitting the earth up in rectangles

We figured a better way to do this is to just split up the earth in smaller rectangles. We could round the longitude and latitude numbers off to an integer and index on these.

CREATE TABLE geo (
  longitude DOUBLE,
  latitude DOUBLE,
  idxlong SMALLINT,
  idxlat SMALLINT,
  INDEX (idxlong,idxlat);
);

When inserting you'd just a idxlong = round(longitude) and you should be good to go.

The problem with this approach is that we split the earth up in 360 x-coordinates, and 180 y-coordinates. Whenever we're on a zoom-level higher than a single one of these sections, the index will not be used effectively. Furthermore, if we zoom in very deep (times square) we run the risk there's a lot of rows matching this area that will need to be evaluated. In short: the index is ineffective if you zoom to much smaller or bigger areas.

Dividing the earth up further

We could divide the earth up in 4 squares, and store that information instead. Every square could then divided up in 4 more squares, and so on.. We end up with what's called a Quadtree. To do this effectively, and not create new columns for every 'zoom level' we might need, we instead attempt to convert the longitude and latitude to a single value.

Simply put, if our X coordinate is 111111 and our Y coordinate is 000000, we want to end up with 101010101010. This is called the Morton number.

We can do this with the following (pseudo-)code:

latitude = 43.63556267294633
longitude = -79.4249939918518

// Since these can both be negative, we should convert them to an unsigned number
// longitude goes from -180 to 180 and latitude from -90 to 90

latitude += 90;
longitude += 180;

// Now we need to turn them into integers. It makes sense to fit them in a 32bit integer.
// The maximum value for a 32bit integer is 4294967295
// Since the numbers now go up to 360, we use round(4294967295/360) = 11930464.

latitude = (int)latitude * 11930464;
longitude = (int)longitude * 11930464;

// The 'morton number' 
morton = 0
// The current bit we're interleaving
bit = 1
// The position of the bit we're interleaving
position = 0

while(bit <= latitude or bit <= longitude) {

  if (bit & latitude) morton = morton | 1 << (2*position+1)
  if (bit & longitude) morton = morton | 1 << (2*position)

  position += 1
  bit = 1 << position

}

We can now easily index on our 'morton' number.

The big flaw of using a Quadtree

This would be the most effective index if we're ever only interested in the contents of '1 square' regardless of the size. But we are in fact usually interested in a range (Everything between a top-left and bottom-right coordinate) that could cover multiple squares.

The best example is near the international dateline. Because we increased our numbers with 180, the international dateline now lies at x-coordinate 0 and 360. If we would like to SELECT items from both sides of this line, we would need to do two queries (or two ranges). This is a simple example, but the problem in fact occurs at every edge of a square. If we select from a random place in europe, and we happen to go across a square from the 3rd significant bit in our morton number, it means we will end up effectively splitting our table in 4 major segments and we'll end up with scanning a higher number of items for matches.

Solutions

If the top-left and bottom-right are close enough to each other ('close' will need to be defined), we can find out if the query could be problematic by getting the morton numbers for both and comparing the most significant bits we care about:

// This example assumes both the numbers are 64 bit, and we really care about the top 16.   
problematic = ((morton1 ^ morton2) >> 48) != 0

// Note that the ^ operator is XOR (I had to look it up myself, because I rarely ever need it).

Another solution is to throw all of this out the window, and go with MySQL's Spatial extensions. The spatial extensions provides much more features beyond my need, so I've yet to find out if this is the best solution for myself. MySQL provides a spatial index, which is based off an R-Tree, which effectively uses overlapping rectangles. The other bonus is that things like selecting by radius (e.g.: everything in the range of x km) is possible.

Anything wrong with this logic? Do you have experience with this? I'd like to hear problems and solutions you've encountered!

Comments

Menno • Mar 19, 2009
Hi Evert,

note that it is quite easy to convert your long,lat columns to a 'spatially enabled' column, using:

--Add a spatially aware column
ALTER TABLE airport ADD coordinate POINT NOT NULL;

-- Convert the common lat,lon data to a spatially aware column
UPDATE airport SET coordinate = GeomFromText(CONCAT('POINT(', CAST(latitude AS CHARACTER(20)),' ', CAST(longitude AS CHARACTER(20)),')'));

--Add a spatial index on the column
ALTER TABLE airport ADD SPATIAL INDEX(coordinate);

And indeed, this enables the possibility to query your data by radius, for example with (copy - paste of my notes, may require some interpretation):

//Replace the @placehorders in the actual query
SET @center = GeomFromText('Point(5.0 5.0)');
SET @radius = 30;
SET @bbox = CONCAT('POLYGON((',
X(@center) - @radius, ' ', Y(@center) - @radius, ',',
X(@center) + @radius, ' ', Y(@center) - @radius, ',',
X(@center) + @radius, ' ', Y(@center) + @radius, ',',
X(@center) - @radius, ' ', Y(@center) + @radius, ',',
X(@center) - @radius, ' ', Y(@center) - @radius, '))'
);

SELECT AsText(coordinate), SQRT(POW( ABS( X(coordinate) - X(@center)), 2) + POW( ABS(Y(coordinate) - Y(@center)), 2 )) AS distance
FROM location
WHERE Intersects( coordinate, GeomFromText(@bbox) )
AND SQRT(POW( ABS( X(coordinate) - X(@center)), 2) + POW( ABS(Y(coordinate) - Y(@center)), 2 )) < @radius
ORDER BY distance;

So, why not give it a shot with the spatial extensions?

With regards
Claudia • Mar 19, 2009
I have exactly the same problem so I will be very interested in your findings...

Instead of morton numbers I currently use cSquares (http://www.cmar.csiro.au/csquares/about-csquares.htm),
but as they work pretty similar the problem with selecting from a range stays the same.

I had major problems using the spatial extension of MySQL because
a) phpmyadmin does not cope with spatial columns very well (which would have been annoying but no show stopper)
b) spatial indexes work only MyISAM - this did not matter to me, but might be important for you
c) a column with a spatial index must be NOT NULL which did not work out at all for me because I always have some entries where I don't know the geographic coordinates. I then tried to maintain a table that contains only the entries where I have the coordinates but this did not integrate well with the rest of my application.

Also note that all the spatial functions on MySQL work with Euclidean (planar) geometry - so the shape of earth is not taken into account.

I once tried Postgresql with PostGis support which works perfectly fine and is easy to use, but of course it is not so easy to switch to a different database system just like that. Also lots of postgres hosting companies do not offer the postgis extension.

Good luck

Claudia
Avi • Mar 19, 2009
@Claudia
To get around you "not null" problem, you can store the spatial information in another table and retrieve it with an outer join.
Also, the spatial functions are intended to work with GIS data so where did you get that "all the spatial functions on MySQL work with Euclidean (planar) geometry - so the shape of earth is not taken into account." from?