subscribe

Indexing geo-data 2 : simple benchmark

After my last post, I decided to do some benchmarking. For this benchmark I used the US data from Geonames.org. I inserted all the data (1,886,420 records) and searched for a big area around new york (between 41.3665028663272, -72.41912841796875 and 40.113789191575236, -75.83038330078125). We're expecting to get 38259 records back for this query.

Test 1: Selecting on longitude, latitude

SELECT SQL_NO_CACHE lat,lng FROM geotest WHERE
lat < 41.3665028663272 AND
lng < -72.41912841796875 AND
lat > 40.113789191575236 AND
lng > -75.83038330078125;

No index 1.73s. With B-Tree index on latitude 0.72s.

Test 2: Using spatial extensions and POINT field

SET @rect = 'POLYGON((41.3665028663272 -72.41912841796875,41.3665028663272 -75.83038330078125,40.113789191575236 -75.83038330078125,40.113789191575236 -72.41912841796875,41.3665028663272 -72.41912841796875))';
SELECT SQL_NO_CACHE astext(location) from geotest where intersects(location,GeomFromText(@rect));

Time taken without index: 9.52s. With a spatial index: 0.73s.

Test 3: Using morton number

SELECT SQL_NO_CACHE lat,lng FROM geotest WHERE
  morton > 3667198027933142835 AND morton < 3671111582099533095 AND
lat < 41.3665028663272 AND
lng < -72.41912841796875 AND
lat > 40.113789191575236 AND
lng > -75.83038330078125;

Time taken without index: 0.78s, with index on on morton: 0.65s.

Conclusion

In the table below 'small' is around times square, 'medium' is new york city and 'large' is about 2/3rd of the US. I didn't bother doing all benchmarks for the ones I knew were slower.

methodsmallmediumlarge
plain select 1.73s
index on latitude0.72s
using point field9.52s
using point field + spatial index0.00s0.73s18.82s
using morton number0.78s
index on morton 0.00s0.65s3.23s

So it seems like using the morton number is a bit faster than using the spatial index, but there's not a huge difference considering this relatively large dataset. Using the spatial index has a number of benefits, the biggest being that you're easily able to select on much more complex queries (polygons and such). The major benefit of the morton number methodology is that it's significantly faster, especially as your dataset grows and you're able to use InnoDB, which can be much better performing if you're expecting a lot of updates.

Early update: my coworker kevin mentions the spatial queries are likely slowed down because 'astext' is called for every row. I'll have to do these again with separate lat/lng fields.

Update 2: Adding a lat and lng field and selecting on those is actually even slower (consistently 0.91s).

Update 3: With a smaller resultset both the spatial index and the morton index are both pegged at 0.00s. With a much larger resultset (big chunk of the US) I got 18.82s for the spatial index, and 3.23s for the morton index.

Web mentions

Comments

  • Victor Sumner

    Victor Sumner

    In your Test 3: Using morton number, do you really need to include the boundary box in your query? What use is the motron if your continuing to include the boundary box?
  • Evert

    Evert

    You do,

    Using just the morton number can also include items from outside the box, because the type of indexing is a Z-curve.

    The absolute worst case scenario is getting 50% of all items on the map, if you are looking at the exact center of the map (regardless of your 'zoom level').
  • Daniel

    Daniel

    Hi, I came across your blog when doing some searches on indexing geodata... Your posts have been extremely insightful, but I have a few questions (which may/may not be stupid)

    My current situation is that I'm storing lat/lng as regular float columns in mysql. I'm searching this data based on a center point and a radius.

    1. My query uses the spherical law of cosines to calculate distance from the center point
    2. In my application code I calculate a bounding box based on the earth's radius/center point/search radius
    3. Not surprisingly, the query rarely uses my lat/lng indexes

    The bounding box is essentially useless in my current code, but after reading about your experiments with the Morton number I wonder if it would be useful if I converted those coordinates to Morton numbers and used that instead of the bounding box (or in addition to)?

    My understanding of this and the math behind it is quite slim, and just enough to get my application working so bear with me ;)

    Also, PHP is a bit of a pain when it comes to dealing with unsigned integers, any tips for calculating the Morton number in this case?

    Thank you!
  • Evert

    Evert

    @Daniel,

    The biggest benefit of this technique is speeding up the bounding-box index. So yes, I do believe you will benefit from this.

    You do need to include the bounding box on top of this though, as I mentioned in my first comment, you will definitely retrieve items outside your box.

  • suri

    Hello there,

    How did you do your benchmarking? did you used your own system or is there any software for benchmarking?

    • Evert

      Evert

      Pretty sure it was a custom set up at the time.

  • StijnDeWitt

    I found that for these huge datasets (in MySQL at least) it can make a huge difference what you fill in for lat / lng value. And not just based on how many results it yields, but also (I suspect) on where in the table those results are. I had a GeoIP query that could take anywhere from a few ms to a few seconds based on what IP address I filled in. So benchmark with a whole bunch of different numbers and then average the results.