Indexing geo-data 2 : simple benchmark
After my last post, I decided to do some benchmarking. For this benchmark I used the US data from Geonames.org. I inserted all the data (1,886,420 records) and searched for a big area around new york (between 41.3665028663272, -72.41912841796875 and 40.113789191575236, -75.83038330078125). We're expecting to get 38259 records back for this query.
Test 1: Selecting on longitude, latitude
SELECT SQL_NO_CACHE lat,lng FROM geotest WHERE
lat < 41.3665028663272 AND
lng < -72.41912841796875 AND
lat > 40.113789191575236 AND
lng > -75.83038330078125;
No index 1.73s. With B-Tree index on latitude 0.72s.
Test 2: Using spatial extensions and POINT field
SET @rect = 'POLYGON((41.3665028663272 -72.41912841796875,41.3665028663272 -75.83038330078125,40.113789191575236 -75.83038330078125,40.113789191575236 -72.41912841796875,41.3665028663272 -72.41912841796875))';
SELECT SQL_NO_CACHE astext(location) from geotest where intersects(location,GeomFromText(@rect));
Time taken without index: 9.52s. With a spatial index: 0.73s.
Test 3: Using morton number
SELECT SQL_NO_CACHE lat,lng FROM geotest WHERE
morton > 3667198027933142835 AND morton < 3671111582099533095 AND
lat < 41.3665028663272 AND
lng < -72.41912841796875 AND
lat > 40.113789191575236 AND
lng > -75.83038330078125;
Time taken without index: 0.78s, with index on on morton: 0.65s.
Conclusion
In the table below 'small' is around times square, 'medium' is new york city and 'large' is about 2/3rd of the US. I didn't bother doing all benchmarks for the ones I knew were slower.
method | small | medium | large |
---|---|---|---|
plain select | 1.73s | ||
index on latitude | 0.72s | ||
using point field | 9.52s | ||
using point field + spatial index | 0.00s | 0.73s | 18.82s |
using morton number | 0.78s | ||
index on morton | 0.00s | 0.65s | 3.23s |
So it seems like using the morton number is a bit faster than using the spatial index, but there's not a huge difference considering this relatively large dataset. Using the spatial index has a number of benefits, the biggest being that you're easily able to select on much more complex queries (polygons and such). The major benefit of the morton number methodology is that it's significantly faster, especially as your dataset grows and you're able to use InnoDB, which can be much better performing if you're expecting a lot of updates.
Early update: my coworker kevin mentions the spatial queries are likely slowed down because 'astext' is called for every row. I'll have to do these again with separate lat/lng fields.
Update 2: Adding a lat and lng field and selecting on those is actually even slower (consistently 0.91s).
Update 3: With a smaller resultset both the spatial index and the morton index are both pegged at 0.00s. With a much larger resultset (big chunk of the US) I got 18.82s for the spatial index, and 3.23s for the morton index.
Comments
Victor Sumner •
In your Test 3: Using morton number, do you really need to include the boundary box in your query? What use is the motron if your continuing to include the boundary box?Evert •
You do,Using just the morton number can also include items from outside the box, because the type of indexing is a Z-curve.
The absolute worst case scenario is getting 50% of all items on the map, if you are looking at the exact center of the map (regardless of your 'zoom level').
Daniel •
Hi, I came across your blog when doing some searches on indexing geodata... Your posts have been extremely insightful, but I have a few questions (which may/may not be stupid)My current situation is that I'm storing lat/lng as regular float columns in mysql. I'm searching this data based on a center point and a radius.
1. My query uses the spherical law of cosines to calculate distance from the center point
2. In my application code I calculate a bounding box based on the earth's radius/center point/search radius
3. Not surprisingly, the query rarely uses my lat/lng indexes
The bounding box is essentially useless in my current code, but after reading about your experiments with the Morton number I wonder if it would be useful if I converted those coordinates to Morton numbers and used that instead of the bounding box (or in addition to)?
My understanding of this and the math behind it is quite slim, and just enough to get my application working so bear with me ;)
Also, PHP is a bit of a pain when it comes to dealing with unsigned integers, any tips for calculating the Morton number in this case?
Thank you!
Evert •
@Daniel,The biggest benefit of this technique is speeding up the bounding-box index. So yes, I do believe you will benefit from this.
You do need to include the bounding box on top of this though, as I mentioned in my first comment, you will definitely retrieve items outside your box.
suri •
Hello there,
How did you do your benchmarking? did you used your own system or is there any software for benchmarking?
Evert •
Pretty sure it was a custom set up at the time.
StijnDeWitt •
I found that for these huge datasets (in MySQL at least) it can make a huge difference what you fill in for lat / lng value. And not just based on how many results it yields, but also (I suspect) on where in the table those results are. I had a GeoIP query that could take anywhere from a few ms to a few seconds based on what IP address I filled in. So benchmark with a whole bunch of different numbers and then average the results.