subscribe

Indexing geo-data 3: In practice

Since my last post I've found out that using the 'morton' number to index spatial number is also referred to as the Z-order.

To index using this order, you can use this stored function:

CREATE FUNCTION getGeoMorton(lat DOUBLE, lng DOUBLE) RETURNS BIGINT UNSIGNED DETERMINISTIC 
BEGIN

  -- 11930464 is round(maximum value of a 32bit integer / 360 degrees) 
  
  DECLARE bit, morton, pos BIGINT UNSIGNED DEFAULT 0;  
 
  SET @lat = CAST((lat + 90) * 11930464 AS UNSIGNED);
  SET @lng = CAST((lng + 180) * 11930464 AS UNSIGNED);
  SET bit = 1;
  
  WHILE bit <= @lat || bit <= @lng DO 
    
    IF(bit & @lat) THEN SET morton = morton | ( 1 << (2 * pos + 1)); END IF;
    IF(bit & @lng) THEN SET morton = morton | ( 1 << (2 * pos)); END IF;
  
    SET pos = pos + 1;
    
    SET bit = 1 << pos;
  
  END WHILE; 
  
  RETURN morton;
END;

Some caveats

  • Since the function is using floating-point numbers, there will be rounding errors. These are generally very small, and for us well within the acceptable margin of error.
  • More significantly, this function assumes euclidean geometry (i.e.: a flat surface). The earth obviously isn't, so as you get closer to the poles you might get back results from outside your rectangle, or miss results from within.
  • It's not recommended to use this function directly in your WHERE clause. Even though the function is marked deterministic (i.e.: it will always yield the same results for the same arguments), the MySQL query optimizer currently ignored that modifier. So, set it in a temporary variable first (SET @number =) and use the variable in your where clause.
  • Don't forget dealing with queries spanning over the -180°, 180° longitude line.
  • Also store your actual longitude and latitude values in the DB. This function is not intended to be exact.
  • When you do your queries, select both on the morton number, and longitude and latitude.

A better way to do it?

In my research I've found the Hilbert Curve to be an even better algorithm, but haven't yet gone through the effort of trying to express it in SQL.

Web mentions

Comments

  • Roland Bouman

    Roland Bouman

    Hi!

    I just read your posts on geo indexing, and from there i ended up reading wikipedia on the matter. Interesting stuff, thanks!

    I just want to point out there is an error in your MySQL getGeoMorton function. The line

    RETURN @lng;

    Should not be there. I also think there are a number of things that could be improved in your implementation. For example, there is no need to use the user-defined varibles (the @ variables). You will find that these are quite a bit slower than proper local variables.

    There are more performance optimizations possible (and with MySQL stored functions, you need all the speed you can get). Here's what I came up with:

    DELIMITER go
    DROP FUNCTION IF EXISTS f_morton;
    go

    CREATE FUNCTION f_morton(
    p_longitude DOUBLE
    , p_latitude DOUBLE
    )
    RETURNS BIGINT UNSIGNED
    DETERMINISTIC
    BEGIN
    DECLARE v_bit BIGINT UNSIGNED
    DEFAULT 1;
    DECLARE v_shift BIGINT UNSIGNED
    DEFAULT 1;
    DECLARE v_morton BIGINT UNSIGNED
    DEFAULT 0;
    DECLARE v_latitude BIGINT UNSIGNED
    DEFAULT CAST((p_latitude + 90) * 11930464 AS UNSIGNED);
    DECLARE v_longitude BIGINT UNSIGNED
    DEFAULT CAST((p_longitude + 180) * 11930464 AS UNSIGNED);

    WHILE v_bit <= v_latitude || v_bit <= v_longitude DO
    IF v_bit & v_longitude THEN
    SET v_morton := v_morton | v_shift;
    END IF;
    SET v_shift := v_shift << 1;
    IF v_bit & v_latitude THEN
    SET v_morton := v_morton | v_shift;
    END IF;
    SET v_shift := v_shift << 1;
    SET v_bit := v_bit << 1;
    END WHILE;

    RETURN v_morton;
    END;
    go

    DELIMITER ;

    Some testing indicated this is about 20-25% faster.

    I hope it helps!

    Cheers, Roland


  • Evert

    Evert

    Dropped that return line. I must have left it in after debugging; Thanks for that Roland!
  • Evert

    Evert

    jbland,

    From what I read there is that geohashing uses the exact same concept for encoding lat/long and shares the same limitations. The difference is that I encode to a 64bit integer, geohashing encodes this number further as a string.

    Using the integer will give you better performance.

    Evert
  • jbland

    jbland

    Evert,
    im thinking of going with morton numbers to help with clustering, but im unsure about how to dial up/down precision... any pointers
  • Evert

    Evert

    It's not really easy to calculate range, what you could do is simply drop bits off the bottom, like: morton XOR 0x1 to drop the last bit, morton XOR 0x2 for the 2 last bits, etc..

    That way you might not get the best possible clusters, but it could definitely push you in the right direction.
  • kellogs

    Very good stuff, especially since you have had a "real-life" set of data to test against. Will attempt an Hilbert curve implementation and share my experience on SO.

  • James Peterson

    Hello, thank you very much for this! Here is some extra resources for PostgreSQL Users:

    http://www.spatialdbadvisor...

    I made some tweaks to the above link to support large inputs (latitude/longitude coordinates in my case) to support precision up to 4 decimal places. As spatialdbadvisor.com's script threw "int out of range" errors.

    See edit here:

    http://pastebin.com/zzSU4X7J

    And further proofs on random set of coordinates:

    Sample size 10,000: http://oi62.tinypic.com/244...

    Sample size 1,000: http://oi60.tinypic.com/r0p...

    Sample size 100: http://oi58.tinypic.com/10p...

    I use the procedure like so to improve record locality on hard disk:

    ---

    alter my_table add column morton_col int8;

    update my_table set morton_col = morton(cast(trunc(least(longitude_col+180,360)*10000) as int8), cast(trunc(least(latitude_col+90,180)*10000) as int8));

    create index idx_morton_col on my_table(morton_col) with (fillfactor=100);

    cluster my_table using idx_morton_col;

    ---

    The multiplication by 10,000 ensures that the coordinate is represented to a precision of 4 decimal places when encoded as a morton value (which is going to represent an accuracy of around 11.1 meters even at the equator - more than precise enough!)

  • SpatialDBAdvisor

    The morton key implementation on my website is an almost direct translation of the original David Mark algorithm. That algorithm was implemented in C and used unsigned 32 bit integers.

    I have used the morton implementation on my site a lot on a lot of data and have not come across the overflow errors. I am not saying that they won't exist for I have never tried creating morton keys for lat/long data, or the merits of treating unequal lat/long ranges as one would planar data. But let's assume that it is OK to do so, what data range is being indexed (is it the whole world)? How big are the Col/Row grids (4 decimal digits of a degree); How many grids are to be created in x and y direction? Does the morton grid start with 0 0 as the lower left corner? Whether the Row/Col ranges are equal (a recommended requirement for a morton key)?

    The only comment I have is why the introduction of the (Oracle implementation) Left_Shift function? Postgresql's "<<" operator implements a bitwise shift left:

    select 2::int8 << 2 as result;
    result
    bigint
    8

    Which is same as:

    select left_shift(2,2) as result;result
    bigint
    8

    I will update the parameters to int8 in the function on my site.

    Thanks
    Simon

  • SpatialDBAdvisor

    Morton use is subject to the restrictions I record on my site:

    "There is a restriction to how a Morton space curve can be applied: the grid must be regular. This means that while the X and Y dimensions of the grid may be different (ie giving a rectangle not a square) the number of grids either side must be the *same* (ie a 100x100 grid not a 150x100 grid). Another requirement that affects the generation of a correct space curve is that the (row,column) index values must always be the same. So, in a 100x100 grid the lower left grid should be referenced as (0,0) or (1234,1234) and not (0,1) or (1234, 3456)."

  • SpatialDBAdvisor

    Here is David Mark's implementation of a Hilbert Curve in C

    #define XMAX 255
    #define YMAX 255
    #define LEVEL 8
    unsigned int xy_to_hilbert (x,y)
    unsigned int x,y;
    /**************************************************************/
    /* */
    /* converts xy to Hilbert-Peano key */
    /* */
    /* Written: David M. Mark, CSIRO-DIT CSIS, Sept 1988 */
    /* */
    /**************************************************************/
    /**************************************************************/
    /* first, an 'almost dummy' procedure to add parameters to */
    /* a simple call */
    /**************************************************************/
    {
    unsigned int order[4];
    unsigned int xy_to_hlbt ();
    order[0]=0; order[1]=1; order[2]=3; order[3]=2;
    return (xy_to_hlbt (x,y,XMAX,YMAX,order,LEVEL));
    }

    /**************************************************************/
    /* and next, the actual recursive procedure, renamed */
    /* xy_to_hlbt() */
    /**************************************************************/
    unsigned int xy_to_hlbt (x,y,xmax,ymax,order,level)
    unsigned int x,y,xmax,ymax,order[],level;
    {
    unsigned int half_side,add_key,child,temp,q,i;
    if (level==0)
    return (0);
    else
    /* determine q, the Morton quad containing x,y */
    {
    half_side = 01<<(level-1);
    if (x > (xmax-half_side))
    if (y > (ymax-half_side)) q=3;
    else q=1;
    elseif (y > (ymax-half_side)) q=2;
    else q=0;
    /* now determine which 'child' of the parent quad the current x,y falls in */
    for (i=0; i<4; i++)
    if (order[i]==q) child=i;
    /* add_key is the contribution to the key of the current bit_pair */
    add_key = child<<(2*(level-1));
    /* modify xmax and xmax to those of the current q */
    if ((q==0)||(q==2)) xmax -= half_side;
    if ((q==0)||(q==1)) ymax -= half_side;
    /* now, if we are in the first (0) or last (3) child
    of the parent quad, we must change the order of the
    children for the next level of recursion */
    if (child==0)
    {
    temp = order[1];
    order[1] = order[3];
    order[3] = temp;
    }
    if (child==3)
    {
    temp = order[0];
    order[0] = order[2];
    order[2] = temp;
    }
    /* finally, make a recursive call, and add 'add_key' to the result */
    return ((xy_to_hlbt(x,y,xmax,ymax,order,(level-1))) + add_key);
    }
    }

    regards
    Simon