2025-05-25 Dev Blog - Introducing GeoFox, Heloid and More

Author: Indigo Curnick

Date: 2025-05-25

#dev-blog   #rustlang   #geodesics   #programming  



In my job I need to work with lots of geographic data. As you'd expect for someone in the navigation industry. However, I'm less than impressed with the current ecosystem of tools available for working with geographic data. Finally about two weeks ago the damn burst and I spawned quite a few new projects.

GeoFox

I started by making a GeoJson command line tool. I was sick of people making new-line deliminated "GeoJsons". I wanted to convert them to the correct format. The thing is, the existence of the GeoJson implies the existence of the GeoToml (and even the GeoYaml).

I already have a good chunk of this developed. Right now I can do things like

geofox convert --input file.ndgeojson --to geojson --dest file.geojson

and so on and so forth. The project isn't public yet, because it's a little too early for it.

I then started to think about other things I might like to do with GeoJsons. Right now you more or less have to use GeoPandas for anything geographic that's complex. This is nonsense! For a start I hate Python, and I love Rust. Wouldn't it be nice to do everything in Rust!

Heloid

It's very frustrating to have runtime dependencies. I do not like the fact that Proj needs to be installed. Ideally it would just be a library as a single binary file in Rust. Then it all gets integrated into the ecosystem.

I downloaded all the necessary data for the transformations from EPSG. They're in a CRS-WKT format, so I will need to write a parser for that. I've never written a parser before, and there doesn't appear to be one in Rust. My current plan is to have a parser to extract the necessary data from the files, and then have another piece of software turn that into a Rust file, maybe using perfect hash functions. This way it's all compiled into the binary - no runtime database lookups is ideal. (See next section for details on that parser!)

In terms of the mapping itself, it was a lot of fun to learn how it works. It turns out there's a very systematised way of doing projections. First, you have a selection of different ellipsoids. These are parameters that define the shape of the earth. The most famous and most widely used is now WGS84, but there are others. There are then several different transforms which allow you to go from this oblate spheroid shape into various different projection shapes, like flat so you can make it a map.

Right now, I'm focusing on the Transverse Mercator (TM) projection. This is a very common project and armed with the WGS84 ellipsoid and TM you already have most of the most common projections.

The primary source for this right now is Map Projections Used by the US Geological Survey John Snyder. Here's the formulas he presents for conversions.

\[x = k_0 N [A + (1 - T + C)A^3/6 + (5 - 18T + T^2 + 72C - 58e^{\prime 2})A^5/120]\] \[y = k_0 {M - M_0 + N \tan \phi [A^2/2 + (5 - T + 9C + 4C^2) A^4/24 + (61 -58T + T^2 + 600C - 330 e^{\prime 2})A^6/720] }\] \[k = k_0 [1 + (1+C)A^2/2 + (5-4T +42C + 13C^2 - 28 e^{\prime 2})A^4/24 + (61 -148T + 16T^2) A^6/720]\]

Where

Furthermore

\[e^{\prime 2} = \frac{e^2}{1-e^2}\] \[N = \frac{a}{\sqrt{1 - e^2 \sin^2 \phi}}\] \[T = \tan^2 \phi\] \[C = e^{\prime 2} \cos^2 \phi\] \[A = \cos \phi (\lambda - \lambda_0)\] \[M = a[(1 - e^2/4 - 3e^4/64 - 5e^6/256 - \dots) \phi - (3e^2/8 + 3e^4/32 + 45e^6/1024 + \dots) \sin 2\phi + (15e^4/256 + 45e^6/1024 + \dots) \sin 4\phi - (35e^6/3072 + \dots) \sin 6\phi + \dots]\]

I'm not sure if this is enough accuracy. There's a much newer paper called Transverse Mercator with an Accuracy of a Few Nanometers by Kerney, which might update some of these values and expansions. For now I will use the above truncated formula for \(M\)

\(M_0\) is the same as \(M\), but replacing all instances of \(\phi\) by \(\phi_0\)

If $\phi = \pm \pi / 2$ then $x=0$, $y = k_0(M-M_0)$, $k=k_0$

Where

So in summary, the TM is not a single map system but rather a technique which gives you a few free parameters. First you pick an ellipsoid defined by an \(a\) and an \(e\). Though, most datasets list the flattening, \(f\) rather than \(e\). Thankfully, it is easy enough to compute one from the other by \(e^2 = 2f - f^2\).

Then, you pick the transformation parameters: $\lambda_0$, $\phi_0$ and $k_0$. You also later can introduce a false northing and false easting, which let you shift around the output. The use for this is to simply adjust the coordinates you get in $x,y$. For example, you might like the bottom left hand corner of the map to be $(0,0)$.

The inverse transformations are given by

\[\phi = \phi_1 - (N_1 \tan \phi_1 / R_1)[D^2 /2 - (5 + 3T_1 + 10C_1 - 4C_1^2 - 9 e^{\prime 2}) D^4/24 + (61 + 90T_1 + 298C_1 + 45T_1^2 -252e^{\prime 2} - 3C_1^2) D^6/720]\] \[\lambda = \lambda_0 + [D - (1 +2T_1 + C_1)D^3/6 + (5-2C_1+28T_1) - 3C_1^2 + 8 e^{\prime 2} + 24T^2_1 D^5 / 120]/ \cos \phi_1\]

Where

\[\phi_1 = \mu + (3e_1/2 - 27e_1^3/32 + \cdot) \sin 2 \mu + (21 e^2_1/16 - 55 e_1^4 /32 - \dots) \sin 4 \mu + (151e_1^3/96 + \dots) \sin 6 \mu + \dots\]

Again, I think the presented numbers is a sufficient degree of precision but we may need more in future. Also note that there is what I believe an error in the original text: it clearly says \(-55e_1^432\) but I strongly believe this is intended to be \(-55e_1^4/32\).

And furthermore

\[e_1 = \frac{1 - \sqrt{1-e^2}}{1 + \sqrt{1-e^2}}\] \[\mu = M / [a (1 - e^2/4 - 3e^4/64 - 5e^6/256 - \dots)]\]

This time we can compute \(M\) by

\[M = M_0 + y / k_0\]

Since \(M_0\) only comes from things part of the transformation this is still possible.

\[e^{\prime 2} = \frac{e^2}{1-e^2}\] \[C_1 = e^{\prime 2} \cos^2 \phi_1\] \[T_1 = \tan^2 \phi_1\] \[N_1 = \frac{a}{\sqrt{1 - e^2 \sin^2 \phi_1}}\] \[R_1 = \frac{a (1- e^2)}{(1 - e^2 \sin^2 \phi_1)^\frac{3}{2}}\] \[D = x / (N_1 k_0)\]

wkt-crs-rs

As explained above, the next steps are really to get a fully functional parser for the wkt-crs format. I got a copy of Geographic Information - Well-known Text Representation of Coordinate Reference Systems by the Open Geospatial Consortium. I've been working through and implementing it.

Let's look at some examples of wkt-crs.

CS[ellipsoidal,3],
AXIS["latitude",north,ORDER[1],ANGLEUNIT["degree",0.017]],
AXIS["longitude",east,ORDER[2],ANGLEUNIT["degree",0.017]],
AXIS["ellipsoidal height (h)",up,ORDER[3],LENGTHUNIT["metre",1.0]]

That one describes a three-dimensional coordinate system with latitude, longitude and height.

GEOGCRS["WGS 84",ENSEMBLE["World Geodetic System 1984 ensemble", MEMBER["World Geodetic System 1984 (Transit)", ID["EPSG",1166]], MEMBER["World Geodetic System 1984 (G730)", ID["EPSG",1152]], MEMBER["World Geodetic System 1984 (G873)", ID["EPSG",1153]], MEMBER["World Geodetic System 1984 (G1150)", ID["EPSG",1154]], MEMBER["World Geodetic System 1984 (G1674)", ID["EPSG",1155]], MEMBER["World Geodetic System 1984 (G1762)", ID["EPSG",1156]], MEMBER["World Geodetic System 1984 (G2139)", ID["EPSG",1309]], MEMBER["World Geodetic System 1984 (G2296)", ID["EPSG",1383]], ELLIPSOID["WGS 84",6378137,298.257223563,LENGTHUNIT["metre",1,ID["EPSG",9001]],ID["EPSG",7030]], ENSEMBLEACCURACY[2],ID["EPSG",6326]],CS[ellipsoidal,2,ID["EPSG",6422]],AXIS["Geodetic latitude (Lat)",north],AXIS["Geodetic longitude (Lon)",east],ANGLEUNIT["degree",0.0174532925199433,ID["EPSG",9102]],ID["EPSG",4326]]

That one is really important! It describes WGS 84 - the most common (by far) coordinate system and base ellipsoid for most in use coordinate systems. This is exactly the kind of thing I will need to parse, to convert into data for Heloid.

I've never written a parser before. I skimmed through a few books like The Theory and Practice of Compiler Writing by Jean-Paul Tremblay and Paul G. Sorenson and Crafting Interpreters by Robery Nystrom, as I knew they would involve some kind of parser for the language in question.

It didn't take too long to figure out the basic structure. The first step is to tokenise the text into tokens. The second step is to parse those tokens into an abstract syntax tree. The third step is to use the specification to validate the structure is correct and convert it into something meaningful in Rust.

The tokens are as follows for wkt-crs

pub enum Token {
	Ident(String),
	String(String),
	Number(f64),
	DateTime(Temporal),
	LBracket,
	RBracket,
	Comma,
}

Actually tokenising is done as follows

pub fn tokenize(mut s: &str) -> Vec<Token> {
    let mut tokens = Vec::new();

    while let Some(c) = s.chars().next() {
        match c {
            '[' => {
                tokens.push(Token::LBracket);
                s = &s[1..];
            }
            ']' => {
                tokens.push(Token::RBracket);
                s = &s[1..];
            }
            ',' => {
                tokens.push(Token::Comma);
                s = &s[1..];
            }
            '"' => {
                let end = s[1..].find('"').unwrap() + 1;
                let content = &s[1..end];
                tokens.push(Token::String(content.to_string()));
                s = &s[end + 1..];
            }
            c if c.is_ascii_digit() || c == '-' || c == '.' => {
                let len = s
                    .find(|ch: char| {
                        !ch.is_ascii_digit()
                            && ch != '.'
                            && ch != '-'
                            && ch != '+'
                            && ch != 'e'
                            && ch != 'E'
                    }) // TODO: we might need more robust scientific notation handling in future
                    .unwrap_or(s.len());
                let num_str = &s[..len];

                // ! This is some premium jank
                // TODO: Integer support?
                if let Ok(num) = num_str.parse::<f64>() {
                    tokens.push(Token::Number(num));
                    s = &s[len..];
                } else if let Ok(date) = Temporal::try_from(num_str) {
                    tokens.push(Token::DateTime(date));
                    s = &s[len..];
                } else {
                    panic!("Unknown number type")
                };
            }
            c if c.is_ascii_alphabetic() => {
                let len = s
                    .find(|ch: char| !ch.is_ascii_alphabetic())
                    .unwrap_or(s.len());
                let ident = &s[..len];

                // All uppercase is a keyword. Lowercase strings without quotes
                // indicate essentially enums, like `ellipsoidal` to delineate
                // CS type
                if is_all_upper(ident) {
                    tokens.push(Token::Ident(ident.to_string()));
                } else {
                    tokens.push(Token::String(ident.to_string()));
                }

                s = &s[len..];
            }
            c if c.is_whitespace() => {
                s = &s[1..];
            }
            _ => panic!("unhandled char {:?}", c),
        }
    }

    tokens
}

It's a little messy but I think it works. I definitely need better error handling.

Once we've done that, we need to parse the tokens into nodes. I use the following node structure

#[derive(Debug, PartialEq)]
pub struct WktNode {
    pub keyword: String,
    pub args: Vec<WktArg>,
}

#[derive(Debug, PartialEq)]
pub enum WktArg {
    String(String),
    Number(f64),
    Node(WktNode),
    DateTime(Temporal),
}

(Don't worry about Temporal - I'll discuss that in the next section.)

We can then parse these nodes with

fn parse_nodes(tokens: &mut Vec<Token>) -> Vec<WktNode> {
    let mut nodes = Vec::new();

    loop {
        match tokens.first() {
            Some(Token::Ident(_)) => {
                nodes.push(parse_node(tokens));
            }
            Some(Token::Comma) => {
                tokens.remove(0);
            }
            None => {
                break;
            }
            _ => panic!("Unexpected token"),
        }
    }
    while let Some(Token::Ident(_)) = tokens.first() {
        nodes.push(parse_node(tokens));
    }
    nodes
}

pub fn parse_node(tokens: &mut Vec<Token>) -> WktNode {
    let keyword = match tokens.remove(0) {
        Token::Ident(s) => s,
        _ => panic!("expected keyword"),
    };

    assert!(matches!(tokens.remove(0), Token::LBracket));

    let mut args = Vec::new();
    loop {
        match tokens.first() {
            Some(Token::RBracket) => {
                tokens.remove(0);
                break;
            }
            Some(Token::Comma) => {
                tokens.remove(0);
            }
            Some(Token::String(s)) => {
                args.push(WktArg::String(s.clone()));
                tokens.remove(0);
            }
            Some(Token::Number(n)) => {
                args.push(WktArg::Number(*n));
                tokens.remove(0);
            }
            Some(Token::DateTime(d)) => {
                args.push(WktArg::DateTime(d.clone()));
                tokens.remove(0);
            }
            Some(Token::Ident(_)) => {
                let node = parse_node(tokens);
                args.push(WktArg::Node(node));
            }
            other => panic!("unexpected token: {:?}", other),
        }
    }

    WktNode { keyword, args }
}

pub fn parse_wkt(s: &str) -> Vec<WktNode> {
    let mut tokens = tokenize(s);
    parse_nodes(&mut tokens)
}

Finally we can actually validate and parse based on this AST. Let's look at ID. An example might be

ID["EuroGeographics","ES_ED50 (BAL90) to ETRS89","2001-04-20"]

Which I would parse into this

Id {
	authority_name: "EuroGeographics".to_string(),
	authority_unique_identifier: NumText::Text("ES_ED50 (BAL90) to ETRS89".to_string()),
	version: Some(NumText::Text("2001-04-20".to_string())),
	authority_citation: None,
	id_uri: None,
};

I've been handling all of these with an impl TryFrom<&WktNode> for Id, which let's us do the conversion. I won't paste the whole code for the validation as it's rather long!

Whether this is the best design, I don't know, but it seems to be working for now!

One really nice thing about developing this parser is it's so easy to test. I can simply implement the spec as a struct or an enum, and then produce a test for it. They even provide examples in the spec! It's almost a meditative process to just move through the spec section by section.

It's also a good place to experiment with test driven development. I've been generally kind of sceptical of it, but I think it might actually work here quite well. The desired results are really well defined here. I'm a little less convinced in situations where I'm reaching out to a database or something. For this project though, having lots of tests is making it a breeze (though huge, the spec is 120 pages).

Horologium

One aspect of the wkt-crs spec which is a little annoying is the dates. Sometimes they're provided in a string, which makes handling them easy. However, they're often provided as an object. This means they can occupy many different formats. For example

In other words, just any random format which an SQL database might spit out. As far as I can see, there's no good wrapper which can handle deserialising arbitrary date formats right now.

So I made it! It's called Horologium and of the projects discussed so far, this is the first which has been release publicly. It's super early days, but it gives you the following enum

pub enum Temporal {
    Date(Date),
    Time(Time),
    DateTime(PrimitiveDateTime),
    OffsetDateTime(OffsetDateTime),
}

Which you can make from a string like Temporal::try_from("2025-05-25"). Useful if you don't know what the date format might look like.

Lofty Goals

So the order of operations for me is something like this

It's a very lofty goal but the day I don't have to touch Python in any way to do the geographic stuff at my job will be a very happy day indeed.

My Other Stuff?

Those of you who are avid readers of my blog will know that I was steadily working through Mathematical Methods in the Physical Sciences by Mary L. Boas. I do intend to finish that still, though it will likely take much longer than I originally anticipated due to this project now taking up a fair amount of my time. I have several articles on integration and vector calculus which are half finished, but I hope will still see the light of day soon!

Similarly I wanted to work through Introduction to Algorithms by Thomas H. Cormen et al, but it seems unlikely that will happen this year now.

In terms of Latin, I'm still translating Lingua Latina by Hans H. Orberg chapter by chapter. Though I haven't actually done that in a while. I intend to finish off the chapter I'm about half way through immediately after publishing this blog. Theoretically I should still complete my entire translation of that into English by the end of year, which has by far been the most effective way to study Latin I've found so far. In terms of Latin blogs, I really don't see many on the horizon over the next few months.