How to map yurts in Ulaanbaatar and save time using open source AI tools

Aliaksandr
8 min readMay 11, 2022

Till now, there were many challenges on Kaggle and on the other platforms, which aim to develop algorithms to map different features on the Earth surface like ships, building footprints and so on. However the actual business need is not just development of good algorithms , but saving time and money.

I will solve real world case— map yurts in Ulaanbaatar and measure how much time I could save using AI approaches. The challenge is that algorithm development costs time and I need to balance time between AI development and juts mapping by hands to make it profitable? So, I will log all my time spent to see how automation works.

Another challenge is that I have only personal laptop with Intel i5 processor, 16GB RAM, Nvidia 3060 6GB GPU. It means so I can’t run many parallel computational experiments, more than that, it blocks any other laptop activity. But it could be closer to real environment in mapping companies, and made possible to do this experiment independent and fair.

Dataset usage should be comply with Mapbox imagery usage license. Looks like, it is suitable only for OSM improvement. Dataset and training pipelines are available on https://github.com/aliaksandr960/ulaanbaatar_yurts

Why yurts?

Due to my work as CV, AI engineer in remote sensing, everyday I look at many satellite images and sometimes figure out something unusual. In case of Mongolia, it was yurts.

Ulaanbaatar (https://map.openaerialmap.org/#/106.92721590399742,47.937303861545516,19/square/1302310/59e62b7a3d6412ef7220952e?_k=nkhprp)

It is unusual, that yurts are frequently situated not only in desert and remote areas, but just in the capital — Ulaanbaatar, as well. If in deserts this traditional portable building migrates with nomads and their flocks, in Ulaanbaatar those buildings stay still for years.

Ulaanbaatar (https://commons.wikimedia.org/wiki/File:The_private_sector_with_yurts_against_the_backdrop_of_high-rise_new_buildings_in_Ulaanbaatar.jpg)

There is a special tag in Openstreetmap “Tag: building=ger” and many yurts are already mapped. But I have done much more, and covered whole Ulaanbaatar, not just downtown — it is more, than 4 700 sq.km.

Ulaanbaatar city area

Existing solutions

While looking for similar solution, I found that ESRI and L3Harris embed good machine learning tools in their products and now, it is possible to automate such problems even without computer vision expertise. And it is just a matter of time, when machine learning becomes a tool, which every remote sensing engineer will use. Unfortunately, now (May 2022) L3Harris and ESRI solutions are paid.

I will use only open source tools and look, how convenient they could be.

Phase 1. Let’s make simple model and try to use it!

Dataset

In real world scenario, I have no time to dig into the model development too much, because I potentially could spend more time on model development, than just mapping everything by hand. So, I decided to make first step as simple as possible and look at the results.

To process data, I divided whole Unaanbaatar territory to square images 2.5x2.5 km (EPSG:3857 projection) and used 18-zoom level (about 0.6 m spatial resolution) in my processing. I got 1915 squares and 11 968 sq km area in EPSG:3857 projection (here — projection matters, in reality, Ulannbaatar area is 4 700 sq km, but source images are projected this way)

All squares 2.5x2.5km (EPSG:3857)

So, in cases below you can easily find that circles are not yurts because of shadows and also real yurts nearby. But in some real cases it looks like quite impossible. Unfortunately it seems like things will be the same for the model.

I manually chose 16 squares which cover various surface types and labeled them using QGIS.

There become 100 sq. km. markup with 1981 yurts. Labeling speed depends highly on territory type, if there is just a row of yurts it gets faster, but if there are sparse or very dense yurt locations it could take more time.

It took about 12 hours, this means, that I marked about 160 yurts/hour.

Training

I sliced dataset image to 1024x1024 pixel tiles, and split them into train/valid sets with proportion 0.7/0.3 (78/33) and removed all tiles without yurts. Actually, absence of the test set and possibility to use the same image to train/valid are controversial, but due to strict time limit this decision was reasonable, because I try to improve total time, not model metrics.

There are two approaches: object detection and segmentation, I chose segmentation. I used U-Net network with an EfficientNet B1 backbone. I have not much data and low-weight backbone looks reasonable. U-Net looks like one of more stable architectures for semantic segmentation, that many people use for last years. I used Jahard Loss as loss function, maybe it is not very good, but it made possible to not write code for metrics, cause Jahard Loss = 1-IoU, and I could easy calculate IoU. As an optimizer I used Adam, it was the default.

On the first training run I got 0.85 IoU on the validation set, it is a pretty good result.

So, let’s infer the whole city! Inference done in a sliding window fashion with overlap, window size was 512x512 pixels.

It took about 1h to convert dataset, 2h to make training pipeline, 1h to train, and 4 hours to inference, 8h, total.

Validation

All yurts shaped as circles. So, I need to convert polygons, that I got due mask polygonization to circles. I uploaded all polygons into QGIS, got areas of each polygon, and their centroids. Then I added a buffer with radius sqrt (“area” / pi()) to every centroid to make circles with the same area as polygon I had.

QGIS polygons to centroids to circles processing

By this way I got pretty looking circles.

Before — red, after — blue

I started looking at results. There is a pretty good result in rural areas, and it seems that network is robust enough to work across the whole city.

Inference in rural areas

But there are many false positive results in uninhabited areas, forests and mountain landscapes, due to inference of 2.5 km square images without additional logic to merge them and without enough data augmentation to handle cropped images.

False positives

I validated part of the city (green) with first model and calculated stats.

Green — phase 1 territory, blue — phase 2 territory

Processed area — 992 squares (2.5x2.5 km) ~ 6200 sq km in EPSG:3857 projection (don’t be confused, in this projection, area become significantly larger than real, but I used it because many online mapping services use this projection).

After validation I counted stats (‘k’ means thousands):

  • 35k true positives
  • 65k false positives
  • 2k false negatives
  • precision 0.35
  • recall 0.94
  • f1 0.51

My solution is pretty unbalanced, very high recall and low precision.

But it has its own advantages, deleting an element takes much less time than adding, you could select many points and delete them with one click, but adding could be done only one-by-one.

It took 3 days to validate ~ 24 hours, I added 35 000 yurts and it means about 1400 yurts/hour.

More, than 8 times faster, than without AI.

Phase 2. Let’s remove false positives and do best!

Model and training

Actually, recall is not bad, and the main issue is with a precision — there are too many false positive objects in forest and mountain areas.

There are lots of ways to improve the metric, I could try to re-balance my dataset by adding more samples without yurts, I could try to use losses, which are better suited for working with hard samples, like Focal Loss, I could try to use active learning and label additional territories, and so on… But, unfortunately, I had limited hardware and time resources and it seemed like I couldn’t afford those options.

I decided to use chip classification approach (similar to this project https://github.com/azavea/raster-vision). It means, that I divide large image on “cells”, pass cell to classification network, and just ask, if there is a target object..

I have lot’s of examples, which I got from Phase 1, and they are suitable to make dataset for chip classifier. I randomly chose 24 images, sliced them into 1024x1024 tiles and got 527 train tiles, and 161 valid tiles.

Several months ago I experimented with flower classification (https://www.kaggle.com/code/aliaksandr960/flower17clstrainrepvgg-b2-aa2), and I just used this the same pipeline to classify yurts. I rewrite dataloader and change the number of classes from 16 to 2. That is all that was done.. And just one training loop, totally. No fine tuning, in order to shrink time limits.

I got 0.92 f1 on validation set. The result looks good enough to use.

Inference

Inferred chips for all territories.

Red — positive chips

If look closer, it is a mosaic of squares, with a side about 150 m.

Chip squares

I got inference results for all territories and keep only Phase 2 territory.

Phase 1 model with Phase 2 chips

I clipped only data with positive chips.

Phase 1 model with yurts within Phase 2 chips

Validation

As in Phase 1, I reviewed all results manually and calculate metrics.

  • 62k true positives
  • 11k false positives
  • 6k false negatives
  • precision 0.84
  • recall 0.91
  • f1 0.87

So, I got the slightly lower recall, but precision rose more than 2 times.

How it feels like — it becomes significantly less boring, cause I don’t need to reject thousands of false positives.

I spent 2h to generate dataset, 2h to adopt classification model, 1h to embed it to inference pipeline, 4h to infer whole territory.

It takes 2 days to validate ~ 16 hours, I added 68 000 yurts and it means about 4250 yurts/hour.

More, than 3 times faster, than Phase 1 AI.

Phase 3. Let’s merge it together!

I got Phase 1 data and Phase 2 data and just merged vector layers in QGIS.

Merged result

There are places, where model produces an ideal result and there are places, where error count is significantly higher, and it was difficult to predict where it happens. There should be a room for unsupervised learning to predict and eliminate such issues and made model better.

It takes about 1 day ~8h.

Totals

  • AI assisted mapping 12+1+2+1+4+24+2+2+1+4+16+8=77 hours
  • Mapping by hands 100k items / 0.16k items per hour = 625 hours

The whole city was processed 8 times faster with AI approach.

And looks like, open source AI approaches soon will become just another common tool for any remote sensing engineer.

https://github.com/aliaksandr960/ulaanbaatar_yurts

--

--