Open dataset, that enhance number of floors markup in OSM for Belarus major cities

4 min readDec 28, 2021

I tried to find some interesting ways to experiment with Openstreetmap and Machine Learning in Russian 4-day holidays in early November. Times ago I worked with object tracking systems, and I understand, that there are ideas, that could help to speedup some boring processes in mapping, and it could be calculating similarity.

Siam network could extend Openstreetmap building levels data by finding similar buildings with marked levels to unmarked ones and assume, that they have same number of floors. Potentially, such approach could be expanded for any other attribute.

Siam networks popular in face verification systems (https://github.com/timesler/facenet-pytorch) and object tracking (https://github.com/huanglianghua/siamfc-pytorch). Idea is to generate multidimensional feature vectors for each item, and train network that way, when feature vectors of similar objects have small euclidean distance, different objects — high distance. One of most popular loss is triplet loss (https://en.wikipedia.org/wiki/Triplet_loss) and I used it.

At least, I not found any similar solutions on GitHUB, and it become point to try. If you do — please post link in comments section.

I choose Belarus, because I live there, and if I help someone to map or explore my homeland I benefit from my own work. Whole territory is too big to process by myself, and I have done all major cities — Minsk, Brest, Grodno, Mogilev, Vitebsk, Gomel. This cities are also administrative centers of regions in this country, and there live more than 40% of population.

How it was done:

Data was downloaded via Overpass Turbo, for floor count extraction I used ‘building:levels’ attribute
Train network with existing OSM building levels data as ground truth and with Mapbox Maxar imagery (license not prohibited to use it for OSM)
Get results for particular city
Manually clean some rough mistakes (in most cases, not all)

Dataset

Original data was downloaded from OSM via Overpass 24 September 2021.

May be GitHUB not good enough place for dataset uploading , but in contrast to my paid google drive account, if I stop pay money this data should not wipe out from there https://github.com/aliaksandr960/by_osm_siam/tree/main

There are fields ‘id’, ‘@id’, ‘building’, ‘building:levels’ — it is original OSM fields and you can merge data by this fileds with OSM

‘osm_levels’ — ‘building:levels’ converted to integer

‘new_levels’ — new levels, which ware made by this experiment

Jupyter-notebooks, that I used with model architecture and training pipeline could be found at ‘notebooks’ folder in repo.

Numbers

total more than 156 000 floor number added
total more than 1700 sq km processed

Minsk

25 258 floor number found
39 422 floor numbers added
About 400 sq km processed

Brest

2 364 floor number found
28 077 floor numbers added
About 160 sq km processed

Gomel

4 219 floor number found
35 642 floor numbers added
About 300 sq km processed

Grodno

13 921 floor number found
10 253 floor numbers added
About 160 sq km processed

Mogilev

6 354 floor number found
40 241 floor numbers added
About 300 sq km processed

Vitebsk

56 535 floor number found (yes, it well mapped)
4 044 floor numbers added
About 400 sq km processed

Accuracy

It is hard to calculate, because most close reference data that I found is Saint-Petersburg (Russia), but in this experiment I decided, firstly, to finish making Belorussian dataset.

I manually cleaned some rough mistakes from dataset, and it is not raw model output. But I not spent too much time on it, so, there are still mistakes yet.

Models are experimental and quite every city and sometimes part of city were done with different environment, but manual post-processing should make dataset enough consistent.

There are no numbers, but I expect that number of rough mistakes (>2 floors) should be about 20%. That is too much to merge to OSM, but could significantly speedup markup and sufficient for some statistics.

In training environment I calculate Accuracy and MAE, MSE, in ideal scenario, accuracy could be up to 0.90, but reality is, surely, no so good, seems like raw model output accuracy should be somewhere between 0.50 and 0.90, depending on environment. But validation of this data is still less boring and more fast than making it for scratch.

Main things, that affect accuracy:

errors in OSM data, that I assume as ground truth, but it not always correct
industrial areas — there pretty impossible to evaluate correct number of floors in particular building, because buildings looks like just big boxes without windows.
in Mapbox Maxar mosaic many satellite images are stitched to one big tile mosaic and it is significant more hard to train network to find similarities on images, which was taken from different angles in different time
garages — somehow, they are similar to block of flats for neural network
levels and building types distributions in “source” markup, which not match well to “target” distributions

Conclusion

train AI for floor number prediction using only OSM and Maxar within several weekends possible
sometimes it not accurate, but validation is easier, than making markup from scratch
Siam networks could help in this task, even if there are nothing popular found on GitHUB
there are big room for enhancement

Today is 29 December, I am glad to finish one of stages of my experimentation in this year and not move it unfinished to 2022 :-)

Good luck!