GeoGrapher tutorial - creating a ML dataset

This tutorial shows how to use geographer to create a remote sensing dataset from a dataset of vector data and large raster tiles. If you are reading the html version in the documentation and would prefer the actual ipynb file you can find it here. We will use the example dataset of stadiums created in the tutorial_nb_basics.ipynb notebook. More advanced examples of the very general cutting capabilities that GeoGrapher provides can be found in the documentation and in a soon forthcoming notebook.

Contents:

  1. Basic cutting

  2. Making segmentation labels

  3. Updating the new dataset

  4. Creating a train/val split

1. Basic cutting

First, we import geographer, as well as some other imports we will need.

[ ]:
# !pip install geographer matplotlib
[20]:
import geographer as gg
import shutil
from pathlib import Path
import rioxarray
from matplotlib import pyplot as plt

First, we load the connector for the dataset:

[21]:
from geographer import Connector

SOURCE_DATA_DIR = Path("gg_example_dataset")

connector = Connector.from_data_dir(SOURCE_DATA_DIR)

We now show how to do basic dataset cutting on our example dataset. There are three reasons we want to cut the datset:

  • The Sentinel-2 raster tiles in this dataset are very large. Each tile is 10980 by 10980 pixels or (given that the resolution is 10m) around 100km by 100km. Sentinel-2 tiles also can have a lot of channels. The resulting files are too large to load into a GPU to do deep learning. To be able to comfortably work with them, we need to create a dataset of smaller rasters.

  • We do not know if the large tiles contain other stadiums and we want to avoid false negatives in the labels (i.e. stadiums that have not been labeled as such and are counted as belonging to the background). False negatives in the labels would confuse ML models and make it harder for them to learn.

  • Since almost all of the pixels in a tile belong to the background class the foreground/backgroun class imbalance will be extreme, which will make it hard for ML models to learn. Even if we would like our ML model to eventually be able to identify stadiums on real life data with a high class imbalance in production, we want to train on a dataset with a more manageable class imbalance.

GeoGrapher’s provides two very general templates for cutting that provide a very large amount of flexibility for how datasets can be cut: iterating over vector features and iterating over rasters. To simplify the complexity of using these for basic use cases, GeoGrapher provides two functions that should provide sensible default behavior for a wide variety of applications. One returns a cutter that creates cutouts around vector features, and one that cuts every raster in the source dataset to a grid of rasters. In this example we want to use the former. We have two rasters per vector feature in our source dataset, and let us say that we want to create two cutouts (from separate rasters) for each vector feature.

Let us start with the imports:

[22]:
from geographer.cutters import get_cutter_rasters_around_every_vector

We use this function to get a dataset cutter.

[23]:
IMG_SIZE = 512
TARGET_DATA_DIR = Path(f"gg_example_dataset_cutouts{IMG_SIZE}")

cutter = get_cutter_rasters_around_every_vector(
    source_data_dir=SOURCE_DATA_DIR,
    target_data_dir=TARGET_DATA_DIR,
    name="Our first cutter",  # for saving, see below
    new_raster_size=IMG_SIZE,  # in pixels
    target_raster_count=2,  # we want to create two different raster cutouts per vector feature
    bands={"rasters": [1, 2, 3]},  # we only want the RGB (or BGR for Sentinel-2) bands
    mode="random",  # defines the way cutouts are chosen. this is the default value, but you can choose other modes.
)

Now we can cut the dataset (i.e. create a new dataset in TARGET_DATA_DIR composed of cutouts from the dataset in SOURCE_DATA_DIR) using the cutter’s cut method. The return value is the connector of the newly created target dataset.

[24]:
target_connector = cutter.cut()
Cutting dataset: 100%|██████████| 9/9 [00:02<00:00,  3.94it/s]

Let’s take a look at the rasters in the new dataset …

[25]:
target_connector.rasters
[25]:
geometry raster_processed? timestamp orig_crs_epsg_code
raster_name
S2B_MSIL2A_20220804T101559_N0400_R065_T32UPU_20220804T130854_Munich Olympiastadion.tif POLYGON ((11.58895 48.14840, 11.58895 48.19594... True 2022-08-04-10:15:59 32632
S2A_MSIL2A_20220627T100611_N0400_R022_T32UPU_20220627T162810_Munich Olympiastadion.tif POLYGON ((11.56448 48.17054, 11.56448 48.21807... True 2022-06-27-10:06:11 32632
S2A_MSIL2A_20220627T100611_N0400_R022_T32UPU_20220627T162810_Munich Staedtisches Stadion Dantestr.tif POLYGON ((11.56403 48.12907, 11.56403 48.17660... True 2022-06-27-10:06:11 32632
S2A_MSIL2A_20220413T092031_N0400_R093_T34TFN_20220413T123632_Vasil Levski National Stadium.tif POLYGON ((23.37083 42.68422, 23.37083 42.73156... True 2022-04-13-09:20:31 32634
S2A_MSIL2A_20220722T092041_N0400_R093_T34TFN_20220722T134859_Vasil Levski National Stadium.tif POLYGON ((23.35795 42.66350, 23.35795 42.71084... True 2022-07-22-09:20:41 32634
S2A_MSIL2A_20220413T092031_N0400_R093_T34TFN_20220413T123632_Bulgarian Army Stadium.tif POLYGON ((23.40157 42.67061, 23.40157 42.71797... True 2022-04-13-09:20:31 32634
S2A_MSIL2A_20220412T012701_N0400_R074_T54SUE_20220412T042315_Jingu Baseball Stadium.tif POLYGON ((139.76736 35.65410, 139.76736 35.700... True 2022-04-12-01:27:01 32654
S2A_MSIL2A_20220701T012711_N0400_R074_T54SUE_20220701T043318_Jingu Baseball Stadium.tif POLYGON ((139.75258 35.66693, 139.75258 35.713... True 2022-07-01-01:27:11 32654

… and at the vectors:

[26]:
target_connector.vectors
[26]:
geometry raster_count location download_exception type
vector_name
Munich Olympiastadion POLYGON Z ((11.54677 48.17472 0.00000, 11.5446... 3 Munich, Germany NoImgsForVectorFeatureFoundError('No images fo... football
Munich Track and Field Stadium1 POLYGON Z ((11.54382 48.17279 0.00000, 11.5438... 3 Munich, Germany NoImgsForVectorFeatureFoundError('No images fo... football
Munich Olympia Track and Field2 POLYGON Z ((11.54686 48.17892 0.00000, 11.5468... 2 Munich, Germany NoImgsForVectorFeatureFoundError('No images fo... football
Munich Staedtisches Stadion Dantestr POLYGON Z ((11.52913 48.16874 0.00000, 11.5291... 2 Munich, Germany NoImgsForVectorFeatureFoundError('No images fo... football
Vasil Levski National Stadium POLYGON Z ((23.33410 42.68813 0.00000, 23.3340... 2 Sofia, Bulgaria NoImgsForVectorFeatureFoundError('No images fo... football
Bulgarian Army Stadium POLYGON Z ((23.34065 42.68492 0.00000, 23.3406... 2 Sofia, Bulgaria NoImgsForVectorFeatureFoundError('No images fo... football
Arena Sofia POLYGON Z ((23.34018 42.68318 0.00000, 23.3401... 2 Sofia, Bulgaria NoImgsForVectorFeatureFoundError('No images fo... football
Jingu Baseball Stadium POLYGON Z ((139.71597 35.67490 0.00000, 139.71... 2 Tokyo, Japan NoImgsForVectorFeatureFoundError('No images fo... baseball
Japan National Stadium POLYGON Z ((139.71482 35.67644 0.00000, 139.71... 2 Tokyo, Japan NoImgsForVectorFeatureFoundError('No images fo... football

Note: We see in the raster_count column that most stadiums are contained in exactly two rasters, but (depending on the rasters in your dataset) some might be contained in three rasters. Choosing how to create cutouts to reach the targeted raster_count is a combinatorially hard problem and this cutter uses a greedy heuristic to select images to approximatively achieve the targeted raster_count which in practice works well (one can also experiment with different random seeds, which can help). What can happen is that at some stage in the iteration the target_raster_count for some vector features V in a cluster C has been reached, but when creating cutouts for the remaining rasters C-V in the cluster the cutouts contain vector features from V. The raster_count may also be less than target_raster_count if there are not enough rasters in the source source dataset to reach the targeted raster_count (e.g. in this example if we had set target_raster_count to 10).

If instead of creating cutouts around the vector stadiums we want to just cut each raster in the source directory to a grid of rasters, we can use the get_cutter_every_raster_to_grid function from geographer.cutters.

The stadiums have now been added to the connector’s vectors GeoDataFrame:

2. Making segmentation labels

To make segmentation labels, we use LabelMakers. The particular LabelMaker we want to use is SegLabelMakerCategorical. This is the reason we set the task_vector_classes argument in the tutorial notebook on GeoGrapher basics in which we created the dataset. The label maker uses the connector.task_vector_classes attribute for a safety check.

[27]:
from geographer.label_makers import SegLabelMakerCategorical

label_maker = SegLabelMakerCategorical()
label_maker.make_labels(connector=target_connector)
Making labels: 100%|██████████| 8/8 [00:00<00:00, 34.09it/s]

If we had made labels for our source dataset in SOURCE_DATA_DIR then because the dataset cutters will also create cutouts for the labels we would not have had to create labels for the new dataset in TARGET_DATA_DIR again.

Let’s take a look at a raster and it’s segmentation mask:

[28]:
raster_paths = target_connector.rasters_containing_vector(
    "Munich Olympiastadion", mode="paths"
)
raster_path = raster_paths[0]
label_path = target_connector.labels_dir / raster_path.name

raster = (
    rioxarray.open_rasterio(raster_path).sel(band=[1, 2, 3]).values.transpose(1, 2, 0)
    / 65535
)
label = rioxarray.open_rasterio(label_path).values.transpose(1, 2, 0) / 255

fig, ax = plt.subplots(1, 2, figsize=(13, 13))

ax[0].imshow(raster)
ax[0].axis("off")

ax[1].imshow(label, cmap="Blues")
ax[1].axis("off")
[28]:
(-0.5, 511.5, 511.5, -0.5)
_images/tutorial_nb_cut_label_cluster_19_1.png

3. Updating the new dataset

If the source dataset grows to contain more vector features or rasters we can update the dataset of cutouts. Let’s simulate this situation by copying the source dataset, deleting some vector features and rasters, cutting the dataset as above, and adding them back again to the source dataset. Then we will see how easy it is to update the dataset of cutouts.

Here we copy the source dataset to TRUNCATED_SOURCE_DATA_DIR:

[29]:
TRUNCATED_SOURCE_DATA_DIR = SOURCE_DATA_DIR.parent / (
    SOURCE_DATA_DIR.name + "_truncated"
)

shutil.copytree(SOURCE_DATA_DIR, TRUNCATED_SOURCE_DATA_DIR)
[29]:
PosixPath('gg_example_dataset_truncated')

Let’s delete some vector features and rasters from the truncated source dataset. We use the drop_vectors and drop_rasters methods. The label_maker arguments are optional. If given they will in the case of drop_vectors update labels for rasters from which vector features have been removed and in the case of drop_rasters delete the segmentation labels corresponding to the rasters.

[30]:
truncated_source_connector = Connector.from_data_dir(TRUNCATED_SOURCE_DATA_DIR)
vectors_to_drop = ["Munich Olympiastadion", "Munich Track and Field Stadium1"]
truncated_source_connector.drop_vectors(vectors_to_drop, label_maker=label_maker)
candidate_rasters_to_drop = truncated_source_connector.rasters_containing_vector(
    "Munich Olympia Track and Field2"
)
raster_to_drop = candidate_rasters_to_drop[0]
truncated_source_connector.drop_rasters(
    raster_to_drop,
    remove_rasters_from_disk=True,  # default value
    label_maker=label_maker,
)
Deleting labels: 100%|██████████| 1/1 [00:00<00:00, 7825.19it/s]

Make sure to save the changes:

[31]:
truncated_source_connector.save()

Let’s make sure vectors and rasters look as they should. Notice how the raster_count column in vectors is consistent with the removals we’ve made.

[32]:
truncated_source_connector.vectors
[32]:
raster_count location download_exception type geometry
vector_name
Munich Olympia Track and Field2 1 Munich, Germany NoImgsForVectorFeatureFoundError('No images fo... football POLYGON Z ((11.54686 48.17892 0.00000, 11.5468...
Munich Staedtisches Stadion Dantestr 1 Munich, Germany NoImgsForVectorFeatureFoundError('No images fo... football POLYGON Z ((11.52913 48.16874 0.00000, 11.5291...
Vasil Levski National Stadium 2 Sofia, Bulgaria NoImgsForVectorFeatureFoundError('No images fo... football POLYGON Z ((23.33410 42.68813 0.00000, 23.3340...
Bulgarian Army Stadium 2 Sofia, Bulgaria NoImgsForVectorFeatureFoundError('No images fo... football POLYGON Z ((23.34065 42.68492 0.00000, 23.3406...
Arena Sofia 2 Sofia, Bulgaria NoImgsForVectorFeatureFoundError('No images fo... football POLYGON Z ((23.34018 42.68318 0.00000, 23.3401...
Jingu Baseball Stadium 2 Tokyo, Japan NoImgsForVectorFeatureFoundError('No images fo... baseball POLYGON Z ((139.71597 35.67490 0.00000, 139.71...
Japan National Stadium 2 Tokyo, Japan NoImgsForVectorFeatureFoundError('No images fo... football POLYGON Z ((139.71482 35.67644 0.00000, 139.71...
[33]:
truncated_source_connector.rasters
[33]:
raster_processed? timestamp orig_crs_epsg_code geometry
raster_name
S2A_MSIL2A_20220722T092041_N0400_R093_T34TFN_20220722T134859.tif True 2022-07-22-09:20:41 32634 POLYGON ((23.54663 42.33578, 23.58754 43.32358...
S2A_MSIL2A_20220413T092031_N0400_R093_T34TFN_20220413T123632.tif True 2022-04-13-09:20:31 32634 POLYGON ((23.54663 42.33578, 23.58754 43.32358...
S2B_MSIL2A_20220804T101559_N0400_R065_T32UPU_20220804T130854.tif True 2022-08-04-10:15:59 32632 POLYGON ((11.79809 47.73104, 11.85244 48.71769...
S2A_MSIL2A_20220412T012701_N0400_R074_T54SUE_20220412T042315.tif True 2022-04-12-01:27:01 32654 POLYGON ((140.00972 35.15084, 139.99743 36.140...
S2A_MSIL2A_20220701T012711_N0400_R074_T54SUE_20220701T043318.tif True 2022-07-01-01:27:11 32654 POLYGON ((140.00972 35.15084, 139.99743 36.140...

Now, we can cut the truncated dataset and create labels for the cutouts as before:

[34]:
# cutting

IMG_SIZE = 512
TARGET_DATA_DIR2 = Path(f"gg_example_dataset_truncated_cutouts{IMG_SIZE}")
TRUNCATED_CUTTER_NAME = "truncated_cutter"

cutter = get_cutter_rasters_around_every_vector(
    source_data_dir=SOURCE_DATA_DIR,
    target_data_dir=TARGET_DATA_DIR2,
    name=TRUNCATED_CUTTER_NAME,  # for saving, see below
    new_raster_size=IMG_SIZE,  # in pixels
    target_raster_count=2,  # we want to create two different raster cutouts per vector feature
    bands={"rasters": [1, 2, 3]},  # we only want the RGB (or BGR for Sentinel-2) bands
    mode="random",  # defines the way cutouts are chosen. this is the default value, but you can choose other modes.
)
truncated_target_connector = cutter.cut()
Cutting dataset: 100%|██████████| 9/9 [00:02<00:00,  3.21it/s]
[35]:
# making labels

label_maker.make_labels(connector=truncated_target_connector)
Making labels: 100%|██████████| 8/8 [00:00<00:00, 36.00it/s]

Now, let’s add the dropped data back:

[36]:
source_connector = Connector.from_data_dir(SOURCE_DATA_DIR)

# copy over deleted raster
shutil.copy(
    source_connector.rasters_dir / raster_to_drop,
    truncated_source_connector.rasters_dir / raster_to_drop,
)

# add entries to rasters and vectors GeoDataFrames
truncated_source_connector.add_to_rasters(
    source_connector.rasters.loc[[raster_to_drop]]
)
truncated_source_connector.add_to_vectors(source_connector.vectors.loc[vectors_to_drop])

# save
truncated_source_connector.save()

Now, update the truncated target dataset. The cutter returned by get_cutter_rasters_around_every_vector is a DSCutterIterOverVectors. To load the saved cutter (we don’t actually have to do that here, since the cutter variable is still bound, but in a normal workflow we would), we run:

[37]:
from geographer.cutters import DSCutterIterOverVectors

cutter_json_path = TARGET_DATA_DIR2 / "connector" / f"{TRUNCATED_CUTTER_NAME}.json"
cutter = DSCutterIterOverVectors.from_json_file(cutter_json_path)
[38]:
cutter.update()
Cutting dataset: : 0it [00:00, ?it/s]
/home/rustam/dida/GeoGrapher/geographer/utils/utils.py:177: FutureWarning: Behavior when concatenating bool-dtype and numeric-dtype arrays is deprecated; in a future version these will cast to object dtype (instead of coercing bools to numeric values). To retain the old behavior, explicitly cast bool-dtype arrays to numeric dtype.
  concatenated_gdf = GeoDataFrame(pd.concat(objs, **kwargs), crs=objs[0].crs)
[38]:
<geographer.connector.Connector at 0x7fcf5ed644f0>

Now, we can check that the truncated target dataset has been updated. Note that we have to reload the truncated_source_connector.

[39]:
truncated_target_connector = Connector.from_data_dir(TARGET_DATA_DIR2)
truncated_target_connector.vectors
[39]:
raster_count location download_exception type geometry
vector_name
Munich Olympiastadion 3 Munich, Germany NoImgsForVectorFeatureFoundError('No images fo... football POLYGON Z ((11.54677 48.17472 0.00000, 11.5446...
Munich Track and Field Stadium1 3 Munich, Germany NoImgsForVectorFeatureFoundError('No images fo... football POLYGON Z ((11.54382 48.17279 0.00000, 11.5438...
Munich Olympia Track and Field2 2 Munich, Germany NoImgsForVectorFeatureFoundError('No images fo... football POLYGON Z ((11.54686 48.17892 0.00000, 11.5468...
Munich Staedtisches Stadion Dantestr 2 Munich, Germany NoImgsForVectorFeatureFoundError('No images fo... football POLYGON Z ((11.52913 48.16874 0.00000, 11.5291...
Vasil Levski National Stadium 2 Sofia, Bulgaria NoImgsForVectorFeatureFoundError('No images fo... football POLYGON Z ((23.33410 42.68813 0.00000, 23.3340...
Bulgarian Army Stadium 2 Sofia, Bulgaria NoImgsForVectorFeatureFoundError('No images fo... football POLYGON Z ((23.34065 42.68492 0.00000, 23.3406...
Arena Sofia 2 Sofia, Bulgaria NoImgsForVectorFeatureFoundError('No images fo... football POLYGON Z ((23.34018 42.68318 0.00000, 23.3401...
Jingu Baseball Stadium 2 Tokyo, Japan NoImgsForVectorFeatureFoundError('No images fo... baseball POLYGON Z ((139.71597 35.67490 0.00000, 139.71...
Japan National Stadium 2 Tokyo, Japan NoImgsForVectorFeatureFoundError('No images fo... football POLYGON Z ((139.71482 35.67644 0.00000, 139.71...
[40]:
truncated_target_connector.rasters
[40]:
raster_processed? timestamp orig_crs_epsg_code geometry
raster_name
S2B_MSIL2A_20220804T101559_N0400_R065_T32UPU_20220804T130854_Munich Olympiastadion.tif 1 2022-08-04-10:15:59 32632 POLYGON ((11.58895 48.14840, 11.58895 48.19594...
S2A_MSIL2A_20220627T100611_N0400_R022_T32UPU_20220627T162810_Munich Olympiastadion.tif 1 2022-06-27-10:06:11 32632 POLYGON ((11.56448 48.17054, 11.56448 48.21807...
S2A_MSIL2A_20220627T100611_N0400_R022_T32UPU_20220627T162810_Munich Staedtisches Stadion Dantestr.tif 1 2022-06-27-10:06:11 32632 POLYGON ((11.56403 48.12907, 11.56403 48.17660...
S2A_MSIL2A_20220413T092031_N0400_R093_T34TFN_20220413T123632_Vasil Levski National Stadium.tif 1 2022-04-13-09:20:31 32634 POLYGON ((23.37083 42.68422, 23.37083 42.73156...
S2A_MSIL2A_20220722T092041_N0400_R093_T34TFN_20220722T134859_Vasil Levski National Stadium.tif 1 2022-07-22-09:20:41 32634 POLYGON ((23.35795 42.66350, 23.35795 42.71084...
S2A_MSIL2A_20220413T092031_N0400_R093_T34TFN_20220413T123632_Bulgarian Army Stadium.tif 1 2022-04-13-09:20:31 32634 POLYGON ((23.40157 42.67061, 23.40157 42.71797...
S2A_MSIL2A_20220412T012701_N0400_R074_T54SUE_20220412T042315_Jingu Baseball Stadium.tif 1 2022-04-12-01:27:01 32654 POLYGON ((139.76736 35.65410, 139.76736 35.700...
S2A_MSIL2A_20220701T012711_N0400_R074_T54SUE_20220701T043318_Jingu Baseball Stadium.tif 1 2022-07-01-01:27:11 32654 POLYGON ((139.75258 35.66693, 139.75258 35.713...

If we want to, we can now check that the truncated target dataset has been updated correctly.

4. Creating a train/val split

To train a ML model it is essential to have a clean train/val split. If you just naively split your rasters into a train and a validation set there might be data leakage. Some vector features might intersect several rasters. Also, rasters can overlap and there might be vector features in the overlaps. GeoGrapher provides a function get_raster_clusters that returns the minimal clusters of rasters such that if all rasters in each cluster are assigned to the same dataset (training or validation) the resulting split has no data leakage.

[41]:
from __future__ import annotations

from geographer.utils.cluster_rasters import get_raster_clusters

clusters: list[set[str]] = get_raster_clusters(
    connector=connector,
    clusters_defined_by="rasters_that_share_vectors",
    preclustering_method="y then x-axis",
)
clusters
[41]:
[{'S2A_MSIL2A_20220627T100611_N0400_R022_T32UPU_20220627T162810.tif',
  'S2B_MSIL2A_20220804T101559_N0400_R065_T32UPU_20220804T130854.tif'},
 {'S2A_MSIL2A_20220413T092031_N0400_R093_T34TFN_20220413T123632.tif',
  'S2A_MSIL2A_20220722T092041_N0400_R093_T34TFN_20220722T134859.tif'},
 {'S2A_MSIL2A_20220412T012701_N0400_R074_T54SUE_20220412T042315.tif',
  'S2A_MSIL2A_20220701T012711_N0400_R074_T54SUE_20220701T043318.tif'}]

As one might expect the clusters correspond to the different locations in our dataset: Munich, Tokyo, Sofia.