HealSparseMap File Specification v1.8.0

A HealSparseMap file can be either a standard FITS file or a Parquet dataset. Each provides a way to record the data from a HealSparseMap object for efficient retrieval of individual coverage pixels. This document describes the memory layout of the key components of the HealSparseMap objects, as well as the specific file formats in FITS (HealSparseMap FITS Serialization) and Parquet (HealSparseMap Parquet Serialization).

HealSparseMap Memory Layout

A HealSparseMap object has two primary components, the coverage map and the sparse map.

Terminology

This is a list of terminology used in a HealSparseMap object:

  • nside_sparse: The HEALPix nside for the fine-grained (sparse) map

  • nside_coverage: The HEALPix nside for the coverage map

  • bit_shift: The number of bits to shift to convert from nside_sparse to nside_coverage in the HEALPix NEST scheme. bit_shift = 2*log_2(nside_sparse/nside_coverage).

  • valid_pixels: The list of pixels with defined values (> sentinel) in the sparse map.

  • sentinel: The sentinel value that notes if a pixel is not a valid pixel. Default is hpgeom.UNSEEN for floating-point maps, -MAXINT for integer maps, and 0 for wide mask maps.

  • nfine_per_cov: The number of fine (sparse) pixels per coverage pixel. nfine_per_cov = 2**bit_shift.

  • wide_mask_width: The width of a wide mask, in bytes.

Pixel Lookups

The HealSparseMap file format is derived from the method of encoding fast look-ups of arbitary pixel values using the HEALPix nest pixel scheme.

Given a nest-encoded sparse (high resolution) pixel value, pix_nest, we can compute the coverage (low resolution) pixel value with a simple bit shift operation: ipnest_cov = right_shift(pix_nest, bit_shift), where bit_shift is defined in Terminology.

The sparse map itself is stored in blocks of data, each of which includes nfine_per_cov contiguous pixels for each coverage pixel that contains valid data. To find the location within a given a coverage block, we need to subtract the first fine (sparse) nest pixel value for the given coverage pixel. Therefore, first_pixel = nfine_per_cov*ipnest_cov.

Next, if we have a map of offsets which points to the location of the proper sparse map block, cov_map_raw_offset, we find the look-up index is index = pix_nest - nfine_per_cov*ipnest_cov + cov_map_raw_offset[ipnest_cov]. In practice, we can combine the final terms here. The look-up index is then index = pix_nest + cov_map[ipnest_cov] where cov_map = cov_map_raw_offset[ipnest_cov] - nfine_per_cov*ipnest_cov.

As described in Sparse Map, the first block in the sparse map is special, and is always filled with sentinel values. All cov_map indices for coverage pixels outside the coverage map point to this sentinel block.

Coverage Map

The coverage map encodes the mapping from raw pixel number to location within the sparse map. It is an integer (numpy.int64) map with 12*nside_coverage*nside_coverage values, all of which are filled.

As described in Pixel Lookups, the coverage map contains indices that are offset pointers to the block in the sparse map with associated values for that coverage pixel. An empty HealSparseMap is intialized with the following coverage pixel values, which all point to the first sentinel block in the sparse map.

import numpy as np
import hpgeom as hpg

cov_map[:] = -1*np.arange(hpg.nside_to_npixel(nside_coverage), dtype=np.int64)*nfine_per_cov

Sparse Map

The sparse map contains the map data, split into blocks, each of which is nfine_per_cov elements long. The first block is special, and is always filled with sentinel values.

The following datatypes are allowed:

  • 1-byte unsigned integer (numpy.uint8)

  • 1-byte signed integer (numpy.int8)

  • 2-byte unsigned integer (numpy.uint16)

  • 2-byte signed integer (numpy.int16)

  • 4-byte unsigned integer (numpy.uint32)

  • 4-byte signed integer (numpy.uint32)

  • 8-byte signed integer (numpy.int64)

  • 4-byte floating point (numpy.float32)

  • 8-byte floating point (numpy.float64)

  • Numpy record array of numeric types that can be serialized with FITS

  • The WIDE_MASK special encoding

Sparse Map Image

If the sparse map is a single datatype, it is held in memory as a one-dimensional image array. The first block of nfine_per_cov values are set to sentinel. Each additional block of nfine_per_cov is associated with a single element in the coverage map. These blocks may be in any arbitrary order, allowing for easy appending of new coverage pixels. All invalid pixels must be set to sentinel.

Sparse Map Wide Mask

If the sparse map is a wide mask map, the sparse map is held in memory as a wide_mask_width * npix array. The sentinel value for wide masks must be 0, and all invalid pixels must be set to 0.

Sparse Map Table

If the sparse map is a numpy record array type, it is held in memory as a one dimensional table array. The first block of nfine_per_cov values are set such that the primary field must be set to sentinel. As with the sparse map image, each additional block of nfine_per_cov is associated with a single element in the coverage map. These blocks may be in any arbitrary order, allowing for easy appending of new coverage pixels. All invalid pixels must have the primary field set to sentinel.

Sparse Map Bit-Packed Mask

If the sparse map is a bit-packed mask, the sparse map is held in memory as an array of numpy.uint8, bit-packed with lowest significant bit (LSB) ordering. It is this array of numpy.uint8 that is serialized, with an additional flag in the header. The sentinel value for sparse bit-packed masks must be False.

HealSparseMap FITS Serialization

A HealSparseMap FITS file is a standard FITS file with two extensions. The primary (zeroth) extension is an integer image that describes the coverage map, and the first extension is an image or binary table that describes the sparse map. This section describes the file format specification of these two extensions in the FITS file.

Coverage Map

Coverage Map Header

The coverage map header must contain the following keywords:

  • EXTNAME must be "COV"

  • PIXTYPE must be "HEALSPARSE"

  • NSIDE is equal to nside_coverage

Coverage Map Image

The FITS coverage map is a direct serialization of the coverage map image in-memory layout described in Coverage Map.

Sparse Map Header

The sparse map header must contain:

  • EXTNAME must be "SPARSE"

  • PIXTYPE must be "HEALSPARSE"

  • SENTINEL is equal to sentinel

  • NSIDE is equal to nside_sparse

If the sparse map is a numpy record array, it must contain:

  • PRIMARY is equal to the name of the “primary” field which defines the valid pixels.

If the sparse map is a wide mask, it must contain:

  • WIDEMASK must be True

  • WWIDTH must be the width (in bytes) of the wide mask.

If the sparse map is a bit-packed mask it must contain:

  • BITPACK must be True

Sparse Map Image

If the sparse map is not of a numpy record array type, it is stored as a one dimensional image array. If the image is an integer type with 32 bits or fewer, it may be stored with FITS tile compression, with the tile size set to the block size (nfine_per_cov). If the image is a floating-point image, it may be stored with FITS tile compression, with quantization_level=0 and GZIP_2 (lossless gzip compression), with the tile size set to the block size (nfine_per_cov).

Sparse Map Wide Mask

If the sparse map is a wide mask map, the sparse map is stored as a flattened version of the in-memory wide_mask_width * npix array. This should be flattened on storage, and reshaped on read, using the default numpy memory ordering. The wide mask image may be stored with FITS tile compression, with the tile size set to the block size times with width (wide_mask_width * nfine_per_cov).

Sparse Map Table

If the sparse map is a numpy record array type, it is stored as a one dimensional table array.

Sparse Map Bit-Packed Mask

If the sparse map is a bit-packed mask, the sparse map is stored as an array of unsigned 8-bit integers. This will be used directly as the data buffer backing the bit-packing array structure.

HealSparseMap Parquet Serialization

A HealSparseMap serialized as Parquet is sharded as a Parquet dataset for efficient access to sub-regions of very large area, high resolution maps. The HealSparseMap Parquet dataset consists of a directory with two metadata files; the coverage map; and a list of low resolution “i/o pixel” directories (default nside_io=4). In each i/o pixel directory is a Parquet file with all of the sparse map data from that i/o pixel, divided into Parquet row groups for each coverage pixel. The HealSparseMap Parquet format uses the default snappy per-column compression.

Parquet File Structure

Parquet Metadata

The metadata is stored separately in the _metadata and _common_metadata files in the dataset directory, as per the Parquet dataset specification. The metadata is stored as a set of key-value pairs, each of which is a binary string. For simplicity we describe the strings here as regular strings, but beware that behind the scenes they are stored in the python b'string' format.

The following metadata strings are required: * 'healsparse::version': '1' * 'healsparse::nside_sparse': str(nside_sparse) * 'healsparse::nside_coverage': str(nside_coverage) * 'healsparse::nside_io': str(nside_io) * 'healsparse::filetype': 'healsparse' * 'healsparse::primary': 'primary' or '' * 'healsparse::sentinel': str(sentinel) or 'UNSEEN' * 'healsparse::widemask': 'True' or 'False' * 'healsparse::wwidth': str(wide_mask_width) or '1' * 'healsparse::bitpacked': str(is_bit_mask_map)

Note that the string 'UNSEEN' will use the special value hpgeom.UNSEEN to fill empty/overflow pixels.

Additional metadata from the map is stored as a FITS header string (for compatibility with the FITS serialization) such that: * 'healsparse::header': header_string

Parquet Coverage Map

The coverage map is a Parquet file with the name _coverage.parquet, stored in the dataset directory. The coverage map has two columns: * cov_pix: Valid coverage pixels (nside = nside_coverage) for the sparse map. * row_group: The row group index within the appropriate i/o pixel file to find the sparse data for the given coverage map.

Parquet Map Files

Each sparse map Parquet file covers one i/o pixel. The name of each file is iopix=###/###.parquet, where ### is a zero-padded three digit number for the given i/o pixel. By putting each pixel in its own directory with this naming scheme we allow pyarrow to use the hive partitioning and only touch the files as necessary.

Each file is written as a series of Parquet row groups. Each row group contains all the data for a single coverage pixel, with nfine_per_cov rows per row group. The row group number within the given i/o pixel is recorded in the _coverage.parquet coverage map file for quick access to individual and groups of coverage pixels.

The exact format of the data depends on whether the map is a simple image, a wide mask, or a record array.

Sparse Map Image

If the sparse map is not of a numpy record array type, it is stored in a two-column Parquet table. The schema is given by: * cov_pix: int32 * sparse: Datatype of the sparse image data.

The cov_pix gives the coverage pixel of the data, and is redundant with the data in _coverage.parquet. It compresses very efficiently, and can be used to reconstruct the _coverage.parquet from the data files if necessary.

The sparse column has the sparse map data (with sparse map image datatype).

Unlike the FITS serialization, the initial “overflow” coverage pixel is not serialized. Instead, on read this is filled in with the sentinel value from the Parquet metadata.

Sparse Map Wide Mask

If the sparse map is a wide mask map, the schema is the same as for a regular sparse map image. In this case, as with the FITS serialization, the sparse map is stored as a flattened version of the in-memory wide_mask_width * npix array. This means that there will be wide_mask_width * nfine_per_cov rows per row group in each wide mask Parquet file.

Sparse Map Table

If the sparse map is a numpy record array type, it is stored as a multi-column Parquet table with the following schema: * cov_pix: int32 * column_1: Datatype of column 1. * column_2: Datatype of column 2. * Etc.

Unlike the FITS serialization, the initial “overflow” coverage pixel is not serialized. Instead, on read this is filled in with the sentinel value from the Parquet metadata for the primary column. The other columns in the overflow coverage pixel are filled with the default sentinel for that datatype (e.g., hpgeom.UNSEEN for floating-point columns and -MAXINT for integer columns).

Sparse Map Bit-Packed Mask

If the sparse map is a bit-packed mask, the schema is the same as for a regular sparse map image. In this case, as with the FITS serialization, the sparse map is stored as an array of unsigned 8-bit integers which is the in-memory backing of the bit-packed array.