
Introduction to Geometric Data Science
Abstract:
Geometric Data Science builds geographicstyle maps on moduli spaces of data objects under important equivalences. The key example is a discrete set of unordered points under isometry, which is any distancepreserving transformation in a metric space. For unordered point clouds, persistent homology is an isometry invariant, which turned out to be weaker than previously anticipated (APCT 2024), but was extended to the complete and Lipschitz continuous Simplexwise Centered Distribution (CVPR 2023). Because of the exponential number of permutations, the challenge was to design algorithms whose complexity is polynomial in the input size for a fixed dimension.
Another practical case is a periodic set of points at atomic centers of crystals, which cannot be reduced to finite clouds because the smallest pattern (a unit cell) of a periodic crystal discontinuously changes under almost any perturbation of atoms. This ambiguity was resolved by the generically complete Pointwise Distance Distribution (NeurIPS 2022), which has a nearlinear time (ICML 2023). The new invariants distinguished all periodic crystals in major datasets including the Cambridge Structural Database CSD (200+ billion pairwise comparisons within one hour on a modest desktop) and Google's GNoME where thousands of entries (more than 20%) were found to be nearduplicates (IUCrJ 2024). The found duplicates in the CSD are truly isometric but one atom is replaced with a different one, which seems physically impossible without perturbing geometry, now under investigation by five journals for data integrity.