The increasing availability of high-resolution geospatiotemporal data sets from sources such as observatory networks, remote sensing platforms, and computational Earth system models has opened new possibilities for knowledge discovery using data sets fused from disparate sources. Standard tools used on individual workstations are impractical for the analysis and synthesis of data sets of this size; however, new algorithmic approaches that can effectively utilize the complex memory hierarchies and the extremely high levels of available parallelism in state-of-the-art high-performance computing platforms can enable such analysis. We describe pKluster, an open-source tool we have developed for accelerated k-means clustering of geospatiotemporal data, and discuss its utilization for several ecological applications.
Results/Conclusions
pKluster can run on machines ranging from laptops to massively parallel machines, allowing it to scalably process massive geospatial data sets. Recently, it has been further enhanced with optimizations to boost computational intensity and utilization of wide SIMD lanes on state-of-the art multi- and manycore processors, including the second-generation Intel Xeon Phi ("Knights Landing") processor based on the Intel Many Integrated Core (MIC) architecture. We describe some of these developments in detail and present performance studies that demonstrate the impact of these developments and the size of data sets that can be practically analyzed with the tool. We also demonstrate application of the tool in ecological studies including quantitative delineation of ecoregions, forest cover change detection, and classification of forest canopy structures from LiDAR point clouds, and we speculate on new kinds of analysis of climatic and ecological data sets that these capabilities could enable.