Abstract:
Graph processing is fundamental to web analysis, yet library selection often lacks empirical comparison on real-world web data. We present a comprehensive performance comparison of graph-tool and NetworkX on Common Crawl web graph data, focusing on domain-level subgraph analysis. Through systematic benchmarking of seven core operations across thousands of domain subgraphs, we challenge the assumption that C++ libraries with Python bindings always outperform pure Python implementations. Our results reveal operation-dependent performance patterns: NetworkX excels in graph traversal operations (2.5-4.6× faster for connected components, shortest path, degree distribution) and community detection (7.5× faster), while graph-tool dominates computationally intensive algorithms (35× faster betweenness centrality, 4× faster clustering coefficient). Memory usage differs significantly, with NetworkX maintaining consistent 900-970 MB baseline versus graph-tool’s operationdependent overhead reaching 2.5 GB. The domain-based decomposition methodology enables statistical analysis across diverse website structures, revealing that optimal library choice depends critically on specific operations, graph size, and available system resources rather than blanket performance assumptions.
