When Small Parquet Files Become a Big Problem (and How I Wrote a Compactor) (datobra.com)

The article describes how an AWS/Kafka pipeline that wrote Parquet microbatches produced hundreds of tiny Parquet files in S3, increasing object counts, scan overhead, and storage costs. To fix it, the author built a small PyArrow-based “compactor” that reads fragmented Parquet inputs and rewrites them into fewer, better-structured files by tuning row-group and file-size settings, using batching to avoid out-of-memory errors in Kubernetes. The result is that large numbers of small objects can be consolidated into a small set of compressed Parquet outputs.

April 07, 2026 15:45 Source: Hacker News