Querying Large Parquet Files with Pandas

27th August 2021 . By Michael A

The Scalability Challenges of Pandas

Many would agree that Pandas is the go-to tool for analysing small to medium sized data in Python on a single machine. It excels at handling data that can fit in memory, but this is also one of its biggest limitations. Attempts to use Pandas to directly query data files with 100s of millions of rows is typically met with slow performance followed by out of memory errors. Even with techniques like chunking, the time taken to load and work with subsets of large data is often too slow to be considered interactive.

When working with large data files, data analysts and data scientists may consider more scalable alternative libraries such as Dask, Koalas, and Vaex to do the heavy lifting. In exchange for solving the scalability issues of Pandas, these alternative libraries often deviate from the...

Continue reading this article on our Open Data Blend Blog.