Querying Large Parquet Files with Pandas

27th August 2021 . By Michael A

The Scalability Challenges of Pandas

Many would agree that Pandas is the go-to tool for analysing small to medium sized data in Python on a single machine. It excels at handling data that can fit in memory, but this is also one of its biggest limitations. Attempts to use Pandas to directly query data files with 100s of millions of rows is typically met with slow performance followed by out of memory errors. Even with techniques like chunking, the time taken to load and work with subsets of large data is often too slow to be considered interactive.

When working with large data files, data analysts and data scientists may consider more scalable alternative libraries such as Dask, Koalas, and Vaex to do the heavy lifting. In exchange for solving the scalability issues of Pandas, these alternative libraries often deviate from the...

Continue reading this article on our Open Data Blend Blog.

Recent articles

View all

6th June 2025

Recent News in Analytics and AI: May 2025 Edition

30th May 2025

6 Things You Should Know About AI in Microsoft Fabric

6th May 2025

Recent News in Analytics and AI: April 2025 Edition

23rd April 2025

5 Things to Think About When Rolling Out Microsoft Fabric

4th April 2025

Recent News in Analytics and AI: March 2025 Edition

7th March 2025

Recent News in Analytics and AI: February 2025 Edition