Thinking in Pandas vs. Thinking in PySpark

My Journey from Pandas to PySpark

Aug 07, 2025

When I started out learning Python as a Data Scientist and Data Engineer, I naturally ran into Pandas. It is the default when beginning to explore the language from a data perspective. I really liked the straightforward approach from Pandas, but something felt wrong all the time. I could not place it.

This changed when I first encountered PySpark.

That was back in the days when we ran a Hadoop Cluster on-premise and had Spark installed on top. I began exploring PySpark, because we had a ton of data we could not handle with Pandas. While it often required more code to achieve the same goals as Pandas, it felt more natural to me. And over time, I began to understand why.

One of the first things that clicked was how PySpark encourages chaining operations. I love that style: transforming data in a series of clearly separated steps, each returning a new DataFrame. No surprises. No accidental overwriting. No mysterious inplace=True flags that behave inconsistently. In PySpark, every transformation is explicit and composable. The whole pipeline feels clear and easy to follow, especially when revisiting your code months later.

With Pandas, I often found myself juggling side effects and debugging strange behaviors. Something about the way Pandas allows (and sometimes encourages) mutation of data structures in place, combined with its rich but sometimes inconsistent syntax, just didn’t sit well with me.

I could say that Pandas is just not my style, but I tried to dig a bit deeper.

Pandas and PySpark in Action

To illustrate the difference let's take the simple operation of adding a column to a Dataframe. We will look at the Pandas way and at the PySpark way.

We create a simple dataset and add a new column by multiplying an existing one with the factor 2.

Pandas

import pandas as pd  
df = pd.DataFrame(
  {
     "id": [1, 2, 3, 4], 
     "value": [10, 20, None, 40] 
  }
)  

# Create a new column
df['value_doubled'] = df['value'] * 2

print(df)

This modifies df directly. The original version of df does not exist any longer.

PySpark

import pyspark.sql.functions as F

df = spark.createDataFrame(
  [(1, 10),(2, 20),(3, None), (4, 40)], 
  ["id", "value"]
)  

# Transformations always return a new DataFrame 
df_new = df.withColumn('value_doubled', F.col('value') * 2)

With PySpark, the original df is preserved. You always assign the result of a transformation to a new variable. Even more, PySpark encourages chaining of commands, which I especially like. Say you want to add another column, you can simple adjust the code to this:

df_new = (
  df
    .withColumn('value_doubled', F.col('value') * 2) # first new column
    .withColumn('department', F.lit('Marketing')) # second new column
)

The dot syntax allows for clean chaining of commands. This is also possible in Pandas, but not always smart. To find out why, we have to take a little look under the hood.

The technical difference

The core difference between Pandas and PySpark is how they run operations and where the data lives during processing.

Pandas runs eagerly. Every time you write a line like df['value'] * 2, the operation is executed right away. The result is held in memory. If your dataset is small, this is fast and efficient. But as data grows, so does memory usage. And unless you explicitly copy your DataFrame, you're often modifying it in place. That can be risky. You might overwrite something by accident or struggle to track what changed and when.

PySpark works the other way around. It uses lazy execution. When you write withColumn or filter, nothing happens immediately. You're not processing data yet. You're building a plan. Spark collects all the steps you've described and waits until you trigger an action. That could be something like .show(), .collect(), or writing the data out. Only then it starts doing the actual work.

This is a big difference. Because Spark waits, it can look at the full plan and find the most efficient way to run it. It can skip steps that don’t matter. It can reorder operations. It can push filters down to the data source so less data needs to be loaded in the first place.

Another important point is mutability. Pandas allows you to change data in place. That can save memory, but it also opens the door to side effects. PySpark takes a different approach. Each transformation creates a new DataFrame. The original stays untouched. This leads to cleaner code and fewer surprises, especially in larger projects or shared environments.

So the technical difference isn’t just about memory or performance. It’s also about how they execute. Pandas is immediate and flexible, great for quick work. PySpark is deliberate and structured, built for scale and clarity.

Wrap Up

Technically, Pandas and PySpark are very different, of course. Pandas runs in-memory on a single machine and is great for small to medium-sized datasets. PySpark was designed for distributed computing from the beginning. It can handle massive datasets by spreading the work across a cluster. That adds complexity, but also enables scalability.

Pandas is working in memory and for this reason is very fast. But this comes with limits, too, because if Pandas runs out of memory, the processing will fail. So, one strategy to avoid this, is to change objects in place without replicating them. From a memory point of view, this is the smartest thing you can do.

PySpark brings a more functional mindset. You don’t mutate, you transform. And although the syntax can be more verbose, it enforces a discipline that leads to cleaner code structure. That creates a predictable data flow and allows easier debugging. At least for me.

Today, I always use PySpark. Not just because of the scale it offers, but because I genuinely prefer the structure, clarity, and coding style. Even when my data would fit into memory, I find myself reaching for PySpark. It just feels right.

The Lakehouse Path

Discussion about this post