Intermediate data analysis on Panthaion
Once you are comfortable running cells and loading data in your Panthaion notebook, you are ready to go deeper. This guide covers the techniques that turn a basic script into a repeatable, shareable analysis — from reshaping messy datasets to building multi-panel charts and writing cleaner code.
Raw datasets from the Panthaion Ecosystem are often close to analysis-ready — but real work still requires cleaning, reshaping, and combining data before you can draw conclusions. This guide covers the techniques you'll use in almost every analysis.
Step 1: Reshape and clean your data
The most common tasks are renaming columns, handling missing values, filtering rows, and converting data types.
df.dropna(subset=["temperature_c"]) # drop rows with missing key fields
df["humidity_pct"].fillna(df["humidity_pct"].mean()) # fill gaps with column mean
df["date"] = pd.to_datetime(df["date"]) # parse string to datetime
Run df.info() and df.describe() at the start of every analysis. These two commands give you a full picture of column types, missing value counts, and basic statistics before you write a single line of cleaning code.
Step 2: Filter, group, and aggregate
Filtering and grouping are the core of most climate and environmental analyses. Select only the rows you need, then summarise:
df.groupby("region")["temperature_c"].mean()
df.groupby("region").agg(
mean_temp=("temperature_c", "mean"),
max_temp=("temperature_c", "max")
)
df.set_index("date").resample("M")["temperature_c"].mean() # monthly averages
Chain operations rather than creating a new variable at every step. df.dropna().groupby("region").mean() is easier to follow and produces cleaner workspaces.
Step 3: Merge and join datasets
One of the most powerful features of the Panthaion Ecosystem is the ability to combine datasets from different sources. Once you have two dataframes loaded, merge them on a shared column:
Use how="left" to keep all rows from your primary dataset, how="inner" for only matching rows, and how="outer" to keep everything and investigate gaps. After merging, always check the row count and look for unexpected nulls — a silent duplicate is one of the most common sources of errors in data analysis.
Step 4: Build multi-panel charts
Single charts are fine for exploration, but publication-ready analyses usually need several panels. Use matplotlib subplots to arrange them:
df.plot(ax=axes[0])
df2.plot(ax=axes[1])
fig.suptitle("Regional temperature comparison")
plt.tight_layout()
For interactive multi-panel charts, use Plotly with facet columns — each region gets its own panel with hover, zoom, and comparison built in:
Always label axes with units. A chart that says "temperature" is ambiguous. One that says "temperature (°C)" is citable.
Step 5: Write reusable functions
Once you find yourself repeating the same cleaning or plotting steps across multiple cells, wrap them in a function. Define it once at the top of your workspace and call it anywhere:
"""Return monthly mean for a given column."""
return df.set_index("date").resample("M")[col].mean()
monthly_mean(df_ocean, "salinity_ppt")
Keep functions short and focused on one task. If a function is doing three things, split it into three functions.
Step 6: Add narrative with markdown
A workspace that is only code is hard to share and harder to cite. Use markdown cells to turn your analysis into a readable document — covering what question each section answers, where the data comes from, and what the output shows. Aim for at least one markdown cell before each major code block.
Step 7: Parameterise and re-run cleanly
Hardcoded values scattered through a workspace make it fragile. Define all parameters in a single cell near the top:
END_DATE = "2024-12-31"
REGION = "North Atlantic"
DATA_FILE = "ocean_temp_v2.parquet"
Every cell below reads from these variables. To re-run for a different region or time window, you change one cell and hit Run all.
Troubleshooting
df["region_id"].duplicated().sum() before merging.df.index.dtype to confirm.plt.tight_layout() after your plot commands, or rotate labels with plt.xticks(rotation=45).%pip install cell at the top for any non-standard libraries.