Using pandas apply function to augment the aggregation functions

Posted by Saugata Chatterjee on July 03, 2022 · 3 mins read

Aggregation functions like mean, median, etc. can be modified using the pandas apply() function. For example, the common pandas mean method finds the mean for the entire data series. We can extend this functionality to find the mean of a 95%-quantile range instead of the entire data range. We will call this trimmed_mean.

First we write the fucntion which is an extension of the standard mean.

def trimmed_mean(x: pd.Series, trim=0.05):
    if trim>1 or trim<0:
        print("Error: Trim value must be between 0 and 1")
        return x
    if not isinstance(x, pd.Series):
        return [np.nan]
    try:
        x = x.astype(float)
    except:
        return [np.nan]
    q1, q2 = np.quantile(x, [trim/2, 1-trim/2])
    mask = (x >= q1) & (x <= q2)
    tm = x[mask].mean()
    return tm

We can apply this function column by column

df = pd.read_csv('https://raw.githubusercontent.com/saugatach/WQU/main/MScFE%20650%20MLF%20GWP%20Data.csv')
trimmed_mean(df['MSCI USA'])
2523.87

We can also apply trimmed_mean to the entire dataframe using the apply method.

df.apply(trimmed_mean, trim=0.25)
LUXXX                      1489.462043
MSCI ARGENTINA             2477.794043
BLP ORIENTE MEDIO          3087.156948

The above piece of code calculates the mean of 75-percentile of the data i.e. mean of data in the top 75%. This method is useful when we want to know how each quantile performed.

How did the top 75% perform?
Trim off the bottom 25% and find the mean of the rest df.apply(trimmed_mean, trim=0.25).

How did the top 10% perform?
Trim off the bottom 90% and find the mean of the rest df.apply(trimmed_mean, trim=0.9).

The behaviour of the modified trimmed_mean is similar to the pandas in-built mean method. However, we are able to achieve a substantial amount of customization. Note the keyword argument trim to the method trimmed_mean(x, trim) is passed as a **kwargs following the definition.

 DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)

Reference
Pandas API reference