aggregate

Utility Function(s) Related to a Grouped/Aggregated Data Frame

The basic syntax for the groupby aggregation is pd.groupby().agg({}) and the utility functions provided here can be applied under the aggregation section.

pandaswizard.aggregate.percentile(n: float, outname: str | None = None, **kwargs) float

Compute the n-th Percentile for the Grouped Data Series

In statistics, a n-th percentile, also known as centile score, is a score below which a given percentage n of scores in its frequency distribution falls or a score at or below which the given percentage falls. More information is available [here](https://en.wikipedia.org/wiki/Percentile). Percentiles are a type of [quanitile](https://en.wikipedia.org/wiki/Quantile) and can be interchangeably used.

Internally, the function uses the pd.Series.quantile() method to calculate the n-th percentile of the grouped series.

Parameters:
  • n (int or float) – Percentage value to compute. Values must be between [0, 100] both inclusive.

  • outname (str) – Output name of the aggregated feature when the method is used in conjuncture with other functions. This does not have any significance when used as in the below example. The outname defaults to f"P{n:.2f}".

Keyword Arguments

  • method (str): This parameter specifies the method to

    use for estimating the percentile. There are many different methods of which some are unique to NumPy. Accepts any value as in np.percentile() parameter, defaults to “linear” method. However, for the pd.Series.quantile() the argument method is termed as interpolation and the values can be: {‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}.

  • interpolation (str): Same as method the

    method for quantile calculation as per pandas. Both the attribute method and interpolation cannot be passed at the same time, and raises AssertionError if done so.

  • basemod (str): Abbreviation for “base module”, allows

    the user to choose from pandas or numpy to calculate percentile. When choosing numpy the default behaviour is np.nanquantile() as followed by pd.Series.quantile however, you can pass dropna = False which calculates using np.percentile and returns np.nan if input contain nan values. Defaults to pandas. Allowed terms: {‘pd’, ‘pandas’, ‘np’, ‘numpy’}.

  • dropna (bool): Calculate the percentile by dropping

    the nan values. This method mimics the np.nanpercentile function, which is the default as in pd.Series.quantile(). More information: https://stackoverflow.com/a/70002786.

Example and Usages

Assuming an end-user have the basic understanding of pandas and percentile, we can use compute the percentile for a group like:

import pandas as pd
import pandaswizard as pdw

data = pd.DataFrame({"G" : ["A", "B", "B"], "V" : [1, 2, 3]})

# CASE-I: standalone usage, can be used on multiple features
percentile = data.groupby("A").agg("V" : pdw.percentile(50))

# CASE-II: usage in conjunture with any other function
# `.agg({})` passing dictionary of values, or by passing
# named tuples like `.agg(outname = ("column" : ))`
# more details: https://stackoverflow.com/a/53619715/6623589

percentile = data.groupby("A").agg({
    "V" : [sum, pdw.percentile(50, outname = "P50")]
})

Both the methods calculates the percentile for the grouped value. In CASE-I the argument “outname” does not have any implications as pandas by default returns using the result with the original name, however in case of CASE-II we can set the feature name using the argument outname.

pandaswizard.aggregate.quantile(n: float, outname: str | None = None, **kwargs) float

Compute the n-th Quantile for the Grouped Data Series

In statisticsand probability, quantiles are cut points dividing the range of probability distribution into continuous intervals with equal probabilities, or dividing the observation in a sample in the same way. More information is available [here](https://en.wikipedia.org/wiki/Quantile).

Internally, the function uses the pd.Series.quantile() to calculate the n-th quantile of the grouped series.

Parameters:
  • n (int or float) – Probability value for the quantiles to compute. The values must be between [0, 1] both inclusive.

  • outname (str) – Output name of the aggregated feature when the method is used in conjuncture with other functions. This does not have any significance when used as in the below example. The outname defaults to f"Q{n:.2f}".

Keyword Arguments

  • method (str): This parameter specifies the method to

    use for estimating the quantile. There are many different methods of which some are unique to NumPy. Accepts any value as in np.quantile() parameter, defaults to “linear” method. However, for the pd.Series.quantile() the argument method is termed as interpolation and the values can be: {‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}.

  • interpolation (str): Same as method the

    method for quantile calculation as per pandas. Both the attribute method and interpolation cannot be passed at the same time, and raises AssertionError if done so.

  • basemod (str): Abbreviation for “base module”, allows

    the user to choose from pandas or numpy to calculate percentile. When choosing numpy the default behaviour is np.nanquantile() as followed by pd.Series.quantile however, you can pass dropna = False which calculates using np.percentile and returns np.nan if input contain nan values. Defaults to pandas. Allowed terms: {‘pd’, ‘pandas’, ‘np’, ‘numpy’}.

  • dropna (bool): Calculate the percentile by dropping

    the nan values. This method mimics the np.nanpercentile function, which is the default as in pd.Series.quantile(). More information: https://stackoverflow.com/a/70002786.

Example and Usages

Assuming an end-user have the basic understanding of pandas and quantile, we can use compute the quantile for a group like:

import pandas as pd
import pandaswizard as pdw

data = pd.DataFrame({"G" : ["A", "B", "B"], "V" : [1, 2, 3]})

# CASE-I: standalone usage, can be used on multiple features
quantile = data.groupby("A").agg("V" : pdw.quantile(50))

# CASE-II: usage in conjunture with any other function
# `.agg({})` passing dictionary of values, or by passing
# named tuples like `.agg(outname = ("column" : ))`
# more details: https://stackoverflow.com/a/53619715/6623589

quantile = data.groupby("A").agg({
    "V" : [sum, pdw.quantile(50, outname = "Q0.5")]
})

Both the methods calculates the quantile for the grouped value. In CASE-I the argument “outname” does not have any implications as pandas by default returns using the result with the original name, however in case of CASE-II we can set the feature name using the argument outname.