aggregate
Utility Function(s) Related to a Grouped/Aggregated Data Frame
The basic syntax for the groupby aggregation is pd.groupby().agg({}) and the utility functions provided here can be applied under the aggregation section.
- pandaswizard.aggregate.percentile(n: float, outname: str | None = None, **kwargs) float
Compute the n-th Percentile for the Grouped Data Series
In statistics, a n-th percentile, also known as centile score, is a score below which a given percentage n of scores in its frequency distribution falls or a score at or below which the given percentage falls. More information is available [here](https://en.wikipedia.org/wiki/Percentile). Percentiles are a type of [quanitile](https://en.wikipedia.org/wiki/Quantile) and can be interchangeably used.
Internally, the function uses the
pd.Series.quantile()method to calculate the n-th percentile of the grouped series.- Parameters:
n (int or float) – Percentage value to compute. Values must be between [0, 100] both inclusive.
outname (str) – Output name of the aggregated feature when the method is used in conjuncture with other functions. This does not have any significance when used as in the below example. The outname defaults to
f"P{n:.2f}".
Keyword Arguments
- method (str): This parameter specifies the method to
use for estimating the percentile. There are many different methods of which some are unique to NumPy. Accepts any value as in
np.percentile()parameter, defaults to “linear” method. However, for the pd.Series.quantile() the argument method is termed as interpolation and the values can be: {‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}.
- interpolation (str): Same as
methodthe method for quantile calculation as per pandas. Both the attribute
methodandinterpolationcannot be passed at the same time, and raises AssertionError if done so.
- interpolation (str): Same as
- basemod (str): Abbreviation for “base module”, allows
the user to choose from pandas or numpy to calculate percentile. When choosing numpy the default behaviour is np.nanquantile() as followed by pd.Series.quantile however, you can pass dropna = False which calculates using np.percentile and returns np.nan if input contain nan values. Defaults to pandas. Allowed terms: {‘pd’, ‘pandas’, ‘np’, ‘numpy’}.
- dropna (bool): Calculate the percentile by dropping
the nan values. This method mimics the np.nanpercentile function, which is the default as in pd.Series.quantile(). More information: https://stackoverflow.com/a/70002786.
Example and Usages
Assuming an end-user have the basic understanding of pandas and percentile, we can use compute the percentile for a group like:
import pandas as pd import pandaswizard as pdw data = pd.DataFrame({"G" : ["A", "B", "B"], "V" : [1, 2, 3]}) # CASE-I: standalone usage, can be used on multiple features percentile = data.groupby("A").agg("V" : pdw.percentile(50)) # CASE-II: usage in conjunture with any other function # `.agg({})` passing dictionary of values, or by passing # named tuples like `.agg(outname = ("column" : ))` # more details: https://stackoverflow.com/a/53619715/6623589 percentile = data.groupby("A").agg({ "V" : [sum, pdw.percentile(50, outname = "P50")] })
Both the methods calculates the percentile for the grouped value. In CASE-I the argument “outname” does not have any implications as pandas by default returns using the result with the original name, however in case of CASE-II we can set the feature name using the argument outname.
- pandaswizard.aggregate.quantile(n: float, outname: str | None = None, **kwargs) float
Compute the n-th Quantile for the Grouped Data Series
In statisticsand probability, quantiles are cut points dividing the range of probability distribution into continuous intervals with equal probabilities, or dividing the observation in a sample in the same way. More information is available [here](https://en.wikipedia.org/wiki/Quantile).
Internally, the function uses the
pd.Series.quantile()to calculate the n-th quantile of the grouped series.- Parameters:
n (int or float) – Probability value for the quantiles to compute. The values must be between [0, 1] both inclusive.
outname (str) – Output name of the aggregated feature when the method is used in conjuncture with other functions. This does not have any significance when used as in the below example. The outname defaults to
f"Q{n:.2f}".
Keyword Arguments
- method (str): This parameter specifies the method to
use for estimating the quantile. There are many different methods of which some are unique to NumPy. Accepts any value as in
np.quantile()parameter, defaults to “linear” method. However, for thepd.Series.quantile()the argument method is termed as interpolation and the values can be: {‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}.
- interpolation (str): Same as
methodthe method for quantile calculation as per pandas. Both the attribute
methodandinterpolationcannot be passed at the same time, and raises AssertionError if done so.
- interpolation (str): Same as
- basemod (str): Abbreviation for “base module”, allows
the user to choose from pandas or numpy to calculate percentile. When choosing numpy the default behaviour is
np.nanquantile()as followed by pd.Series.quantile however, you can pass dropna = False which calculates using np.percentile and returns np.nan if input contain nan values. Defaults to pandas. Allowed terms: {‘pd’, ‘pandas’, ‘np’, ‘numpy’}.
- dropna (bool): Calculate the percentile by dropping
the nan values. This method mimics the np.nanpercentile function, which is the default as in pd.Series.quantile(). More information: https://stackoverflow.com/a/70002786.
Example and Usages
Assuming an end-user have the basic understanding of pandas and quantile, we can use compute the quantile for a group like:
import pandas as pd import pandaswizard as pdw data = pd.DataFrame({"G" : ["A", "B", "B"], "V" : [1, 2, 3]}) # CASE-I: standalone usage, can be used on multiple features quantile = data.groupby("A").agg("V" : pdw.quantile(50)) # CASE-II: usage in conjunture with any other function # `.agg({})` passing dictionary of values, or by passing # named tuples like `.agg(outname = ("column" : ))` # more details: https://stackoverflow.com/a/53619715/6623589 quantile = data.groupby("A").agg({ "V" : [sum, pdw.quantile(50, outname = "Q0.5")] })
Both the methods calculates the quantile for the grouped value. In CASE-I the argument “outname” does not have any implications as pandas by default returns using the result with the original name, however in case of CASE-II we can set the feature name using the argument outname.