pyspark.sql.DataFrame.approxQuantile#
- DataFrame.approxQuantile(col, probabilities, relativeError)[source]#
- Calculates the approximate quantiles of numerical columns of a - DataFrame.- The result of this algorithm has the following deterministic bound: If the - DataFramehas N elements and if we request the quantile at probability p up to error err, then the algorithm will return a sample x from the- DataFrameso that the exact rank of x is close to (p * N). More precisely,- floor((p - err) * N) <= rank(x) <= ceil((p + err) * N). - This method implements a variation of the Greenwald-Khanna algorithm (with some speed optimizations). The algorithm was first present in [[https://doi.org/10.1145/375663.375670 Space-efficient Online Computation of Quantile Summaries]] by Greenwald and Khanna. - New in version 2.0.0. - Changed in version 3.4.0: Supports Spark Connect. - Parameters
- col: str, tuple or list
- Can be a single column name, or a list of names for multiple columns. - Changed in version 2.2.0: Added support for multiple columns. 
- probabilitieslist or tuple of floats
- a list of quantile probabilities Each number must be a float in the range [0, 1]. For example 0.0 is the minimum, 0.5 is the median, 1.0 is the maximum. 
- relativeErrorfloat
- The relative target precision to achieve (>= 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but gives the same result as 1. 
 
- Returns
- list
- the approximate quantiles at the given probabilities. - If the input col is a string, the output is a list of floats. 
- If the input col is a list or tuple of strings, the output is also a
- list, but each element in it is a list of floats, i.e., the output is a list of list of floats. 
 
 
 
 - Notes - Null values will be ignored in numerical columns before calculation. For columns only containing null values, an empty list is returned. - Examples - Example 1: Calculating quantiles for a single column - >>> data = [(1,), (2,), (3,), (4,), (5,)] >>> df = spark.createDataFrame(data, ["values"]) >>> quantiles = df.approxQuantile("values", [0.0, 0.5, 1.0], 0.05) >>> quantiles [1.0, 3.0, 5.0] - Example 2: Calculating quantiles for multiple columns - >>> data = [(1, 10), (2, 20), (3, 30), (4, 40), (5, 50)] >>> df = spark.createDataFrame(data, ["col1", "col2"]) >>> quantiles = df.approxQuantile(["col1", "col2"], [0.0, 0.5, 1.0], 0.05) >>> quantiles [[1.0, 3.0, 5.0], [10.0, 30.0, 50.0]] - Example 3: Handling null values - >>> data = [(1,), (None,), (3,), (4,), (None,)] >>> df = spark.createDataFrame(data, ["values"]) >>> quantiles = df.approxQuantile("values", [0.0, 0.5, 1.0], 0.05) >>> quantiles [1.0, 3.0, 4.0] - Example 4: Calculating quantiles with low precision - >>> data = [(1,), (2,), (3,), (4,), (5,)] >>> df = spark.createDataFrame(data, ["values"]) >>> quantiles = df.approxQuantile("values", [0.0, 0.2, 1.0], 0.1) >>> quantiles [1.0, 1.0, 5.0]