Changes

Summary

  1. [SPARK-34965][BUILD] Remove .sbtopts that duplicately sets the default (commit: 2eda1c6) (details)
  2. [SPARK-34968][TEST][PYTHON] Add the `-fr` argument to xargs rm (commit: 390d5bd) (details)
  3. [SPARK-34922][SQL] Use a relative cost comparison function in the CBO (commit: 7c8dc5e) (details)
  4. [SPARK-34678][SQL] Add table function registry (commit: 0aa2c28) (details)
  5. [SPARK-34969][SPARK-34906][SQL] Followup for Refactor TreeNode's (commit: 06c09a7) (details)
  6. [SPARK-27658][SQL] Add FunctionCatalog API (commit: 3c7d6c3) (details)
  7. [SPARK-34970][SQL][SERCURITY] Redact map-type options in the output of (commit: f208d80) (details)
  8. [SPARK-34972][PYTHON] Make pandas-on-Spark doctests work (commit: 2635c38) (details)
  9. [SPARK-34976][SQL] Rename GroupingSet to BaseGroupingSets (commit: 5a3f41a) (details)
  10. [SPARK-34668][SQL] Support casting of day-time intervals to strings (commit: 3dfd456) (details)
  11. [SPARK-34493][DOCS] Add "TEXT Files" page for Data Source documents (commit: 8e15ac1) (details)
Commit 2eda1c65ea122b21c75d19bd3a6c58c905cb7304 by dhyun
[SPARK-34965][BUILD] Remove .sbtopts that duplicately sets the default memory

### What changes were proposed in this pull request?

This PR removes `.sbtopts` (added in https://github.com/apache/spark/pull/29286) that duplicately sets the default memory. The default memories are set:

https://github.com/apache/spark/blob/3b634f66c3e4a942178a1e322ae65ce82779625d/build/sbt-launch-lib.bash#L119-L124

### Why are the changes needed?

This file disables the memory option from the `build/sbt` script:

```bash
./build/sbt -mem 6144
```
```
.../jdk-11.0.3.jdk/Contents/Home as default JAVA_HOME.
Note, this will be overridden by -java-home if it is set.
Error occurred during initialization of VM
Initial heap size set to a larger value than the maximum heap size
```

because it adds these memory options at the last:

```bash
/.../bin/java -Xms6144m -Xmx6144m -XX:ReservedCodeCacheSize=256m -Xmx4G -Xss4m -jar build/sbt-launch-1.5.0.jar
```

and Java respects the rightmost memory configurations.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually ran SBT. It will be tested in the CIs in this Pr.

Closes #32062 from HyukjinKwon/SPARK-34965.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: 2eda1c6)
The file was modifiedbuild/sbt (diff)
The file was removed.sbtopts
The file was modifiedbuild/sbt-launch-lib.bash (diff)
Commit 390d5bde81ce282a016c6cf36fb0d57012515388 by dhyun
[SPARK-34968][TEST][PYTHON] Add the `-fr` argument to xargs rm

### What changes were proposed in this pull request?
This patch add  the `-fr` argument to `xargs rm`.

### Why are the changes needed?

This cmd is unavailable in basic case. If the find command does not get any search results, the rm command is invoked with an empty argument list, and then we will get a `rm: missing operand` and break, then the coverage report does not generate.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
python/run-tests-with-coverage --testnames pyspark.sql.tests.test_arrow --python-executables=python

The coverage report result is generated without break.

Closes #32064 from Yikun/patch-1.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: 390d5bd)
The file was modifiedpython/run-tests-with-coverage (diff)
Commit 7c8dc5e0b54731c5790bb94274eb9308774772ff by yamamuro
[SPARK-34922][SQL] Use a relative cost comparison function in the CBO

### What changes were proposed in this pull request?

Changed the cost comparison function of the CBO to use the ratios of row counts and sizes in bytes.

### Why are the changes needed?

In #30965 we changed to CBO cost comparison function so it would be "symetric": `A.betterThan(B)` now implies, that `!B.betterThan(A)`.
With that we caused a performance regressions in some queries - TPCDS q19 for example.

The original cost comparison function used the ratios `relativeRows = A.rowCount / B.rowCount` and `relativeSize = A.size / B.size`. The changed function compared "absolute" cost values `costA = w*A.rowCount + (1-w)*A.size` and `costB = w*B.rowCount + (1-w)*B.size`.

Given the input from wzhfy we decided to go back to the relative values, because otherwise one (size) may overwhelm the other (rowCount). But this time we avoid adding up the ratios.

Originally `A.betterThan(B) => w*relativeRows + (1-w)*relativeSize < 1` was used. Besides being "non-symteric", this also can exhibit one overwhelming other.
For `w=0.5` If `A` size (bytes) is at least 2x larger than `B`, then no matter how many times more rows does the `B` plan have, `B` will allways be considered to be better - `0.5*2 + 0.5*0.00000000000001 > 1`.

When working with ratios, then it would be better to multiply them.
The proposed cost comparison function is: `A.betterThan(B) => relativeRows^w  * relativeSize^(1-w) < 1`.

### Does this PR introduce _any_ user-facing change?

Comparison of the changed TPCDS v1.4 query execution times at sf=10:

  | absolute | multiplicative |   | additive |  
-- | -- | -- | -- | -- | --
q12 | 145 | 137 | -5.52% | 141 | -2.76%
q13 | 264 | 271 | 2.65% | 271 | 2.65%
q17 | 4521 | 4243 | -6.15% | 4348 | -3.83%
q18 | 758 | 466 | -38.52% | 480 | -36.68%
q19 | 38503 | 2167 | -94.37% | 2176 | -94.35%
q20 | 119 | 120 | 0.84% | 126 | 5.88%
q24a | 16429 | 16838 | 2.49% | 17103 | 4.10%
q24b | 16592 | 16999 | 2.45% | 17268 | 4.07%
q25 | 3558 | 3556 | -0.06% | 3675 | 3.29%
q33 | 362 | 361 | -0.28% | 380 | 4.97%
q52 | 1020 | 1032 | 1.18% | 1052 | 3.14%
q55 | 927 | 938 | 1.19% | 961 | 3.67%
q72 | 24169 | 13377 | -44.65% | 24306 | 0.57%
q81 | 1285 | 1185 | -7.78% | 1168 | -9.11%
q91 | 324 | 336 | 3.70% | 337 | 4.01%
q98 | 126 | 129 | 2.38% | 131 | 3.97%

All times are in ms, the change is compared to the situation in the master branch (absolute).
The proposed cost function (multiplicative) significantlly improves the performance on q18, q19 and q72. The original cost function (additive) has similar improvements at q18 and q19. All other chagnes are within the error bars and I would ignore them - perhaps q81 has also improved.

### How was this patch tested?

PlanStabilitySuite

Closes #32014 from tanelk/SPARK-34922_cbo_better_cost_function.

Lead-authored-by: Tanel Kiis <tanel.kiis@gmail.com>
Co-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(commit: 7c8dc5e)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q81.sf100/simplified.txt (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/joinReorder/JoinReorderSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q98.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-modified/q73.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q55.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q98.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q12.sf100/explain.txt (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q25.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q19.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q52.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-modified/q73.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q20.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q12.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q52.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q18a.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q20.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q72.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q72.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q19.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q24b.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q24a.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q25.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q91.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q20.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q24b.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q72.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q13.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q18.sf100/explain.txt (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q33.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q72.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q12.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q13.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q98.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q18a.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q20.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q24a.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q18.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q98.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q81.sf100/explain.txt (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/joinReorder/StarJoinCostBasedReorderSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q55.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q33.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q91.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q12.sf100/explain.txt (diff)
Commit 0aa2c284e4052cb57ebf7276ecc4867ea2d5f94f by wenchen
[SPARK-34678][SQL] Add table function registry

### What changes were proposed in this pull request?
This PR extends the current function registry and catalog to support table-valued functions by adding a table function registry. It also refactors `range` to be a built-in function in the table function registry.

### Why are the changes needed?
Currently, Spark resolves table-valued functions very differently from the other functions. This change is to make the behavior for table and non-table functions consistent. It also allows Spark to display information about built-in table-valued functions:
Before:
```scala
scala> sql("describe function range").show(false)
+--------------------------+
|function_desc             |
+--------------------------+
|Function: range not found.|
+--------------------------+
```
After:
```scala
Function: range
Class: org.apache.spark.sql.catalyst.plans.logical.Range
Usage:
  range(start: Long, end: Long, step: Long, numPartitions: Int)
  range(start: Long, end: Long, step: Long)
  range(start: Long, end: Long)
  range(end: Long)

// Extended
Function: range
Class: org.apache.spark.sql.catalyst.plans.logical.Range
Usage:
  range(start: Long, end: Long, step: Long, numPartitions: Int)
  range(start: Long, end: Long, step: Long)
  range(start: Long, end: Long)
  range(end: Long)

Extended Usage:
  Examples:
    > SELECT * FROM range(1);
      +---+
      | id|
      +---+
      |  0|
      +---+
    > SELECT * FROM range(0, 2);
      +---+
      |id |
      +---+
      |0  |
      |1  |
      +---+
    > SELECT range(0, 4, 2);
      +---+
      |id |
      +---+
      |0  |
      |2  |
      +---+

    Since: 2.0.0
```

### Does this PR introduce _any_ user-facing change?
Yes. User will not be able to create a function with name `range` in the default database:
Before:
```scala
scala> sql("create function range as 'range'")
res3: org.apache.spark.sql.DataFrame = []
```
After:
```
scala> sql("create function range as 'range'")
org.apache.spark.sql.catalyst.analysis.FunctionAlreadyExistsException: Function 'default.range' already exists in database 'default'
```

### How was this patch tested?
Unit test

Closes #31791 from allisonwang-db/spark-34678-table-func-registry.

Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 0aa2c28)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/int8.sql.out (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveTableValuedFunctions.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisTest.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/numeric.sql.out (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/table-valued-functions.sql.out (diff)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionInfo.java (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/table-valued-functions.sql (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala (diff)
Commit 06c09a79b371c5ac3e4ebad1118ed94b460f48d1 by herman
[SPARK-34969][SPARK-34906][SQL] Followup for Refactor TreeNode's children handling methods into specialized traits

### What changes were proposed in this pull request?

This is a followup for https://github.com/apache/spark/pull/31932.
In this PR we:
- Introduce the `QuaternaryLike` trait for node types with 4 children.
- Specialize more node types
- Fix a number of style errors that were introduced in the original PR.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

This is a refactoring, passes existing tests.

Closes #32065 from dbaliafroozeh/FollowupSPARK-34906.

Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com>
Signed-off-by: herman <herman@databricks.com>
(commit: 06c09a7)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CreateTableExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowCurrentNamespaceExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ReplaceTableExec.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DescribeTableExec.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DropNamespaceExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/RenamePartitionExec.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/TruncateTableExec.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/bitwiseAggregates.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DeleteFromTableExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CacheTableExec.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/TestingTypedCount.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DropTableExec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/AddPartitionExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/RenameTableExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2CommandExec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/TypedImperativeAggregateSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/AlterTableExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DescribeColumnExec.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Sum.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V1FallbackWriters.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DropPartitionExec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowTablePropertiesExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/AlterNamespaceSetPropertiesExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CreateNamespaceExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/SetCatalogAndNamespaceExec.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountIf.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/TruncatePartitionExec.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Covariance.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PartitionTransforms.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/RefreshTableExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DescribeNamespaceExec.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala (diff)
Commit 3c7d6c38e819c23b548d42dbd37da4d6ee2380d3 by wenchen
[SPARK-27658][SQL] Add FunctionCatalog API

## What changes were proposed in this pull request?

This adds a new API for catalog plugins that exposes functions to Spark. The API can list and load functions. This does not include create, delete, or alter operations.
- [Design Document](https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit?usp=sharing)

There are 3 types of functions defined:
* A `ScalarFunction` that produces a value for every call
* An `AggregateFunction` that produces a value after updates for a group of rows

Functions loaded from the catalog by name as `UnboundFunction`. Once input arguments are determined `bind` is called on the unbound function to get a `BoundFunction` implementation that is one of the 3 types above. Binding can fail if the function doesn't support the input type. `BoundFunction` returns the result type produced by the function.

## How was this patch tested?

This includes a test that demonstrates the new API.

Closes #24559 from rdblue/SPARK-27658-add-function-catalog-api.

Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 3c7d6c3)
The file was addedsql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/BoundFunction.java
The file was addedsql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/Function.java
The file was addedsql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/functions/AggregateFunctionSuite.scala
The file was addedsql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/FunctionCatalog.java
The file was addedsql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/AggregateFunction.java
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala (diff)
The file was addedsql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/ScalarFunction.java
The file was addedsql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/UnboundFunction.java
Commit f208d808819dae38739fa9acda9510a4b9763dde by ltnwgl
[SPARK-34970][SQL][SERCURITY] Redact map-type options in the output of explain()

### What changes were proposed in this pull request?

The `explain()` method prints the arguments of tree nodes in logical/physical plans. The arguments could contain a map-type option that contains sensitive data.
We should map-type options in the output of `explain()`. Otherwise, we will see sensitive data in explain output or Spark UI.
![image](https://user-images.githubusercontent.com/1097932/113719178-326ffb00-96a2-11eb-8a2c-28fca3e72941.png)

### Why are the changes needed?

Data security.

### Does this PR introduce _any_ user-facing change?

Yes, redact the map-type options in the output of `explain()`

### How was this patch tested?

Unit tests

Closes #32066 from gengliangwang/redactOptions.

Authored-by: Gengliang Wang <ltnwgl@gmail.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
(commit: f208d80)
The file was modifiedsql/core/src/test/resources/sql-tests/results/describe.sql.out (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala (diff)
Commit 2635c3894ff935bf1cd2d86648a28dcb4dc3dc73 by gurwls223
[SPARK-34972][PYTHON] Make pandas-on-Spark doctests work

### What changes were proposed in this pull request?

Now that we merged the Koalas main code into PySpark code base (#32036), we should enable doctests on the Spark's infrastructure.

### Why are the changes needed?

Currently the pandas-on-Spark modules are not tested at all.
We should enable doctests first, and we will port other unit tests separately later.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Enabled the whole doctests.

Closes #32069 from ueshin/issues/SPARK-34972/pyspark-pandas_doctests.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 2635c38)
The file was modifiedpython/pyspark/pandas/generic.py (diff)
The file was modifiedpython/pyspark/pandas/ml.py (diff)
The file was modifiedpython/pyspark/pandas/base.py (diff)
The file was modifiedpython/pyspark/pandas/categorical.py (diff)
The file was modifieddev/run-tests.py (diff)
The file was modifiedpython/pyspark/pandas/groupby.py (diff)
The file was modifiedpython/pyspark/pandas/utils.py (diff)
The file was modifiedpython/pyspark/pandas/internal.py (diff)
The file was modifiedpython/pyspark/pandas/indexes/category.py (diff)
The file was modifiedpython/pyspark/pandas/accessors.py (diff)
The file was modifiedpython/pyspark/pandas/indexes/base.py (diff)
The file was removedpython/pyspark/pandas/sql.py
The file was modifiedpython/pyspark/pandas/indexing.py (diff)
The file was modified.github/workflows/build_and_test.yml (diff)
The file was modifiedpython/pyspark/pandas/__init__.py (diff)
The file was modifiedpython/pyspark/pandas/config.py (diff)
The file was modifiedpython/pyspark/pandas/indexes/numeric.py (diff)
The file was modifiedpython/pyspark/pandas/typedef/typehints.py (diff)
The file was modifiedpython/pyspark/pandas/spark/accessors.py (diff)
The file was modifiedpython/pyspark/pandas/strings.py (diff)
The file was modifiedpython/pyspark/pandas/mlflow.py (diff)
The file was modifiedpython/pyspark/pandas/usage_logging/__init__.py (diff)
The file was modifiedpython/pyspark/pandas/indexes/multi.py (diff)
The file was modifiedpython/pyspark/pandas/numpy_compat.py (diff)
The file was modifiedpython/pyspark/pandas/frame.py (diff)
The file was modifiedpython/pyspark/pandas/series.py (diff)
The file was addedpython/pyspark/pandas/sql_processor.py
The file was modifiedpython/pyspark/pandas/window.py (diff)
The file was modifiedpython/pyspark/pandas/indexes/datetimes.py (diff)
The file was modifiedpython/pyspark/pandas/datetimes.py (diff)
The file was modifiedpython/pyspark/pandas/namespace.py (diff)
The file was modifiedpython/pyspark/pandas/extensions.py (diff)
The file was modifiedpython/pyspark/pandas/spark/utils.py (diff)
The file was modifieddev/sparktestsupport/modules.py (diff)
The file was modifiedpython/pyspark/pandas/exceptions.py (diff)
Commit 5a3f41a017bf0fb07cf242e5c6ba25fc231863c6 by wenchen
[SPARK-34976][SQL] Rename GroupingSet to BaseGroupingSets

### What changes were proposed in this pull request?
Current trait `GroupingSet` is ambiguous, since `grouping set` in parser level means one set of a group.
Rename this to `BaseGroupingSets` since cube/rollup is syntax sugar for grouping sets.`

### Why are the changes needed?
Refactor class name

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Not need

Closes #32073 from AngersZhuuuu/SPARK-34976.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 5a3f41a)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveGroupingAnalyticsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
Commit 3dfd456b2c4133f751a67e4132196d2d1470af29 by wenchen
[SPARK-34668][SQL] Support casting of day-time intervals to strings

### What changes were proposed in this pull request?
1. Added new method `toDayTimeIntervalString()` to `IntervalUtils` which converts a day-time interval as a number of microseconds to a string in the form **"INTERVAL '[sign]days hours:minutes:secondsWithFraction' DAY TO SECOND"**.
2. Extended the `Cast` expression to support casting of `DayTimeIntervalType` to `StringType`.

### Why are the changes needed?
To conform the ANSI SQL standard which requires to support such casting.

### Does this PR introduce _any_ user-facing change?
Should not because new day-time interval has not been released yet.

### How was this patch tested?
Added new tests for casting:
```
$ build/sbt "testOnly *CastSuite*"
```

Closes #32070 from MaxGekk/cast-dt-interval-to-string.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 3dfd456)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala (diff)
Commit 8e15ac1c67cac2e8d76a4117a51e2ed7d80e9aa8 by max.gekk
[SPARK-34493][DOCS] Add "TEXT Files" page for Data Source documents

### What changes were proposed in this pull request?

This PR aims to add a documentation on how to read and write TEXT files through various APIs such as Scala, Python and JAVA in Spark to [Data Source documents](https://spark.apache.org/docs/latest/sql-data-sources.html#data-sources).

### Why are the changes needed?

Documentation on how Spark handles TEXT files is missing. It should be added to the document for user convenience.

### Does this PR introduce _any_ user-facing change?

Yes, this PR adds a new page to Data Sources documents.

### How was this patch tested?

Manually build documents and check the page on local as below.

![Screen Shot 2021-04-07 at 4 05 01 PM](https://user-images.githubusercontent.com/44108233/113824674-085e2c00-97bb-11eb-91ae-d2cc19dfd369.png)

Closes #32053 from itholic/SPARK-34491-TEXT.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: 8e15ac1)
The file was addeddocs/sql-data-sources-text.md
The file was modifiedexamples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java (diff)
The file was modifieddocs/sql-data-sources.md (diff)
The file was modifiedexamples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala (diff)
The file was modifiedexamples/src/main/python/sql/datasource.py (diff)
The file was modifieddocs/_data/menu-sql.yaml (diff)