Changes

Summary

  1. [SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to (commit: 634f96d) (details)
  2. [SPARK-36266][SHUFFLE] Rename classes in shuffle RPC used for block push (commit: c4aa54e) (details)
  3. [SPARK-36260][PYTHON] Add set_categories to CategoricalAccessor and (commit: 55971b7) (details)
  4. [SPARK-36142][PYTHON] Follow Pandas when pow between fractional series (commit: d52c2de) (details)
  5. [SPARK-36267][PYTHON] Clean up CategoricalAccessor and CategoricalIndex (commit: c40d9d4) (details)
  6. [SPARK-36288][DOCS][PYTHON] Update API usage on pyspark pandas documents (commit: 9a47483) (details)
  7. [SPARK-36247][SQL] Check string length for char/varchar and apply type (commit: 068f8d4) (details)
  8. [SPARK-36211][PYTHON] Correct typing of `udf` return value (commit: ede1bc6) (details)
  9. [SPARK-36285][INFRA][TESTS] Skip MiMa in PySpark/SparkR/Docker GHA job (commit: 674202e) (details)
  10. [SPARK-36241][SQL] Support creating tables with null column (commit: 8e7e14d) (details)
  11. [SPARK-34619][SQL][DOCS] Describe ANSI interval types at the `Data (commit: f483796) (details)
  12. Revert "[SPARK-36136][SQL][TESTS] Refactor (commit: 22ac98d) (details)
  13. [SPARK-34249][DOCS] Add documentation for ANSI implicit cast rules (commit: df98d5b) (details)
  14. [SPARK-36263][SQL][PYTHON] Add Dataframe.observation to PySpark (commit: f90eb6a) (details)
  15. [SPARK-36310][PYTHON] Fix IndexOpsMixin.hasnans to use isnull().any() (commit: bcc595c) (details)
  16. [SPARK-36314][SS] Update Sessionization examples to use native support (commit: 1fafa8e) (details)
  17. [SPARK-36318][SQL][DOCS] Update docs about mapping of ANSI interval (commit: 1614d00) (details)
  18. [SPARK-34952][SQL][FOLLOW-UP] DSv2 aggregate push down follow-up (commit: c8dd97d) (details)
  19. [SPARK-36275][SQL] ResolveAggregateFunctions should works with nested (commit: 23a6ffa) (details)
  20. [SPARK-35639][SQL] Add metrics about coalesced partitions to (commit: 41a16eb) (details)
  21. [SPARK-36312][SQL] ParquetWriterSupport.setSchema should check inner (commit: 59e0c25) (details)
  22. [SPARK-34399][DOCS][FOLLOWUP] Add docs for the new metrics of task/job (commit: c9a7ff3) (details)
  23. [SPARK-36006][SQL] Migrate ALTER TABLE ... ADD/REPLACE COLUMNS commands (commit: 809b88a) (details)
  24. [SPARK-33865][SPARK-36202][SQL] When HiveDDL, we need check avro schema (commit: 86f4457) (details)
  25. [SPARK-36312][SQL][FOLLOWUP] Add back (commit: f086c17) (details)
  26. [SPARK-35320][SQL] Improve error message for unsupported key types in (commit: f12793d) (details)
  27. [SPARK-36270][BUILD][FOLLOWUP] Fix typo in the sbt memory setting (commit: aff1826) (details)
  28. [SPARK-36320][PYTHON] Fix Series/Index.copy() to drop extra columns (commit: 3c76a92) (details)
  29. [SPARK-36095][CORE] Grouping exception in core/rdd (commit: af6d04b) (details)
  30. [SPARK-36229][SQL] conv() inconsistently handles invalid strings with (commit: e1c50ff) (details)
  31. [SPARK-36143][PYTHON] Adjust `astype` of fractional Series with missing (commit: 0121309) (details)
  32. [SPARK-36236][SS] Additional metrics for RocksDB based state store (commit: eb4d1c0) (details)
Commit 634f96dde40639df5a2ef246884bedbd48b3dc69 by viirya
[SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package

### What changes were proposed in this pull request?

Move both `PruneFileSourcePartitionsSuite` and `PrunePartitionSuiteBase` to the package `org.apache.spark.sql.execution.datasources`. Did a few refactoring to enable this.

### Why are the changes needed?

Currently both `PruneFileSourcePartitionsSuite` and `PrunePartitionSuiteBase` are in package `org.apache.spark.sql.hive.execution` which doesn't look correct as these tests are not specific to Hive. Therefore, it's better to move them into `org.apache.spark.sql.execution.datasources`, the same place where the rule `PruneFileSourcePartitions` is at.

### Does this PR introduce _any_ user-facing change?

No, it's just test refactoring.

### How was this patch tested?

Using existing tests:
```
build/sbt "sql/testOnly *PruneFileSourcePartitionsSuite"
```
and
```
build/sbt "hive/testOnly *PruneHiveTablePartitionsSuite"
```

Closes #33350 from sunchao/SPARK-36136-partitions-suite.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(commit: 634f96d)
The file was removedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PrunePartitionSuiteBase.scala
The file was addedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitionsSuite.scala
The file was removedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneFileSourcePartitionsSuite.scala
The file was addedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/PrunePartitionSuiteBase.scala
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitionsSuite.scala (diff)
Commit c4aa54ed4e8cf7942335bfafdeacf57b5d148f2a by mridulatgmail.com
[SPARK-36266][SHUFFLE] Rename classes in shuffle RPC used for block push operations

### What changes were proposed in this pull request?
This is a follow-up to #29855 according to the [comments](https://github.com/apache/spark/pull/29855/files#r505536514)
In this PR, the following changes are made:

1. A new `BlockPushingListener` interface is created specifically for block push. The existing `BlockFetchingListener` interface is left as is, since it might be used by external shuffle solutions. These 2 interfaces are unified under `BlockTransferListener` to enable code reuse.
2. `RetryingBlockFetcher`, `BlockFetchStarter`, and `RetryingBlockFetchListener` are renamed to `RetryingBlockTransferor`, `BlockTransferStarter`, and `RetryingBlockTransferListener` respectively. This makes their names more generic to be reused across both block fetch and push.
3. Comments in `OneForOneBlockPusher` are further clarified to better explain how we handle retries for block push.

### Why are the changes needed?
To make code cleaner without sacrificing backward compatibility.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing unit tests.

Closes #33340 from Victsm/SPARK-32915-followup.

Lead-authored-by: Min Shen <mshen@linkedin.com>
Co-authored-by: Min Shen <victor.nju@gmail.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(commit: c4aa54e)
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/BlockStoreClient.java (diff)
The file was addedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/BlockTransferListener.java
The file was modifiedcommon/network-shuffle/src/test/java/org/apache/spark/network/shuffle/OneForOneBlockPusherSuite.java (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/shuffle/ShuffleBlockPusher.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/shuffle/ShuffleBlockPusherSuite.scala (diff)
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockStoreClient.java (diff)
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/OneForOneBlockPusher.java (diff)
The file was addedcommon/network-shuffle/src/test/java/org/apache/spark/network/shuffle/RetryingBlockTransferorSuite.java
The file was addedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockTransferor.java
The file was addedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/BlockPushingListener.java
The file was modifiedcore/src/main/scala/org/apache/spark/network/netty/NettyBlockTransferService.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/network/netty/NettyBlockTransferServiceSuite.scala (diff)
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/BlockFetchingListener.java (diff)
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ErrorHandler.java (diff)
The file was removedcommon/network-shuffle/src/test/java/org/apache/spark/network/shuffle/RetryingBlockFetcherSuite.java
The file was removedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockFetcher.java
Commit 55971b70fe3c899d4516e4955bc7c9ebd4b4af70 by ueshin
[SPARK-36260][PYTHON] Add set_categories to CategoricalAccessor and CategoricalIndex

### What changes were proposed in this pull request?
Add set_categories to CategoricalAccessor and CategoricalIndex.

### Why are the changes needed?
set_categories is supported in pandas CategoricalAccessor and CategoricalIndex. We ought to follow pandas.

### Does this PR introduce _any_ user-facing change?
Yes, users will be able to use `set_categories`.

### How was this patch tested?
Unit tests.

Closes #33506 from xinrong-databricks/set_categories.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(commit: 55971b7)
The file was modifiedpython/pyspark/pandas/tests/indexes/test_category.py (diff)
The file was modifiedpython/docs/source/reference/pyspark.pandas/indexing.rst (diff)
The file was modifiedpython/pyspark/pandas/tests/test_categorical.py (diff)
The file was modifiedpython/pyspark/pandas/categorical.py (diff)
The file was modifiedpython/pyspark/pandas/indexes/category.py (diff)
The file was modifiedpython/pyspark/pandas/missing/indexes.py (diff)
The file was modifiedpython/docs/source/reference/pyspark.pandas/series.rst (diff)
Commit d52c2de08b60930a129825d15e8f822c07e8bd31 by gurwls223
[SPARK-36142][PYTHON] Follow Pandas when pow between fractional series with Na and bool literal

### What changes were proposed in this pull request?

Set the result to 1 when the exp with 0(or False).

### Why are the changes needed?
Currently, exponentiation between fractional series and bools is not consistent with pandas' behavior.
```
>>> pser = pd.Series([1, 2, np.nan], dtype=float)
>>> psser = ps.from_pandas(pser)
>>> pser ** False
0 1.0
1 1.0
2 1.0
dtype: float64
>>> psser ** False
0 1.0
1 1.0
2 NaN
dtype: float64
```
We ought to adjust that.

See more in [SPARK-36142](https://issues.apache.org/jira/browse/SPARK-36142)

### Does this PR introduce _any_ user-facing change?
Yes, it introduces a user-facing change, resulting in a different result for pow between fractional Series with missing values and bool literal, the results follow pandas behavior.

### How was this patch tested?
- Add test_pow_with_float_nan ut
- Exsiting test in test_pow

Closes #33521 from Yikun/SPARK-36142.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: d52c2de)
The file was modifiedpython/pyspark/pandas/data_type_ops/num_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_num_ops.py (diff)
Commit c40d9d46f12d5909bfe18be6376d5216ef320782 by gurwls223
[SPARK-36267][PYTHON] Clean up CategoricalAccessor and CategoricalIndex

### What changes were proposed in this pull request?

Clean up `CategoricalAccessor` and `CategoricalIndex`.

- Clean up the classes
- Add deprecation warnings
- Clean up the docs

### Why are the changes needed?

To finalize the series of PRs for `CategoricalAccessor` and `CategoricalIndex`, we should clean up the classes.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33528 from ueshin/issues/SPARK-36267/cleanup.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: c40d9d4)
The file was modifiedpython/docs/source/reference/pyspark.pandas/series.rst (diff)
The file was modifiedpython/docs/source/reference/pyspark.pandas/indexing.rst (diff)
The file was modifiedpython/pyspark/pandas/indexes/category.py (diff)
The file was modifiedpython/pyspark/pandas/categorical.py (diff)
Commit 9a47483f740138d7df4d0f254a935088c78ae72c by gurwls223
[SPARK-36288][DOCS][PYTHON] Update API usage on pyspark pandas documents

### What changes were proposed in this pull request?

Update api usage examples on PySpark pandas API documents.

### Why are the changes needed?

If users try to use PySpark pandas API from the document, they will see some API deprication warnings.
It is kind for users to update those documents to avoid confusion.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

```
make html
```

Closes #33519 from yoda-mon/update-pyspark-configurations.

Authored-by: Leona <yodal@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 9a47483)
The file was modifiedpython/docs/source/user_guide/pandas_on_spark/best_practices.rst (diff)
The file was modifiedpython/docs/source/user_guide/pandas_on_spark/from_to_dbms.rst (diff)
Commit 068f8d434ad9a6651006151de521d0799db8af52 by wenchen
[SPARK-36247][SQL] Check string length for char/varchar and apply type coercion in UPDATE/MERGE command

### What changes were proposed in this pull request?

We added the char/varchar support in 3.1, but the string length check is only applied to INSERT, not UPDATE/MERGE. This PR fixes it. This PR also adds the missing type coercion for UPDATE/MERGE.

### Why are the changes needed?

complete the char/varchar support and make UPDATE/MERGE easier to use by doing type coercion.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new UT. No built-in source support UPDATE/MERGE so end-to-end test is not applicable here.

Closes #33468 from cloud-fan/char.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 068f8d4)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/PlanResolutionSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CharVarcharUtils.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala (diff)
Commit ede1bc6b51c23b2d857b497d335b8e7fe3a5e0cc by mszymkiewicz
[SPARK-36211][PYTHON] Correct typing of `udf` return value

The following code should type-check:

```python3
import uuid

import pyspark.sql.functions as F

my_udf = F.udf(lambda: str(uuid.uuid4())).asNondeterministic()
```

### What changes were proposed in this pull request?

The `udf` function should return a more specific type.

### Why are the changes needed?

Right now, `mypy` will throw spurious errors, such as for the code given above.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

This was not tested. Sorry, I am not very familiar with this repo -- are there any typing tests?

Closes #33399 from luranhe/patch-1.

Lead-authored-by: Luran He <luranjhe@gmail.com>
Co-authored-by: Luran He <luran.he@compass.com>
Signed-off-by: zero323 <mszymkiewicz@gmail.com>
(commit: ede1bc6)
The file was modifiedpython/pyspark/sql/_typing.pyi (diff)
The file was modifiedpython/pyspark/sql/functions.pyi (diff)
Commit 674202e7b6d640ff7d9ee1787ef4e8b3ed822207 by gurwls223
[SPARK-36285][INFRA][TESTS] Skip MiMa in PySpark/SparkR/Docker GHA job

### What changes were proposed in this pull request?
This PR aims to skip MiMa in PySpark/SparkR/Docker GHA job.

### Why are the changes needed?
This will save GHA resource because MiMa is irrelevant to Python.

### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the GHA.

Closes #33532 from williamhyun/mima.

Lead-authored-by: William Hyun <william@apache.org>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 674202e)
The file was modifieddev/run-tests.py (diff)
The file was modified.github/workflows/build_and_test.yml (diff)
Commit 8e7e14dc0d182cfe136e103d0b2370844ff661de by wenchen
[SPARK-36241][SQL] Support creating tables with null column

### What changes were proposed in this pull request?
Previously we blocked creating tables with the null column to follow the hive behavior in PR #28833
In this PR, I propose the restore the previous behavior to support the null column in a table.

### Why are the changes needed?
For a complex query, it's possible to generate a column with null type. If this happens to the input query of
CTAS, the query will fail due to Spark doesn't allow creating a table with null type. From the user's perspective,
it’s hard to figure out why the null type column is produced in the complicated query and how to fix it. So removing
this constraint is more friendly to users.

### Does this PR introduce _any_ user-facing change?
Yes, this reverts the previous behavior change in #28833, for example, below command will success after this PR
```sql
CREATE TABLE t (col_1 void, col_2 int)
```

### How was this patch tested?
newly added and existing test cases

Closes #33488 from linhongliu-db/SPARK-36241-support-void-column.

Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 8e7e14d)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala (diff)
Commit f4837961a9c4c35eaf71406c22874984b454e8fd by sarutak
[SPARK-34619][SQL][DOCS] Describe ANSI interval types at the `Data types` page of the SQL reference

### What changes were proposed in this pull request?
In the PR, I propose to update the page https://spark.apache.org/docs/latest/sql-ref-datatypes.html and add information about the year-month and day-time interval types introduced by SPARK-27790.

<img width="932" alt="Screenshot 2021-07-27 at 10 38 23" src="https://user-images.githubusercontent.com/1580697/127115289-e633ca3a-2c18-49a0-a7c0-22421ae5c363.png">

### Why are the changes needed?
To inform users about new ANSI interval types, and improve UX with Spark SQL.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Should be tested by a GitHub action.

Closes #33518 from MaxGekk/doc-interval-types.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
(commit: f483796)
The file was modifieddocs/sql-ref-datatypes.md (diff)
Commit 22ac98dcbf48575af7912dab2583e38a2a1b751d by gurwls223
Revert "[SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package"

This reverts commit 634f96dde40639df5a2ef246884bedbd48b3dc69.

Closes #33533 from viirya/revert-SPARK-36136.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 22ac98d)
The file was addedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneFileSourcePartitionsSuite.scala
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitionsSuite.scala (diff)
The file was removedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/PrunePartitionSuiteBase.scala
The file was addedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PrunePartitionSuiteBase.scala
The file was removedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitionsSuite.scala
Commit df98d5b5f13c95b283b446e4f9f26bf1ec3e4d97 by wenchen
[SPARK-34249][DOCS] Add documentation for ANSI implicit cast rules

### What changes were proposed in this pull request?

Add documentation for the ANSI implicit cast rules which are introduced from https://github.com/apache/spark/pull/31349

### Why are the changes needed?

Better documentation.

### Does this PR introduce _any_ user-facing change?

No
### How was this patch tested?

Build and preview in local:
![image](https://user-images.githubusercontent.com/1097932/127149039-f0cc4766-8eca-4061-bc35-c8e67f009544.png)
![image](https://user-images.githubusercontent.com/1097932/127149072-1b65ef56-65ff-4327-9a5e-450d44719073.png)

![image](https://user-images.githubusercontent.com/1097932/127033375-b4536854-ca72-42fa-8ea9-dde158264aa5.png)
![image](https://user-images.githubusercontent.com/1097932/126950445-435ba521-92b8-44d1-8f2c-250b9efb4b98.png)
![image](https://user-images.githubusercontent.com/1097932/126950495-9aa4e960-60cd-4b20-88d9-b697ff57a7f7.png)

Closes #33516 from gengliangwang/addDoc.

Lead-authored-by: Gengliang Wang <gengliang@apache.org>
Co-authored-by: Serge Rielau <serge@rielau.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: df98d5b)
The file was modifieddocs/sql-ref-ansi-compliance.md (diff)
The file was addeddocs/img/type-precedence-list.png
Commit f90eb6a5db0778fd18b0b544f93eac3103bbf03b by wenchen
[SPARK-36263][SQL][PYTHON] Add Dataframe.observation to PySpark

### What changes were proposed in this pull request?
With SPARK-34806 we can now easily add an equivalent for `Dataset.observe(Observation, Column, Column*)` to PySpark's `DataFrame` API.

### Why are the changes needed?
This further aligns the Python DataFrame API with Scala Dataset API.

### Does this PR introduce _any_ user-facing change?
Yes, it adds the `Observation` class and the `DataFrame.observe` method.

### How was this patch tested?
Adds test `test_observe` to `pyspark.sql.test.test_dataframe`.

Closes #33484 from EnricoMi/branch-observation-python.

Authored-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: f90eb6a)
The file was modifiedpython/docs/source/reference/pyspark.sql.rst (diff)
The file was modifiedpython/pyspark/sql/__init__.py (diff)
The file was modifiedpython/pyspark/sql/__init__.pyi (diff)
The file was addedpython/pyspark/sql/observation.pyi
The file was modifiedpython/pyspark/sql/dataframe.pyi (diff)
The file was addedpython/pyspark/sql/observation.py
The file was modifieddev/sparktestsupport/modules.py (diff)
The file was modifiedpython/pyspark/sql/dataframe.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_dataframe.py (diff)
Commit bcc595c112a23d8e3024ace50f0dbc7eab7144b2 by gurwls223
[SPARK-36310][PYTHON] Fix IndexOpsMixin.hasnans to use isnull().any()

### What changes were proposed in this pull request?

Fix `IndexOpsMixin.hasnans` to use `IndexOpsMixin.isnull().any()`.

### Why are the changes needed?

`IndexOpsMixin.hasnans` has a potential issue to cause `a window function inside an aggregate function` error.
Also it returns a wrong value when the `Series`/`Index` is empty.

```py
>>> ps.Series([]).hasnans
None
```

whereas:

```py
>>> pd.Series([]).hasnans
False
```

`IndexOpsMixin.any()` is safe for both cases.

### Does this PR introduce _any_ user-facing change?

`IndexOpsMixin.hasnans` will return `False` when empty.

### How was this patch tested?

Added some tests.

Closes #33547 from ueshin/issues/SPARK-36310/hasnan.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: bcc595c)
The file was modifiedpython/pyspark/pandas/tests/indexes/test_base.py (diff)
The file was modifiedpython/pyspark/pandas/base.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_series.py (diff)
Commit 1fafa8e191430d7a8d6a6eb5aa19056108f310c9 by viirya
[SPARK-36314][SS] Update Sessionization examples to use native support of session window

### What changes were proposed in this pull request?

This PR proposes to update Sessionization examples to use native support of session window. It also adds the example for PySpark as native support of session window is available to PySpark as well.

### Why are the changes needed?

We should guide the simplest way to achieve the same workload. I'll provide another example for cases we can't do with native support of session window.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually tested.

Closes #33548 from HeartSaVioR/SPARK-36314.

Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(commit: 1fafa8e)
The file was modifiedexamples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredSessionization.java (diff)
The file was modifiedexamples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredSessionization.scala (diff)
The file was addedexamples/src/main/python/sql/streaming/structured_sessionization.py
Commit 1614d004174c1aeda0c1511d3cba92cf55fc14b0 by sarutak
[SPARK-36318][SQL][DOCS] Update docs about mapping of ANSI interval types to Java/Scala/SQL types

### What changes were proposed in this pull request?
1. Update the tables at https://spark.apache.org/docs/latest/sql-ref-datatypes.html about mapping ANSI interval types to Java/Scala/SQL types.
2. Remove `CalendarIntervalType` from the table of mapping Catalyst types to SQL types.

<img width="1028" alt="Screenshot 2021-07-27 at 20 52 57" src="https://user-images.githubusercontent.com/1580697/127204790-7ccb9c64-daf2-427d-963e-b7367aaa3439.png">
<img width="1017" alt="Screenshot 2021-07-27 at 20 53 22" src="https://user-images.githubusercontent.com/1580697/127204806-a0a51950-3c2d-4198-8a22-0f6614bb1487.png">

### Why are the changes needed?
To inform users which types from language APIs should be used as ANSI interval types.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually checking by building the docs:
```
$ SKIP_RDOC=1 SKIP_API=1 SKIP_PYTHONDOC=1 bundle exec jekyll build
```

Closes #33543 from MaxGekk/doc-interval-type-lang-api.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
(commit: 1614d00)
The file was modifieddocs/sql-ref-datatypes.md (diff)
Commit c8dd97d4566e4cd6865437c2640467c9c16080d4 by wenchen
[SPARK-34952][SQL][FOLLOW-UP] DSv2 aggregate push down follow-up

### What changes were proposed in this pull request?
update java doc, JDBC data source doc, address follow up comments

### Why are the changes needed?
update doc and address follow up comments

### Does this PR introduce _any_ user-facing change?
Yes, add the new JDBC option `pushDownAggregate` in JDBC data source doc.

### How was this patch tested?
manually checked

Closes #33526 from huaxingao/aggPD_followup.

Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: c8dd97d)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCScanBuilder.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala (diff)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Min.java (diff)
The file was modifieddocs/sql-data-sources-jdbc.md (diff)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Aggregation.java (diff)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Count.java (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala (diff)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/CountStar.java (diff)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/connector/read/SupportsPushDownAggregates.java (diff)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Sum.java (diff)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Max.java (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala (diff)
Commit 23a6ffa5dc6d2330ea1c3e2b0890328e7d2d0f5d by wenchen
[SPARK-36275][SQL] ResolveAggregateFunctions should works with nested fields

### What changes were proposed in this pull request?
This PR fixes an issue in `ResolveAggregateFunctions` where non-aggregated nested fields in ORDER BY and HAVING are not resolved correctly. This is because nested fields are resolved as aliases that fail to be semantically equal to any grouping/aggregate expressions.

### Why are the changes needed?
To fix an analyzer issue.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit tests.

Closes #33498 from allisonwang-db/spark-36275-resolve-agg-func.

Authored-by: allisonwang-db <allison.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 23a6ffa)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
Commit 41a16ebf1196bec86aec104e72fd7fb1597c0073 by wenchen
[SPARK-35639][SQL] Add metrics about coalesced partitions to AQEShuffleRead in AQE

### What changes were proposed in this pull request?

AQEShuffleReadExec already reports "number of skewed partitions" and "number of skewed partition splits".
It would be useful to also report "number of coalesced partitions" and for ShuffleExchange to report "number of partitions"
This way it's clear what happened on the map side and on the reduce side.

![Metrics](https://user-images.githubusercontent.com/4297661/126729820-cf01b3fa-7bc4-44a5-8098-91689766a68a.png)

### Why are the changes needed?

Improves usability

### Does this PR introduce _any_ user-facing change?

Yes, it now provides more information about `AQEShuffleReadExec` operator behavior in the metrics system.

### How was this patch tested?

Existing tests

Closes #32776 from ekoifman/PRISM-91635-customshufflereader-sql-metrics.

Authored-by: Eugene Koifman <eugene.koifman@workday.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 41a16eb)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AQEShuffleReadExec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala (diff)
Commit 59e0c25376e1b3d227a1dc9ed93a7593314eddb3 by wenchen
[SPARK-36312][SQL] ParquetWriterSupport.setSchema should check inner field

### What changes were proposed in this pull request?
Last pr only support add inner field check for hive ddl, this pr add check for parquet data source write API.

### Why are the changes needed?
Failed earlier

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added Ut

Without this UI it failed as
```
[info] - SPARK-36312: ParquetWriteSupport should check inner field *** FAILED *** (8 seconds, 29 milliseconds)
[info]   Expected exception org.apache.spark.sql.AnalysisException to be thrown, but org.apache.spark.SparkException was thrown (HiveDDLSuite.scala:3035)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
[info]   at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
[info]   at org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1563)
[info]   at org.scalatest.Assertions.intercept(Assertions.scala:756)
[info]   at org.scalatest.Assertions.intercept$(Assertions.scala:746)
[info]   at org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1563)
[info]   at org.apache.spark.sql.hive.execution.HiveDDLSuite.$anonfun$new$396(HiveDDLSuite.scala:3035)
[info]   at org.apache.spark.sql.hive.execution.HiveDDLSuite.$anonfun$new$396$adapted(HiveDDLSuite.scala:3034)
[info]   at org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath(SQLHelper.scala:69)
[info]   at org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath$(SQLHelper.scala:66)
[info]   at org.apache.spark.sql.QueryTest.withTempPath(QueryTest.scala:34)
[info]   at org.apache.spark.sql.hive.execution.HiveDDLSuite.$anonfun$new$395(HiveDDLSuite.scala:3034)
[info]   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468)
[info]   at org.apache.spark.sql.test.SQLTestUtilsBase.withView(SQLTestUtils.scala:316)
[info]   at org.apache.spark.sql.test.SQLTestUtilsBase.withView$(SQLTestUtils.scala:314)
[info]   at org.apache.spark.sql.hive.execution.HiveDDLSuite.withView(HiveDDLSuite.scala:396)
[info]   at org.apache.spark.sql.hive.execution.HiveDDLSuite.$anonfun$new$394(HiveDDLSuite.scala:3032)
[info]   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
[info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
[info]   at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)
[info]   at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
[info]   at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
[info]   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
[info]   at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
[info]   at scala.collection.immutable.List.foreach(List.scala:431)
[info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
[info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
[info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
[info]   at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563)
[info]   at org.scalatest.Suite.run(Suite.scala:1112)
[info]   at org.scalatest.Suite.run$(Suite.scala:1094)
[info]   at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1563)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
[info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
[info]   at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
[info]   at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:62)
[info]   at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
[info]   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
[info]   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
[info]   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:62)
[info]   at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:318)
[info]   at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:513)
[info]   at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413)
[info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[info]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[info]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[info]   at java.lang.Thread.run(Thread.java:748)
[info]   Cause: org.apache.spark.SparkException: Job aborted.
[info]   at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:496)
[info]   at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:251)
[info]   at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
[info]   at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
[info]   at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
[info]   at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
[info]   at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:97)
[info]   at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
[info]   at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
[info]   at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
[info]   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
[info]   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
[info]   at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97)
[info]   at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:93)
[info]   at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)
[info]   at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
[info]   at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
[info]   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
[info]   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
[info]   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
[info]   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
[info]   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
[info]   at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
[info]   at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:93)
[info]   at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:80)
[info]   at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:78)
[info]   at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:115)
[info]   at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:848)
[info]   at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382)
[info]   at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355)
[info]   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
[info]   at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:781)
[in
```

Closes #33531 from AngersZhuuuu/SPARK-36312.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 59e0c25)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala (diff)
Commit c9a7ff3f36838fad5b62fa5d9be020aa465e4193 by wenchen
[SPARK-34399][DOCS][FOLLOWUP] Add docs for the new metrics of task/job commit time

### What changes were proposed in this pull request?

This is follow-up of https://github.com/apache/spark/pull/31522.
It adds docs for the new metrics of task/job commit time

### Why are the changes needed?

So that users can understand the metrics better and know that the new metrics are only for file table writes.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Build docs and preview:
![image](https://user-images.githubusercontent.com/1097932/127198210-2ab201d3-5fca-4065-ace6-0b930390380f.png)

Closes #33542 from gengliangwang/addDocForMetrics.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: c9a7ff3)
The file was modifieddocs/web-ui.md (diff)
Commit 809b88a16287ffb87835b72419fdc9150b9de0e0 by wenchen
[SPARK-36006][SQL] Migrate ALTER TABLE ... ADD/REPLACE COLUMNS commands to use UnresolvedTable to resolve the identifier

### What changes were proposed in this pull request?

This PR proposes to migrate the following `ALTER TABLE ... ADD/REPLACE COLUMNS` commands to use `UnresolvedTable` as a `child` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).

### Why are the changes needed?

This is a part of effort to make the relation lookup behavior consistent: [SPARK-29900](https://issues.apache.org/jira/browse/SPARK-29900).

### Does this PR introduce _any_ user-facing change?

After this PR, the above `ALTER TABLE ... ADD/REPLACE COLUMNS` commands will have a consistent resolution behavior.

### How was this patch tested?

Updated existing tests.

Closes #33200 from imback82/alter_add_cols.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 809b88a)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/V2CommandsCaseSensitivitySuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statements.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/CreateTablePartitioningValidationSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala (diff)
Commit 86f44578e5204487930f334aecdd97255681a3fc by wenchen
[SPARK-33865][SPARK-36202][SQL] When HiveDDL, we need check avro schema too

### What changes were proposed in this pull request?
Unify schema check code of FileFormat and check avro schema filed name when CREATE TABLE DDL too

### Why are the changes needed?
Refactor code

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Not need

Closes #33441 from AngersZhuuuu/SPARK-36202.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 86f4457)
The file was modifiedexternal/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala (diff)
The file was modifiedexternal/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala (diff)
Commit f086c17b8e1b74a3493b7381b36323afe0be3df5 by wenchen
[SPARK-36312][SQL][FOLLOWUP] Add back ParquetSchemaConverter.checkFieldNames

### What changes were proposed in this pull request?
Add back ParquetSchemaConverter.checkFieldNames()

### Why are the changes needed?
Fix code

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Closes #33552 from AngersZhuuuu/SPARK-36312-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: f086c17)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala (diff)
Commit f12793de205a46c75529ac20649e2e61edc0b144 by wenchen
[SPARK-35320][SQL] Improve error message for unsupported key types in MapType in from_json expression

### What changes were proposed in this pull request?

Currently, when a map is parsed in a from_json function, only StringType key is supported. If you try to parse other type, it results on a cast exception.
For example:
```scala
Seq((s"""{"2021-05-05T20:05:08": "sampleValue"}"""))
  .toDF("value")
  .withColumn("value1", from_json(col("value"),  MapType(TimestampType, StringType)))
  .show
```
```
Exception in thread "main" java.lang.ClassCastException: class org.apache.spark.unsafe.types.UTF8String cannot be cast to class java.lang.Long (org.apache.spark.unsafe.types.UTF8String is in unnamed module of loader 'app'; java.lang.Long is in module java.base of loader 'bootstrap')
at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107)
at org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$8$adapted(Cast.scala:297)
at org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:285)
at org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$7(Cast.scala:297)
```
This PR proposes to improve the error message.
```
org.apache.spark.sql.AnalysisException: cannot resolve 'entries' due to data type mismatch: Input schema map<timestamp,string> can only contain StringType as a key type for a MapType.;
'Project [unresolvedalias(from_json(MapType(TimestampType,StringType,true), value#1, Some(America/Los_Angeles)), Some(org.apache.spark.sql.Column$$Lambda$1496/54693608710e5bf9c))]
+- LocalRelation [value#1]
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:197)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:182)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$2(TreeNode.scala:535)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
...
```
In https://github.com/apache/spark/pull/32599 we decide to improve the error message instead of support this.

### Why are the changes needed?

Avoid confusion in the interpretation of the error

### Does this PR introduce _any_ user-facing change?

Yes, the error message returned in this case

### How was this patch tested?

Unit testing and manual testing

Closes #33525 from planga82/feature/spark35320_improve_error_message.

Authored-by: Pablo Langa <soypab@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: f12793d)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala (diff)
Commit aff182633100e1e9ae329375422a7a7d41d6a275 by gurwls223
[SPARK-36270][BUILD][FOLLOWUP] Fix typo in the sbt memory setting

### What changes were proposed in this pull request?
remove the `$` in the `-Xms$256m` memory options

### Why are the changes needed?
fix typo

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
not needed

Closes #33557 from linhongliu-db/minor-fix.

Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: aff1826)
The file was modifiedbuild/sbt-launch-lib.bash (diff)
Commit 3c76a924ce2c930a561afcc3e017d9ce5b54cdf5 by gurwls223
[SPARK-36320][PYTHON] Fix Series/Index.copy() to drop extra columns

### What changes were proposed in this pull request?

Fix `Series`/`Index.copy()` to drop extra columns.

### Why are the changes needed?

Currently `Series`/`Index.copy()` keeps the copy of the anchor DataFrame which holds unnecessary columns.
We can drop those when `Series`/`Index.copy()`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33549 from ueshin/issues/SPARK-36320/index_ops_copy.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 3c76a92)
The file was modifiedpython/pyspark/pandas/indexes/base.py (diff)
The file was modifiedpython/pyspark/pandas/series.py (diff)
The file was modifiedpython/pyspark/pandas/categorical.py (diff)
The file was modifiedpython/pyspark/pandas/groupby.py (diff)
Commit af6d04b65ca45c0d8e1aacc93da7be271b586506 by wenchen
[SPARK-36095][CORE] Grouping exception in core/rdd

### What changes were proposed in this pull request?
This PR group exception messages in core/src/main/scala/org/apache/spark/rdd

### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.

### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.

### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.

Closes #33317 from dgd-contributor/SPARK-36095_GroupExceptionCoreRdd.

Lead-authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: af6d04b)
The file was modifiedcore/src/main/scala/org/apache/spark/rdd/ReliableCheckpointRDD.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/rdd/PipedRDD.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/rdd/BlockRDD.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/rdd/DoubleRDDFunctions.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/rdd/EmptyRDD.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/rdd/LocalCheckpointRDD.scala (diff)
The file was addedcore/src/main/scala/org/apache/spark/errors/SparkCoreErrors.scala
The file was modifiedcore/src/main/scala/org/apache/spark/rdd/ReliableRDDCheckpointData.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/rdd/RDD.scala (diff)
Commit e1c50ff779f78b22c983dbe5db4550b6dc8daa18 by wenchen
[SPARK-36229][SQL] conv() inconsistently handles invalid strings with more than 64 invalid characters and return wrong value on overflow

### What changes were proposed in this pull request?
1/ conv() have inconsistency in behavior where the returned value is different above the 64 char threshold.

```
scala> spark.sql("select conv(repeat('?', 64), 10, 16)").show
+---------------------------+
|conv(repeat(?, 64), 10, 16)|
+---------------------------+
|                          0|
+---------------------------+

scala> spark.sql("select conv(repeat('?', 65), 10, 16)").show // which should be 0
+---------------------------+
|conv(repeat(?, 65), 10, 16)|
+---------------------------+
|           FFFFFFFFFFFFFFFF|
+---------------------------+

scala> spark.sql("select conv(repeat('?', 65), 10, -16)").show // which should be 0
+----------------------------+
|conv(repeat(?, 65), 10, -16)|
+----------------------------+
|                          -1|
+----------------------------+

scala> spark.sql("select conv(repeat('?', 64), 10, -16)").show
+----------------------------+
|conv(repeat(?, 64), 10, -16)|
+----------------------------+
|                           0|
+----------------------------+
```

2/ conv should return result equal to max unsigned long value in base toBase when there is overflow

```
scala> spark.sql(select conv('aaaaaaa0aaaaaaa0a', 16, 10)).show // which should be 18446744073709551615

+-------------------------------+
|conv(aaaaaaa0aaaaaaa0a, 16, 10)|
+-------------------------------+
|           12297828695278266890|
+-------------------------------+
```

### Why are the changes needed?
Bug fix, this pull request aim to make conv function behave similarly with the behavior of conv function from MySQL database
### Does this PR introduce _any_ user-facing change?
change in result of conv() function
### How was this patch tested?
add test

Closes #33459 from dgd-contributor/SPARK-36229_convInconsistencyBehaviorWithMoreThan64Characters.

Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: e1c50ff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/MathFunctionsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/NumberConverter.scala (diff)
Commit 01213095e2c26a4b0940735daf290c90d3267f51 by ueshin
[SPARK-36143][PYTHON] Adjust `astype` of fractional Series with missing values to follow pandas

### What changes were proposed in this pull request?
Adjust `astype` of fractional Series with missing values to follow pandas.

Non-goal: Adjust the issue of `astype` of Decimal Series with missing values to follow pandas.

### Why are the changes needed?
`astype` of fractional Series with missing values doesn't behave the same as pandas, for example, float Series returns itself when `astype` integer, while a ValueError is raised in pandas.

We ought to follow pandas.

### Does this PR introduce _any_ user-facing change?
Yes.

From:
```py
>>> import numpy as np
>>> import pyspark.pandas as ps
>>> psser = ps.Series([1, 2, np.nan])
>>> psser.astype(int)
0    1.0
1    2.0
2    NaN
dtype: float64

```

To:
```py
>>> import numpy as np
>>> import pyspark.pandas as ps
>>> psser = ps.Series([1, 2, np.nan])
>>> psser.astype(int)
Traceback (most recent call last):
...
ValueError: Cannot convert fractions with missing values to integer

```

### How was this patch tested?
Unit tests.

Closes #33466 from xinrong-databricks/extension_astype.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(commit: 0121309)
The file was modifiedpython/pyspark/pandas/series.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_num_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/num_ops.py (diff)
Commit eb4d1c03326cf84d5588023d17744f1d81faf21f by viirya
[SPARK-36236][SS] Additional metrics for RocksDB based state store implementation

### What changes were proposed in this pull request?

Proposing adding new metrics to `customMetrics` under the `stateOperators` in `StreamingQueryProgress` event These metrics help have better visibility into the RocksDB based state store in streaming jobs. For full details of metrics, refer to https://issues.apache.org/jira/browse/SPARK-36236.

### Why are the changes needed?

Current metrics available for the RockDB state store, do not provide observability into many operations such as how much time is spent by the RocksDB in compaction and what is the cache hit ratio. These metrics help compare performance differences in state store operations between slow and fast microbatches .

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unittests

Closes #33455 from vkorukanti/rocksdb-metrics.

Authored-by: Venki Korukanti <venki.korukanti@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(commit: eb4d1c0)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreIntegrationSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala (diff)