Changes

Summary

  1. [SPARK-35490][BUILD] Update json4s to 3.7.0-M11 (commit: e3c6907) (details)
  2. [SPARK-35509][DOCS] Move text data source options from Python and Scala (commit: 79a6b0c) (details)
  3. [SPARK-35527][SQL][TESTS] Fix HiveExternalCatalogVersionsSuite to pass (commit: 50fefc6) (details)
  4. [SPARK-35501][SQL][TESTS] Add a feature for removing pulled container (commit: 116a97e) (details)
  5. [SPARK-35529][SQL] Add fallback metrics for hash aggregate (commit: dd67777) (details)
  6. [SPARK-35282][SQL] Support AQE side shuffled hash join formula using (commit: dc7b5a9) (details)
  7. [SPARK-35452][PYTHON] Introduce ArrayOps, MapOps and StructOps (commit: 266608d) (details)
  8. [SPARK-35522][PYTHON] Introduce BinaryOps for BinaryType (commit: 8cc7232) (details)
  9. [SPARK-35537][PYTHON] Introduce a util function spark_column_equals (commit: d6d3209) (details)
  10. [SPARK-35098][PYTHON] Re-enable pandas-on-Spark test cases (commit: 79a2a46) (details)
  11. [SPARK-35351][SQL][FOLLOWUP] Avoid using `loaded` variable for LEFT ANTI (commit: 5cc17ba) (details)
  12. [SPARK-35057][SQL] Group exception messages in hive/thriftserver (commit: 3e19080) (details)
  13. [SPARK-35535][SQL] New data source V2 API: LocalScan (commit: 5bcd1c2) (details)
  14. [SPARK-33428][SQL] Match the behavior of conv function to MySQL's (commit: 52a1f8c) (details)
  15. [SPARK-35172][SS] The implementation of RocksDBCheckpointMetadata (commit: f98a063) (details)
  16. [SPARK-35541][SQL] Simplify OptimizeSkewedJoin (commit: 29ed1a2) (details)
  17. [SPARK-35483][INFRA] Add docker-integration-tests to run-tests.py and GA (commit: 0a74ad6) (details)
  18. [SPARK-35530][ML][TESTS] Fix rounding error in (commit: 3267b17) (details)
  19. Revert "[SPARK-35483][INFRA] Add docker-integration-tests to (commit: d189cf7) (details)
  20. [SPARK-35483][INFRA] Add docker-integration-tests to run-tests.py and GA (commit: 2de19e4) (details)
  21. [SPARK-35510][PYTHON] Fix and reenable (commit: 7eb7448) (details)
  22. [SPARK-35552][SQL] Make query stage materialized more readable (commit: 3b94aad) (details)
  23. [SPARK-35194][SQL] Refactor nested column aliasing for readability (commit: e863166) (details)
Commit e3c6907c99ff8c1d4b127b2c98944b7c0407cb02 by max.gekk
[SPARK-35490][BUILD] Update json4s to 3.7.0-M11

### What changes were proposed in this pull request?
This PR aims to upgrade json4s from   3.7.0-M5  to 3.7.0-M11

Note: json4s version greater than 3.7.0-M11 is not binary compatible with Spark third party jars

### Why are the changes needed?
Multiple defect fixes and improvements  like

https://github.com/json4s/json4s/issues/750
https://github.com/json4s/json4s/issues/554
https://github.com/json4s/json4s/issues/715

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Ran with the existing UTs

Closes #32636 from vinodkc/br_build_upgrade_json4s.

Authored-by: Vinod KC <vinod.kc.in@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: e3c6907)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/FaultToleranceTest.scala (diff)
The file was modifieddev/deps/spark-deps-hadoop-2.7-hive-2.3 (diff)
The file was modifieddev/deps/spark-deps-hadoop-3.2-hive-2.3 (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala (diff)
The file was modifiedpom.xml (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQueryStatusAndProgressSuite.scala (diff)
Commit 79a6b0cc8a75875329849650227120aa1324403b by gurwls223
[SPARK-35509][DOCS] Move text data source options from Python and Scala into a single page

### What changes were proposed in this pull request?

This PR proposes move text data source options from Python, Scala and Java into a single page.

### Why are the changes needed?

So far, the documentation for text data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language.

### Does this PR introduce _any_ user-facing change?

Yes, the documents will be shown below after this change:

- "Text Files" page
<img width="823" alt="Screen Shot 2021-05-26 at 3 20 11 PM" src="https://user-images.githubusercontent.com/44108233/119611669-f5202200-be35-11eb-9307-45846949d300.png">

- Python
<img width="791" alt="Screen Shot 2021-05-25 at 5 04 26 PM" src="https://user-images.githubusercontent.com/44108233/119462469-b9c11d00-bd7b-11eb-8f19-2ba7b9ceb318.png">

- Scala
<img width="683" alt="Screen Shot 2021-05-25 at 5 05 10 PM" src="https://user-images.githubusercontent.com/44108233/119462483-bd54a400-bd7b-11eb-8177-74e4d7035e63.png">

- Java
<img width="665" alt="Screen Shot 2021-05-25 at 5 05 36 PM" src="https://user-images.githubusercontent.com/44108233/119462501-bfb6fe00-bd7b-11eb-8161-12c58fabe7e2.png">

### How was this patch tested?

Manually build docs and confirm the page.

Closes #32660 from itholic/SPARK-35509.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 79a6b0c)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala (diff)
The file was modifiedpython/pyspark/sql/streaming.py (diff)
The file was modifieddocs/sql-data-sources-text.md (diff)
The file was modifiedpython/pyspark/sql/readwriter.py (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala (diff)
Commit 50fefc64472a4d9295d6870e2eb918a52f3a5d19 by gurwls223
[SPARK-35527][SQL][TESTS] Fix HiveExternalCatalogVersionsSuite to pass with Java 11

### What changes were proposed in this pull request?

This PR fixes `HiveExternalCatalogVersionsSuite`.
With this change, only <major>.<minor> version is set to `spark.sql.hive.metastore.version`.

### Why are the changes needed?

I'm personally checking whether all the tests pass with Java 11 for the current `master` and I found `HiveExternalCatalogVersionsSuite` fails.
The reason is that Spark 3.0.2 and 3.1.1 doesn't accept `2.3.8` as a hive metastore version.

`HiveExternalCatalogVersionsSuite` downloads Spark releases from https://dist.apache.org/repos/dist/release/spark/ and run test for each release. The Spark releases are `3.0.2` and `3.1.1` for the current `master` for now.
https://github.com/apache/spark/blob/e47e615c0ede9692fd3aa1098155e92e4fb50b7f/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala#L239-L259

With Java 11, the suite run with a hive metastore version which corresponds to the builtin Hive version and it's `2.3.8` for the current `master`.
https://github.com/apache/spark/blob/20750a3f9e13a2f02860859f87bbc38a18cba85e/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala#L62-L66

But `branch-3.0` and `branch-3.1` doesn't accept `2.3.8`, the suite with Java 11 fails.

Another solution would be backporting SPARK-34271 (#31371) but after [a discussion](https://github.com/apache/spark/pull/32668#issuecomment-848435170), we prefer to fix the test,

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests with CI.

Closes #32670 from sarutak/fix-version-suite-for-java11.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 50fefc6)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala (diff)
Commit 116a97e1534ccde8b8cc4df7560d988c5552fb91 by gurwls223
[SPARK-35501][SQL][TESTS] Add a feature for removing pulled container image for docker integration tests

### What changes were proposed in this pull request?

This PR adds a feature for removing pulled container image after every docker integration test finish.
This feature is enabled by the new propoerty `spark.tes.docker.removePulledImage`.

### Why are the changes needed?

For idempotent.
I'm trying to add docker integration tests to GA in SPARK-35483 (#32631) but I noticed that `jdbc.OracleIntegrationSuite` consistently fails(https://github.com/sarutak/spark/runs/2646707235?check_suite_focus=true).
I investigated the reason and I found it's short of the storage capacity of the host on GA.
```
ORACLE PASSWORD FOR SYS AND SYSTEM: oracle
The location '/opt/oracle' specified for database files has insufficient space.
Database creation needs at least '4.5GB' disk space.
Specify a different database file destination that has enough space in the configuration file '/etc/sysconfig/oracle-xe-18c.conf'.
mv: cannot stat '/opt/oracle/product/18c/dbhomeXE/dbs/spfileXE.ora': No such file or directory
mv: cannot stat '/opt/oracle/product/18c/dbhomeXE/dbs/orapwXE': No such file or directory
ORACLE_HOME = [/home/oracle] ? ORACLE_BASE environment variable is not being set since this
information is not available for the current user ID .
You can set ORACLE_BASE manually if it is required.
Resetting ORACLE_BASE to its previous value or ORACLE_HOME
The Oracle base remains unchanged with value /opt/oracle
#####################################
########### E R R O R ###############
DATABASE SETUP WAS NOT SUCCESSFUL!
Please check output for further info!
########### E R R O R ###############
#####################################
The following output is now a tail of the alert.log:
tail: cannot open '/opt/oracle/diag/rdbms/*/*/trace/alert*.log' for reading: No such file or directory
tail: no files remaining
```

With this feature, pulled container image is removed and keep the capacity for `jdbc.OracleIntegrationSuite` in GA.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I confirmed the following things.

* A container image which is absent in the local repository is removed after test finished if `spark.test.container.removePulledImage` is `true`.
* A container image which is present in the local repository is not removed after the finished even if `spark.test.container.removePulledImage` is `true`.
* A container image is not removed regardless of presence of the container image in the local repository even if `spark.test.container.removePulledImage` is `true`.

Closes #32652 from sarutak/docker-image-rm.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 116a97e)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DockerJDBCIntegrationSuite.scala (diff)
The file was modifiedpom.xml (diff)
Commit dd677770d88324bdbac13e48ecb9597c0f176a81 by wenchen
[SPARK-35529][SQL] Add fallback metrics for hash aggregate

### What changes were proposed in this pull request?

Add the metrics to record how many tasks fallback to sort-based aggregation for hash aggregation. This will help developers and users to debug and optimize query. Object hash aggregation has similar metrics already.

### Why are the changes needed?

Help developers and users to debug and optimize query with hash aggregation.

### Does this PR introduce _any_ user-facing change?

Yes, the added metrics will show up in Spark web UI.
Example:
<img width="604" alt="Screen Shot 2021-05-26 at 12 17 08 AM" src="https://user-images.githubusercontent.com/4629931/119618437-bf3c5880-bdb7-11eb-89bb-5b88db78639f.png">

### How was this patch tested?

Changed unit test in `SQLMetricsSuite.scala`.

Closes #32671 from c21/agg-metrics.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: dd67777)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/ObjectHashAggregateExec.scala (diff)
Commit dc7b5a99f07e172de2c7530e6a7c14dcf6e4f217 by wenchen
[SPARK-35282][SQL] Support AQE side shuffled hash join formula using rule

### What changes were proposed in this pull request?

The main code change is:
* Change rule `DemoteBroadcastHashJoin` to `DynamicJoinSelection` and add shuffle hash join selection code.
* Specify a join strategy hint `SHUFFLE_HASH` if AQE think a join can be converted to SHJ.
* Skip `preferSortMerge` config check in AQE side if a join can be converted to SHJ.

### Why are the changes needed?

Use AQE runtime statistics to decide if we can use shuffled hash join instead of sort merge join. Currently, the formula of shuffled hash join selection dose not work due to the dymanic shuffle partition number.

Add a new config spark.sql.adaptive.shuffledHashJoinLocalMapThreshold to decide if join can be converted to shuffled hash join safely.

### Does this PR introduce _any_ user-facing change?

Yes, add a new config.

### How was this patch tested?

Add test.

Closes #32550 from ulysses-you/SPARK-35282-2.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: dc7b5a9)
The file was removedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/DemoteBroadcastHashJoin.scala
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/hints.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AQEOptimizer.scala (diff)
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/DynamicJoinSelection.scala
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
Commit 266608d50e866d12543fc13808c02d324a043cef by ueshin
[SPARK-35452][PYTHON] Introduce ArrayOps, MapOps and StructOps

### What changes were proposed in this pull request?

The PR is proposed to introduce ArrayOps, MapOps and StructOps to handle data-type-based operations for StructType, ArrayType, and MapType separately.

### Why are the changes needed?

StructType, ArrayType, and MapType are not accepted by DataTypeOps now.

We should handle these complex types. Among them:

- ArrayType supports concatenation: for example, ps.Series([[1,2,3]]) + ps.Series([[4,5,6]]) should work the same as pd.Series([[1,2,3]]) + pd.Series([[4,5,6]]), as concatenation.

- StructOps will be helpful to make to/from pandas conversion data-type-based.

### Does this PR introduce _any_ user-facing change?

Yes.

Before the change:
```py
>>> import pyspark.pandas as ps
>>> from pyspark.pandas.config import set_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> ps.Series([[1, 2, 3]]) + ps.Series([[0.4, 0.5]])
Traceback (most recent call last):
...
TypeError: Type object was not understood.
>>> ps.Series([[1, 2, 3]]) + ps.Series([[4, 5]])
Traceback (most recent call last):
...
TypeError: Type object was not understood.
>>> ps.Series([[1, 2, 3]]) + ps.Series([['x']])
Traceback (most recent call last):
...
TypeError: Type object was not understood.
```

After the change:
```py
>>> import pyspark.pandas as ps
>>> from pyspark.pandas.config import set_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> ps.Series([[1, 2, 3]]) + ps.Series([[0.4, 0.5]])
0    [1.0, 2.0, 3.0, 0.4, 0.5]
dtype: object
>>> ps.Series([[1, 2, 3]]) + ps.Series([[4, 5]])
0    [1, 2, 3, 4, 5]
dtype: object
>>> ps.Series([[1, 2, 3]]) + ps.Series([['x']])
Traceback (most recent call last):
...
TypeError: Concatenation can only be applied to arrays of the same type
```

### How was this patch tested?

Unit tests.

Closes #32626 from xinrong-databricks/datatypeop_complex.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(commit: 266608d)
The file was modifieddev/sparktestsupport/modules.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/base.py (diff)
The file was addedpython/pyspark/pandas/data_type_ops/complex_ops.py
The file was addedpython/pyspark/pandas/tests/data_type_ops/test_complex_ops.py
Commit 8cc7232ffaa7f5de2c25a2f7fa172f1e111addac by ueshin
[SPARK-35522][PYTHON] Introduce BinaryOps for BinaryType

### What changes were proposed in this pull request?

BinaryType, which represents byte sequence values in Spark, doesn't support data-type-based operations yet. We are going to introduce BinaryOps for it.

### Why are the changes needed?

The data-type-based-operations class should be set for each individual data type, including BinaryType.
In addition, BinaryType has its special way of addition, which means concatenation.

### Does this PR introduce _any_ user-facing change?

Yes.

Before the change:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([b'1', b'2', b'3'])
>>> psser + psser
Traceback (most recent call last):
...
TypeError: Type object was not understood.
>>> psser + b'1'
Traceback (most recent call last):
...
TypeError: Type object was not understood.

```
After the change:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([b'1', b'2', b'3'])
>>> psser + psser
0    [49, 49]
1    [50, 50]
2    [51, 51]
dtype: object
>>> psser + b'1'
0    [49, 49]
1    [50, 49]
2    [51, 49]
dtype: object
```

### How was this patch tested?

Unit tests.

Closes #32665 from xinrong-databricks/datatypeops_binary.

Lead-authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Co-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(commit: 8cc7232)
The file was modifiedpython/pyspark/pandas/data_type_ops/base.py (diff)
The file was modifieddev/sparktestsupport/modules.py (diff)
The file was addedpython/pyspark/pandas/tests/data_type_ops/test_binary_ops.py
The file was addedpython/pyspark/pandas/data_type_ops/binary_ops.py
Commit d6d3209c2f4beb51529c89318fc0070140c0dc84 by gurwls223
[SPARK-35537][PYTHON] Introduce a util function spark_column_equals

### What changes were proposed in this pull request?

Introduce a util function `spark_column_equals` to check the underlying expressions of columns are the same or not.

### Why are the changes needed?

In pandas on Spark, there are some places checking the underlying expressions of columns are the same or not, but it's done one-by-one.
We should introduce a util function for it.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

The existing tests.

Closes #32680 from ueshin/issues/SPARK-35537/spark_column_equals.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: d6d3209)
The file was modifiedpython/pyspark/pandas/tests/test_internal.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_namespace.py (diff)
The file was modifiedpython/pyspark/pandas/utils.py (diff)
The file was modifiedpython/pyspark/pandas/internal.py (diff)
The file was modifiedpython/pyspark/pandas/indexing.py (diff)
Commit 79a2a46cdb56dfe15ccaa1fc9be6dee6c31adf2e by gurwls223
[SPARK-35098][PYTHON] Re-enable pandas-on-Spark test cases

### What changes were proposed in this pull request?

Re-enable some pandas-on-Spark test cases.

### Why are the changes needed?

pandas version in GitHub Actions is upgraded now so we can re-enable  some pandas-on-Spark test cases.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

Closes #32682 from xinrong-databricks/enable_tests.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 79a2a46)
The file was modifiedpython/pyspark/pandas/tests/indexes/test_base.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_series.py (diff)
Commit 5cc17ba0c7d81278a622fb474cf18f8e8530335f by wenchen
[SPARK-35351][SQL][FOLLOWUP] Avoid using `loaded` variable for LEFT ANTI SMJ code-gen

### What changes were proposed in this pull request?

This is a followup from https://github.com/apache/spark/pull/32547#discussion_r639916474, where for LEFT ANTI join, we do not need to depend on `loaded` variable, as in `codegenAnti` we only load `streamedAfter` no more than once (i.e. assign column values from streamed row which are not used in join condition).

### Why are the changes needed?

Avoid unnecessary processing in code-gen (though it's just `boolean $loaded = false;`, and `if (!$loaded) { $loaded = true; }`).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unite tests in `ExistenceJoinSuite`.

Closes #32681 from c21/join-followup.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 5cc17ba)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala (diff)
Commit 3e190807bcbc90ec8105c44e38cf10eeeecde788 by wenchen
[SPARK-35057][SQL] Group exception messages in hive/thriftserver

### What changes were proposed in this pull request?
This PR group exception messages in `sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver`.

### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.

### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.

### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.

Closes #32646 from beliefer/SPARK-35057.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 3e19080)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala (diff)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/ThriftServerTab.scala (diff)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkOperation.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala (diff)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala (diff)
The file was addedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServerErrors.scala
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLSessionManager.scala (diff)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIService.scala (diff)
Commit 5bcd1c29f0e8442d73d3985253ccadb6f4078044 by gurwls223
[SPARK-35535][SQL] New data source V2 API: LocalScan

### What changes were proposed in this pull request?

Add a new data source V2 API: `LocalScan`. It is a special Scan that will happen on Driver locally instead of Executors.

### Why are the changes needed?

The new API improves the flexibility of the DSV2 API. It allows developers to implement connectors for data sources of small data sizes.
For example, we can build a data source for Spark History applications from Spark History Server RESTFUL API. The result set is small and fetching all the results from the Spark driver is good enough. Making it a data source allows us to operate SQL queries with filters or table joins.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test

Closes #32678 from gengliangwang/LocalScan.

Lead-authored-by: Gengliang Wang <ltnwgl@gmail.com>
Co-authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 5bcd1c2)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)
The file was addedsql/core/src/main/java/org/apache/spark/sql/connector/read/LocalScan.java
The file was addedsql/core/src/test/scala/org/apache/spark/sql/connector/LocalScanSuite.scala
Commit 52a1f8c0003d1713d661d197b39737f545723309 by wenchen
[SPARK-33428][SQL] Match the behavior of conv function to MySQL's

### What changes were proposed in this pull request?
Spark conv function is from MySQL and it's better to follow the MySQL behavior. MySQL returns the max unsigned long if the input string is too big, and Spark should follow it.

However, seems Spark has different behavior in two cases:

MySQL allows leading spaces but Spark does not.
If the input string is way too long, Spark fails with ArrayIndexOutOfBoundException

This patch now help conv follow behavior in those two cases
conv allows leading spaces
conv will return the max unsigned long when the input string is way too long

### Why are the changes needed?
fixing it to match the behavior of conv function to the (almost) only one reference of another DBMS, MySQL

### Does this PR introduce _any_ user-facing change?
Yes, as pointed out above

### How was this patch tested?
Add test

Closes #32684 from dgd-contributor/SPARK-33428.

Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 52a1f8c)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/NumberConverter.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/MathFunctionsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala (diff)
Commit f98a063a4b9f9930a9c27a78617daa7569506151 by kabhwan.opensource
[SPARK-35172][SS] The implementation of RocksDBCheckpointMetadata

### What changes were proposed in this pull request?
Initial implementation of RocksDBCheckpointMetadata. It persists the metadata for RocksDBFileManager.

### Why are the changes needed?
The RocksDBCheckpointMetadata persists the metadata for each committed batch in JSON format. The object contains all RocksDB file names and the number of total keys.
The metadata binds closely with the directory structure of RocksDBFileManager, as described in the design doc - [Directory Structure and Format for Files stored in DFS](https://docs.google.com/document/d/10wVGaUorgPt4iVe4phunAcjU924fa3-_Kf29-2nxH6Y/edit#heading=h.zgvw85ijoz2).

### Does this PR introduce _any_ user-facing change?
No. Internal implementation only.

### How was this patch tested?
New UT added.

Closes #32272 from xuanyuanking/SPARK-35172.

Lead-authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Co-authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(commit: f98a063)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala
Commit 29ed1a2de42e7a663f764192fce157a9f23029b3 by viirya
[SPARK-35541][SQL] Simplify OptimizeSkewedJoin

### What changes were proposed in this pull request?

Various small code simplification/cleanup for OptimizeSkewedJoin

### Why are the changes needed?

code refactor

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #32685 from cloud-fan/skew-join.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(commit: 29ed1a2)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala (diff)
Commit 0a74ad66b34534b164739422004cbbc4500db68d by gurwls223
[SPARK-35483][INFRA] Add docker-integration-tests to run-tests.py and GA

### What changes were proposed in this pull request?

This PR proposes to add `docker-integratin-tests` to `run-tests.py` and GA.
`doker-integration-tests` can't run if docker is not installed so it run only if `docker-integration-tests` is specified with `--module`.

### Why are the changes needed?

CI for `docker-integration-tests` is absent for now.

### Does this PR introduce _any_ user-facing change?

GA.

### How was this patch tested?

Closes #32631 from sarutak/docker-integration-test-ga.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 0a74ad6)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/PostgresIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCNamespaceTest.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresKrbIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala (diff)
The file was modifieddev/run-tests.py (diff)
The file was modifieddev/sparktestsupport/modules.py (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/PostgresNamespaceSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MySQLIntegrationSuite.scala (diff)
The file was addedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DockerIntegrationFunSuite.scala
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSqlServerIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2KrbIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DockerKrbJDBCIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DockerJDBCIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MariaDBKrbIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/OracleIntegrationSuite.scala (diff)
The file was modified.github/workflows/build_and_test.yml (diff)
Commit 3267b17713c7e4b5fb12db8d3f02dc57c4a49bf1 by gurwls223
[SPARK-35530][ML][TESTS] Fix rounding error in DifferentiableLossAggregatorSuite with Java 11

### What changes were proposed in this pull request?

This PR fixes an test failure of `DifferentiableLossAggregatorSuite` with Java 11.

### Why are the changes needed?

I'm personally checking whether all the tests pass with Java 11 for the current master and I found DifferentiableLossAggregatorSuite fails.
https://github.com/sarutak/spark/runs/2661859541?check_suite_focus=true#step:9:13895

The reason seems that the implementation of Blas.daxpy is different between for Java 8 and Java 11. For Java 11, `Math.fma` is used.

https://github.com/luhenry/netlib/blob/v2.2.0/blas/src/main/java/dev/ludovic/netlib/blas/Java8BLAS.java#L92
https://github.com/luhenry/netlib/blob/0053ea30b11686336cbdb8c7fceb41d59d268fa2/blas/src/main/java/dev/ludovic/netlib/blas/Java11BLAS.java#L40

To remove the rounding error, this PR changes `TestAggregator.add` with fma.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I confirmed `DifferentiableLossAggregatorSuite` passes with both Java 8 and Java 11.

Closes #32673 from sarutak/fix-rounding-error.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 3267b17)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/optim/aggregator/DifferentiableLossAggregatorSuite.scala (diff)
Commit d189cf75f99dff4211afc4e318b4399c5c6e8219 by gurwls223
Revert "[SPARK-35483][INFRA] Add docker-integration-tests to run-tests.py and GA"

This reverts commit 0a74ad66b34534b164739422004cbbc4500db68d.
(commit: d189cf7)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCNamespaceTest.scala (diff)
The file was modifieddev/sparktestsupport/modules.py (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DockerJDBCIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala (diff)
The file was modified.github/workflows/build_and_test.yml (diff)
The file was removedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DockerIntegrationFunSuite.scala
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSqlServerIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/PostgresIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/OracleIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/PostgresNamespaceSuite.scala (diff)
The file was modifieddev/run-tests.py (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresKrbIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DockerKrbJDBCIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MariaDBKrbIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MySQLIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2KrbIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala (diff)
Commit 2de19e460bb3b305d4d155709fb88ad97efd6142 by gurwls223
[SPARK-35483][INFRA] Add docker-integration-tests to run-tests.py and GA

### What changes were proposed in this pull request?

This PR proposes to add `docker-integratin-tests` to `run-tests.py` and GA.
Once #32631 was merged but there was a lack of consideration.

Diff between this change and https://github.com/apache/spark/pull/32631/commits/692d95d1458993cbb9cbd47014202e84cd6aa328 merged in #32631 is as follows.

```
       if: github.repository != 'apache/spark'
       id: sync-branch
       run: |
+        apache_spark_ref=`git rev-parse HEAD`
         git fetch https://github.com/$GITHUB_REPOSITORY.git ${GITHUB_REF#refs/heads/}
         git -c user.name='Apache Spark Test Account' -c user.email='sparktestaccgmail.com' merge --no-commit --progress --squash FETCH_HEAD
         git -c user.name='Apache Spark Test Account' -c user.email='sparktestaccgmail.com' commit -m "Merged commit"
+        echo "::set-output name=APACHE_SPARK_REF::$apache_spark_ref"
     - name: Cache Scala, SBT and Maven
       uses: actions/cachev2
       with:
```

### Why are the changes needed?

CI for `docker-integration-tests` is absent for now.

### Does this PR introduce _any_ user-facing change?

GA.

### How was this patch tested?

Closes #32691 from sarutak/docker-integration-test-ga-take2.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 2de19e4)
The file was modified.github/workflows/build_and_test.yml (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresKrbIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2KrbIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala (diff)
The file was modifieddev/run-tests.py (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/PostgresIntegrationSuite.scala (diff)
The file was modifieddev/sparktestsupport/modules.py (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MsSqlServerIntegrationSuite.scala (diff)
The file was addedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DockerIntegrationFunSuite.scala
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MySQLIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCNamespaceTest.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DockerKrbJDBCIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/MariaDBKrbIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/OracleIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/PostgresNamespaceSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MySQLIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DockerJDBCIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DB2IntegrationSuite.scala (diff)
Commit 7eb74482a709c6f7ff51ed82d3e1049e040fad4f by gurwls223
[SPARK-35510][PYTHON] Fix and reenable test_stats_on_non_numeric_columns_should_be_discarded_if_numeric_only_is_true

### What changes were proposed in this pull request?

This PR proposes to fix and reenable `test_stats_on_non_numeric_columns_should_be_discarded_if_numeric_only_is_true` that was disabled when we upgrade Python 3.9 in CI at https://github.com/apache/spark/pull/32657.

Seems like this is because of the latest NumPy's behaviour change, see also `https://github.com/numpy/numpy/pull/16273#discussion_r641264085`.

pandas inherits this behaviour but it doesn't make sense when `numeric_only` is set to `True` in pandas. I will track and follow the status of the issue between pandas and NumPy.

For the time being, I propose to exclude boolean case alone in percentile/quartile test case

### Why are the changes needed?

To keep the test coverage.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

I roughly locally tested. But it should pass in CI.

Closes #32690 from HyukjinKwon/SPARK-35510.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 7eb7448)
The file was modifiedpython/pyspark/pandas/tests/test_stats.py (diff)
Commit 3b94aad5e72a6b96e4a8f517ac60e0a2fed2590b by gengliang
[SPARK-35552][SQL] Make query stage materialized more readable

### What changes were proposed in this pull request?

Add a new method `isMaterialized` in `QueryStageExec`.

### Why are the changes needed?

Currently, we use `resultOption().get.isDefined` to check if a query stage has materialized. The code is not readable at a glance. It's better to use a new method like `isMaterialized` to define it.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass CI.

Closes #32689 from ulysses-you/SPARK-35552.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(commit: 3b94aad)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/DynamicJoinSelection.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AQEPropagateEmptyRelation.scala (diff)
Commit e8631660ecf316e4333210650d1f40b5912fb11b by wenchen
[SPARK-35194][SQL] Refactor nested column aliasing for readability

### What changes were proposed in this pull request?

Refactors `NestedColumnAliasing` and `GeneratorNestedColumnAliasing` for readability.

### Why are the changes needed?

Improves readability for future maintenance.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #32301 from karenfeng/refactor-nested-column-aliasing.

Authored-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: e863166)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala-2.12/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala (diff)
The file was modifiedsql/catalyst/src/main/scala-2.13/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)