Changes

Summary

  1. [SPARK-34474][SQL] Remove unnecessary Union under Distinct/Deduplicate (commit: f7ac2d6) (details)
  2. [SPARK-34152][SQL][FOLLOWUP] Do not uncache the temp view if it doesn't (commit: dffb01f) (details)
  3. [SPARK-34505][BUILD] Upgrade Scala to 2.13.5 (commit: 1967760) (details)
  4. [SPARK-34535][SQL] Cleanup unused symbol in Orc related code (commit: 0d3a9cd) (details)
  5. Revert "[SPARK-32617][K8S][TESTS] Configure kubernetes client based on (commit: 4d428a8) (details)
  6. [SPARK-34543][SQL] Respect the `spark.sql.caseSensitive` config while (commit: 5c7d019) (details)
  7. [SPARK-34551][INFRA] Fix credit related scripts to recover, drop Python (commit: 5b92531) (details)
  8. [SPARK-34553][INFRA] Rename GITHUB_API_TOKEN to GITHUB_OAUTH_KEY in (commit: ac774ec) (details)
  9. [SPARK-34524][SQL] Simplify v2 partition commands resolution (commit: 73857cd) (details)
  10. [SPARK-34533][SQL] Eliminate LEFT ANTI join to empty relation in AQE (commit: 7d5021f) (details)
  11. [SPARK-34550][SQL] Skip InSet null value during push filter to Hive (commit: 82267ac) (details)
  12. [SPARK-34549][BUILD] Upgrade aws kinesis to 1.14.0 and java sdk 1.11.844 (commit: a9e8e05) (details)
  13. [SPARK-34554][SQL] Implement the copy() method in ColumnarMap (commit: c1beb16) (details)
  14. [SPARK-33971][SQL] Eliminate distinct from more aggregates (commit: 67ec4f7) (details)
  15. [MINOR] Add more known translations of contributors (commit: 8d68f3f) (details)
  16. [SPARK-34392][SQL] Support ZoneOffset +h:mm in DateTimeUtils. getZoneId (commit: 56e664c) (details)
  17. [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 (commit: 05069ff) (details)
  18. [SPARK-34557][BUILD] Exclude Avro's transitive zstd-jni dependency (commit: d758210) (details)
  19. [SPARK-34559][BUILD] Upgrade to ZSTD JNI 1.4.8-6 (commit: 1aeafb4) (details)
  20. [SPARK-34415][ML] Randomization in hyperparameter optimization (commit: 397b843) (details)
  21. [SPARK-34479][SQL] Add zstandard codec to Avro compression codec list (commit: 54c053a) (details)
  22. [SPARK-34415][ML] Python example (commit: 5a48eb8) (details)
  23. [SPARK-33687][SQL] Support analyze all tables in a specific database (commit: d07fc30) (details)
  24. [SPARK-34506][CORE] ADD JAR with ivy coordinates should be compatible (commit: 0216051) (details)
  25. [SPARK-34579][SQL][TEST] Fix wrong UT in SQLQuerySuite (commit: d574308) (details)
  26. [SPARK-34570][SQL] Remove dead code from constructors of (commit: 1afe284) (details)
  27. [SPARK-33212][FOLLOWUP] Add hadoop-yarn-server-web-proxy for Hadoop 3.x (commit: f494c5c) (details)
  28. [SPARK-34520][CORE][FOLLOW-UP] Remove SecurityManager in GangliaSink (commit: 3d0ee96) (details)
  29. [SPARK-34556][SQL] Checking duplicate static partition columns should (commit: 62737e1) (details)
  30. [SPARK-34574][DOCS] Jekyll fails to generate Scala API docs for Scala (commit: a6cc5e6) (details)
Commit f7ac2d655c756100c33e652402cefc507d2493b7 by viirya
[SPARK-34474][SQL] Remove unnecessary Union under Distinct/Deduplicate

### What changes were proposed in this pull request?

This patch proposes to let optimizer to remove unnecessary `Union` under `Distinct`/`Deduplicate`.

### Why are the changes needed?

For an `Union` under `Distinct`/`Deduplicate`, if its children are all the same, we can just keep one among them and remove the `Union`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests.

Closes #31595 from viirya/remove-union.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(commit: f7ac2d6)
The file was addedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RemoveNoopUnionSuite.scala
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/explain.sql (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/explain.sql.out (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/explain-aqe.sql.out (diff)
Commit dffb01f28a1f5fb6445e59e1c8eefdd683fe4a29 by dhyun
[SPARK-34152][SQL][FOLLOWUP] Do not uncache the temp view if it doesn't exist

### What changes were proposed in this pull request?

This PR fixes a mistake in https://github.com/apache/spark/pull/31273. When CREATE OR REPLACE a temp view, we need to uncache the to-be-replaced existing temp view. However, we shouldn't uncache if there is no existing temp view.

This doesn't cause real issues because the uncache action is failure-safe. But it produces a lot of warning messages.

### Why are the changes needed?

Avoid unnecessary warning logs.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

manually run tests and check the warning messages.

Closes #31650 from cloud-fan/warnning.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: dffb01f)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala (diff)
Commit 1967760277595af5c842402c6f2d1f28dfb18728 by dhyun
[SPARK-34505][BUILD] Upgrade Scala to 2.13.5

### What changes were proposed in this pull request?

This PR aims to update from Scala 2.13.4 to Scala 2.13.5 for Apache Spark 3.2.

### Why are the changes needed?

Scala 2.13.5 is a maintenance release for 2.13 line and improves Java 13, 14, 15, 16, and 17 support.
- https://github.com/scala/scala/releases/tag/v2.13.5

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass the GitHub Action `Scala 2.13` job and manual test.

I verified the following locally and all passed.
```
$ dev/change-scala-version.sh 2.13
$ build/sbt test -Pscala-2.13
```

Closes #31620 from dongjoon-hyun/SPARK-34505.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: 1967760)
The file was modifiedpom.xml (diff)
Commit 0d3a9cd3c9d25fdd35ddf04d0fe2ed1f8fead2a5 by gurwls223
[SPARK-34535][SQL] Cleanup unused symbol in Orc related code

### What changes were proposed in this pull request?
Cleanup unused symbol in Orc related code as follows:

- `OrcDeserializer` : parameter `dataSchema` in constructor
- `OrcFilters`  : parameter `schema ` in method `convertibleFilters`.
- `OrcPartitionReaderFactory`: ignore return value of `OrcUtils.orcResultSchemaString` in  method `buildReader(file: PartitionedFile)`

### Why are the changes needed?
Cleanup code.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #31644 from LuciferYang/cleanup-orc-unused-symbol.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 0d3a9cd)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcScanBuilder.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcPartitionReaderFactory.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala (diff)
Commit 4d428a821b2117789d0a2c61c7229d00af1704eb by dhyun
Revert "[SPARK-32617][K8S][TESTS] Configure kubernetes client based on kubeconfig settings in kubernetes integration tests"

This reverts commit b17754a8cbd2593eb2b1952e95a7eeb0f8e09cdb.
(commit: 4d428a8)
The file was modifiedresource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/backend/minikube/Minikube.scala (diff)
The file was modifiedresource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh (diff)
Commit 5c7d019b609c87a9427fa9309f3aa03d02f61878 by wenchen
[SPARK-34543][SQL] Respect the `spark.sql.caseSensitive` config while resolving partition spec in v1 `SET LOCATION`

### What changes were proposed in this pull request?
Preprocess the partition spec passed to the V1 `ALTER TABLE .. SET LOCATION` implementation `AlterTableSetLocationCommand`, and normalize the passed spec according to the partition columns w.r.t the case sensitivity flag  **spark.sql.caseSensitive**.

### Why are the changes needed?
V1 `ALTER TABLE .. SET LOCATION` is case sensitive in fact, and doesn't respect the SQL config **spark.sql.caseSensitive** which is false by default, for instance:
```sql
spark-sql> CREATE TABLE tbl (id INT, part INT) PARTITIONED BY (part);
spark-sql> INSERT INTO tbl PARTITION (part=0) SELECT 0;
spark-sql> SHOW TABLE EXTENDED LIKE 'tbl' PARTITION (part=0);
Location: file:/Users/maximgekk/proj/set-location-case-sense/spark-warehouse/tbl/part=0
spark-sql> ALTER TABLE tbl ADD PARTITION (part=1);
spark-sql> SELECT * FROM tbl;
0 0
```
Create new partition folder in the file system:
```
$ cp -r /Users/maximgekk/proj/set-location-case-sense/spark-warehouse/tbl/part=0 /Users/maximgekk/proj/set-location-case-sense/spark-warehouse/tbl/aaa
```
Set new location for the partition part=1:
```sql
spark-sql> ALTER TABLE tbl PARTITION (part=1) SET LOCATION '/Users/maximgekk/proj/set-location-case-sense/spark-warehouse/tbl/aaa';
spark-sql> SELECT * FROM tbl;
0 0
0 1
spark-sql> ALTER TABLE tbl ADD PARTITION (PART=2);
spark-sql> SELECT * FROM tbl;
0 0
0 1
```
Set location for a partition in the upper case:
```
$ cp -r /Users/maximgekk/proj/set-location-case-sense/spark-warehouse/tbl/part=0 /Users/maximgekk/proj/set-location-case-sense/spark-warehouse/tbl/bbb
```
```sql
spark-sql> ALTER TABLE tbl PARTITION (PART=2) SET LOCATION '/Users/maximgekk/proj/set-location-case-sense/spark-warehouse/tbl/bbb';
Error in query: Partition spec is invalid. The spec (PART) must match the partition spec (part) defined in table '`default`.`tbl`'
```

### Does this PR introduce _any_ user-facing change?
Yes. After the changes, the command above works as expected:
```sql
spark-sql> ALTER TABLE tbl PARTITION (PART=2) SET LOCATION '/Users/maximgekk/proj/set-location-case-sense/spark-warehouse/tbl/bbb';
spark-sql> SELECT * FROM tbl;
0 0
0 1
0 2
```

### How was this patch tested?
By running the modified test suite:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CatalogedDDLSuite"
```

Closes #31651 from MaxGekk/set-location-case-sense.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 5c7d019)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala (diff)
Commit 5b925319374b11fa30f6d00b9c2e92fbee3aa343 by gurwls223
[SPARK-34551][INFRA] Fix credit related scripts to recover, drop Python 2 and work with Python 3

### What changes were proposed in this pull request?

This PR proposes to make the scripts working by:
- Recovering credit related scripts that were broken from https://github.com/apache/spark/pull/29563
    `raw_input` does not exist in `releaseutils` but only in Python 2
- Dropping Python 2 in these scripts because we dropped Python 2 in https://github.com/apache/spark/pull/28957
- Making these scripts workin with Python 3

### Why are the changes needed?

To unblock the release.

### Does this PR introduce _any_ user-facing change?

No, it's dev-only change.

### How was this patch tested?

I manually tested against Spark 3.1.1 RC3.

Closes #31660 from HyukjinKwon/SPARK-34551.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 5b92531)
The file was modifieddev/create-release/releaseutils.py (diff)
The file was modifieddev/create-release/generate-contributors.py (diff)
The file was modifieddev/create-release/translate-contributors.py (diff)
The file was modifieddev/requirements.txt (diff)
Commit ac774ec0c2cdc9a5d2e20e5f751ef2e753df352f by gurwls223
[SPARK-34553][INFRA] Rename GITHUB_API_TOKEN to GITHUB_OAUTH_KEY in translate-contributors.py

### What changes were proposed in this pull request?

This PR proposes to add an alias environment variable `GITHUB_OAUTH_KEY` for `GITHUB_API_TOKEN` in `translate-contributors.py` script.

### Why are the changes needed?

```
dev/github_jira_sync.py:GITHUB_OAUTH_KEY = os.environ.get("GITHUB_OAUTH_KEY")
dev/github_jira_sync.py:        request.add_header('Authorization', 'token %s' % GITHUB_OAUTH_KEY)
dev/github_jira_sync.py:        request.add_header('Authorization', 'token %s' % GITHUB_OAUTH_KEY)
dev/merge_spark_pr.py:GITHUB_OAUTH_KEY = os.environ.get("GITHUB_OAUTH_KEY")
dev/merge_spark_pr.py:        if GITHUB_OAUTH_KEY:
dev/merge_spark_pr.py:            request.add_header('Authorization', 'token %s' % GITHUB_OAUTH_KEY)
dev/run-tests-jenkins.py:    github_oauth_key = os.environ["GITHUB_OAUTH_KEY"]
```

Spark uses `GITHUB_OAUTH_KEY` for GitHub token, but `translate-contributors.py` script alone uses `GITHUB_API_TOKEN`. We should better match to make it easier to run the script

### Does this PR introduce _any_ user-facing change?

No, it's dev-only.

### How was this patch tested?

I manually tested by running this script.

Closes #31662 from HyukjinKwon/minor-gh-token-name.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: ac774ec)
The file was modifieddev/create-release/translate-contributors.py (diff)
Commit 73857cdd87757d2888bd92f6b7c2fad709701484 by wenchen
[SPARK-34524][SQL] Simplify v2 partition commands resolution

### What changes were proposed in this pull request?

This PR simplifies the resolution of v2 partition commands:
1. Add a common trait for v2 partition commands, so that we don't need to match them one by one in the rules.
2. Make partition spec an expression, so that it's easier to resolve them via tree node transformation.
3. Add `TruncatePartition` so that `TruncateTable` doesn't need to be a v2 partition command.
4. Simplify `CheckAnalysis` to only check if the table is partitioned. For partitioned tables, partition spec is always resolved, so we don't need to check it. The `SupportsAtomicPartitionManagement` check is also done in the runtime. Since Spark eagerly executes commands, exception in runtime will also be thrown at analysis time.

### Why are the changes needed?

code cleanup

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #31637 from cloud-fan/simplify.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 73857cd)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/v2/AlterTableDropPartitionSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/v2ResolutionPlans.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/v2/ShowPartitionsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolvePartitionSpec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/v2/AlterTableAddPartitionSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/TruncateTableParserSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowPartitionsSuiteBase.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/TruncateTableExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Implicits.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/v2/TruncateTableSuite.scala (diff)
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/TruncatePartitionExec.scala
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala (diff)
Commit 7d5021f5eed2b9c48bd02b92cce1535edc46d0e4 by wenchen
[SPARK-34533][SQL] Eliminate LEFT ANTI join to empty relation in AQE

### What changes were proposed in this pull request?

I discovered from review discussion - https://github.com/apache/spark/pull/31630#discussion_r581774000 , that we can eliminate LEFT ANTI join (with no join condition) to empty relation, if the right side is known to be non-empty. So with AQE, this is doable similar to https://github.com/apache/spark/pull/29484 .

### Why are the changes needed?

This can help eliminate the join operator during logical plan optimization.
Before this PR, [left side physical plan `execute()` will be called](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala#L192), so if left side is complicated (e.g. contain broadcast exchange operator), then some computation would happen. However after this PR, the join operator will be removed during logical plan, and nothing will be computed from left side. Potentially it can save resource for these kinds of query.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit tests for positive and negative queries in `AdaptiveQueryExecSuite.scala`.

Closes #31641 from c21/left-anti-aqe.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 7d5021f)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/EliminateJoinToEmptyRelation.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala (diff)
Commit 82267acfe8c78a70d56a6ae6ab9a1135c0dc0836 by gurwls223
[SPARK-34550][SQL] Skip InSet null value during push filter to Hive metastore

### What changes were proposed in this pull request?

Skip `InSet` null value during push filter to Hive metastore.

### Why are the changes needed?

If `InSet` contains a null value, we should skip it and push other values to metastore. To keep same behavior with `In`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Add test.

Closes #31659 from ulysses-you/SPARK-34550.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 82267ac)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/client/FiltersSuite.scala (diff)
Commit a9e8e0528a52d19103463bae0a9420127a99bf59 by gurwls223
[SPARK-34549][BUILD] Upgrade aws kinesis to 1.14.0 and java sdk 1.11.844

### What changes were proposed in this pull request?

This patch tries to upgrade aws kinesis and java sdk version.

### Why are the changes needed?

Upgrade aws kinesis and java sdk to catch up minimum requirement for new feature like IAM role for service accounts: https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts-minimum-sdk.html

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests.

Closes #31658 from viirya/upgrade-aws-sdk.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: a9e8e05)
The file was modifiedpom.xml (diff)
Commit c1beb16cc8db9f61f1b86b5bfa4cd4d603c9b990 by gurwls223
[SPARK-34554][SQL] Implement the copy() method in ColumnarMap

### What changes were proposed in this pull request?
Implement `ColumnarMap.copy()` by using the `copy()` method of `ColumnarArray`.

### Why are the changes needed?
To eliminate `java.lang.UnsupportedOperationException` while using `ColumnarMap`.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
By running new tests in `ColumnarBatchSuite`.

Closes #31663 from MaxGekk/columnar-map-copy.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: c1beb16)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ColumnarMap.java (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala (diff)
Commit 67ec4f7f67dc494c2619b7faf1b1145f2200b65c by yamamuro
[SPARK-33971][SQL] Eliminate distinct from more aggregates

### What changes were proposed in this pull request?

Add more aggregate expressions to `EliminateDistinct` rule.

### Why are the changes needed?

Distinct aggregation can add a significant overhead. It's better to remove distinct whenever possible.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

UT

Closes #30999 from tanelk/SPARK-33971_eliminate_distinct.

Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(commit: 67ec4f7)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateDistinctSuite.scala (diff)
Commit 8d68f3f74658c7e0c12ee1d6f09a1aae14d9e04f by gurwls223
[MINOR] Add more known translations of contributors

### What changes were proposed in this pull request?

This PR adds some more known translations of contributors who contributed multiple times in Spark 3.1.1.

### Why are the changes needed?

To make release process easier.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

N/A (auto-generated)

Closes #31665 from HyukjinKwon/minor-add-known-translations.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 8d68f3f)
The file was modifieddev/create-release/known_translations (diff)
Commit 56e664c7179eadeb5134b4418f3aaa6a9d742ef6 by srowen
[SPARK-34392][SQL] Support ZoneOffset +h:mm in DateTimeUtils. getZoneId

### What changes were proposed in this pull request?
To support +8:00 in Spark3 when execute sql
`select to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00")`

### Why are the changes needed?
+8:00 this format is supported in PostgreSQL,hive, presto, but not supported in Spark3
https://issues.apache.org/jira/browse/SPARK-34392

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
unit test

Closes #31624 from Karl-WangSK/zone.

Lead-authored-by: ShiKai Wang <wskqing@gmail.com>
Co-authored-by: Karl-WangSK <shikai.wang@linkflowtech.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(commit: 56e664c)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/internal/SQLConfSuite.scala (diff)
Commit 05069ff4ce1bbacd88b0b8497c97d8a8ca23d5a7 by yamamuro
[SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has 0/1 partition

### What changes were proposed in this pull request?
if child rdd has only one partition or zero partition, skip the shuffle

### Why are the changes needed?
skip shuffle if possible

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #31468 from zhengruifeng/collect_limit_single_partition.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(commit: 05069ff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/TakeOrderedAndProjectSuite.scala (diff)
Commit d75821038f88144918b0814830ba4cb03f739433 by dongjoon
[SPARK-34557][BUILD] Exclude Avro's transitive zstd-jni dependency

### What changes were proposed in this pull request?

This PR aims to exclude `Apache Avro`'s transitive zstd-jni dependency.

### Why are the changes needed?

While SPARK-27733 upgrades Apache Avro from 1.8 to 1.10,
`zstd-jni` transitive dependency is created.

This PR explicitly prevents dependency conflicts.

**BEFORE**
```
$ build/sbt "core/evicted" | grep zstd
[info] * com.github.luben:zstd-jni:1.4.8-5 is selected over 1.4.5-12
```

**AFTER**
```
$ build/sbt "core/evicted" | grep zstd
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #31670 from dongjoon-hyun/SPARK-34557.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: d758210)
The file was modifiedpom.xml (diff)
Commit 1aeafb485298f87c64c5c09ec3a70aad4171209f by dongjoon
[SPARK-34559][BUILD] Upgrade to ZSTD JNI 1.4.8-6

### What changes were proposed in this pull request?

This PR aims to upgrade ZSTD JNI to 1.4.8-6.

### Why are the changes needed?

This fixes the following issue and will unblock SPARK-34479 (Support ZSTD at Avro data source).
- https://github.com/luben/zstd-jni/issues/161

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #31674 from dongjoon-hyun/SPARK-34559.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 1aeafb4)
The file was modifieddev/deps/spark-deps-hadoop-3.2-hive-2.3 (diff)
The file was modifiedpom.xml (diff)
The file was modifieddev/deps/spark-deps-hadoop-2.7-hive-2.3 (diff)
Commit 397b843890db974a0534394b1907d33d62c2b888 by srowen
[SPARK-34415][ML] Randomization in hyperparameter optimization

### What changes were proposed in this pull request?

Code in the PR generates random parameters for hyperparameter tuning. A discussion with Sean Owen can be found on the dev mailing list here:

http://apache-spark-developers-list.1001551.n3.nabble.com/Hyperparameter-Optimization-via-Randomization-td30629.html

All code is entirely my own work and I license the work to the project under the project’s open source license.

### Why are the changes needed?

Randomization can be a more effective techinique than a grid search since min/max points can fall between the grid and never be found. Randomisation is not so restricted although the probability of finding minima/maxima is dependent on the number of attempts.

Alice Zheng has an accessible description on how this technique works at https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html

Although there are Python libraries with more sophisticated techniques, not every Spark developer is using Python.

### Does this PR introduce _any_ user-facing change?

A new class (`ParamRandomBuilder.scala`) and its tests have been created but there is no change to existing code. This class offers an alternative to `ParamGridBuilder` and can be dropped into the code wherever `ParamGridBuilder` appears. Indeed, it extends `ParamGridBuilder` and is completely compatible with  its interface. It merely adds one method that provides a range over which a hyperparameter will be randomly defined.

### How was this patch tested?

Tests `ParamRandomBuilderSuite.scala` and `RandomRangesSuite.scala` were added.

`ParamRandomBuilderSuite` is the analogue of the already existing `ParamGridBuilderSuite` which tests the user-facing interface.

`RandomRangesSuite` uses ScalaCheck to test the random ranges over which hyperparameters are distributed.

Closes #31535 from PhillHenry/ParamRandomBuilder.

Authored-by: Phillip Henry <PhillHenry@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(commit: 397b843)
The file was addedmllib/src/test/scala/org/apache/spark/ml/tuning/ParamRandomBuilderSuite.scala
The file was modifieddocs/ml-tuning.md (diff)
The file was addedexamples/src/main/scala/org/apache/spark/examples/ml/ModelSelectionViaRandomHyperparametersExample.scala
The file was addedmllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
The file was addedexamples/src/main/java/org/apache/spark/examples/ml/JavaModelSelectionViaRandomHyperparametersExample.java
The file was modifiedpython/docs/source/reference/pyspark.ml.rst (diff)
The file was modifiedpython/pyspark/ml/tuning.pyi (diff)
The file was addedmllib/src/test/scala/org/apache/spark/ml/tuning/RandomRangesSuite.scala
The file was modifiedpython/pyspark/ml/tuning.py (diff)
The file was modifiedpython/pyspark/ml/tests/test_tuning.py (diff)
Commit 54c053afb0c9d3fcc7ac311100c8db9deeb163c0 by dongjoon
[SPARK-34479][SQL] Add zstandard codec to Avro compression codec list

### What changes were proposed in this pull request?

Avro add zstandard codec since AVRO-2195. This pr add zstandard codec to Avro compression codec list.

### Why are the changes needed?

To make Avro support zstandard codec.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #31673 from wangyum/SPARK-34479.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 54c053a)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedexternal/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala (diff)
The file was modifiedexternal/avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala (diff)
Commit 5a48eb8d00faee3a7c8f023c0699296e22edb893 by srowen
[SPARK-34415][ML] Python example

Missing Python example file for [SPARK-34415][ML] Randomization in hyperparameter optimization
(https://github.com/apache/spark/pull/31535)

### What changes were proposed in this pull request?
For some reason (probably me being silly) a examples/src/main/python/ml/model_selection_random_hyperparameters_example.py was not pushed in a previous PR.
This PR restores that file.

### Why are the changes needed?
A single file (examples/src/main/python/ml/model_selection_random_hyperparameters_example.py) that should have been pushed as part of SPARK-34415 but was not. This was causing Lint errors as highlighted by dongjoon-hyun. Consequently, srowen asked for a new PR.

### Does this PR introduce _any_ user-facing change?
No, it merely restores a file that was overlook in SPARK-34415.

### How was this patch tested?
By running:
`bin/spark-submit examples/src/main/python/ml/model_selection_random_hyperparameters_example.py`

Closes #31687 from PhillHenry/SPARK-34415_model_selection_random_hyperparameters_example.

Authored-by: Phillip Henry <PhillHenry@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(commit: 5a48eb8)
The file was addedexamples/src/main/python/ml/model_selection_random_hyperparameters_example.py
Commit d07fc3076b296e642ce321f2b2435f3059eeed4c by yamamuro
[SPARK-33687][SQL] Support analyze all tables in a specific database

### What changes were proposed in this pull request?

This pr add support analyze all tables in a specific database:
```g4
ANALYZE TABLES ((FROM | IN) multipartIdentifier)? COMPUTE STATISTICS (identifier)?
```

### Why are the changes needed?

1. Make it easy to analyze all tables in a specific database.
2. PostgreSQL has a similar implementation: https://www.postgresql.org/docs/12/sql-analyze.html.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The feature tested by unit test.
The documentation tested by regenerating the documentation:

menu-sql.yaml |  sql-ref-syntax-aux-analyze-tables.md
-- | --
![image](https://user-images.githubusercontent.com/5399861/109098769-dc33a200-775c-11eb-86b1-55531e5425e0.png) | ![image](https://user-images.githubusercontent.com/5399861/109098841-02594200-775d-11eb-8588-de8da97ec94a.png)

Closes #30648 from wangyum/SPARK-33687.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(commit: d07fc30)
The file was modifieddocs/sql-ref-syntax-aux-analyze.md (diff)
The file was modifiedsql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 (diff)
The file was modifieddocs/sql-ref-syntax.md (diff)
The file was modifieddocs/_data/menu-sql.yaml (diff)
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTablesCommand.scala
The file was modifieddocs/sql-ref-syntax-aux-analyze-table.md (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala (diff)
The file was addeddocs/sql-ref-syntax-aux-analyze-tables.md
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala (diff)
Commit 0216051acadedcc7e9bcd840aa78776159b200d1 by yamamuro
[SPARK-34506][CORE] ADD JAR with ivy coordinates should be compatible with Hive transitive behavior

### What changes were proposed in this pull request?
SPARK-33084 added the ability to use ivy coordinates with `SparkContext.addJar`. PR #29966 claims to mimic Hive behavior although I found a few cases where it doesn't

1) The default value of the transitive parameter is false, both in case of parameter not being specified in coordinate or parameter value being invalid. The Hive behavior is that transitive is [true if not specified](https://github.com/apache/hive/blob/cb2ac3dcc6af276c6f64ee00f034f082fe75222b/ql/src/java/org/apache/hadoop/hive/ql/util/DependencyResolver.java#L169) in the coordinate and [false for invalid values](https://github.com/apache/hive/blob/cb2ac3dcc6af276c6f64ee00f034f082fe75222b/ql/src/java/org/apache/hadoop/hive/ql/util/DependencyResolver.java#L124). Also, regardless of Hive, I think a default of true for the transitive parameter also matches [ivy's own defaults](https://ant.apache.org/ivy/history/2.5.0/ivyfile/dependency.html#_attributes).

2) The parameter value for transitive parameter is regarded as case-sensitive [based on the understanding](https://github.com/apache/spark/pull/29966#discussion_r547752259) that Hive behavior is case-sensitive. However, this is not correct, Hive [treats the parameter value case-insensitively](https://github.com/apache/hive/blob/cb2ac3dcc6af276c6f64ee00f034f082fe75222b/ql/src/java/org/apache/hadoop/hive/ql/util/DependencyResolver.java#L122).

I propose that we be compatible with Hive for these behaviors

### Why are the changes needed?
To make `ADD JAR` with ivy coordinates compatible with Hive's transitive behavior

### Does this PR introduce _any_ user-facing change?

The user-facing changes here are within master as the feature introduced in SPARK-33084 has not been released yet
1. Previously an ivy coordinate without `transitive` parameter specified did not resolve transitive dependency, now it does.
2. Previously an `transitive` parameter value was treated case-sensitively. e.g. `transitive=TRUE` would be treated as false as it did not match exactly `true`. Now it will be treated case-insensitively.

### How was this patch tested?

Modified existing unit tests to test new behavior
Add new unit test to cover usage of `exclude` with unspecified `transitive`

Closes #31623 from shardulm94/spark-34506.

Authored-by: Shardul Mahadik <smahadik@linkedin.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(commit: 0216051)
The file was modifiedcore/src/main/scala/org/apache/spark/util/DependencyUtils.scala (diff)
The file was modifieddocs/sql-ref-syntax-aux-resource-mgmt-add-jar.md (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/SparkContextSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala (diff)
Commit d574308864816b74372346b1f0b497f2e71c2000 by dongjoon
[SPARK-34579][SQL][TEST] Fix wrong UT in SQLQuerySuite

### What changes were proposed in this pull request?
Some UT in SQLQuerySuite is  not incorrect, it have wrong table name in `withTable`, this pr to make it correct.

### Why are the changes needed?
Fix UT

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existed UT

Closes #31681 from AngersZhuuuu/SPARK-34569.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: d574308)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala (diff)
Commit 1afe284ed899792a5230b0635ae11ff56ebd8f1b by yamamuro
[SPARK-34570][SQL] Remove dead code from constructors of [Hive]SessionStateBuilder

### What changes were proposed in this pull request?

the parameter - `options` is never used. The changes here was part of https://github.com/apache/spark/pull/30642, It got reverted for easier backporting #30642 as a hotfix by https://github.com/apache/spark/pull/30642/commits/dad24543aa7bb7cc81d2a8522112eb797b015633, this PR brings it back to master.

### Why are the changes needed?

remove unless dead code

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

Passing CI is enough.

Closes #31683 from yaooqinn/SPARK-34570.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(commit: 1afe284)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/test/TestSQLContext.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/test/TestHive.scala (diff)
Commit f494c5cff9d56744f8e7a2b646be6d01de8a09f4 by dongjoon
[SPARK-33212][FOLLOWUP] Add hadoop-yarn-server-web-proxy for Hadoop 3.x profile

### What changes were proposed in this pull request?

This adds `hadoop-yarn-server-web-proxy` as dependency for Yarn and Hadoop 3.x profile (it is already a dependency for 2.x). Also excludes some dependencies from the module which are already covered by other Hadoop jars used by Spark.

### Why are the changes needed?

The class `org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter` is used by `ApplicationMaster`:
```scala
  private def addAmIpFilter(driver: Option[RpcEndpointRef], proxyBase: String) = {
    val amFilter = "org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter"
    val params = client.getAmIpFilterParams(yarnConf, proxyBase)
    driver match {
      case Some(d) =>
        d.send(AddWebUIFilter(amFilter, params, proxyBase))
   ...
```
and will be loaded at runtime. Therefore, without the above jar Spark Yarn app will fail with `ClassNotFoundError`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit tests. Also tested manually and it worked with the fix, while was failing previously.

Closes #31642 from sunchao/SPARK-33212-followup-2.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: f494c5c)
The file was modifieddev/deps/spark-deps-hadoop-3.2-hive-2.3 (diff)
The file was modifiedassembly/pom.xml (diff)
The file was modifiedpom.xml (diff)
Commit 3d0ee9604eab3c01af469049d80b053bf2aaa636 by gurwls223
[SPARK-34520][CORE][FOLLOW-UP] Remove SecurityManager in GangliaSink

### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/31636. There was one place missed in `GangliaSink`, and we should also remove `SecurityManager`.

### Why are the changes needed?

To make `GangliaSink` work.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

It was found in the internal it tests in the company I work for.

Closes #31688 from HyukjinKwon/SPARK-34520-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 3d0ee96)
The file was modifiedexternal/spark-ganglia-lgpl/src/main/scala/org/apache/spark/metrics/sink/GangliaSink.scala (diff)
Commit 62737e140c7b04805726a33c392c297335db7b45 by gurwls223
[SPARK-34556][SQL] Checking duplicate static partition columns should respect case sensitive conf

### What changes were proposed in this pull request?

This PR makes partition spec parsing respect case sensitive conf.

### Why are the changes needed?

When parsing the partition spec, Spark will call `org.apache.spark.sql.catalyst.parser.ParserUtils.checkDuplicateKeys` to check if there are duplicate partition column names in the list. But this method is always case sensitive and doesn't detect duplicate partition column names when using different cases.

### Does this PR introduce _any_ user-facing change?

Yep. This prevents users from writing incorrect queries such as `INSERT OVERWRITE t PARTITION (c='2', C='3') VALUES (1)` when they don't enable case sensitive conf.

### How was this patch tested?

The new added test will fail without this change.

Closes #31669 from zsxwing/SPARK-34556.

Authored-by: Shixiong Zhu <zsxwing@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 62737e1)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SQLInsertTestSuite.scala (diff)
Commit a6cc5e625fcba2ef889f759207a01075cce3b38b by gurwls223
[SPARK-34574][DOCS] Jekyll fails to generate Scala API docs for Scala 2.13

### What changes were proposed in this pull request?

This PR fixes an issue that `bundler exec jekyll` build fails to generate Scala API docs even though after `dev/change-scala-version.sh 2.13` run.

### Why are the changes needed?

The reason of this issue is that `build/sbt` in `copy_api_dirs.rb` runs without `-Pscala-2.13`.
So, it's a bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I tested the following patterns manually.

* `dev/change-scala-version 2.13` and then `bundler exec jekyll build`
* `dev/change-scala-version 2.12` to change back to Scala 2.12 and then `bundler exec jekyll build`
* `dev/change-scala-version 2.13` two times to confirm the idempotency and then `bundler exec jekyll build`
* `dev/change-scala-version 2.12` two times to confirm the idempotency and then `bundler exec jekyll build`

Closes #31690 from sarutak/jekyll-scala-2.13.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: a6cc5e6)
The file was modifieddev/change-scala-version.sh (diff)