Changes

Summary

  1. [SPARK-33654][SQL] Migrate CACHE TABLE to use UnresolvedRelation to (commit: 8f5db71) (details)
  2. [SPARK-33706][SQL] Require fully specified partition identifier in (commit: 8b97b19) (details)
  3. [SPARK-33526][SQL] Add config to control if cancel invoke interrupt task (commit: 5bab27e) (details)
  4. [MINOR][INFRA] Add kubernetes-integration-tests to GitHub Actions for (commit: 29cc5b3) (details)
  5. [SPARK-33757][INFRA][R] Fix the R dependencies build error on GitHub (commit: fb2e3af) (details)
  6. [SPARK-33729][SQL] When refreshing cache, Spark should not use cached (commit: be09d37) (details)
  7. [SPARK-32447][CORE][PYTHON][FOLLOW-UP] Fix other occurrences of 'python' (commit: e2cdfce) (details)
  8. [MINOR][UI] Correct JobPage's skipped/pending tableHeaderId (commit: 0277fdd) (details)
Commit 8f5db716fae1162e411750cd5d5380a399d410ae by wenchen
[SPARK-33654][SQL] Migrate CACHE TABLE to use UnresolvedRelation to resolve identifier

### What changes were proposed in this pull request?

This PR proposes to migrate `CACHE TABLE` to use `UnresolvedRelation` to resolve the table/view identifier in Analyzer as discussed https://github.com/apache/spark/pull/30403/files#r532360022.

### Why are the changes needed?

To resolve the table in the analyzer.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests

Closes #30598 from imback82/cache_v2.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 8f5db71)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala (diff)
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CacheTableExec.scala
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/test/TestHive.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
Commit 8b97b19ffad7ec78e4b1f05cb1168ef79dc647b2 by wenchen
[SPARK-33706][SQL] Require fully specified partition identifier in partitionExists()

### What changes were proposed in this pull request?
1. Check that the partition identifier passed to `SupportsPartitionManagement.partitionExists()` is fully specified (specifies all values of partition fields).
2. Remove the custom implementation of `partitionExists()` from `InMemoryPartitionTable`, and re-use the default implementation from `SupportsPartitionManagement`.

### Why are the changes needed?
The method is supposed to check existence of one partition but currently it can return `true` for partially specified partition. This can lead to incorrect commands behavior, for instance the commands could modify or place data in the middle of partition path.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running existing test suites:
```
$ build/sbt "test:testOnly *AlterTablePartitionV2SQLSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *SupportsPartitionManagementSuite"
```

Closes #30667 from MaxGekk/check-len-partitionExists.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 8b97b19)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryPartitionTable.scala (diff)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsPartitionManagement.java (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/SupportsPartitionManagementSuite.scala (diff)
Commit 5bab27e00bcad31400c952149ffd0389f841a992 by gurwls223
[SPARK-33526][SQL] Add config to control if cancel invoke interrupt task on thriftserver

### What changes were proposed in this pull request?

This PR add a new config `spark.sql.thriftServer.forceCancel` to give user a way to interrupt task when cancel statement.

### Why are the changes needed?

After [#29933](https://github.com/apache/spark/pull/29933), we support cancel query if timeout, but the default behavior of `SparkContext.cancelJobGroups` won't interrupt task and just let task finish by itself. In some case it's dangerous, e.g., data skew or exists a heavily shuffle. A task will hold in a long time after do cancel and the resource will not release.

### Does this PR introduce _any_ user-facing change?

Yes, a new config.

### How was this patch tested?

Add test.

Closes #30481 from ulysses-you/SPARK-33526.

Lead-authored-by: ulysses-you <ulyssesyou18@gmail.com>
Co-authored-by: ulysses-you <youxiduo@weidian.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 5bab27e)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/ThriftServerWithSparkContextSuite.scala (diff)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala (diff)
Commit 29cc5b3f235ff178cf888f16877e6e0fd44253cc by gurwls223
[MINOR][INFRA] Add kubernetes-integration-tests to GitHub Actions for Scala 2.13 build

### What changes were proposed in this pull request?

This PR adds `kubernetes-integration-tests` to GitHub Actions for Scala 2.13 build.

### Why are the changes needed?

Now that the build pass with `kubernetes-integration-tests` and Scala 2.13, it's better to keep it build-able.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Should be done by GitHub Actions.
I also confirmed that the build passes with the following command.
```
$ build/sbt -Pscala-2.13 -Pkubernetes -Pkubernetes-integration-tests compile test:compile
```

Closes #30731 from sarutak/github-actions-k8s.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 29cc5b3)
The file was modified.github/workflows/build_and_test.yml (diff)
Commit fb2e3af4b5d92398d57e61b766466cc7efd9d7cb by gurwls223
[SPARK-33757][INFRA][R] Fix the R dependencies build error on GitHub Actions and AppVeyor

### What changes were proposed in this pull request?

This PR fixes the R dependencies build error on GitHub Actions and AppVeyor.
The reason seems that `usethis` package is updated 2020/12/10.
https://cran.r-project.org/web/packages/usethis/index.html

### Why are the changes needed?

To keep the build clean.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Should be done by GitHub Actions.

Closes #30737 from sarutak/fix-r-dependencies-build-error.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: fb2e3af)
The file was modifiedappveyor.yml (diff)
The file was modified.github/workflows/build_and_test.yml (diff)
Commit be09d37398f6b62c853e961df64b94b34fd3389d by dongjoon
[SPARK-33729][SQL] When refreshing cache, Spark should not use cached plan when recaching data

### What changes were proposed in this pull request?

This fixes `CatalogImpl.refreshTable` by using a new logical plan when recache the target table.

### Why are the changes needed?

In `CatalogImpl.refreshTable`, we currently recache the target table via:
```scala
sparkSession.sharedState.cacheManager.cacheQuery(table, cacheName, cacheLevel)
```
However, here `table` is generated before the `tableRelationCache` in `SessionCatalog` is invalidated, and therefore it still refers to old and staled logical plan, which is incorrect.

### Does this PR introduce _any_ user-facing change?

Yes, this fix behavior when a table is refreshed.

### How was this patch tested?

Added a unit test.

Closes #30699 from sunchao/SPARK-33729.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: be09d37)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala (diff)
Commit e2cdfcebd9b39a1104b34d8eafafbcdc6acf5d3e by gurwls223
[SPARK-32447][CORE][PYTHON][FOLLOW-UP] Fix other occurrences of 'python' to 'python3'

### What changes were proposed in this pull request?

This PR proposes to change python to python3 in several places missed.

### Why are the changes needed?

To use Python 3 by default safely.

### Does this PR introduce _any_ user-facing change?

Yes, it will uses `python3` as its default Python interpreter.

### How was this patch tested?

It was tested together in https://github.com/apache/spark/pull/30735. The test cases there will verify this change together.

Closes #30750 from HyukjinKwon/SPARK-32447.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: e2cdfce)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/PythonRunner.scala (diff)
The file was modifiedlauncher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java (diff)
The file was modifiedpython/pyspark/context.py (diff)
Commit 0277fddaef17b615354c735a2c89cdced5f1d8f6 by sarutak
[MINOR][UI] Correct JobPage's skipped/pending tableHeaderId

### What changes were proposed in this pull request?

Current Spark Web UI job page's header link of pending/skipped stages is inconsistent with their statuses. See the picture below:
![image](https://user-images.githubusercontent.com/9404831/101998894-1e843180-3c8c-11eb-8d94-10df9edb68e7.png)

### Why are the changes needed?

The code determining the `pendingOrSkippedTableId` has the wrong logic. As explained in the code:
> If the job is completed, then any pending stages are displayed as "skipped" [code pointer](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/jobs/JobPage.scala#L266)

This PR fixes the logic for `pendingOrSkippedTableId` which aligns with the stage statuses.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Verified that header link is consistent with stage status with the fix.

Closes #30749 from linzebing/ui_bug.

Authored-by: linzebing <linzebing1995@gmail.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
(commit: 0277fdd)
The file was modifiedcore/src/main/scala/org/apache/spark/ui/jobs/JobPage.scala (diff)