Changes

Summary

  1. [MINOR] Fix string interpolation in CommandUtils.scala and (commit: 154f604) (details)
  2. [SPARK-33668][K8S][TEST] Fix flaky test "Verify logging configuration is (commit: 6317ba2) (details)
  3. [SPARK-33652][SQL] DSv2: DeleteFrom should refresh cache (commit: e857e06) (details)
  4. [SPARK-33256][PYTHON][DOCS] Clarify PySpark follows NumPy documentation (commit: 5250841) (details)
  5. [SPARK-33667][SQL] Respect the `spark.sql.caseSensitive` config while (commit: 4829781) (details)
  6. [SPARK-33674][TEST] Show Slowpoke notifications in SBT tests (commit: b94ecf0) (details)
  7. [SPARK-33663][SQL] Uncaching should not be called on non-existing temp (commit: 119539f) (details)
  8. [SPARK-33675][INFRA] Add GitHub Action job to publish snapshot (commit: e32de29) (details)
  9. [SPARK-33670][SQL] Verify the partition provider is Hive in v1 SHOW (commit: 29096a8) (details)
  10. [SPARK-33683][INFRA] Remove -Djava.version=11 from Scala 2.13 build in (commit: e88f0d4) (details)
  11. [SPARK-33680][SQL][TESTS] Fix (commit: 73412ff) (details)
Commit 154f6044033d1a3b4c19c64b206b168bf919cb3b by gurwls223
[MINOR] Fix string interpolation in CommandUtils.scala and KafkaDataConsumer.scala

### What changes were proposed in this pull request?

This PR proposes to fix a string interpolation in `CommandUtils.scala` and `KafkaDataConsumer.scala`.

### Why are the changes needed?

To fix a string interpolation bug.

### Does this PR introduce _any_ user-facing change?

Yes, the string will be correctly constructed.

### How was this patch tested?

Existing tests since they were used in exception/log messages.

Closes #30609 from imback82/fix_cache_str_interporlation.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 154f604)
The file was modifiedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/consumer/KafkaDataConsumer.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala (diff)
Commit 6317ba29a1bb1b7198fe8df71ddefcf47a55bd51 by dongjoon
[SPARK-33668][K8S][TEST] Fix flaky test "Verify logging configuration is picked from the provided

### What changes were proposed in this pull request?
Fix flaky test "Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties."
The test is flaking, with multiple flaked instances - the reason for the failure has been similar to:

```

The code passed to eventually never returned normally. Attempted 109 times over 3.0079882413999997 minutes. Last failure message: Failure executing: GET at:
https://192.168.39.167:8443/api/v1/namespaces/b37fc72a991b49baa68a2eaaa1516463/pods/spark-pi-97a9bc76308e7fe3-exec-1/log?pretty=false. Message: pods "spark-pi-97a9bc76308e7fe3-exec-1" not found. Received status: Status(apiVersion=v1, code=404, details=StatusDetails(causes=[], group=null, kind=pods, name=spark-pi-97a9bc76308e7fe3-exec-1, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=pods "spark-pi-97a9bc76308e7fe3-exec-1" not found, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=NotFound, status=Failure, additionalProperties={}).. (KubernetesSuite.scala:402)

```
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36854/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36852/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36850/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36848/console
From the above failures, it seems, that executor finishes too quickly and is removed by spark before the test can complete.
So, in order to mitigate this situation, one way is to turn on the flag
   "spark.kubernetes.executor.deleteOnTermination"

### Why are the changes needed?

Fixes a flaky test.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests.
May be a few runs of jenkins integration test, may reveal if the problem is resolved or not.

Closes #30616 from ScrapCodes/SPARK-33668/fix-flaky-k8s-integration-test.

Authored-by: Prashant Sharma <prashsh1@in.ibm.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 6317ba2)
The file was modifiedresource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesSuite.scala (diff)
The file was modifiedresource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/SparkConfPropagateSuite.scala (diff)
Commit e857e06452c2cf478beb31367f76d6950b660ebb by dongjoon
[SPARK-33652][SQL] DSv2: DeleteFrom should refresh cache

### What changes were proposed in this pull request?

This changes `DeleteFromTableExec` to also refresh caches referencing the original table, by passing the `refreshCache` callback to the class. Note that in order to construct the callback, I have to change `DataSourceV2ScanRelation` to contain a `DataSourceV2Relation` instead of a `Table`.

### Why are the changes needed?

Currently DSv2 delete from table doesn't refresh caches. This could lead to correctness issue if the staled cache is queried later.

### Does this PR introduce _any_ user-facing change?

Yes. Now delete from table in v2 also refreshes cache.

### How was this patch tested?

Added a test case.

Closes #30597 from sunchao/SPARK-33652.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: e857e06)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/Dataset.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DeleteFromTableExec.scala (diff)
Commit 5250841537d7a8c54fb451748e2a21d3bcc5d966 by dongjoon
[SPARK-33256][PYTHON][DOCS] Clarify PySpark follows NumPy documentation style

### What changes were proposed in this pull request?

This PR adds few lines about docstring style to document that PySpark follows [NumPy documentation style](https://numpydoc.readthedocs.io/en/latest/format.html). We all completed the migration to NumPy documentation style at SPARK-32085.

Ideally we should have a page like https://pandas.pydata.org/docs/development/contributing_docstring.html but I would like to leave it as a future work.

### Why are the changes needed?

To tell developers that PySpark now follows NumPy documentation style.

### Does this PR introduce _any_ user-facing change?

No, it's a change in unreleased branches yet.

### How was this patch tested?

Manually tested via `make clean html` under `python/docs`:

![Screen Shot 2020-12-06 at 1 34 50 PM](https://user-images.githubusercontent.com/6477701/101271623-d5ce0380-37c7-11eb-93ac-da73caa50c37.png)

Closes #30622 from HyukjinKwon/SPARK-33256.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 5250841)
The file was modifiedpython/docs/source/development/contributing.rst (diff)
Commit 48297818f37a8e02cc02ba6fa9ec04fe37540aca by dongjoon
[SPARK-33667][SQL] Respect the `spark.sql.caseSensitive` config while resolving partition spec in v1 `SHOW PARTITIONS`

### What changes were proposed in this pull request?
Preprocess the partition spec passed to the V1 SHOW PARTITIONS implementation `ShowPartitionsCommand`, and normalize the passed spec according to the partition columns w.r.t the case sensitivity flag  **spark.sql.caseSensitive**.

### Why are the changes needed?
V1 SHOW PARTITIONS is case sensitive in fact, and doesn't respect the SQL config **spark.sql.caseSensitive** which is false by default, for instance:
```sql
spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int)
         > USING parquet
         > PARTITIONED BY (year, month);
spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1;
spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1);
Error in query: Non-partitioning column(s) [YEAR, Month] are specified for SHOW PARTITIONS;
```
The `SHOW PARTITIONS` command must show the partition `year = 2015, month = 1` specified by `YEAR = 2015, Month = 1`.

### Does this PR introduce _any_ user-facing change?
Yes. After the changes, the command above works as expected:
```sql
spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1);
year=2015/month=1
```

### How was this patch tested?
By running the affected test suites:
- `v1/ShowPartitionsSuite`
- `v2/ShowPartitionsSuite`

Closes #30615 from MaxGekk/show-partitions-case-sensitivity-test.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 4829781)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowPartitionsSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowPartitionsSuiteBase.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/v2/ShowPartitionsSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala (diff)
Commit b94ecf0734b829878956d98b74323e0c80822fec by yumwang
[SPARK-33674][TEST] Show Slowpoke notifications in SBT tests

### What changes were proposed in this pull request?
This PR is to show Slowpoke notifications in the log when running tests using SBT.

For example, the test case "zero sized blocks" in ExternalShuffleServiceSuite enters the infinite loop. After this change, the log file will have a notification message every 5 minute when the test case running longer than two minutes. Below is an example message.

```
[info] ExternalShuffleServiceSuite:
[info] - groupByKey without compression (101 milliseconds)
[info] - shuffle non-zero block size (3 seconds, 186 milliseconds)
[info] - shuffle serializer (3 seconds, 189 milliseconds)
[info] *** Test still running after 2 minute, 1 seconds: suite name: ExternalShuffleServiceSuite, test name: zero sized blocks.
[info] *** Test still running after 7 minute, 1 seconds: suite name: ExternalShuffleServiceSuite, test name: zero sized blocks.
[info] *** Test still running after 12 minutes, 1 seconds: suite name: ExternalShuffleServiceSuite, test name: zero sized blocks.
[info] *** Test still running after 17 minutes, 1 seconds: suite name: ExternalShuffleServiceSuite, test name: zero sized blocks.
```

### Why are the changes needed?
When the tests/code has bug and enters the infinite loop, it is hard to tell which test cases hit some issues from the log, especially when we are running the tests in parallel. It would be nice to show the Slowpoke notifications.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manual testing in my local dev environment.

Closes #30621 from gatorsmile/addSlowpoke.

Authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
(commit: b94ecf0)
The file was modifiedproject/SparkBuild.scala (diff)
Commit 119539fd493af5ed0e37af79320787f145eaf3f1 by gurwls223
[SPARK-33663][SQL] Uncaching should not be called on non-existing temp views

### What changes were proposed in this pull request?

This PR proposes to fix a misleading logs in the following scenario when uncaching is called on non-existing views:
```
scala> sql("CREATE TABLE table USING parquet AS SELECT 2")
res0: org.apache.spark.sql.DataFrame = []

scala> val df = spark.table("table")
df: org.apache.spark.sql.DataFrame = [2: int]

scala> df.createOrReplaceTempView("t2")
20/12/04 10:16:24 WARN CommandUtils: Exception when attempting to uncache $name
org.apache.spark.sql.AnalysisException: Table or view not found: t2;;
'UnresolvedRelation [t2], [], false

at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:113)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:93)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:183)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:93)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:90)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:152)
at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:172)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:214)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:169)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:73)
at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:138)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768)
at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:138)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:73)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:71)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:63)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:88)
at org.apache.spark.sql.DataFrameReader.table(DataFrameReader.scala:889)
at org.apache.spark.sql.SparkSession.table(SparkSession.scala:589)
at org.apache.spark.sql.internal.CatalogImpl.uncacheTable(CatalogImpl.scala:476)
at org.apache.spark.sql.execution.command.CommandUtils$.uncacheTableOrView(CommandUtils.scala:392)
at org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:124)
```
Since `t2` does not exist yet, it shouldn't try to uncache.

### Why are the changes needed?

To fix misleading message.

### Does this PR introduce _any_ user-facing change?

Yes, the above message will not be displayed if the view doesn't exist yet.

### How was this patch tested?

Manually tested since this is a log message printed.

Closes #30608 from imback82/fix_cache_message.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 119539f)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala (diff)
Commit e32de29bcee6073a2d2b9bb4e5930459eaf460d9 by gurwls223
[SPARK-33675][INFRA] Add GitHub Action job to publish snapshot

### What changes were proposed in this pull request?

This PR aims to add `GitHub Action` job to publish daily snapshot for **master** branch.
- https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.12/3.2.0-SNAPSHOT/

For the other branches, I'll make adjusted backports.
- For `branch-3.1`, we can specify the checkout `ref` to `branch-3.1`.
- For `branch-2.4` and `branch-3.0`, we can publish at every commit since the traffic is low.
  - https://github.com/apache/spark/pull/30630 (branch-3.0)
  - https://github.com/apache/spark/pull/30629 (branch-2.4 LTS)

### Why are the changes needed?

After this series of jobs, this will reduce our maintenance burden permanently from AmpLab Jenkins by removing the following completely.

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/

For now, AmpLab Jenkins doesn't have a job for `branch-3.1`. We can do it by ourselves by `GitHub Action`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The snapshot publishing is tested here at PR trigger. Since this PR adds a scheduled job, we cannot test in this PR.
- https://github.com/dongjoon-hyun/spark/runs/1505792859

Apache Infra team finished the setup here.
- https://issues.apache.org/jira/browse/INFRA-21167

Closes #30623 from dongjoon-hyun/SPARK-33675.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: e32de29)
The file was added.github/workflows/publish_snapshot.yml
Commit 29096a8869c95221dc75ce7fd3d098680bef4f55 by gurwls223
[SPARK-33670][SQL] Verify the partition provider is Hive in v1 SHOW TABLE EXTENDED

### What changes were proposed in this pull request?
Invoke the check `DDLUtils.verifyPartitionProviderIsHive()` from V1 implementation of `SHOW TABLE EXTENDED` when partition specs are specified.

This PR is some kind of follow up https://github.com/apache/spark/pull/16373 and https://github.com/apache/spark/pull/15515.

### Why are the changes needed?
To output an user friendly error with recommendation like
**"
... partition metadata is not stored in the Hive metastore. To import this information into the metastore, run `msck repair table tableName`
"**
instead of silently output an empty result.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
By running the affected test suites, in particular:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "hive/test:testOnly *PartitionProviderCompatibilitySuite"
```

Closes #30618 from MaxGekk/show-table-extended-verifyPartitionProviderIsHive.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 29096a8)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionProviderCompatibilitySuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowTablesSuite.scala (diff)
Commit e88f0d4a2436cc47c8bf8ed2a739eab728ea3d81 by dongjoon
[SPARK-33683][INFRA] Remove -Djava.version=11 from Scala 2.13 build in GitHub Actions

### What changes were proposed in this pull request?

This PR removes `-Djava.version=11` from the build command for Scala 2.13 in the GitHub Actions' job.

In the GitHub Actions' job, the build command for Scala 2.13 is defined as follows.
```
./build/sbt -Pyarn -Pmesos -Pkubernetes -Phive -Phive-thriftserver -Phadoop-cloud -Pkinesis-asl -Djava.version=11 -Pscala-2.13 compile test:compile
```

Though, Scala 2.13 build uses Java 8 rather than 11 so let's remove `-Djava.version=11`.

### Why are the changes needed?

To build with consistent configuration.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Should be done by GitHub Actions' workflow.

Closes #30633 from sarutak/scala-213-java11.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: e88f0d4)
The file was modified.github/workflows/build_and_test.yml (diff)
Commit 73412ffb3a857acda5dab41d7be3f7ae627f6eaf by dongjoon
[SPARK-33680][SQL][TESTS] Fix PrunePartitionSuiteBase/BucketedReadWithHiveSupportSuite not to depend on the default conf

### What changes were proposed in this pull request?

This PR updates `PrunePartitionSuiteBase/BucketedReadWithHiveSupportSuite` to have the require conf explicitly.

### Why are the changes needed?

The unit test should not depend on the default configurations.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

According to https://github.com/apache/spark/pull/30628 , this seems to be the only ones.

Pass the CIs.

Closes #30631 from dongjoon-hyun/SPARK-CONF-AGNO.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 73412ff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/sources/BucketedReadWithHiveSupportSuite.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PrunePartitionSuiteBase.scala (diff)