Changes

Summary

  1. [SPARK-32357][INFRA] Publish failed and succeeded test reports in GitHub (commit: 5debde9) (details)
  2. [SPARK-20680][SQL][FOLLOW-UP] Add HiveVoidType in HiveClientImpl (commit: 339eec5) (details)
  3. [SPARK-32590][SQL] Remove fullOutput from RowDataSourceScanExec (commit: 14003d4) (details)
  4. [MINOR][SQL] Fixed approx_count_distinct rsd param description (commit: 10edeaf) (details)
  5. [SPARK-32616][SQL] Window operators should be added determinedly (commit: c6be207) (details)
  6. [SPARK-32119][CORE] ExecutorPlugin doesn't work with Standalone Cluster (commit: 1a4c8f7) (details)
Commit 5debde94019d46d4ab66f7927d9e5e8c4d16a7ec by dongjoon
[SPARK-32357][INFRA] Publish failed and succeeded test reports in GitHub
Actions
### What changes were proposed in this pull request?
This PR proposes to report the failed and succeeded tests in GitHub
Actions in order to improve the development velocity by leveraging
[ScaCap/action-surefire-report](https://github.com/ScaCap/action-surefire-report).
See the example below:
![Screen Shot 2020-08-13 at 8 17 52
PM](https://user-images.githubusercontent.com/6477701/90128649-28f7f280-dda2-11ea-9211-e98e34332f6b.png)
Note that we cannot just use
[ScaCap/action-surefire-report](https://github.com/ScaCap/action-surefire-report)
in Apache Spark because PRs are from the forked repository, and GitHub
secrets are unavailable for the security reason. This plugin and all
similar plugins require to have the GitHub token that has the write
access in order to post test results but it is unavailable in PRs.
To work around this limitation, I took this approach:
1. In workflow A, run the tests and upload the JUnit XML test results.
GitHub provides to upload and download some files. 2. GitHub introduced
new event type
[`workflow_run`](https://github.blog/2020-08-03-github-actions-improvements-for-fork-and-pull-request-workflows/)
10 days ago. By leveraging this, it triggers another workflow B. 3.
Workflow B is in the main repo instead of fork repo, and has the write
access the plugin needs. In workflow B, it downloads the artifact
uploaded from workflow A (from the forked repository). 4. Workflow B
generates the test reports to port from JUnit xml files. 5. Workflow B
looks up the PR and posts the test reports.
The `workflow_run` event is very new feature, and looks not so many
GitHub Actions plugins support. In order to make this working with
[ScaCap/action-surefire-report](https://github.com/ScaCap/action-surefire-report),
I had to fork two GitHub Actions plugins to use:
-
[ScaCap/action-surefire-report](https://github.com/ScaCap/action-surefire-report)
to have this custom fix:
https://github.com/HyukjinKwon/action-surefire-report/commit/c96094cc35061fcf154a7cb46807f2f3e2339476
   It added `commit` argument to specify the commit to post the test
reports. With `workflow_run`, it can access, in workflow B, to the
commit from workflow A.
-
[dawidd6/action-download-artifact](https://github.com/dawidd6/action-download-artifact)
to have this custom fix:
https://github.com/HyukjinKwon/action-download-artifact/commit/750b71af351aba467757d7be6924199bb08db4ed
   It added the support of downloading all artifacts from workflow A, in
workflow B. By default, it only supports to specify the name of
artifact.
    Note that I was not able to use the official
[actions/download-artifact](https://github.com/actions/download-artifact)
because:
     - It does not support to download artifacts between different
workflows, see also
https://github.com/actions/download-artifact/issues/3. Once this issue
is resolved, we can switch it back to
[actions/download-artifact](https://github.com/actions/download-artifact).
I plan to make a pull request for both repositories so we don't have to
rely on forks.
### Why are the changes needed?
Currently, it's difficult to check the failed tests. You should scroll
down long logs from GitHub Actions logs.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
Manually tested at: https://github.com/HyukjinKwon/spark/pull/17,
https://github.com/HyukjinKwon/spark/pull/18,
https://github.com/HyukjinKwon/spark/pull/19,
https://github.com/HyukjinKwon/spark/pull/20, and master branch of my
forked repository.
Closes #29333 from HyukjinKwon/SPARK-32357-fix.
Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by:
HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun
<dongjoon@apache.org>
(commit: 5debde9)
The file was modified.github/workflows/master.yml (diff)
The file was added.github/workflows/test_report.yml
Commit 339eec5f321fac07e01441a3fd83024f4b92a0d6 by wenchen
[SPARK-20680][SQL][FOLLOW-UP] Add HiveVoidType in HiveClientImpl
### What changes were proposed in this pull request?
Discussion with
[comment](https://github.com/apache/spark/pull/29244#issuecomment-671746329).
Add `HiveVoidType` class in `HiveClientImpl` then we can replace
`NullType` to `HiveVoidType` before we call hive client.
### Why are the changes needed?
Better compatible with hive.
More details in [#29244](https://github.com/apache/spark/pull/29244).
### Does this PR introduce _any_ user-facing change?
Yes, user can create view with null type in Hive.
### How was this patch tested?
New test.
Closes #29423 from ulysses-you/add-HiveVoidType.
Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 339eec5)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveSQLViewSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/SQLViewSuite.scala (diff)
Commit 14003d4c30e881940b6e3bf2d978bc8dca941146 by wenchen
[SPARK-32590][SQL] Remove fullOutput from RowDataSourceScanExec
### What changes were proposed in this pull request? Remove `fullOutput`
from `RowDataSourceScanExec`
### Why are the changes needed?
`RowDataSourceScanExec` requires the full output instead of the scan
output after column pruning. However, in v2 code path, we don't have the
full output anymore so we just pass the pruned output.
`RowDataSourceScanExec.fullOutput` is actually meaningless so we should
remove it.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? existing tests
Closes #29415 from huaxingao/rm_full_output.
Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 14003d4)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)
Commit 10edeafc69250afef8c71ed7b3c77992f67aa4ff by yamamuro
[MINOR][SQL] Fixed approx_count_distinct rsd param description
### What changes were proposed in this pull request?
In the docs concerning the approx_count_distinct I have changed the
description of the rsd parameter from **_maximum estimation error
allowed_** to _**maximum relative standard deviation allowed**_
### Why are the changes needed?
Maximum estimation error allowed can be misleading. You can set the
target relative standard deviation, which affects the estimation error,
but on given runs the estimation error can still be above the rsd
parameter.
### Does this PR introduce _any_ user-facing change?
This PR should make it easier for users reading the docs to understand
that the rsd parameter in approx_count_distinct doesn't cap the
estimation error, but just sets the target deviation instead,
### How was this patch tested?
No tests, as no code changes were made.
Closes #29424 from
Comonut/fix-approx_count_distinct-rsd-param-description.
Authored-by: alexander-daskalov <alexander.daskalov@adevinta.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(commit: 10edeaf)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala (diff)
The file was modifiedpython/pyspark/sql/functions.py (diff)
The file was modifiedR/pkg/R/functions.R (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlus.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/functions.scala (diff)
Commit c6be2074cc704ec1578eb05d1d8a696227cd2bbf by wenchen
[SPARK-32616][SQL] Window operators should be added determinedly
### What changes were proposed in this pull request?
Use the `LinkedHashMap` instead of `immutable.Map` to hold the `Window`
expressions in `ExtractWindowExpressions.addWindow`.
### Why are the changes needed?
This is a bug fix for https://github.com/apache/spark/pull/29270. In
that PR, the generated plan(especially for the queries q47, q49, q57) on
Jenkins always can not match the golden plan generated on my laptop.
It happens because `ExtractWindowExpressions.addWindow` now uses
`immutable.Map` to hold the `Window` expressions by the key
`(spec.partitionSpec, spec.orderSpec,
WindowFunctionType.functionType(expr))` and converts the map to `Seq` at
the end. Then, the `Seq` is used to add Window operators on top of the
child plan. However, for the same query, the order of Windows expression
inside the `Seq` could be undetermined when the expression id
changes(which can affect the key). As a result, the same query could
have different plans because of the undetermined order of Window
operators.
Therefore, we use `LinkedHashMap`, which records the insertion order of
entries, to make the adding order determined.
### Does this PR introduce _any_ user-facing change?
Maybe yes, users now always see the same plan for the same queries with
multiple Window operators.
### How was this patch tested?
It's really hard to make a reproduce demo. I just tested manually with
https://github.com/apache/spark/pull/29270 and it looks good.
Closes #29432 from Ngone51/fix-addWindow.
Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: c6be207)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
Commit 1a4c8f718f748d8c2a39f0a3b209ce606bffe958 by mridulatgmail.com
[SPARK-32119][CORE] ExecutorPlugin doesn't work with Standalone Cluster
and Kubernetes with --jars
### What changes were proposed in this pull request?
This PR changes Executor to load jars and files added by --jars and
--files on Executor initialization. To avoid downloading those
jars/files twice, they are assosiated with `startTime` as their uploaded
timestamp.
### Why are the changes needed?
ExecutorPlugin can't work with Standalone Cluster and Kubernetes when a
jar which contains plugins and files used by the plugins are added by
--jars and --files option with spark-submit.
This is because jars and files added by --jars and --files are not
loaded on Executor initialization. I confirmed it works with YARN
because jars/files are distributed as distributed cache.
### Does this PR introduce _any_ user-facing change?
Yes. jars/files added by --jars and --files are downloaded on each
executor on initialization.
### How was this patch tested?
Added a new testcase.
Closes #28939 from sarutak/fix-plugin-issue.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by:
Mridul Muralidharan <mridul<at>gmail.com>
(commit: 1a4c8f7)
The file was modifiedcore/src/main/scala/org/apache/spark/executor/Executor.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/SparkContext.scala (diff)
The file was modifiedcore/src/test/java/test/org/apache/spark/JavaSparkContextSuite.java (diff)