Changes

Summary

  1. [MINOR][DOCS] Soften security warning and keep it in cluster management (commit: 2e1e1f8) (details)
  2. [SPARK-34787][CORE] Option variable in Spark historyServer log should be (commit: 94849af) (details)
  3. [SPARK-35101][INFRA] Add GitHub status check in PR instead of a comment (commit: 2bdb26b) (details)
  4. [MINOR][INFRA] Upgrade Jira client to 2.0.0 (commit: 7f6dee8) (details)
  5. [SPARK-35116][SQL][TESTS] The generated data fits the precision of (commit: 03191e8) (details)
  6. [SPARK-35114][SQL][TESTS] Add checks for ANSI intervals to (commit: d04b467) (details)
  7. [SPARK-34716][SQL] Support ANSI SQL intervals by the aggregate function (commit: 12abfe7) (details)
  8. [SPARK-35115][SQL][TESTS] Check ANSI intervals in (commit: 074f770) (details)
  9. [SPARK-35092][UI] the auto-generated rdd's name in the storage tab (commit: 978cd0b) (details)
  10. [SPARK-35109][SQL] Fix minor exception messages of HashedRelation and (commit: fd08c93) (details)
  11. [SPARK-34581][SQL] Don't optimize out grouping expressions from (commit: c8d78a7) (details)
  12. [SPARK-35122][SQL] Migrate CACHE/UNCACHE TABLE to use (commit: 7a06cdd) (details)
Commit 2e1e1f83e49f69e37229c65592c1a2e87b1efe56 by srowen
[MINOR][DOCS] Soften security warning and keep it in cluster management docs only

### What changes were proposed in this pull request?

Soften security warning and keep it in cluster management docs only, not in the main doc page, where it's not necessarily relevant.

### Why are the changes needed?

The statement is perhaps unnecessarily 'frightening' as the first section in the main docs page. It applies to clusters not local mode, anyhow.

### Does this PR introduce _any_ user-facing change?

Just a docs change.

### How was this patch tested?

N/A

Closes #32206 from srowen/SecurityStatement.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(commit: 2e1e1f8)
The file was modifieddocs/security.md (diff)
The file was modifieddocs/quick-start.md (diff)
The file was modifieddocs/running-on-kubernetes.md (diff)
The file was modifieddocs/spark-standalone.md (diff)
The file was modifieddocs/running-on-mesos.md (diff)
The file was modifieddocs/running-on-yarn.md (diff)
The file was modifieddocs/index.md (diff)
Commit 94849afc546e6d30bc55c248c3647158b6cb5a0c by dhyun
[SPARK-34787][CORE] Option variable in Spark historyServer log should be displayed as actual value instead of Some(XX)

### What changes were proposed in this pull request?
Make the attemptId in the log of historyServer to be more easily to read.

### Why are the changes needed?
Option variable in Spark historyServer log should be displayed as actual value instead of Some(XX)

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
manual test

Closes #32189 from kyoty/history-server-print-option-variable.

Authored-by: kyoty <echohlne@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: 94849af)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/history/ApplicationCache.scala (diff)
Commit 2bdb26b37448aa856494a3976b26e8ae72ce5d4c by gurwls223
[SPARK-35101][INFRA] Add GitHub status check in PR instead of a comment

### What changes were proposed in this pull request?

TL;DR: now it shows green yellow read status of tests instead of relying on a comment in a PR, **see https://github.com/HyukjinKwon/spark/pull/41 for an example**.

This PR proposes the GitHub status checks instead of a comment that link to the build (from forked repository) in PRs.

This is how it works:

1. **forked repo**: "Build and test" workflow is triggered when you create a branch to create a PR which uses your resources in GitHub Actions.
1. **main repo**: "Notify test workflow" (previously created a comment) now creates a in-progress status (yellow status) as a GitHub Actions check to your current PR.
1.  **main repo**: "Update build status workflow" regularly (every 15 mins) checks open PRs, and updates the status of GitHub Actions checks at PRs according to the status of workflows in the forked repositories (status sync).

**NOTE** that creating/updating statuses in the PRs is only allowed from the main repo. That's why the flow is as above.

### Why are the changes needed?

The GitHub status shows a green although the tests are running, which is confusing.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested at:
- https://github.com/HyukjinKwon/spark/pull/41
- HyukjinKwon#42
- HyukjinKwon#43
- https://github.com/HyukjinKwon/spark/pull/37

**queued**:
<img width="861" alt="Screen Shot 2021-04-16 at 10 56 03 AM" src="https://user-images.githubusercontent.com/6477701/114960831-c9a73080-9ea2-11eb-8442-ddf3f6008a45.png">

**in progress**:
<img width="871" alt="Screen Shot 2021-04-16 at 12 14 39 PM" src="https://user-images.githubusercontent.com/6477701/114966359-59ea7300-9ead-11eb-98cb-1e63323980ad.png">

**passed**:
![Screen Shot 2021-04-16 at 2 04 07 PM](https://user-images.githubusercontent.com/6477701/114974045-a12c3000-9ebc-11eb-9be5-653393a863e6.png)

**failure**:
![Screen Shot 2021-04-16 at 10 46 10 PM](https://user-images.githubusercontent.com/6477701/115033584-90ec7300-9f05-11eb-8f2e-0fc2ef986a70.png)

Closes #32193 from HyukjinKwon/update-checks-pr-poc.

Lead-authored-by: HyukjinKwon <gurwls223@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 2bdb26b)
The file was modified.github/workflows/notify_test_workflow.yml (diff)
The file was modified.github/workflows/labeler.yml (diff)
The file was added.github/workflows/update_build_status.yml
Commit 7f6dee8c861ceea85edbd3c263e98b41d751fd31 by gurwls223
[MINOR][INFRA] Upgrade Jira client to 2.0.0

### What changes were proposed in this pull request?

SPARK-10498 added the initial Jira client requirement with 1.0.3 five year ago (2016 January). As of today, it causes `dev/merge_spark_pr.py` failure with `Python 3.9.4` due to this old dependency. This PR aims to upgrade it to the latest version, 2.0.0. The latest version is also a little old (2018 July).
- https://pypi.org/project/jira/#history

### Why are the changes needed?

`Jira==2.0.0` works well with both Python 3.8/3.9 while `Jira==1.0.3` fails with Python 3.9.

**BEFORE**
```
$ pyenv global 3.9.4
$ pip freeze | grep jira
jira==1.0.3
$ dev/merge_spark_pr.py
Traceback (most recent call last):
  File "/Users/dongjoon/APACHE/spark-merge/dev/merge_spark_pr.py", line 39, in <module>
    import jira.client
  File "/Users/dongjoon/.pyenv/versions/3.9.4/lib/python3.9/site-packages/jira/__init__.py", line 5, in <module>
    from .config import get_jira
  File "/Users/dongjoon/.pyenv/versions/3.9.4/lib/python3.9/site-packages/jira/config.py", line 17, in <module>
    from .client import JIRA
  File "/Users/dongjoon/.pyenv/versions/3.9.4/lib/python3.9/site-packages/jira/client.py", line 165
    validate=False, get_server_info=True, async=False, logging=True, max_retries=3):
                                          ^
SyntaxError: invalid syntax
```

**AFTER**
```
$ pip install jira==2.0.0
$ dev/merge_spark_pr.py
git rev-parse --abbrev-ref HEAD
Which pull request would you like to merge? (e.g. 34):
```

### Does this PR introduce _any_ user-facing change?

No. This is a committer-only script.

### How was this patch tested?

Manually.

Closes #32215 from dongjoon-hyun/jira.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 7f6dee8)
The file was modifieddev/requirements.txt (diff)
Commit 03191e8d8f424f418cc464f855247b9907567b06 by max.gekk
[SPARK-35116][SQL][TESTS] The generated data fits the precision of DayTimeIntervalType in spark

### What changes were proposed in this pull request?
The precision of `java.time.Duration` is nanosecond, but when it is used as `DayTimeIntervalType` in Spark, it is microsecond.
At present, the `DayTimeIntervalType` data generated in the implementation of `RandomDataGenerator` is accurate to nanosecond, which will cause the `DayTimeIntervalType` to be converted to long, and then back to `DayTimeIntervalType` to lose the accuracy, which will cause the test to fail. For example: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137390/testReport/org.apache.spark.sql.hive.execution/HashAggregationQueryWithControlledFallbackSuite/udaf_with_all_data_types/

### Why are the changes needed?
Improve `RandomDataGenerator` so that the generated data fits the precision of DayTimeIntervalType in spark.

### Does this PR introduce _any_ user-facing change?
'No'. Just change the test class.

### How was this patch tested?
Jenkins test.

Closes #32212 from beliefer/SPARK-35116.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: 03191e8)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGeneratorSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala (diff)
Commit d04b467690f1684728b3cc18f033be0d29f3f9e3 by max.gekk
[SPARK-35114][SQL][TESTS] Add checks for ANSI intervals to `LiteralExpressionSuite`

### What changes were proposed in this pull request?
In the PR, I propose to add additional checks for ANSI interval types `YearMonthIntervalType` and `DayTimeIntervalType` to `LiteralExpressionSuite`.

Also, I replaced some long literal values by `CalendarInterval` to check `CalendarIntervalType` that the tests were supposed to check.

### Why are the changes needed?
To improve test coverage and have the same checks for ANSI types as for `CalendarIntervalType`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the modified test suite:
```
$ build/sbt "test:testOnly *LiteralExpressionSuite"
```

Closes #32213 from MaxGekk/interval-literal-tests.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: d04b467)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/LiteralExpressionSuite.scala (diff)
Commit 12abfe79173f6ab00b3341f3b31cad5aa26aa6e4 by max.gekk
[SPARK-34716][SQL] Support ANSI SQL intervals by the aggregate function `sum`

### What changes were proposed in this pull request?
Extend the `Sum` expression to  to support `DayTimeIntervalType` and `YearMonthIntervalType` added by #31614.

Note: the expressions can throw the overflow exception independently from the SQL config `spark.sql.ansi.enabled`. In this way, the modified expressions always behave in the ANSI mode for the intervals.

### Why are the changes needed?
Extend `org.apache.spark.sql.catalyst.expressions.aggregate.Sum` to support `DayTimeIntervalType` and `YearMonthIntervalType`.

### Does this PR introduce _any_ user-facing change?
'No'.
Should not since new types have not been released yet.

### How was this patch tested?
Jenkins test

Closes #32107 from beliefer/SPARK-34716.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: 12abfe7)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Sum.scala (diff)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ExpressionTypeCheckingSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/udaf.scala (diff)
Commit 074f7701372829d745417a80155f31ba6817e619 by yamamuro
[SPARK-35115][SQL][TESTS] Check ANSI intervals in `MutableProjectionSuite`

### What changes were proposed in this pull request?
Add checks for `YearMonthIntervalType` and `DayTimeIntervalType` to `MutableProjectionSuite`.

### Why are the changes needed?
To improve test coverage, and the same checks as for `CalendarIntervalType`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the modified test suite:
```
$ build/sbt "test:testOnly *MutableProjectionSuite"
```

Closes #32225 from MaxGekk/test-ansi-intervals-in-MutableProjectionSuite.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(commit: 074f770)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/MutableProjectionSuite.scala (diff)
Commit 978cd0bf4991de81ebe8a82b67144c0029464fe6 by sarutak
[SPARK-35092][UI] the auto-generated rdd's name in the storage tab should be truncated if it is too long

### What changes were proposed in this pull request?
the auto-generated rdd's name in the storage tab should be truncated  as a single line if it is too long.

### Why are the changes needed?
to make the ui shows more friendly.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
just a simple modifition in css, manual test works well like below:

before modified:
![the rdd title in storage page shows too long](https://user-images.githubusercontent.com/52202080/115009655-17da2500-9edf-11eb-86a7-088bed7ef8f7.png)

after modified:
Tht titile  needs just one line:

![storage标题过长修改后](https://user-images.githubusercontent.com/52202080/114872091-8c07c080-9e2c-11eb-81a8-0c097b1a77bf.png)

Closes #32191 from kyoty/storage-rdd-titile-display-improve.

Authored-by: kyoty <echohlne@gmail.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
(commit: 978cd0b)
The file was modifiedcore/src/main/scala/org/apache/spark/ui/UIUtils.scala (diff)
Commit fd08c931510b19d96acbbb445962200f776bdd99 by yamamuro
[SPARK-35109][SQL] Fix minor exception messages of HashedRelation and HashJoin

### What changes were proposed in this pull request?

It seems that we miss classifying one `SparkOutOfMemoryError` in `HashedRelation`. Add the error classification for it. In addition, clean up two errors definition of `HashJoin` as they are not used.

### Why are the changes needed?

Better error classification.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #32211 from c21/error-message.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(commit: fd08c93)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala (diff)
Commit c8d78a70b43f8373ff3f2c170d6aa3eaece03b7c by wenchen
[SPARK-34581][SQL] Don't optimize out grouping expressions from aggregate expressions without aggregate function

### What changes were proposed in this pull request?
This PR:
- Adds a new expression `GroupingExprRef` that can be used in aggregate expressions of `Aggregate` nodes to refer grouping expressions by index. These expressions capture the data type and nullability of the referred grouping expression.
- Adds a new rule `EnforceGroupingReferencesInAggregates` that inserts the references in the beginning of the optimization phase.
- Adds a new rule `UpdateGroupingExprRefNullability` to update nullability of `GroupingExprRef` expressions as nullability of referred grouping expression can change during optimization.

### Why are the changes needed?
If aggregate expressions (without aggregate functions) in an `Aggregate` node are complex then the `Optimizer` can optimize out grouping expressions from them and so making aggregate expressions invalid.

Here is a simple example:
```
SELECT not(t.id IS NULL) , count(*)
FROM t
GROUP BY t.id IS NULL
```
In this case the `BooleanSimplification` rule does this:
```
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.BooleanSimplification ===
!Aggregate [isnull(id#222)], [NOT isnull(id#222) AS (NOT (id IS NULL))#226, count(1) AS c#224L]   Aggregate [isnull(id#222)], [isnotnull(id#222) AS (NOT (id IS NULL))#226, count(1) AS c#224L]
+- Project [value#219 AS id#222]                                                                 +- Project [value#219 AS id#222]
    +- LocalRelation [value#219]                                                                     +- LocalRelation [value#219]
```
where `NOT isnull(id#222)` is optimized to `isnotnull(id#222)` and so it no longer refers to any grouping expression.

Before this PR:
```
== Optimized Logical Plan ==
Aggregate [isnull(id#222)], [isnotnull(id#222) AS (NOT (id IS NULL))#234, count(1) AS c#232L]
+- Project [value#219 AS id#222]
   +- LocalRelation [value#219]
```
and running the query throws an error:
```
Couldn't find id#222 in [isnull(id#222)#230,count(1)#226L]
java.lang.IllegalStateException: Couldn't find id#222 in [isnull(id#222)#230,count(1)#226L]
```

After this PR:
```
== Optimized Logical Plan ==
Aggregate [isnull(id#222)], [NOT groupingexprref(0) AS (NOT (id IS NULL))#234, count(1) AS c#232L]
+- Project [value#219 AS id#222]
   +- LocalRelation [value#219]
```
and the query works.

### Does this PR introduce _any_ user-facing change?
Yes, the query works.

### How was this patch tested?
Added new UT.

Closes #31913 from peter-toth/SPARK-34581-keep-grouping-expressions.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: c8d78a7)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ComplexTypes.scala (diff)
The file was removedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UpdateAttributeNullability.scala
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAggregatesSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/group-by.sql.out (diff)
The file was addedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UpdateNullability.scala
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/group-by.sql (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/complexTypesSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala (diff)
The file was addedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/EnforceGroupingReferencesInAggregates.scala
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala (diff)
Commit 7a06cdd53b377ffcd58dacd63a1b96d6b422941c by wenchen
[SPARK-35122][SQL] Migrate CACHE/UNCACHE TABLE to use AnalysisOnlyCommand

### What changes were proposed in this pull request?

Now that `AnalysisOnlyCommand` in introduced in #32032, `CacheTable` and `UncacheTable` can extend `AnalysisOnlyCommand` to simplify the code base. For example, the logic to handle these commands such that the tables are only analyzed is scattered across different places.

### Why are the changes needed?

To simplify the code base to handle these two commands.

### Does this PR introduce _any_ user-facing change?

No, just internal refactoring.

### How was this patch tested?

The existing tests (e.g., `CachedTableSuite`) cover the changes in this PR. For example, if I make `CacheTable`/`UncacheTable` extend `LeafCommand`, there are few failures in `CachedTableSuite`.

Closes #32220 from imback82/cache_cmd_analysis_only.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 7a06cdd)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)