Changes

Summary

  1. [SPARK-35567][SQL] Fix: Explain cost is not showing statistics for all (commit: cd2ef9c) (details)
  2. [SPARK-35576][SQL] Redact the sensitive info in the result of Set (commit: 8e11f5f) (details)
  3. [SPARK-35453][PYTHON] Move Koalas accessor to pandas_on_spark accessor (commit: 7e27173) (details)
  4. [SPARK-35573][R][TESTS] Make SparkR tests pass with R 4.1+ (commit: 1ba1b70) (details)
  5. [SPARK-35578][SQL][TEST] Add a test case for a bug in janino (commit: bb2a074) (details)
  6. [SPARK-35433][DOCS] Move CSV data source options from Python and Scala (commit: 73d4f67) (details)
  7. [SPARK-35544][SQL] Add tree pattern pruning to Analyzer rules (commit: 1dd0ca2) (details)
  8. [SPARK-35582][PYTHON][DOCS] Remove # noqa in Python API documents (commit: fe09def) (details)
  9. [SPARK-35586][K8S][TESTS] Set a default value for (commit: e048838) (details)
  10. [SPARK-35584][CORE][TESTS] Increase the timeout in FallbackStorageSuite (commit: d773373) (details)
  11. [SPARK-35516][WEBUI] Storage UI tab Storage Level tool tip correction (commit: b7dd4b3) (details)
  12. [SPARK-35581][SQL] Support special datetime values in typed literals (commit: a59063d) (details)
  13. [SPARK-35577][TESTS] Allow to log container output for docker (commit: 08e6f63) (details)
  14. [SPARK-35402][WEBUI] Increase the max thread pool size of jetty server (commit: a127d91) (details)
  15. [SPARK-35314][PYTHON] Support arithmetic operations against bool (commit: 0ac5c16) (details)
  16. [SPARK-35589][CORE] BlockManagerMasterEndpoint should not ignore (commit: 35cfabc) (details)
  17. [SPARK-35600][TESTS] Move Set command related test cases to (commit: 6a277bb) (details)
  18. [SPARK-35539][PYTHON] Restore to_koalas to keep the backward (commit: 0ad5ae5) (details)
  19. [SPARK-35595][TESTS] Support multiple loggers in testing method (commit: 9d0d4ed) (details)
  20. [SPARK-35560][SQL] Remove redundant subexpression evaluation in nested (commit: dbf0b50) (details)
  21. [SPARK-35100][ML] Refactor AFT - support virtual centering (commit: c2de0a6) (details)
  22. [SPARK-35077][SQL] Migrate to transformWithPruning for leftover (commit: 3f6322f) (details)
  23. [SPARK-35583][DOCS] Move JDBC data source options from Python and Scala (commit: 48252ba) (details)
  24. [SPARK-35604][SQL] Fix condition check for FULL OUTER sort merge join (commit: 54e9999) (details)
  25. [SPARK-35585][SQL] Support propagate empty relation through (commit: daf9d19) (details)
  26. [SPARK-21957][SQL] Support current_user function (commit: 345d35e) (details)
  27. [SPARK-35059][SQL] Group exception messages in hive/execution (commit: 9f7cdb8) (details)
  28. [SPARK-34808][SQL][FOLLOWUP] Remove canPlanAsBroadcastHashJoin check in (commit: 8041aed) (details)
  29. [SPARK-35610][CORE] Fix the memory leak introduced by the Executor's (commit: 806edf8) (details)
Commit cd2ef9cb43f4e27302906ac2f605df4b66a72a2f by wenchen
[SPARK-35567][SQL] Fix: Explain cost is not showing statistics for all the nodes

### What changes were proposed in this pull request?
Explain cost command in spark currently doesn't show statistics for all the nodes. It misses some nodes in almost all the TPCDS queries.
In this PR, we are collecting all the plan nodes including the subqueries and computing  the statistics for each node, if it doesn't exists in stats cache,

### Why are the changes needed?
**Before Fix**
For eg: Query1,  Project node doesn't have statistics
![image](https://user-images.githubusercontent.com/23054875/120123442-868feb00-c1cc-11eb-9af9-3a87bf2117d2.png)

Query15, Aggregate node doesn't have statistics

![image](https://user-images.githubusercontent.com/23054875/120123296-a4108500-c1cb-11eb-89df-7fddd651572e.png)

**After Fix:**
Query1:
![image](https://user-images.githubusercontent.com/23054875/120123559-1df53e00-c1cd-11eb-938a-53704f5240e6.png)
Query 15:
![image](https://user-images.githubusercontent.com/23054875/120123665-bb507200-c1cd-11eb-8ea2-84c732215bac.png)
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manual testing

Closes #32704 from shahidki31/shahid/fixshowstats.

Authored-by: shahid <shahidki31@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: cd2ef9c)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala (diff)
Commit 8e11f5f00764b60c7c621aa9d53f3cef0d8709ac by dhyun
[SPARK-35576][SQL] Redact the sensitive info in the result of Set command

### What changes were proposed in this pull request?

Currently, the results of following SQL queries are not redacted:
```
SET [KEY];
SET;
```
For example:

```
scala> spark.sql("set javax.jdo.option.ConnectionPassword=123456").show()
+--------------------+------+
|                 key| value|
+--------------------+------+
|javax.jdo.option....|123456|
+--------------------+------+

scala> spark.sql("set javax.jdo.option.ConnectionPassword").show()
+--------------------+------+
|                 key| value|
+--------------------+------+
|javax.jdo.option....|123456|
+--------------------+------+

scala> spark.sql("set").show()
+--------------------+--------------------+
|                 key|               value|
+--------------------+--------------------+
|javax.jdo.option....|              123456|

```

We should hide the sensitive information and redact the query output.

### Why are the changes needed?

Security.

### Does this PR introduce _any_ user-facing change?

Yes, the sensitive information in the output of Set commands are redacted

### How was this patch tested?

Unit test

Closes #32712 from gengliangwang/redactSet.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: 8e11f5f)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/SetCommand.scala (diff)
Commit 7e2717333b980b7c9a258ea3dff1619cf2351cbf by gurwls223
[SPARK-35453][PYTHON] Move Koalas accessor to pandas_on_spark accessor

### What changes were proposed in this pull request?

This PR proposes renaming the existing "Koalas Accessor" to "Pandas API on Spark Accessor".

### Why are the changes needed?

Because we don't use name "Koalas" anymore, rather use "Pandas API on Spark".

So, the related code bases are all need to be changed.

### Does this PR introduce _any_ user-facing change?

Yes, the usage of pandas API on Spark accessor is changed from `df.koalas.[...]`. to `df.pandas_on_spark.[...]`.

**Note:** `df.koalas.[...]` is still available but with deprecated warnings.

### How was this patch tested?

Manually tested in local and checked one by one.

Closes #32674 from itholic/SPARK-35453.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 7e27173)
The file was modifiedpython/pyspark/pandas/accessors.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_categorical.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_dataframe.py (diff)
The file was modifiedpython/pyspark/pandas/strings.py (diff)
The file was modifiedpython/pyspark/pandas/indexes/datetimes.py (diff)
The file was modifiedpython/pyspark/pandas/datetimes.py (diff)
The file was modifiedpython/pyspark/pandas/frame.py (diff)
The file was modifiedpython/pyspark/pandas/namespace.py (diff)
The file was modifiedpython/pyspark/pandas/series.py (diff)
Commit 1ba1b70cfe24f94b882ebc2dcc6f18d8638596a2 by gurwls223
[SPARK-35573][R][TESTS] Make SparkR tests pass with R 4.1+

### What changes were proposed in this pull request?

This PR proposes to support R 4.1.0+ in SparkR. Currently the tests are being failed as below:

```
══ Failed ══════════════════════════════════════════════════════════════════════
── 1. Failure (test_sparkSQL_arrow.R:71:3): createDataFrame/collect Arrow optimi
collect(createDataFrame(rdf)) not equal to `expected`.
Component “g”: 'tzone' attributes are inconsistent ('UTC' and '')

── 2. Failure (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - type
collect(ret) not equal to `rdf`.
Component “b”: 'tzone' attributes are inconsistent ('UTC' and '')

── 3. Failure (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - type
collect(ret) not equal to `rdf`.
Component “b”: 'tzone' attributes are inconsistent ('UTC' and '')

── 4. Error (test_sparkSQL.R:1454:3): column functions ─────────────────────────
Error: (converted from warning) cannot xtfrm data frames
Backtrace:
  1. base::sort(collect(distinct(select(df, input_file_name())))) test_sparkSQL.R:1454:2
  2. base::sort.default(collect(distinct(select(df, input_file_name()))))
  5. base::order(x, na.last = na.last, decreasing = decreasing)
  6. base::lapply(z, function(x) if (is.object(x)) as.vector(xtfrm(x)) else x)
  7. base:::FUN(X[[i]], ...)
10. base::xtfrm.data.frame(x)

── 5. Failure (test_utils.R:67:3): cleanClosure on R functions ─────────────────
`actual` not equal to `g`.
names for current but not for target
Length mismatch: comparison on first 0 components

── 6. Failure (test_utils.R:80:3): cleanClosure on R functions ─────────────────
`actual` not equal to `g`.
names for current but not for target
Length mismatch: comparison on first 0 components
```

It fixes three as below:

- Avoid a sort on DataFrame which isn't legitimate: https://github.com/apache/spark/pull/32709#discussion_r642458108
- Treat the empty timezone and local timezone as equivalent in SparkR: https://github.com/apache/spark/pull/32709#discussion_r642464454
- Disable `check.environment` in the cleaned closure comparison (enabled by default from R 4.1+, https://cran.r-project.org/doc/manuals/r-release/NEWS.html), and keep the test as is https://github.com/apache/spark/pull/32709#discussion_r642510089

### Why are the changes needed?

Higher R versions have bug fixes and improvements. More importantly R users tend to use highest R versions.

### Does this PR introduce _any_ user-facing change?

Yes, SparkR will work together with R 4.1.0+

### How was this patch tested?

```bash
./R/run-tests.sh
```

```
sparkSQL_arrow:
SparkSQL Arrow optimization: .................

...

sparkSQL:
SparkSQL functions: ........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................
........................................................................................................................................................................................................

...

utils:
functions in utils.R: ..............................................
```

Closes #32709 from HyukjinKwon/SPARK-35573.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 1ba1b70)
The file was modifiedR/pkg/tests/fulltests/test_sparkSQL_arrow.R (diff)
The file was modifiedR/pkg/tests/fulltests/test_utils.R (diff)
The file was modifiedR/pkg/tests/fulltests/test_sparkSQL.R (diff)
Commit bb2a0747d242ce0d423235862dfbe55f4d505f91 by gurwls223
[SPARK-35578][SQL][TEST] Add a test case for a bug in janino

### What changes were proposed in this pull request?

This PR adds a unit test to show a bug in the latest janino version which fails to compile valid Java code. Unfortunately, I can't share the exact query that can trigger this bug (includes some custom expressions), but this pattern is not very uncommon and I believe can be triggered by some real queries.

A follow-up is needed before the 3.2 release, to either fix this bug in janino, or revert the janino version upgrade, or work around it in Spark.

### Why are the changes needed?

make it easy for people to debug janino, as I'm not a janino expert.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

Closes #32716 from cloud-fan/janino.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: bb2a074)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala (diff)
Commit 73d4f67145dd7fbad282a9608ac2ac0f31c4b385 by gurwls223
[SPARK-35433][DOCS] Move CSV data source options from Python and Scala into a single page

### What changes were proposed in this pull request?

This PR proposes move CSV data source options from Python, Scala and Java into a single page.

### Why are the changes needed?

So far, the documentation for CSV data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language.

### Does this PR introduce _any_ user-facing change?

Yes, the documents will be shown below after this change:

- "CSV Files" page
<img width="970" alt="Screen Shot 2021-05-27 at 12 35 36 PM" src="https://user-images.githubusercontent.com/44108233/119762269-586a8c80-bee8-11eb-8443-ae5b3c7a685c.png">

- Python
<img width="785" alt="Screen Shot 2021-05-25 at 4 12 10 PM" src="https://user-images.githubusercontent.com/44108233/119455390-83cc6a80-bd74-11eb-9156-65785ae27db0.png">

- Scala
<img width="718" alt="Screen Shot 2021-05-25 at 4 12 39 PM" src="https://user-images.githubusercontent.com/44108233/119455414-89c24b80-bd74-11eb-9775-aeda549d081e.png">

- Java
<img width="667" alt="Screen Shot 2021-05-25 at 4 13 09 PM" src="https://user-images.githubusercontent.com/44108233/119455422-8d55d280-bd74-11eb-97e8-86c1eabeadc2.png">

### How was this patch tested?

Manually build docs and confirm the page.

Closes #32658 from itholic/SPARK-35433.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 73d4f67)
The file was modifiedpython/pyspark/sql/streaming.py (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala (diff)
The file was modifiedpython/pyspark/sql/readwriter.py (diff)
The file was modifieddocs/sql-data-sources-csv.md (diff)
The file was modifieddocs/sql-data-sources-text.md (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/functions.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala (diff)
Commit 1dd0ca23f64acfc7a3dc697e19627a1b74012a2d by gengliang
[SPARK-35544][SQL] Add tree pattern pruning to Analyzer rules

### What changes were proposed in this pull request?

Added the following TreePattern enums:
- AGGREGATE_EXPRESSION
- ALIAS
- GROUPING_ANALYTICS
- GENERATOR
- HIGH_ORDER_FUNCTION
- LAMBDA_FUNCTION
- NEW_INSTANCE
- PIVOT
- PYTHON_UDF
- TIME_WINDOW
- TIME_ZONE_AWARE_EXPRESSION
- UP_CAST
- COMMAND
- EVENT_TIME_WATERMARK
- UNRESOLVED_RELATION
- WITH_WINDOW_DEFINITION
- UNRESOLVED_ALIAS
- UNRESOLVED_ATTRIBUTE
- UNRESOLVED_DESERIALIZER
- UNRESOLVED_ORDINAL
- UNRESOLVED_FUNCTION
- UNRESOLVED_HINT
- UNRESOLVED_SUBQUERY_COLUMN_ALIAS
- UNRESOLVED_FUNC

Added tree pattern pruning to the following Analyzer rules:
- ResolveBinaryArithmetic
- WindowsSubstitution
- ResolveAliases
- ResolveGroupingAnalytics
- ResolvePivot
- ResolveOrdinalInOrderByAndGroupBy
- LookupFunction
- ResolveSubquery
- ResolveSubqueryColumnAliases
- ApplyCharTypePadding
- UpdateOuterReferences
- ResolveCreateNamedStruct
- TimeWindowing
- CleanupAliases
- EliminateUnions
- EliminateSubqueryAliases
- HandleAnalysisOnlyCommand
- ResolveNewInstances
- ResolveUpCast
- ResolveDeserializer
- ResolveOutputRelation
- ResolveEncodersInUDF
- HandleNullInputsForUDF
- ResolveGenerate
- ExtractGenerator
- GlobalAggregates
- ResolveAggregateFunctions

### Why are the changes needed?

Reduce the number of tree traversals and hence improve the query compilation latency.

### How was this patch tested?

Existing tests.
Performance diff:
<google-sheets-html-origin><style type="text/css"></style>
&nbsp; | Baseline | Experiment | Experiment/Baseline
-- | -- | -- | --
ResolveBinaryArithmetic | 43264874 | 34707117 | 0.80
WindowsSubstitution | 3322996 | 2734192 | 0.82
ResolveAliases | 24859263 | 21359941 | 0.86
ResolveGroupingAnalytics | 39249143 | 25417569 | 0.80
ResolvePivot | 6393408 | 2843314 | 0.44
ResolveOrdinalInOrderByAndGroupBy | 10750806 | 3386715 | 0.32
LookupFunction | 22087384 | 15481294 | 0.70
ResolveSubquery | 1129139340 | 944402323 | 0.84
ResolveSubqueryColumnAliases | 5055038 | 2808210 | 0.56
ApplyCharTypePadding | 76285576 | 63785681 | 0.84
UpdateOuterReferences | 6548321 | 3092539 | 0.47
ResolveCreateNamedStruct | 38111477 | 17350249 | 0.46
TimeWindowing | 41694190 | 3739134 | 0.09
CleanupAliases | 48683506 | 39584921 | 0.81
EliminateUnions | 3405069 | 2372506 | 0.70
EliminateSubqueryAliases | 9626649 | 9518216 | 0.99
HandleAnalysisOnlyCommand | 2562123 | 2661432 | 1.04
ResolveNewInstances | 16208966 | 1982314 | 0.12
ResolveUpCast | 14067843 | 1868615 | 0.13
ResolveDeserializer | 17991103 | 2320308 | 0.13
ResolveOutputRelation | 5815277 | 2088787 | 0.36
ResolveEncodersInUDF | 14182892 | 1045113 | 0.07
HandleNullInputsForUDF | 19850838 | 1329528 | 0.07
ResolveGenerate | 5587345 | 1953192 | 0.35
ExtractGenerator | 120378046 | 3386286 | 0.03
GlobalAggregates | 16510455 | 13553155 | 0.82
ResolveAggregateFunctions | 1041848509 | 828049280 | 0.79

</google-sheets-html-origin>

Closes #32686 from sigmod/analyzer.

Authored-by: Yingyi Bu <yingyi.bu@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(commit: 1dd0ca2)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PythonUDF.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolvePartitionSpec.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/SubstituteUnresolvedOrdinals.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreePatterns.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleIdCollection.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/higherOrderFunctions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/v2ResolutionPlans.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TimeWindow.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DeduplicateRelations.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/hints.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCommandsWithIfExists.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UpdateAttributeNullability.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Command.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/EventTimeWatermark.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/timeZoneAnalysis.scala (diff)
Commit fe09def3231600e52cb58aaba5f72af33ab4dc33 by gurwls223
[SPARK-35582][PYTHON][DOCS] Remove # noqa in Python API documents

### What changes were proposed in this pull request?

This PR aims to move `# noqa` in the Python docstring to the proper place so that hide them from the official documents.

### Why are the changes needed?

If we don't move `# noqa` to the proper place, it is exposed in the middle of the docstring, and it looks a bit wired as below:
<img width="613" alt="Screen Shot 2021-06-01 at 3 17 52 PM" src="https://user-images.githubusercontent.com/44108233/120275617-91da3800-c2ec-11eb-9778-16c5fe789418.png">

### Does this PR introduce _any_ user-facing change?

Yes, the `# noqa` is no more shown in the documents as below:
<img width="609" alt="Screen Shot 2021-06-01 at 3 21 00 PM" src="https://user-images.githubusercontent.com/44108233/120275927-fbf2dd00-c2ec-11eb-950d-346af2745711.png">

### How was this patch tested?

Manually build docs and check.

Closes #32728 from itholic/SPARK-35582.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: fe09def)
The file was modifiedpython/pyspark/sql/functions.py (diff)
Commit e04883880ff8735dbc347b73447353f57afe3bf8 by dhyun
[SPARK-35586][K8S][TESTS] Set a default value for spark.kubernetes.test.sparkTgz in pom.xml for Kubernetes integration tests

### What changes were proposed in this pull request?

This PR set a default value for `spark.kubernetes.test.sparkTgz` in `kubernetes/integration-tests/pom.xml` for Kubernetes integration tests.

### Why are the changes needed?

In the current master, running the integration tests with the following command will fail because there is no default value set for the property.
```
build/mvn -Dspark.kubernetes.test.namespace=default -Pkubernetes -Pkubernetes-integration-tests -Psparkr  -pl resource-managers/kubernetes/integration-tests integration-test
```
```
+ mkdir -p /home/kou/work/oss/spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked
+ tar -xzvf --test-exclude-tags --strip-components=1 -C /home/kou/work/oss/spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked
tar (child): --test-exclude-tags: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
[ERROR] Command execution failed.
```

According to `setup-integration-test-env.sh`, `N/A` is intended as the default value so this PR choose it.
```
SPARK_TGZ="N/A"
MVN="$TEST_ROOT_DIR/build/mvn"
EXCLUDE_TAGS=""
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Build and tests successfully finish with the command shown above.

Closes #32722 from sarutak/fix-pom-for-kube-integ.

Authored-by: Kousuke <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: e048838)
The file was modifiedresource-managers/kubernetes/integration-tests/pom.xml (diff)
Commit d77337307489978b0523fcca27e60093bcc2df38 by dhyun
[SPARK-35584][CORE][TESTS] Increase the timeout in FallbackStorageSuite

### What changes were proposed in this pull request?
```
- Upload multi stages *** FAILED ***
{{ The code passed to eventually never returned normally. Attempted 20 times over 10.011176743 seconds. Last failure message: fallbackStorage.exists(0, file) was false. (FallbackStorageSuite.scala:243)}}
```
The error like above was raised in aarch64 randomly and also in github action test[1][2].

[1] https://github.com/apache/spark/actions/runs/489319612
[2]https://github.com/apache/spark/actions/runs/479317320

### Why are the changes needed?
timeout is too short, need to increase to let test case complete.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
build/mvn test -Dtest=none -DwildcardSuites=org.apache.spark.storage.FallbackStorageSuite -pl :spark-core_2.12

Closes #32719 from Yikun/SPARK-35584.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: d773373)
The file was modifiedcore/src/test/scala/org/apache/spark/storage/FallbackStorageSuite.scala (diff)
Commit b7dd4b37e5657070357c4e20bd2a26d754e5b3d1 by sarutak
[SPARK-35516][WEBUI] Storage UI tab Storage Level tool tip correction

### What changes were proposed in this pull request?
Fixed tooltip for "Storage" tab in UI

### Why are the changes needed?
Tooltip correction was needed

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Manually tested

Closes #32664 from lidiyag/storagewebui.

Authored-by: lidiyag <lidiya.nixon@huawei.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
(commit: b7dd4b3)
The file was modifiedcore/src/main/scala/org/apache/spark/ui/storage/ToolTips.scala (diff)
Commit a59063d5446a78d713c19a0d43f86c9f72ffd77d by max.gekk
[SPARK-35581][SQL] Support special datetime values in typed literals only

### What changes were proposed in this pull request?
In the PR, I propose to support special datetime values introduced by #25708 and by #25716 only in typed literals, and don't recognize them in parsing strings to dates/timestamps. The following string values are supported only in typed timestamp literals:
- `epoch [zoneId]` - `1970-01-01 00:00:00+00 (Unix system time zero)`
- `today [zoneId]` - midnight today.
- `yesterday [zoneId]` - midnight yesterday
- `tomorrow [zoneId]` - midnight tomorrow
- `now` - current query start time.

For example:
```sql
spark-sql> SELECT timestamp 'tomorrow';
2019-09-07 00:00:00
```

Similarly, the following special date values are supported only in typed date literals:
- `epoch [zoneId]` - `1970-01-01`
- `today [zoneId]` - the current date in the time zone specified by `spark.sql.session.timeZone`.
- `yesterday [zoneId]` - the current date -1
- `tomorrow [zoneId]` - the current date + 1
- `now` - the date of running the current query. It has the same notion as `today`.

For example:
```sql
spark-sql> SELECT date 'tomorrow' - date 'yesterday';
2
```

### Why are the changes needed?
In the current implementation, Spark supports the special date/timestamp value in any input strings casted to dates/timestamps that leads to the following problems:
- If executors have different system time, the result is inconsistent, and random. Column values depend on where the conversions were performed.
- The special values play the role of distributed non-deterministic functions though users might think of the values as constants.

### Does this PR introduce _any_ user-facing change?
Yes but the probability should be small.

### How was this patch tested?
By running existing test suites:
```
$ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z interval.sql"
$ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z date.sql"
$ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z timestamp.sql"
$ build/sbt "test:testOnly *DateTimeUtilsSuite"
```

Closes #32714 from MaxGekk/remove-datetime-special-values.

Lead-authored-by: Max Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: a59063d)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateFormatterSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala (diff)
The file was modifieddocs/sql-migration-guide.md (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetPartitionDiscoverySuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DatetimeFormatterSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/postgreSQL/timestamp.sql (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateFormatter.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/client/FiltersSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/client/HivePartitionFilteringSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HashExpressionsSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClient.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityGenerator.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/timestamp.sql.out (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/UnsafeArraySuite.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala (diff)
Commit 08e6f633b5bc3a7d8d008db2a264b1607d269f25 by yamamuro
[SPARK-35577][TESTS] Allow to log container output for docker integration tests

### What changes were proposed in this pull request?

This PR proposes to add a feature that logs container output for docker integration tests.
With this change, if we run test with SBT, we will have like the following log in `unit-tests.log`.
```
===== CONTAINER LOGS FOR container Id: 3360c98eb28337d8b217fb614e47bf49aafa18a6cb60ecadf3178aee0c663021 =====
21/05/31 20:54:56.433 pool-1-thread-1 INFO PostgresIntegrationSuite: The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/postgresql/data ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... UTC
creating configuration files ... ok
running bootstrap script ... ok
sh: locale: not found
2021-05-31 11:54:49.892 UTC [29] WARNING:  no usable system locales were found
performing post-bootstrap initialization ... ok
initdb: warning: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.
syncing data to disk ... ok

Success. You can now start the database server using:

    pg_ctl -D /var/lib/postgresql/data -l logfile start

waiting for server to start....2021-05-31 11:54:50.284 UTC [34] LOG:  starting PostgreSQL 13.0 on x86_64-pc-linux-musl, compiled by gcc (Alpine 9.3.0) 9.3.0, 64-bit
2021-05-31 11:54:50.287 UTC [34] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2021-05-31 11:54:50.296 UTC [35] LOG:  database system was shut down at 2021-05-31 11:54:50 UTC
2021-05-31 11:54:50.301 UTC [34] LOG:  database system is ready to accept connections
done
server started

/usr/local/bin/docker-entrypoint.sh: ignoring /docker-entrypoint-initdb.d/*

waiting for server to shut down....2021-05-31 11:54:50.363 UTC [34] LOG:  received fast shutdown request
2021-05-31 11:54:50.366 UTC [34] LOG:  aborting any active transactions
2021-05-31 11:54:50.368 UTC [34] LOG:  background worker "logical replication launcher" (PID 41) exited with exit code 1
2021-05-31 11:54:50.368 UTC [36] LOG:  shutting down
2021-05-31 11:54:50.402 UTC [34] LOG:  database system is shut down
done
server stopped

PostgreSQL init process complete; ready for start up.

2021-05-31 11:54:50.510 UTC [1] LOG:  starting PostgreSQL 13.0 on x86_64-pc-linux-musl, compiled by gcc (Alpine 9.3.0) 9.3.0, 64-bit
2021-05-31 11:54:50.510 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2021-05-31 11:54:50.510 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2021-05-31 11:54:50.517 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2021-05-31 11:54:50.526 UTC [43] LOG:  database system was shut down at 2021-05-31 11:54:50 UTC
2021-05-31 11:54:50.531 UTC [1] LOG:  database system is ready to accept connections
2021-05-31 11:54:54.226 UTC [54] ERROR:  relation "public.barcopy" does not exist at character 15
2021-05-31 11:54:54.226 UTC [54] STATEMENT:  SELECT 1 FROM public.barcopy LIMIT 1
2021-05-31 11:54:54.610 UTC [59] ERROR:  relation "public.barcopy2" does not exist at character 15
2021-05-31 11:54:54.610 UTC [59] STATEMENT:  SELECT 1 FROM public.barcopy2 LIMIT 1
2021-05-31 11:54:54.934 UTC [63] ERROR:  relation "shortfloat" does not exist at character 15
2021-05-31 11:54:54.934 UTC [63] STATEMENT:  SELECT 1 FROM shortfloat LIMIT 1
2021-05-31 11:54:55.675 UTC [75] ERROR:  relation "byte_to_smallint_test" does not exist at character 15
2021-05-31 11:54:55.675 UTC [75] STATEMENT:  SELECT 1 FROM byte_to_smallint_test LIMIT 1

21/05/31 20:54:56.434 pool-1-thread-1 INFO PostgresIntegrationSuite:

===== END OF CONTAINER LOGS FOR container Id: 3360c98eb28337d8b217fb614e47bf49aafa18a6cb60ecadf3178aee0c663021 =====
```

### Why are the changes needed?

If we have container logs, it's useful to debug especially for GA.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I run docker integration tests and got logs. The example shown above is for `PostgresIntegrationSuite`.

Closes #32715 from sarutak/log-docker-container.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(commit: 08e6f63)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DockerJDBCIntegrationSuite.scala (diff)
Commit a127d912927f17b7754b89f4e315a5e3fb6461d4 by yao
[SPARK-35402][WEBUI] Increase the max thread pool size of jetty server in HistoryServer UI

### What changes were proposed in this pull request?

For different UIs, e.g. History Server or Spark Live UI, maybe need different capabilities to handle HTTP requests. Usually, a History Server is for multi-users and needs more threads to increase concurrency, while  Live UI is per application, which needn't that large pool size.

In this PR, we increase the max pool size of the History Server's jetty backend

### Why are the changes needed?

increase the client concurrency of HistoryServer

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests

Closes #32539 from yaooqinn/SPARK-35402.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
(commit: a127d91)
The file was modifiedcore/src/test/scala/org/apache/spark/ui/UISuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/ui/WebUI.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/ui/JettyUtils.scala (diff)
Commit 0ac5c16177156e7720ed3e237761117655f2edfc by ueshin
[SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin

### What changes were proposed in this pull request?

Support arithmetic operations against bool IndexOpsMixin.

### Why are the changes needed?

Existing binary operations of bool IndexOpsMixin in Koalas do not match pandas’ behaviors.

pandas take True as 1, False as 0 when dealing with numeric values, numeric collections, and numeric Series/Index; whereas Koalas raises an AnalysisException no matter what the binary operation is.

We aim to match pandas' behaviors.

### Does this PR introduce _any_ user-facing change?

Yes.

Before the change:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([True, True, False])
>>> psser + 1
Traceback (most recent call last):
...
TypeError: Addition can not be applied to booleans.
>>> 1 + psser
Traceback (most recent call last):
...
TypeError: Addition can not be applied to booleans.
>>> from pyspark.pandas.config import set_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> psser + ps.Series([1, 2, 3])
Traceback (most recent call last):
...
TypeError: Addition can not be applied to booleans.
>>> ps.Series([1, 2, 3]) + psser
Traceback (most recent call last):
...
TypeError: addition can not be applied to given types.
```

After the change:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([True, True, False])
>>> psser + 1
0    2
1    2
2    1
dtype: int64
>>> 1 + psser
0    2
1    2
2    1
dtype: int64
>>> from pyspark.pandas.config import set_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> psser + ps.Series([1, 2, 3])
0    2
1    3
2    3
dtype: int64
>>> ps.Series([1, 2, 3]) + psser
0    2
1    3
2    3
dtype: int64

```

### How was this patch tested?

Unit tests.

Closes #32611 from xinrong-databricks/datatypeop_arith_bool.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(commit: 0ac5c16)
The file was modifiedpython/pyspark/pandas/tests/test_series.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/boolean_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/base.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_dataframe.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/datetime_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/num_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_num_ops.py (diff)
Commit 35cfabcf5c8a9a7a8f05e562080c83e3a4a959c9 by dhyun
[SPARK-35589][CORE] BlockManagerMasterEndpoint should not ignore index-only shuffle file during updating

### What changes were proposed in this pull request?

This PR aims to make `BlockManagerMasterEndpoint.updateBlockInfo` not to ignore index-only shuffle files.
In addition, this PR fixes `IndexShuffleBlockResolver.getMigrationBlocks` to return data files first.

### Why are the changes needed?

When [SPARK-20629](https://github.com/apache/spark/commit/a4ca355af8556e8c5948e492ef70ef0b48416dc4) introduced a worker decommission, index-only shuffle files are not considered properly.
- SPARK-33198 fixed `getMigrationBlocks` to handle index only shuffle files
- SPARK-35589 (this) aims to fix `updateBlockInfo` to handle index only shuffle files.

### Does this PR introduce _any_ user-facing change?

No. This is a bug fix.

### How was this patch tested?

Pass the CIs with the newly added test case.

Closes #32727 from dongjoon-hyun/SPARK-UPDATE-OUTPUT.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: 35cfabc)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala (diff)
Commit 6a277bb7c62a7979f49b4c8df8d1c329543a5542 by gurwls223
[SPARK-35600][TESTS] Move Set command related test cases to SetCommandSuite

### What changes were proposed in this pull request?

Move `Set` command related test cases from `SQLQuerySuite` to a new test suite `SetCommandSuite`. There are 7 test cases in total.

### Why are the changes needed?

Code refactoring. `SQLQuerySuite` is becoming big.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests

Closes #32732 from gengliangwang/setsuite.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 6a277bb)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala (diff)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/SetCommandSuite.scala
Commit 0ad5ae54b2814c41a3829ba9716c7fd8d7a51a6a by gurwls223
[SPARK-35539][PYTHON] Restore to_koalas to keep the backward compatibility

### What changes were proposed in this pull request?

This PR proposes restoring `to_koalas` to keep the backward compatibility, with throwing deprecated warning.

### Why are the changes needed?

If we remove `to_koalas`, the existing Koalas codes that include `to_koalas` wouldn't work.

### Does this PR introduce _any_ user-facing change?

No. It's restoring the existing functionality.

### How was this patch tested?

Manually tested in local.

```shell
>>> sdf.to_koalas()
.../spark/python/pyspark/pandas/frame.py:4550: FutureWarning: DataFrame.to_koalas is deprecated as of DataFrame.to_pandas_on_spark. Please use the API instead.
  warnings.warn(
```

Closes #32729 from itholic/SPARK-35539.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 0ad5ae5)
The file was modifiedpython/pyspark/pandas/__init__.py (diff)
The file was modifiedpython/pyspark/pandas/frame.py (diff)
Commit 9d0d4edb43d51e151d3eb0b00e5acc3f5e00516d by gengliang
[SPARK-35595][TESTS] Support multiple loggers in testing method withLogAppender

### What changes were proposed in this pull request?

A test case of AdaptiveQueryExecSuite becomes flaky since there are too many debug logs in RootLogger:
https://github.com/Yikun/spark/runs/2715222392?check_suite_focus=true
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139125/testReport/

To fix it,  I suggest supporting multiple loggers in the testing method withLogAppender. So that the LogAppender gets clean target log outputs.

### Why are the changes needed?

Fix a flaky test case.
Also, reduce unnecessary memory cost in tests.

### Does this PR introduce _any_ user-facing change?

No
### How was this patch tested?

Unit test

Closes #32725 from gengliangwang/fixFlakyLogAppender.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(commit: 9d0d4ed)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/SparkFunSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala (diff)
Commit dbf0b50757cecd6fe49276836eb3a045aedc5736 by viirya
[SPARK-35560][SQL] Remove redundant subexpression evaluation in nested subexpressions

### What changes were proposed in this pull request?

This patch proposes to improve subexpression evaluation under whole-stage codegen for the cases of nested subexpressions.

### Why are the changes needed?

In the cases of nested subexpressions, whole-stage codegen's subexpression elimination will do redundant subexpression evaluation. We should reduce it. For example, if we have two sub-exprs:

1. `simpleUDF($"id")`
2. `functions.length(simpleUDF($"id"))`

We should only evaluate `simpleUDF($"id")` once, i.e.

```java
subExpr1 = simpleUDF($"id");
subExpr2 = functions.length(subExpr1);
```

Snippets of generated codes:

Before:
```java
/* 040 */   private int project_subExpr_1(long project_expr_0_0) {
/* 041 */     boolean project_isNull_6 = false;
/* 042 */     UTF8String project_value_6 = null;
/* 043 */     if (!false) {
/* 044 */       project_value_6 = UTF8String.fromString(String.valueOf(project_expr_0_0));
/* 045 */     }
/* 046 */
/* 047 */     Object project_arg_1 = null;
/* 048 */     if (project_isNull_6) {
/* 049 */       project_arg_1 = ((scala.Function1[]) references[3] /* converters */)[0].apply(null);
/* 050 */     } else {
/* 051 */       project_arg_1 = ((scala.Function1[]) references[3] /* converters */)[0].apply(project_value_6);                                                              /* 052 */     }
/* 053 */
/* 054 */     UTF8String project_result_1 = null;                                                                                                                            /* 055 */     try {                                                                                                                                                          /* 056 */       project_result_1 = (UTF8String)((scala.Function1[]) references[3] /* converters */)[1].apply(((scala.Function1) references[4] /* udf */).apply(project_arg_1)
);
/* 057 */     } catch (Throwable e) {
/* 058 */       throw QueryExecutionErrors.failedExecuteUserDefinedFunctionError(
/* 059 */         "DataFrameSuite$$Lambda$6418/1507986601", "string", "string", e);
/* 060 */     }
/* 061 */
/* 062 */     boolean project_isNull_5 = project_result_1 == null;
/* 063 */     UTF8String project_value_5 = null;
/* 064 */     if (!project_isNull_5) {
/* 065 */       project_value_5 = project_result_1;
/* 066 */     }
/* 067 */     boolean project_isNull_4 = project_isNull_5;
/* 068 */     int project_value_4 = -1;
/* 069 */
/* 070 */     if (!project_isNull_5) {
/* 071 */       project_value_4 = (project_value_5).numChars();
/* 072 */     }
/* 073 */     project_subExprIsNull_1 = project_isNull_4;
/* 074 */     return project_value_4;
/* 075 */   }
...
/* 149 */   private UTF8String project_subExpr_0(long project_expr_0_0) {
/* 150 */     boolean project_isNull_2 = false;
/* 151 */     UTF8String project_value_2 = null;
/* 152 */     if (!false) {
/* 153 */       project_value_2 = UTF8String.fromString(String.valueOf(project_expr_0_0));
/* 154 */     }
/* 155 */
/* 156 */     Object project_arg_0 = null;
/* 157 */     if (project_isNull_2) {
/* 158 */       project_arg_0 = ((scala.Function1[]) references[1] /* converters */)[0].apply(null);
/* 159 */     } else {
/* 160 */       project_arg_0 = ((scala.Function1[]) references[1] /* converters */)[0].apply(project_value_2);
/* 161 */     }
/* 162 */
/* 163 */     UTF8String project_result_0 = null;
/* 164 */     try {
/* 165 */       project_result_0 = (UTF8String)((scala.Function1[]) references[1] /* converters */)[1].apply(((scala.Function1) references[2] /* udf */).apply(project_arg_0)
);
/* 166 */     } catch (Throwable e) {
/* 167 */       throw QueryExecutionErrors.failedExecuteUserDefinedFunctionError(
/* 168 */         "DataFrameSuite$$Lambda$6418/1507986601", "string", "string", e);
/* 169 */     }
/* 170 */
/* 171 */     boolean project_isNull_1 = project_result_0 == null;                                                                                                           /* 172 */     UTF8String project_value_1 = null;                                                                                                                             /* 173 */     if (!project_isNull_1) {                                                                                                                                       /* 174 */       project_value_1 = project_result_0;
/* 175 */     }
/* 176 */     project_subExprIsNull_0 = project_isNull_1;
/* 177 */     return project_value_1;
/* 178 */   }
```

After:
```java
/* 041 */   private void project_subExpr_1(long project_expr_0_0) {
/* 042 */     boolean project_isNull_8 = project_subExprIsNull_0;
/* 043 */     int project_value_8 = -1;
/* 044 */
/* 045 */     if (!project_subExprIsNull_0) {
/* 046 */       project_value_8 = (project_mutableStateArray_0[0]).numChars();
/* 047 */     }
/* 048 */     project_subExprIsNull_1 = project_isNull_8;
/* 049 */     project_subExprValue_0 = project_value_8;
/* 050 */   }
/* 056 */
...
/* 123 */
/* 124 */   private void project_subExpr_0(long project_expr_0_0) {
/* 125 */     boolean project_isNull_6 = false;
/* 126 */     UTF8String project_value_6 = null;
/* 127 */     if (!false) {
/* 128 */       project_value_6 = UTF8String.fromString(String.valueOf(project_expr_0_0));
/* 129 */     }
/* 130 */
/* 131 */     Object project_arg_1 = null;
/* 132 */     if (project_isNull_6) {
/* 133 */       project_arg_1 = ((scala.Function1[]) references[3] /* converters */)[0].apply(null);
/* 134 */     } else {
/* 135 */       project_arg_1 = ((scala.Function1[]) references[3] /* converters */)[0].apply(project_value_6);
/* 136 */     }
/* 137 */
/* 138 */     UTF8String project_result_1 = null;
/* 139 */     try {
/* 140 */       project_result_1 = (UTF8String)((scala.Function1[]) references[3] /* converters */)[1].apply(((scala.Function1) references[4] /* udf */).apply(project_arg_1)
);
/* 141 */     } catch (Throwable e) {
/* 142 */       throw QueryExecutionErrors.failedExecuteUserDefinedFunctionError(
/* 143 */         "DataFrameSuite$$Lambda$6430/2004847941", "string", "string", e);
/* 144 */     }
/* 145 */
/* 146 */     boolean project_isNull_5 = project_result_1 == null;
/* 147 */     UTF8String project_value_5 = null;
/* 148 */     if (!project_isNull_5) {
/* 149 */       project_value_5 = project_result_1;
/* 150 */     }
/* 151 */     project_subExprIsNull_0 = project_isNull_5;
/* 152 */     project_mutableStateArray_0[0] = project_value_5;
/* 153 */   }
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test.

Closes #32699 from viirya/improve-subexpr.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(commit: dbf0b50)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala (diff)
Commit c2de0a64e9dcca2e887af29595b903442db3032d by ruifengz
[SPARK-35100][ML] Refactor AFT - support virtual centering

### What changes were proposed in this pull request?
1, remove old agg;
2, apply new agg supporting virtual centering

### Why are the changes needed?
for better convergence

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #32199 from zhengruifeng/refactor_aft.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
(commit: c2de0a6)
The file was removedmllib/src/main/scala/org/apache/spark/ml/optim/aggregator/AFTAggregator.scala
The file was addedmllib/src/main/scala/org/apache/spark/ml/optim/aggregator/AFTBlockAggregator.scala
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala (diff)
Commit 3f6322f9aaaa551119d8b6b713e45a6a58d41fdf by gengliang
[SPARK-35077][SQL] Migrate to transformWithPruning for leftover optimizer rules

### What changes were proposed in this pull request?

Migrate to transformWithPruning for the following queries:
- SimplifyExtractValueOps
- NormalizeFloatingNumbers
- PushProjectionThroughUnion
- PushDownPredicates
- ExtractPythonUDFFromAggregate
- ExtractPythonUDFFromJoinCondition
- ExtractGroupingPythonUDFFromAggregate
- ExtractPythonUDFs
- CleanupDynamicPruningFilters

</google-sheets-html-origin>

### Why are the changes needed?

Reduce the number of tree traversals and hence improve the query compilation latency.

### How was this patch tested?

Existing tests.
Performance diff:
<google-sheets-html-origin><style type="text/css"></style>
&nbsp; | Baseline | Experiment | Experiment/Baseline
-- | -- | -- | --
SimplifyExtractValueOps | 99367049 | 3679579 | 0.04
NormalizeFloatingNumbers | 24717928 | 20451094 | 0.83
PushProjectionThroughUnion | 14130245 | 7913551 | 0.56
PushDownPredicates | 276333542 | 261246842 | 0.95
ExtractPythonUDFFromAggregate | 6459451 | 2683556 | 0.42
ExtractPythonUDFFromJoinCondition | 5695404 | 2504573 | 0.44
ExtractGroupingPythonUDFFromAggregate | 5546701 | 1858755 | 0.34
ExtractPythonUDFs | 58726458 | 1598518 | 0.03
CleanupDynamicPruningFilters | 26606652 | 15417936 | 0.58
OptimizeSubqueries | 3072287940 | 2876462708 | 0.94

</google-sheets-html-origin>

Closes #32721 from sigmod/pushdown.

Authored-by: Yingyi Bu <yingyi.bu@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(commit: 3f6322f)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ComplexTypes.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/CleanupDynamicPruningFilters.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleIdCollection.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala (diff)
Commit 48252bac95b66a22f4a3449b7e953716c6ed4fec by gurwls223
[SPARK-35583][DOCS] Move JDBC data source options from Python and Scala into a single page

### What changes were proposed in this pull request?

This PR proposes move missing JDBC data source options from Python, Scala and Java into a single page.

### Why are the changes needed?

So far, the documentation for JDBC data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language.

### Does this PR introduce _any_ user-facing change?

Yes, the documents will be shown below after this change:

- "JDBC To Other Databases" page
<img width="803" alt="Screen Shot 2021-06-02 at 11 34 14 AM" src="https://user-images.githubusercontent.com/44108233/120415520-a115c000-c396-11eb-9663-9e666e08ed2b.png">

- Python
![Screen Shot 2021-06-01 at 2 57 40 PM](https://user-images.githubusercontent.com/44108233/120273628-ba146780-c2e9-11eb-96a8-11bd25415197.png)

- Scala
![Screen Shot 2021-06-01 at 2 57 03 PM](https://user-images.githubusercontent.com/44108233/120273567-a2d57a00-c2e9-11eb-9788-ea58028ca0a6.png)

- Java
![Screen Shot 2021-06-01 at 2 58 27 PM](https://user-images.githubusercontent.com/44108233/120273722-d912f980-c2e9-11eb-83b3-e09992d8c582.png)

### How was this patch tested?

Manually build docs and confirm the page.

Closes #32723 from itholic/SPARK-35583.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 48252ba)
The file was modifiedpython/pyspark/sql/readwriter.py (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala (diff)
The file was modifieddocs/sql-data-sources-jdbc.md (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala (diff)
Commit 54e9999d39823c4fc236f328fe55e46607515cd0 by ltnwgl
[SPARK-35604][SQL] Fix condition check for FULL OUTER sort merge join

### What changes were proposed in this pull request?

The condition check for FULL OUTER sort merge join (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L1368 ) has unnecessary trip when `leftIndex == leftMatches.size` or `rightIndex == rightMatches.size`. Though this does not affect correctness (`scanNextInBuffered()` returns false anyway). But we can avoid it in the first place.

### Why are the changes needed?

Better readability for developers and avoid unnecessary execution.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit tests, such as `OuterJoinSuite.scala`.

Closes #32736 from c21/join-bug.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
(commit: 54e9999)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala (diff)
Commit daf9d198dc9b71dbc79d7efa24c9336ed700e80c by wenchen
[SPARK-35585][SQL] Support propagate empty relation through project/filter

### What changes were proposed in this pull request?

Add rule `ConvertToLocalRelation` into AQE Optimizer.

### Why are the changes needed?

Support propagate empty local relation through project and filter like such SQL case:
```
Aggregate
  Project
    Join
      ShuffleStage
      ShuffleStage
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Add test.

Closes #32724 from ulysses-you/SPARK-35585.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: daf9d19)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AQEOptimizer.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala (diff)
Commit 345d35ed1aa508bd2228603f94cc8aa0ad998906 by wenchen
[SPARK-21957][SQL] Support current_user function

### What changes were proposed in this pull request?

Currently, we do not have a suitable definition of the `user` concept in Spark. We only have a `sparkUser` app widely but do not support identify or retrieve the user information from a session in STS or a runtime query execution.

`current_user()` is very popular and supported by plenty of other modern or old school databases, and also ANSI compliant.

This PR add `current_user()`  as a SQL function. And, they are the same.  In this PR, we add these functions w/o ambiguity.
1. For a normal single-threaded Spark application, clearly the `sparkUser` is always equivalent to `current_user()` .
2. For a multi-threaded Spark application, e.g. Spark thrift server, we use a `ThreadLocal` variable to store the client-side user(after authenticated) before running the query and retrieve it in the parser.

### Why are the changes needed?

`current_user()` is very popular and supported by plenty of other modern or old school databases, and also ANSI compliant.

### Does this PR introduce _any_ user-facing change?

yes, added  `current_user()`  as a SQL function
### How was this patch tested?

new tests in thrift server and sql/catalyst

Closes #32718 from yaooqinn/SPARK-21957.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 345d35e)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/MiscFunctionsSuite.scala (diff)
The file was modifiedsql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/SharedThriftServer.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala (diff)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkOperation.scala (diff)
The file was addedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CurrentUserContext.scala
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/expressions/ExpressionInfoSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-functions/sql-expression-schema.md (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
The file was modifiedsql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/ThriftServerWithSparkContextSuite.scala (diff)
Commit 9f7cdb89f7614ef38ba2d8877d4c0f87ad9b6f5f by wenchen
[SPARK-35059][SQL] Group exception messages in hive/execution

### What changes were proposed in this pull request?
This PR group exception messages in `sql/hive/src/main/scala/org/apache/spark/sql/hive/execution`.

### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.

### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.

### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.

Closes #32694 from beliefer/SPARK-35059.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 9f7cdb8)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala (diff)
Commit 8041aed296cf2880db1f1864cd39f7fff80862c8 by wenchen
[SPARK-34808][SQL][FOLLOWUP] Remove canPlanAsBroadcastHashJoin check in EliminateOuterJoin

### What changes were proposed in this pull request?

This PR removes `canPlanAsBroadcastHashJoin` check in `EliminateOuterJoin.

### Why are the changes needed?

We can always removes outer join if it only has DISTINCT on streamed side.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #32744 from wangyum/SPARK-34808-2.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 8041aed)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/AggregateOptimizeSuite.scala (diff)
Commit 806edf8f4460b81e5b71ea00f83c14c4f8134bd4 by dhyun
[SPARK-35610][CORE] Fix the memory leak introduced by the Executor's stop shutdown hook

### What changes were proposed in this pull request?

Fixing the memory leak by deregistering the shutdown hook when the executor is stopped. This way the Garbage Collector can release the executor object early. Which is a huge win for our tests as user's classloader could be also released which keeps references to objects which are created for the jars on the classpath.

### Why are the changes needed?

I have identified this leak by running the Livy tests (I know it is close to the attic but this leak causes a constant OOM there) and it is in our Spark unit tests as well.

This leak can be identified by checking the number of `LeakyEntry` in case of Scala 2.12.14 (and `ZipEntry` for Scala 2.12.10) instances which with its related data can take up a considerable amount of memory (as those are created from the jars which are on the classpath).

I have my own tool for instrumenting JVM code [trace-agent](https://github.com/attilapiros/trace-agent) and with that I am able to call JVM diagnostic commands at specific methods. Let me show how it in action.

It has a single text file embedded into the tool's jar called action.txt.
In this case actions.txt content is:

{noformat}
$ unzip -q -c trace-agent-0.0.7.jar actions.txt
diagnostic_command org.apache.spark.repl.ReplSuite runInterpreter  cmd:gcClassHistogram,limit_output_lines:8,where:beforeAndAfter,with_gc:true
diagnostic_command org.apache.spark.repl.ReplSuite afterAll  cmd:gcClassHistogram,limit_output_lines:8,where:after,with_gc:true
{noformat}

Which creates a class histogram at the beginning and at the end of `org.apache.spark.repl.ReplSuite#runInterpreter()` (after triggering a GC which might not finish as GC is done in a separate thread..) and one histogram in the end of the `org.apache.spark.repl.ReplSuite#afterAll()` method.

And the histograms are the followings on master branch:

```
$ ./build/sbt ";project repl;set Test/javaOptions += \"-javaagent:/Users/attilazsoltpiros/git/attilapiros/memoryLeak/trace-agent-0.0.7.jar\"; testOnly" |grep "ZipEntry\|LeakyEntry"
   3:        197089        9460272  scala.reflect.io.FileZipArchive$LeakyEntry
   3:        197089        9460272  scala.reflect.io.FileZipArchive$LeakyEntry
   3:        197089        9460272  scala.reflect.io.FileZipArchive$LeakyEntry
   3:        197089        9460272  scala.reflect.io.FileZipArchive$LeakyEntry
   3:        197089        9460272  scala.reflect.io.FileZipArchive$LeakyEntry
   3:        197089        9460272  scala.reflect.io.FileZipArchive$LeakyEntry
   3:        394178       18920544  scala.reflect.io.FileZipArchive$LeakyEntry
   3:        394178       18920544  scala.reflect.io.FileZipArchive$LeakyEntry
   3:        591267       28380816  scala.reflect.io.FileZipArchive$LeakyEntry
   3:        591267       28380816  scala.reflect.io.FileZipArchive$LeakyEntry
   3:        788356       37841088  scala.reflect.io.FileZipArchive$LeakyEntry
   3:        788356       37841088  scala.reflect.io.FileZipArchive$LeakyEntry
   3:        985445       47301360  scala.reflect.io.FileZipArchive$LeakyEntry
   3:        985445       47301360  scala.reflect.io.FileZipArchive$LeakyEntry
   3:       1182534       56761632  scala.reflect.io.FileZipArchive$LeakyEntry
   3:       1182534       56761632  scala.reflect.io.FileZipArchive$LeakyEntry
   3:       1379623       66221904  scala.reflect.io.FileZipArchive$LeakyEntry
   3:       1379623       66221904  scala.reflect.io.FileZipArchive$LeakyEntry
   3:       1576712       75682176  scala.reflect.io.FileZipArchive$LeakyEntry
```

Where the header of the table is:

```
num     #instances         #bytes  class name
```

So the `LeakyEntry` in the end is about 75MB (173MB in case of Scala 2.12.10 and before for another class called `ZipEntry`) but the first item (a char/byte arrays) and the second item (strings) in the histogram also relates to this leak:

```
$ ./build/sbt ";project repl;set Test/javaOptions += \"-javaagent:/Users/attilazsoltpiros/git/attilapiros/memoryLeak/trace-agent-0.0.7.jar\"; testOnly" |grep "1:\|2:\|3:"
   1:          2701        3496112  [B
   2:         21855        2607192  [C
   3:          4885         537264  java.lang.Class
   1:        480323       55970208  [C
   2:        480499       11531976  java.lang.String
   3:        197089        9460272  scala.reflect.io.FileZipArchive$LeakyEntry
   1:        481825       56148024  [C
   2:        481998       11567952  java.lang.String
   3:        197089        9460272  scala.reflect.io.FileZipArchive$LeakyEntry
   1:        487056       57550344  [C
   2:        487179       11692296  java.lang.String
   3:        197089        9460272  scala.reflect.io.FileZipArchive$LeakyEntry
   1:        487054       57551008  [C
   2:        487176       11692224  java.lang.String
   3:        197089        9460272  scala.reflect.io.FileZipArchive$LeakyEntry
   1:        927823      107139160  [C
   2:        928072       22273728  java.lang.String
   3:        394178       18920544  scala.reflect.io.FileZipArchive$LeakyEntry
   1:        927793      107129328  [C
   2:        928041       22272984  java.lang.String
   3:        394178       18920544  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       1361851      155555608  [C
   2:       1362261       32694264  java.lang.String
   3:        591267       28380816  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       1361683      155493464  [C
   2:       1362092       32690208  java.lang.String
   3:        591267       28380816  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       1803074      205157728  [C
   2:       1803268       43278432  java.lang.String
   3:        788356       37841088  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       1802385      204938224  [C
   2:       1802579       43261896  java.lang.String
   3:        788356       37841088  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       2236631      253636592  [C
   2:       2237029       53688696  java.lang.String
   3:        985445       47301360  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       2236536      253603008  [C
   2:       2236933       53686392  java.lang.String
   3:        985445       47301360  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       2668892      301893920  [C
   2:       2669510       64068240  java.lang.String
   3:       1182534       56761632  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       2668759      301846376  [C
   2:       2669376       64065024  java.lang.String
   3:       1182534       56761632  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       3101238      350101048  [C
   2:       3102073       74449752  java.lang.String
   3:       1379623       66221904  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       3101240      350101104  [C
   2:       3102075       74449800  java.lang.String
   3:       1379623       66221904  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       3533785      398371760  [C
   2:       3534835       84836040  java.lang.String
   3:       1576712       75682176  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       3533759      398367088  [C
   2:       3534807       84835368  java.lang.String
   3:       1576712       75682176  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       3967049      446893400  [C
   2:       3968314       95239536  java.lang.String
   3:       1773801       85142448  scala.reflect.io.FileZipArchive$LeakyEntry
[info] - SPARK-26633: ExecutorClassLoader.getResourceAsStream find REPL classes (8 seconds, 248 milliseconds)
Setting default log level to "ERROR".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
   1:       3966423      446709584  [C
   2:       3967682       95224368  java.lang.String
   3:       1773801       85142448  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       4399583      495097208  [C
   2:       4401050      105625200  java.lang.String
   3:       1970890       94602720  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       4399578      495070064  [C
   2:       4401040      105624960  java.lang.String
   3:       1970890       94602720  scala.reflect.io.FileZipArchive$LeakyEntry
```

The last three is about 700MB altogether.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

I used the trace-agent tool with the same settings for the modified code:

```
$ ./build/sbt ";project repl;set Test/javaOptions += \"-javaagent:/Users/attilazsoltpiros/git/attilapiros/memoryLeak/trace-agent-0.0.7.jar\"; testOnly" |grep "1:\|2:\|3:"
   1:          2701        3496112  [B
   2:         21855        2607192  [C
   3:          4885         537264  java.lang.Class
   1:        480323       55970208  [C
   2:        480499       11531976  java.lang.String
   3:        197089        9460272  scala.reflect.io.FileZipArchive$LeakyEntry
   1:        481825       56148024  [C
   2:        481998       11567952  java.lang.String
   3:        197089        9460272  scala.reflect.io.FileZipArchive$LeakyEntry
   1:        487056       57550344  [C
   2:        487179       11692296  java.lang.String
   3:        197089        9460272  scala.reflect.io.FileZipArchive$LeakyEntry
   1:        487054       57551008  [C
   2:        487176       11692224  java.lang.String
   3:        197089        9460272  scala.reflect.io.FileZipArchive$LeakyEntry
   1:        927823      107139160  [C
   2:        928072       22273728  java.lang.String
   3:        394178       18920544  scala.reflect.io.FileZipArchive$LeakyEntry
   1:        927793      107129328  [C
   2:        928041       22272984  java.lang.String
   3:        394178       18920544  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       1361851      155555608  [C
   2:       1362261       32694264  java.lang.String
   3:        591267       28380816  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       1361683      155493464  [C
   2:       1362092       32690208  java.lang.String
   3:        591267       28380816  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       1803074      205157728  [C
   2:       1803268       43278432  java.lang.String
   3:        788356       37841088  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       1802385      204938224  [C
   2:       1802579       43261896  java.lang.String
   3:        788356       37841088  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       2236631      253636592  [C
   2:       2237029       53688696  java.lang.String
   3:        985445       47301360  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       2236536      253603008  [C
   2:       2236933       53686392  java.lang.String
   3:        985445       47301360  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       2668892      301893920  [C
   2:       2669510       64068240  java.lang.String
   3:       1182534       56761632  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       2668759      301846376  [C
   2:       2669376       64065024  java.lang.String
   3:       1182534       56761632  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       3101238      350101048  [C
   2:       3102073       74449752  java.lang.String
   3:       1379623       66221904  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       3101240      350101104  [C
   2:       3102075       74449800  java.lang.String
   3:       1379623       66221904  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       3533785      398371760  [C
   2:       3534835       84836040  java.lang.String
   3:       1576712       75682176  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       3533759      398367088  [C
   2:       3534807       84835368  java.lang.String
   3:       1576712       75682176  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       3967049      446893400  [C
   2:       3968314       95239536  java.lang.String
   3:       1773801       85142448  scala.reflect.io.FileZipArchive$LeakyEntry
[info] - SPARK-26633: ExecutorClassLoader.getResourceAsStream find REPL classes (8 seconds, 248 milliseconds)
Setting default log level to "ERROR".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
   1:       3966423      446709584  [C
   2:       3967682       95224368  java.lang.String
   3:       1773801       85142448  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       4399583      495097208  [C
   2:       4401050      105625200  java.lang.String
   3:       1970890       94602720  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       4399578      495070064  [C
   2:       4401040      105624960  java.lang.String
   3:       1970890       94602720  scala.reflect.io.FileZipArchive$LeakyEntry
[success] Total time: 174 s (02:54), completed Jun 2, 2021 2:00:43 PM
╭─attilazsoltpirosapiros-MBP16 ~/git/attilapiros/memoryLeak ‹SPARK-35610*›
╰─$ vim
╭─attilazsoltpirosapiros-MBP16 ~/git/attilapiros/memoryLeak ‹SPARK-35610*›
╰─$ ./build/sbt ";project repl;set Test/javaOptions += \"-javaagent:/Users/attilazsoltpiros/git/attilapiros/memoryLeak/trace-agent-0.0.7.jar\"; testOnly" |grep "1:\|2:\|3:"
   1:          2685        3457368  [B
   2:         21833        2606712  [C
   3:          4885         537264  java.lang.Class
   1:        480245       55978400  [C
   2:        480421       11530104  java.lang.String
   3:        197089        9460272  scala.reflect.io.FileZipArchive$LeakyEntry
   1:        480460       56005784  [C
   2:        480633       11535192  java.lang.String
   3:        197089        9460272  scala.reflect.io.FileZipArchive$LeakyEntry
   1:        486643       57537784  [C
   2:        486766       11682384  java.lang.String
   3:        197089        9460272  scala.reflect.io.FileZipArchive$LeakyEntry
   1:        486636       57538192  [C
   2:        486758       11682192  java.lang.String
   3:        197089        9460272  scala.reflect.io.FileZipArchive$LeakyEntry
   1:        501208       60411856  [C
   2:        501180       12028320  java.lang.String
   3:        197089        9460272  scala.reflect.io.FileZipArchive$LeakyEntry
   1:        501206       60412960  [C
   2:        501177       12028248  java.lang.String
   3:        197089        9460272  scala.reflect.io.FileZipArchive$LeakyEntry
   1:        934925      108773320  [C
   2:        935058       22441392  java.lang.String
   3:        394178       18920544  scala.reflect.io.FileZipArchive$LeakyEntry
   1:        934912      108769528  [C
   2:        935044       22441056  java.lang.String
   3:        394178       18920544  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       1370351      156901296  [C
   2:       1370318       32887632  java.lang.String
   3:        591267       28380816  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       1369660      156681680  [C
   2:       1369627       32871048  java.lang.String
   3:        591267       28380816  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       1803746      205383136  [C
   2:       1803917       43294008  java.lang.String
   3:        788356       37841088  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       1803658      205353096  [C
   2:       1803828       43291872  java.lang.String
   3:        788356       37841088  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       2235677      253608240  [C
   2:       2236068       53665632  java.lang.String
   3:        985445       47301360  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       2235539      253560088  [C
   2:       2235929       53662296  java.lang.String
   3:        985445       47301360  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       2667775      301799240  [C
   2:       2668383       64041192  java.lang.String
   3:       1182534       56761632  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       2667765      301798568  [C
   2:       2668373       64040952  java.lang.String
   3:       1182534       56761632  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       2666665      301491096  [C
   2:       2667285       64014840  java.lang.String
   3:       1182534       56761632  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       2666648      301490792  [C
   2:       2667266       64014384  java.lang.String
   3:       1182534       56761632  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       2668169      301833032  [C
   2:       2668782       64050768  java.lang.String
   3:       1182534       56761632  scala.reflect.io.FileZipArchive$LeakyEntry
[info] - SPARK-26633: ExecutorClassLoader.getResourceAsStream find REPL classes (6 seconds, 396 milliseconds)
Setting default log level to "ERROR".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
   1:       2235495      253419952  [C
   2:       2235887       53661288  java.lang.String
   3:        985445       47301360  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       2668379      301800768  [C
   2:       2668979       64055496  java.lang.String
   3:       1182534       56761632  scala.reflect.io.FileZipArchive$LeakyEntry
   1:       2236123      253522640  [C
   2:       2236514       53676336  java.lang.String
   3:        985445       47301360  scala.reflect.io.FileZipArchive$LeakyEntry
```

The sum of the last three numbers is about 354MB.

Closes #32748 from attilapiros/SPARK-35610.

Authored-by: attilapiros <piros.attila.zsolt@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: 806edf8)
The file was modifiedcore/src/main/scala/org/apache/spark/executor/Executor.scala (diff)