Changes

Summary

  1. [SPARK-33765][SQL] Migrate UNCACHE TABLE to use UnresolvedRelation to (commit: 62be248) (details)
  2. [SPARK-33786][SQL] The storage level for a cache should be respected (commit: ef7f690) (details)
  3. [MINOR][DOCS] Fix Jenkins job badge image and link in README.md (commit: 12f3715) (details)
  4. [SPARK-33802][INFRA] Override name and email address explicitly when (commit: 888a274) (details)
  5. [SPARK-33803][SQL] Sort table properties by key in DESCRIBE TABLE (commit: 7845865) (details)
  6. [SPARK-33789][SQL][TESTS] Refactor unified V1 and V2 datasource tests (commit: 9d9d4a8) (details)
  7. [SPARK-32991][SQL] [FOLLOWUP] Reset command relies on session initials (commit: 205d8e4) (details)
  8. [SPARK-33802][INFRA][FOLLOW-UP] Separate arguments properly for -c (commit: ddda32b) (details)
  9. [SPARK-33800][SQL] Remove command name in AnalysisException message when (commit: 8666d1c) (details)
  10. [SPARK-33810][TESTS] Reenable test cases disabled in SPARK-31732 (commit: 3d03234) (details)
  11. [SPARK-33806][SQL] limit partition num to 1 when distributing by (commit: 728a129) (details)
  12. [SPARK-33514][SQL][FOLLOW-UP] Remove unused TruncateTableStatement case (commit: e7e29fd) (details)
  13. [SPARK-33775][BUILD] Suppress sbt compilation warnings in Scala 2.13 (commit: 477046c) (details)
  14. [SPARK-33790][CORE] Reduce the rpc call of getFileStatus in (commit: 0c12900) (details)
  15. [SPARK-33815][SQL] Migrate ALTER TABLE ... SET [SERDE|SERDEPROPERTIES] (commit: 0c19497) (details)
  16. [SPARK-33697][SQL] RemoveRedundantProjects should require column (commit: 1e85707) (details)
  17. [SPARK-33821][BUILD] Upgrade SBT to 1.4.5 (commit: b1950cc) (details)
  18. [SPARK-33819][CORE] (commit: ed09673) (details)
  19. [SPARK-26199][SPARK-31517][R] Fix strategy for handling ... names in (commit: 12b69cc) (details)
  20. [SPARK-33774][UI][CORE] Back to Master" returns 500 error in Standalone (commit: 34e4d87) (details)
  21. [SPARK-22769] Do not log rpc post message error when sparkEnv is already (commit: 8c81cf7) (details)
  22. [SPARK-33173][CORE][TESTS][FOLLOWUP] Use `local[2]` and AtomicInteger (commit: 15616f4) (details)
  23. [SPARK-33822][SQL] Use the `CastSupport.cast` method in HashJoin (commit: 51ef443) (details)
  24. [SPARK-33824][PYTHON][DOCS] Restructure and improve Python package (commit: 6315118) (details)
  25. [SPARK-33797][SS][DOCS] Update SS doc about State Store and task (commit: 42e1831) (details)
  26. [SPARK-33831][UI] Update to jetty 9.4.34 (commit: 131a23d) (details)
  27. [SPARK-33817][SQL] CACHE TABLE uses a logical plan when caching a query (commit: 0f1a183) (details)
  28. [SPARK-26341][WEBUI] Expose executor memory metrics at the stage level, (commit: 25c6cc2) (details)
  29. [MINOR][INFRA] Add -Pspark-ganglia-lgpl to the build definition with (commit: b0da2bc) (details)
  30. [SPARK-33593][SQL] Vector reader got incorrect data with binary (commit: 0603913) (details)
  31. [SPARK-33840][DOCS] Add spark.sql.files.minPartitionNum to performence (commit: bc46d27) (details)
  32. [SPARK-33798][SQL] Add new rule to push down the foldable expressions (commit: 06b1bbb) (details)
  33. [SPARK-33597][SQL] Support REGEXP_LIKE for consistent with mainstream (commit: f239128) (details)
  34. [SPARK-33599][SQL] Group exception messages in catalyst/analysis (commit: 6dca2e5) (details)
Commit 62be2483d7d78e61fd2f77929cf41c76eff17869 by wenchen
[SPARK-33765][SQL] Migrate UNCACHE TABLE to use UnresolvedRelation to resolve identifier

### What changes were proposed in this pull request?

This PR proposes to migrate `UNCACHE TABLE` to use `UnresolvedRelation` to resolve the table/view identifier in Analyzer as discussed https://github.com/apache/spark/pull/30403/files#r532360022.

### Why are the changes needed?

To resolve the table/view in the analyzer.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Updated existing tests

Closes #30743 from imback82/uncache_v2.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 62be248)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCommandsWithIfExists.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/CachedTableSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CacheTableExec.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)
Commit ef7f6903b4fa28c554a1f0b58b9da194979b61ee by wenchen
[SPARK-33786][SQL] The storage level for a cache should be respected when a table name is altered

### What changes were proposed in this pull request?

This PR proposes to retain the cache's storage level when a table name is altered by `ALTER TABLE ... RENAME TO ...`.

### Why are the changes needed?

Currently, when a table name is altered, the table's cache is refreshed (if exists), but the storage level is not retained. For example:
```scala
        def getStorageLevel(tableName: String): StorageLevel = {
          val table = spark.table(tableName)
          val cachedData = spark.sharedState.cacheManager.lookupCachedData(table).get
          cachedData.cachedRepresentation.cacheBuilder.storageLevel
        }

        Seq(1 -> "a").toDF("i", "j").write.parquet(path.getCanonicalPath)
        sql(s"CREATE TABLE old USING parquet LOCATION '${path.toURI}'")
        sql("CACHE TABLE old OPTIONS('storageLevel' 'MEMORY_ONLY')")
        val oldStorageLevel = getStorageLevel("old")

        sql("ALTER TABLE old RENAME TO new")
        val newStorageLevel = getStorageLevel("new")
```
`oldStorageLevel` will be `StorageLevel(memory, deserialized, 1 replicas)` whereas `newStorageLevel` will be `StorageLevel(disk, memory, deserialized, 1 replicas)`, which is the default storage level.

### Does this PR introduce _any_ user-facing change?

Yes, now the storage level for the cache will be retained.

### How was this patch tested?

Added a unit test.

Closes #30774 from imback82/alter_table_rename_cache_fix.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: ef7f690)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala (diff)
Commit 12f3715ed7e0cd06131272845c3d04f4ad1b441c by dongjoon
[MINOR][DOCS] Fix Jenkins job badge image and link in README.md

### What changes were proposed in this pull request?

This PR proposes to fix the Jenkins job badge:

Before:

![Screen Shot 2020-12-16 at 4 14 14 PM](https://user-images.githubusercontent.com/6477701/102316960-2c9ebe80-3fba-11eb-878d-07ae735fb3a6.png)

After:

![Screen Shot 2020-12-16 at 4 14 09 PM](https://user-images.githubusercontent.com/6477701/102316956-2a3c6480-3fba-11eb-9fa4-b8312edb8a1a.png)

### Why are the changes needed?

To make people can easily check the status of builds.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested via using GitHub.

Closes #30797 from HyukjinKwon/minor-readme.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 12f3715)
The file was modifiedREADME.md (diff)
Commit 888a274a88560ebe3c43ff9f003c296751d0c207 by gurwls223
[SPARK-33802][INFRA] Override name and email address explicitly when updating PySpark coverage

### What changes were proposed in this pull request?

The current Jenkins job fails as below (https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-3.2/1726/console)

```
Generating HTML files for PySpark coverage under /home/jenkins/workspace/spark-master-test-sbt-hadoop-3.2/python/test_coverage/htmlcov
/home/jenkins/workspace/spark-master-test-sbt-hadoop-3.2
Cloning into 'pyspark-coverage-site'...

*** Please tell me who you are.

Run

  git config --global user.email "youexample.com"
  git config --global user.name "Your Name"

to set your account's default identity.
Omit --global to set the identity only in this repository.
```

This PR proposes to set both when committing to the coverage site.

### Why are the changes needed?

To make the coverage site keep working.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested in the console but it has to be merged to test in the Jenkins environment.

Closes #30796 from HyukjinKwon/SPARK-33802.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 888a274)
The file was modifieddev/run-tests.py (diff)
Commit 7845865b8d5c03a4daf82588be0ff2ebb90152a7 by wenchen
[SPARK-33803][SQL] Sort table properties by key in DESCRIBE TABLE command

### What changes were proposed in this pull request?

This PR proposes to sort table properties in DESCRIBE TABLE command. This is consistent with DSv2 command as well:
https://github.com/apache/spark/blob/e3058ba17cb4512537953eb4ded884e24ee93ba2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DescribeTableExec.scala#L63

This PR fixes the test case in Scala 2.13 build as well where the table properties have different order in the map.

### Why are the changes needed?

To keep the deterministic and pretty output, and fix the tests in Scala 2.13 build.
See https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/49/testReport/junit/org.apache.spark.sql/SQLQueryTestSuite/describe_sql/

```
describe.sql&#010;Expected "...spark_catalog, view.[query.out.col.2=c, view.referredTempFunctionsNames=[], view.catalogAndNamespace.part.1=default]]", but got "...spark_catalog, view.[catalogAndNamespace.part.1=default, view.query.out.col.2=c, view.referredTempFunctionsNames=[]]]" Result did not match for query #29&#010;DESC FORMATTED v
```

### Does this PR introduce _any_ user-facing change?

Yes, it will change the text output from `DESCRIBE [EXTENDED|FORMATTED] table_name`.
Now the table properties are sorted by its key.

### How was this patch tested?

Related unittests were fixed accordingly.

Closes #30799 from HyukjinKwon/SPARK-33803.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 7845865)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/describe.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/create_view.sql.out (diff)
Commit 9d9d4a8e122cf1137edeca857e925f7e76c1ace2 by wenchen
[SPARK-33789][SQL][TESTS] Refactor unified V1 and V2 datasource tests

### What changes were proposed in this pull request?
1. Move common utility functions such as `test()`, `withNsTable()` and `checkPartitions()` to `DDLCommandTestUtils`.
2. Place common settings such as `version`, `catalog`, `defaultUsing`, `sparkConf` to `CommandSuiteBase`.

### Why are the changes needed?
To improve code maintenance of the unified tests.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the affected test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowPartitionsSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowTablesSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableAddPartitionSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite"
```

Closes #30779 from MaxGekk/refactor-unified-tests.

Lead-authored-by: Max Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 9d9d4a8)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/CommandSuiteBase.scala
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/v2/ShowPartitionsSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowPartitionsSuiteBase.scala (diff)
The file was addedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/command/CommandSuiteBase.scala
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/v2/AlterTableAddPartitionSuite.scala (diff)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/execution/command/v2/CommandSuiteBase.scala
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/AlterTableAddPartitionSuiteBase.scala (diff)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLCommandTestUtils.scala
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowTablesSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/AlterTableAddPartitionSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/v2/AlterTableDropPartitionSuite.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/command/AlterTableDropPartitionSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowTablesSuiteBase.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/AlterTableDropPartitionSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/AlterTableDropPartitionSuiteBase.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowPartitionsSuite.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/command/ShowPartitionsSuite.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/command/ShowTablesSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/v2/ShowTablesSuite.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/command/AlterTableAddPartitionSuite.scala (diff)
Commit 205d8e40bc8446c5953c9a082ffaede3029d1d53 by wenchen
[SPARK-32991][SQL] [FOLLOWUP] Reset command relies on session initials first

### What changes were proposed in this pull request?

As a follow-up of https://github.com/apache/spark/pull/30045, we modify the RESET command here to respect the session initial configs per session first then fall back to the `SharedState` conf, which makes each session could maintain a different copy of initial configs for resetting.

### Why are the changes needed?

to make reset command saner.
### Does this PR introduce _any_ user-facing change?

yes, RESET will respect session initials first not always go to the system defaults

### How was this patch tested?

add new tests

Closes #30642 from yaooqinn/SPARK-32991-F.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 205d8e4)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SparkSessionBuilderSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/SetCommand.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala (diff)
Commit ddda32b156e4c2e2ba1d1ed37cf34fb2f26d769e by gurwls223
[SPARK-33802][INFRA][FOLLOW-UP] Separate arguments properly for -c option in git command for PySpark coverage

### What changes were proposed in this pull request?

This PR proposes to separate arguments properly for `-c` options. Otherwise, the space is considered as its part of argument:

```
Cloning into 'pyspark-coverage-site'...
unknown option: -c user.name='Apache Spark Test Account'
usage: git [--version] [--help] [-C <path>] [-c <name>=<value>]
           [--exec-path[=<path>]] [--html-path] [--man-path] [--info-path]
           [-p | --paginate | -P | --no-pager] [--no-replace-objects] [--bare]
           [--git-dir=<path>] [--work-tree=<path>] [--namespace=<name>]
           <command> [<args>]
[error] running git -c user.name='Apache Spark Test Account' -c user.email='sparktestaccgmail.com' commit -am Coverage report at latest commit in Apache Spark ; received return code 129
```

### Why are the changes needed?

To make the build pass (https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-3.2/1728/console).

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

```python
>>> from sparktestsupport.shellutils import run_cmd
>>> run_cmd([
...             "git",
...             "-c",
...             "user.name='Apache Spark Test Account'",
...             "-c",
...             "user.email='sparktestaccgmail.com'",
...             "commit",
...             "-am",
...             "Coverage report at latest commit in Apache Spark"])
[SPARK-33802-followup 80d2565a511] Coverage report at latest commit in Apache Spark
1 file changed, 1 insertion(+), 1 deletion(-)
CompletedProcess(args=['git', '-c', "user.name='Apache Spark Test Account'", '-c', "user.email='sparktestaccgmail.com'", 'commit', '-am', 'Coverage report at latest commit in Apache Spark'], returncode=0)
```

I cannot run e2e test because it requires the env to have Jenkins secret.

Closes #30804 from HyukjinKwon/SPARK-33802-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: ddda32b)
The file was modifieddev/run-tests.py (diff)
Commit 8666d1c39cb6d49e4aa3cd0b9342b82405541aed by wenchen
[SPARK-33800][SQL] Remove command name in AnalysisException message when a relation is not resolved

### What changes were proposed in this pull request?

Based on the discussion https://github.com/apache/spark/pull/30743#discussion_r543124594, this PR proposes to remove the command name in AnalysisException message when a relation is not resolved.

For some of the commands that use `UnresolvedTable`, `UnresolvedView`, and `UnresolvedTableOrView` to resolve an identifier, when the identifier cannot be resolved, the exception will be something like `Table or view not found for 'SHOW TBLPROPERTIES': badtable`. The command name (`SHOW TBLPROPERTIES` in this case) should be dropped to be consistent with other existing commands.

### Why are the changes needed?

To make the exception message consistent.

### Does this PR introduce _any_ user-facing change?

Yes, the exception message will be changed from
```
Table or view not found for 'SHOW TBLPROPERTIES': badtable
```
to
```
Table or view not found: badtable
```
for commands that use `UnresolvedTable`, `UnresolvedView`, and `UnresolvedTableOrView` to resolve an identifier.

### How was this patch tested?

Updated existing tests.

Closes #30794 from imback82/remove_cmd_from_exception_msg.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 8666d1c)
The file was modifiedsql/core/src/test/resources/sql-tests/results/show_columns.sql.out (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/SQLViewSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveCommandSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala (diff)
Commit 3d0323401f7a3e4369a3d3f4ff98f15d19e8a643 by dongjoon
[SPARK-33810][TESTS] Reenable test cases disabled in SPARK-31732

### What changes were proposed in this pull request?

The test failures were due to machine being slow in Jenkins. We switched to Ubuntu 20 if I am not wrong.
Looks like all machines are functioning properly unlike the past, and the tests pass without a problem anymore.

This PR proposes to enable them back.

### Why are the changes needed?

To restore test coverage.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Jenkins jobs in this PR show the flakiness.

Closes #30798 from HyukjinKwon/do-not-merge-test.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 3d03234)
The file was modifiedexternal/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/DirectKafkaStreamSuite.scala (diff)
The file was modifiedexternal/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala (diff)
The file was modifiedstreaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala (diff)
The file was modifiedexternal/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaRelationSuite.scala (diff)
Commit 728a1298afa78c6acd7cdc4c21ee441120c34716 by dongjoon
[SPARK-33806][SQL] limit partition num to 1 when distributing by foldable expressions

### What changes were proposed in this pull request?

It seems a very popular way that people use DISTRIBUTE BY clause with a literal to coalesce partition in the pure SQL data processing.

For example
```
insert into table src select * from values (1), (2), (3) t(a) distribute by 1
```

Users may want the final output to be one single data file, but if the reality is not always true. Spark will always create a file for partition 0 whether it contains data or not, so when the data all goes to a partition(IDX >0), there will be always 2 files there and the part-00000 is empty. On the other hand, a lot of empty tasks will be launched too, this is unnecessary.

When users repeat the insert statement daily, hourly, or minutely, it causes small file issues.

```
spark-sql> set spark.sql.shuffle.partitions=3;drop table if exists test2;create table test2 using parquet as select * from values (1), (2), (3) t(a) distribute by 1;

kentyaohulk  ~/spark   SPARK-33806  tree /Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20201202/spark-warehouse/test2/ -s
/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20201202/spark-warehouse/test2/
├── [          0]  _SUCCESS
├── [        298]  part-00000-5dc19733-9405-414b-9681-d25c4d3e9ee6-c000.snappy.parquet
└── [        426]  part-00001-5dc19733-9405-414b-9681-d25c4d3e9ee6-c000.snappy.parquet
```

To avoid this, there are some options you can take.

1. use `distribute by null`, let the data go to the partition 0
2. set spark.sql.adaptive.enabled to true for Spark to automatically coalesce
3. using hints instead of `distribute by`
4. set spark.sql.shuffle.partitions to 1

In this PR, we set the partition number to 1 in this particular case.

### Why are the changes needed?

1. avoid small file issues
2. avoid unnecessary empty tasks when no adaptive execution

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new test

Closes #30800 from yaooqinn/SPARK-33806.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 728a129)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala (diff)
Commit e7e29fd0affe81a24959ecc0286ec4c85f319722 by dongjoon
[SPARK-33514][SQL][FOLLOW-UP] Remove unused TruncateTableStatement case class

### What changes were proposed in this pull request?

This PR removes unused `TruncateTableStatement`: https://github.com/apache/spark/pull/30457#discussion_r544433820

### Why are the changes needed?

To remove unused `TruncateTableStatement` from #30457.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Not needed.

Closes #30811 from imback82/remove_truncate_table_stmt.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: e7e29fd)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statements.scala (diff)
Commit 477046c63fab281570d26a183be4b0b8b77ac41a by dongjoon
[SPARK-33775][BUILD] Suppress sbt compilation warnings in Scala 2.13

### What changes were proposed in this pull request?
There are too many compilation warnings in Scala 2.13, this pr add some `-Wconf:msg= regexes` rules to `SparkBuild.scala` to suppress compilation warnings and the suppressed will not be printed to the console.

The suppressed compilation warnings includes:

- All warnings related to `method\value\type\object\trait\inheritance` deprecated since 2.13

- All warnings related to `Widening conversion from XXX to YYY is deprecated because it loses precision`

- Auto-application to `()` is deprecated. Supply the empty argument list `()` explicitly to invoke method `methodName`, or remove the empty argument list from its definition (Java-defined methods are exempt).In Scala 3, an unapplied method like this will be eta-expanded into a function.

- method with a single empty parameter list overrides method without any parameter list

- method without a parameter list overrides a method with a single empty one

Not suppressed compilation warnings includes:

- Unicode escapes in triple quoted strings are deprecated, use the literal character instead.

- view bounds are deprecated

- symbol literal is deprecated

### Why are the changes needed?
Suppress unimportant compilation warnings in Scala 2.13

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #30760 from LuciferYang/SPARK-33775.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 477046c)
The file was modifiedproject/SparkBuild.scala (diff)
Commit 0c129001201ccb63ae96f576b6f354da84024fb3 by kabhwan.opensource
[SPARK-33790][CORE] Reduce the rpc call of getFileStatus in SingleFileEventLogFileReader

### What changes were proposed in this pull request?
`FsHistoryProvider#checkForLogs` already has `FileStatus` when constructing `SingleFileEventLogFileReader`, and there is no need to get the `FileStatus` again when `SingleFileEventLogFileReader#fileSizeForLastIndex`.

### Why are the changes needed?
This can reduce a lot of rpc calls and improve the speed of the history server.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
exist ut

Closes #30780 from cxzl25/SPARK-33790.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(commit: 0c12900)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/history/EventLogFileReaders.scala (diff)
The file was modifiedproject/MimaExcludes.scala (diff)
Commit 0c19497222c26818ecdde527601c12c757acb4ad by wenchen
[SPARK-33815][SQL] Migrate ALTER TABLE ... SET [SERDE|SERDEPROPERTIES] to use UnresolvedTable to resolve the identifier

### What changes were proposed in this pull request?

This PR proposes to migrate `ALTER TABLE ... SET [SERDE|SERDEPROPERTIES` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).

Note that `ALTER TABLE ... SET [SERDE|SERDEPROPERTIES]` is not supported for v2 tables.

### Why are the changes needed?

The PR makes the resolution consistent behavior consistent. For example,
```scala
sql("CREATE DATABASE test")
sql("CREATE TABLE spark_catalog.test.t (id bigint, val string) USING csv PARTITIONED BY (id)")
sql("CREATE TEMPORARY VIEW t AS SELECT 2")
sql("USE spark_catalog.test")
sql("ALTER TABLE t SET SERDE 'serdename'") // works fine
```
, but after this PR:
```
sql("ALTER TABLE t SET SERDE 'serdename'")
org.apache.spark.sql.AnalysisException: t is a temp view. 'ALTER TABLE ... SET [SERDE|SERDEPROPERTIES\' expects a table; line 1 pos 0
```
, which is the consistent behavior with other commands.

### Does this PR introduce _any_ user-facing change?

After this PR, `t` in the above example is resolved to a temp view first instead of `spark_catalog.test.t`.

### How was this patch tested?

Updated existing tests.

Closes #30813 from imback82/alter_table_serde_v2.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 0c19497)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/SQLViewSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statements.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)
Commit 1e85707738a830d33598ca267a6740b3f06b1861 by wenchen
[SPARK-33697][SQL] RemoveRedundantProjects should require column ordering by default

### What changes were proposed in this pull request?
This PR changes the rule `RemoveRedundantProjects` from by default passing column ordering requirements from parent nodes to always require column orders regardless of the requirements from parent nodes unless otherwise specified. More specifically, instead of excluding a few nodes like GenerateExec, UnionExec that are known to require children columns to be ordered, the rule now includes a whitelist of nodes that allow passing through the ordering requirements from their parents.

### Why are the changes needed?
Currently, this rule passes through ordering requirements from parents directly to children except for a few excluded nodes. This incorrectly removes the necessary project nodes below a UnionExec since it is not excluded. An earlier PR also fixed a similar issue for GenerateExec (SPARK-32861). In order to prevent similar issues, the rule should be changed to always require column ordering except for a few specific nodes that we know for sure can pass through the requirements.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit tests

Closes #30659 from allisonwang-db/spark-33697-remove-project-union.

Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 1e85707)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/RemoveRedundantProjectsSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/RemoveRedundantProjects.scala (diff)
Commit b1950cc9162999c2200a0a988fa28aee640fb459 by gurwls223
[SPARK-33821][BUILD] Upgrade SBT to 1.4.5

### What changes were proposed in this pull request?

This PR aims to upgrade SBT to 1.4.5 to support Apple Silicon.

### Why are the changes needed?

The following is the release note including `sbt 1.4.5 adds support for Apple silicon (AArch64 also called ARM64)`.
- https://github.com/sbt/sbt/releases/tag/v1.4.5

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #30817 from dongjoon-hyun/SPARK-33821.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: b1950cc)
The file was modifiedproject/build.properties (diff)
Commit ed09673fb941830c15e5e5ad748be9de4755935c by gurwls223
[SPARK-33819][CORE] SingleFileEventLogFileReader/RollingEventLogFilesFileReader should be `package private`

### What changes were proposed in this pull request?

This PR aims to convert `EventLogFileReader`'s derived classes into `package private`.
- SingleFileEventLogFileReader
- RollingEventLogFilesFileReader

`EventLogFileReader` itself is used in `scheduler` module during tests.

### Why are the changes needed?

This classes were designed to be internal. This PR hides it explicitly to reduce the maintenance burden.

### Does this PR introduce _any_ user-facing change?

Yes, but these were exposed accidentally.

### How was this patch tested?

Pass CIs.

Closes #30814 from dongjoon-hyun/SPARK-33790.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: ed09673)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/history/EventLogFileReaders.scala (diff)
The file was modifiedproject/MimaExcludes.scala (diff)
Commit 12b69cc27caa476a9a29844f8d096f08263ba6ef by gurwls223
[SPARK-26199][SPARK-31517][R] Fix strategy for handling ... names in mutate

### What changes were proposed in this pull request?

Change the strategy for how the varargs are handled in the default `mutate` method

### Why are the changes needed?

Bugfix -- `deparse` + `sapply` not working as intended due to `width.cutoff`

### Does this PR introduce any user-facing change?

Yes, bugfix. Shouldn't change any working code.

### How was this patch tested?

None! yet.

Closes #28386 from MichaelChirico/r-mutate-deparse.

Lead-authored-by: Michael Chirico <michael.chirico@grabtaxi.com>
Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 12b69cc)
The file was modifiedR/pkg/R/DataFrame.R (diff)
The file was modifiedR/pkg/tests/fulltests/test_sparkSQL.R (diff)
Commit 34e4d87023535c086a0aa43fe194f794b41e09b7 by srowen
[SPARK-33774][UI][CORE] Back to Master" returns 500 error in Standalone cluster

### What changes were proposed in this pull request?

Initiate the `masterWebUiUrl` with the `webUi. webUrl` instead of the `masterPublicAddress`.

### Why are the changes needed?

Since [SPARK-21642](https://issues.apache.org/jira/browse/SPARK-21642), `WebUI` has changed from `localHostName` to `localCanonicalHostName` as the hostname to set up the web UI. However, the `masterPublicAddress` is from `RpcEnv`'s host address, which still uses `localHostName`. As a result, it returns the wrong Master web URL to the Worker.

### Does this PR introduce _any_ user-facing change?

Yes, when users click "Back to Master" in the Worker page:

Before this PR:

<img width="3258" alt="WeChat4acbfd163f51c76a5f9bc388c7479785" src="https://user-images.githubusercontent.com/16397174/102057951-b9664280-3e29-11eb-8749-5ee293902bdf.png">

After this PR:

![image](https://user-images.githubusercontent.com/16397174/102058016-d438b700-3e29-11eb-8641-a23a6b2f542e.png)

(Return to the Master page successfully.)

### How was this patch tested?

Tested manually.

Closes #30759 from Ngone51/fix-back-to-master.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(commit: 34e4d87)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/master/Master.scala (diff)
Commit 8c81cf7d71baf34dfafe54835a90cc19e7293561 by srowen
[SPARK-22769] Do not log rpc post message error when sparkEnv is already stopped

### What changes were proposed in this pull request?

When driver stopping, pending rpc requests will cause error like:

> 17/12/12 18:30:16 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154)
at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:134)
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:570)
at org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:180)
at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)

Or like:

> 17/12/12 18:20:44 INFO MemoryStore: MemoryStore cleared
17/12/12 18:20:44 INFO BlockManager: BlockManager stopped
17/12/12 18:20:44 INFO BlockManagerMaster: BlockManagerMaster stopped
17/12/12 18:20:44 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.
org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already stopped.
at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:152)
at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:134)
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:570)

These are because CoarseGrainedScheduler and rpcEnv are already stopped, they're not error.

The related issue SPARK-22769 was opened on 2017, but the author didn't finish the pull request, so reopen this issue.

### How was this patch tested?
Existing tests

Closes #30658 from sqlwindspeaker/donot-log-rpc-error.

Authored-by: suqilong <suqilong@qiyi.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(commit: 8c81cf7)
The file was modifiedcore/src/main/scala/org/apache/spark/rpc/netty/Dispatcher.scala (diff)
Commit 15616f499aca93c98a71732add2a80de863d3d5f by dongjoon
[SPARK-33173][CORE][TESTS][FOLLOWUP] Use `local[2]` and AtomicInteger

### What changes were proposed in this pull request?

Use `local[2]` to let tasks launch at the same time. And change counters (`numOnTaskXXX`) to `AtomicInteger` type to ensure thread safe.

### Why are the changes needed?

The test is still flaky after the fix https://github.com/apache/spark/pull/30072. See: https://github.com/apache/spark/pull/30728/checks?check_run_id=1557987642

And it's easy to reproduce if you test it multiple times (e.g. 100) locally.

The test sets up a stage with 2 tasks to run on an executor with 1 core. So these 2 tasks have to be launched one by one.
The task-2 will be launched after task-1 fails. However, since we don't retry failed task in local mode  (MAX_LOCAL_TASK_FAILURES = 1), the stage will abort right away after task-1 fail and cancels the running task-2 at the same time. There's a chance that task-2 gets canceled before calling `PluginContainer.onTaskStart`, which leads to the test failure.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Tested manually after the fix and the test is no longer flaky.

Closes #30823 from Ngone51/debug-flaky-spark-33088.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 15616f4)
The file was modifiedcore/src/test/scala/org/apache/spark/internal/plugin/PluginContainerSuite.scala (diff)
Commit 51ef4430dcbc934d43315ee6bdc851c9be84a1f2 by dongjoon
[SPARK-33822][SQL] Use the `CastSupport.cast` method in HashJoin

### What changes were proposed in this pull request?

This PR intends to fix the bug that throws a unsupported exception when running [the TPCDS q5](https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q5.sql) with AQE enabled ([this option is enabled by default now via SPARK-33679](https://github.com/apache/spark/commit/031c5ef280e0cba8c4718a6457a44b6cccb17f46)):
```
java.lang.UnsupportedOperationException: BroadcastExchange does not support the execute() code path.
  at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecute(BroadcastExchangeExec.scala:189)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
  at org.apache.spark.sql.execution.exchange.ReusedExchangeExec.doExecute(Exchange.scala:60)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
  at org.apache.spark.sql.execution.adaptive.QueryStageExec.doExecute(QueryStageExec.scala:115)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
  at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:321)
  at org.apache.spark.sql.execution.SparkPlan.executeCollectIterator(SparkPlan.scala:397)
  at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.$anonfun$relationFuture$1(BroadcastExchangeExec.scala:118)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:185)
  at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
  ...
```

I've checked the AQE code and I found `EnsureRequirements` wrongly puts `BroadcastExchange` on a top of `BroadcastQueryStage` in the `reOptimize` phase as follows:
```
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#2183]
  +- BroadcastQueryStage 2
    +- ReusedExchange [d_date_sk#1086], BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#1963]
```
A root cause is that a `Cast` class in a required child's distribution does not have a `timeZoneId` field (`timeZoneId=None`), and a `Cast` class in `child.outputPartitioning` has it. So, this difference can make the distribution requirement check fail in `EnsureRequirements`:
https://github.com/apache/spark/blob/1e85707738a830d33598ca267a6740b3f06b1861/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L47-L50

The `Cast` class that does not have a `timeZoneId` field is generated in the `HashJoin` object. To fix this issue, this PR proposes to use the `CastSupport.cast` method there.

### Why are the changes needed?

Bugfix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually checked that q5 passed.

Closes #30818 from maropu/BugfixInAQE.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 51ef443)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala (diff)
Commit 6315118676c99ccef2566c50ab9873de8876e468 by gurwls223
[SPARK-33824][PYTHON][DOCS] Restructure and improve Python package management page

### What changes were proposed in this pull request?

This PR proposes to restructure and refine the Python dependency management page.
I lately wrote a blog post which will be published soon, and decided contribute some of the contents back to PySpark documentation.
FWIW, it has been reviewed by some tech writers and engineers.

I built the site for making the review easier: https://hyukjin-spark.readthedocs.io/en/stable/user_guide/python_packaging.html

### Why are the changes needed?

For better documentation.

### Does this PR introduce _any_ user-facing change?

It's doc change but only in unreleased bracnhs for now.

### How was this patch tested?

I manually built the docs as:

```bash
cd python/docs
make clean html
open
```

Closes #30822 from HyukjinKwon/SPARK-33824.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 6315118)
The file was modifiedpython/docs/source/user_guide/python_packaging.rst (diff)
Commit 42e1831ebb19be15921a2ac612dfdac47639edeb by kabhwan.opensource
[SPARK-33797][SS][DOCS] Update SS doc about State Store and task locality

### What changes were proposed in this pull request?

This updates SS documentation to document about State Store and task locality.

### Why are the changes needed?

During running some tests for structured streaming, I found state store locality becomes an issue sometimes and it is not very straightforward for end-users. It'd be great if we can document it.

### Does this PR introduce _any_ user-facing change?

No, only doc change.

### How was this patch tested?

No, only doc change.

Closes #30789 from viirya/ss-statestore-doc.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(commit: 42e1831)
The file was modifieddocs/structured-streaming-programming-guide.md (diff)
Commit 131a23d88a56280d47584aed93bc8fb617550717 by dongjoon
[SPARK-33831][UI] Update to jetty 9.4.34

### What changes were proposed in this pull request?

Update Jetty to 9.4.34

### Why are the changes needed?

Picks up fixes and improvements, including a possible CVE fix.

https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.33.v20201020
https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.34.v20201102

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #30828 from srowen/SPARK-33831.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 131a23d)
The file was modifiedpom.xml (diff)
Commit 0f1a18370a1a95a2b7943519584af7a0dff42ae8 by wenchen
[SPARK-33817][SQL] CACHE TABLE uses a logical plan when caching a query to avoid creating a dataframe

### What changes were proposed in this pull request?

This PR proposes to update `CACHE TABLE` to use a `LogicalPlan` when caching a query to avoid creating a `DataFrame` as suggested here: https://github.com/apache/spark/pull/30743#discussion_r543123190

For reference, `UNCACHE TABLE` also uses `LogicalPlan`: https://github.com/apache/spark/blob/0c129001201ccb63ae96f576b6f354da84024fb3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CacheTableExec.scala#L91-L98

### Why are the changes needed?

To avoid creating an unnecessary dataframe and make it consistent with `uncacheQuery` used in `UNCACHE TABLE`.

### Does this PR introduce _any_ user-facing change?

No, just internal changes.

### How was this patch tested?

Existing tests since this is an internal refactoring change.

Closes #30815 from imback82/cache_with_logical_plan.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 0f1a183)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CacheTableExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
Commit 25c6cc25f74e8a24aa424f6596a574f26ae80e1d by sarutak
[SPARK-26341][WEBUI] Expose executor memory metrics at the stage level, in the Stages tab

### What changes were proposed in this pull request?
Expose executor memory metrics at the stage level, in the Stages tab,
Current like below, and I am not sure which column we will truly need.
![image](https://user-images.githubusercontent.com/46485123/101170248-2256f900-3679-11eb-8c34-794fcf8e94a8.png)

![image](https://user-images.githubusercontent.com/46485123/101170359-4dd9e380-3679-11eb-984b-b0430f236160.png)

![image](https://user-images.githubusercontent.com/46485123/101314915-86a1d480-3894-11eb-9b6f-8050d326e11f.png)

### Why are the changes needed?
User can know executor jvm usage more directly in SparkUI

### Does this PR introduce any user-facing change?
User can know executor jvm usage more directly in SparkUI

### How was this patch tested?
Manual Tested

Closes #30573 from AngersZhuuuu/SPARK-26341.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
(commit: 25c6cc2)
The file was modifiedcore/src/test/resources/HistoryServerExpectations/one_stage_attempt_json_expectation.json (diff)
The file was modifiedcore/src/main/resources/org/apache/spark/ui/static/stagepage.js (diff)
The file was modifiedcore/src/main/resources/org/apache/spark/ui/static/stagespage-template.html (diff)
The file was modifiedcore/src/test/resources/HistoryServerExpectations/excludeOnFailure_for_stage_expectation.json (diff)
The file was modifiedcore/src/test/resources/HistoryServerExpectations/one_stage_json_expectation.json (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/status/AppStatusListener.scala (diff)
The file was modifiedcore/src/test/resources/HistoryServerExpectations/stage_with_accumulable_json_expectation.json (diff)
The file was modifiedcore/src/test/resources/HistoryServerExpectations/excludeOnFailure_node_for_stage_expectation.json (diff)
Commit b0da2bcd464b24d58e2ce56d4f93f1f9527839ff by gurwls223
[MINOR][INFRA] Add -Pspark-ganglia-lgpl to the build definition with Scala 2.13 on GitHub Actions

### What changes were proposed in this pull request?

This PR adds `-Pspark-ganglia-lgpl` to the build definition with Scala 2.13 on GitHub Actions.

### Why are the changes needed?

Keep the code build-able with Scala 2.13.
With this change, all the sub-modules seems to be built-able with Scala 2.13.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I confirmed Scala 2.13 build pass with the following command.
```
$ ./dev/change-scala-version.sh 2.13
$ build/sbt -Pspark-ganglia-lgpl -Pscala-2.13 compile test:compile
```

Closes #30834 from sarutak/ganglia-scala-2.13.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: b0da2bc)
The file was modified.github/workflows/build_and_test.yml (diff)
Commit 0603913c666bae1a9640f2f1469fe50bc59e461d by dongjoon
[SPARK-33593][SQL] Vector reader got incorrect data with binary partition value

### What changes were proposed in this pull request?

Currently when enable parquet vectorized reader, use binary type as partition col will return incorrect value as below UT
```scala
test("Parquet vector reader incorrect with binary partition value") {
  Seq(false, true).foreach(tag => {
    withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {
      withTable("t1") {
        sql(
          """CREATE TABLE t1(name STRING, id BINARY, part BINARY)
            | USING PARQUET PARTITIONED BY (part)""".stripMargin)
        sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', X'537061726B2053514C')")
        if (tag) {
          checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"),
            Row("a", "Spark SQL", ""))
        } else {
          checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"),
            Row("a", "Spark SQL", "Spark SQL"))
        }
      }
    }
  })
}
```

### Why are the changes needed?
Fix data incorrect issue

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #30824 from AngersZhuuuu/SPARK-33593.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 0603913)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVectorUtils.java (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReaderSuite.scala (diff)
Commit bc46d273e0ae0d13d0e31e30e39198ac19dcd27b by gurwls223
[SPARK-33840][DOCS] Add spark.sql.files.minPartitionNum to performence tuning doc

### What changes were proposed in this pull request?

Add `spark.sql.files.minPartitionNum` and it's description to sql-performence-tuning.md.

### Why are the changes needed?

Help user to find it.

### Does this PR introduce _any_ user-facing change?

Yes, it's the doc.

### How was this patch tested?

Pass CI.

Closes #30838 from ulysses-you/SPARK-33840.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: bc46d27)
The file was modifieddocs/sql-performance-tuning.md (diff)
Commit 06b1bbbbab8cab2ce77d255a3287a2aacdb2df78 by wenchen
[SPARK-33798][SQL] Add new rule to push down the foldable expressions through CaseWhen/If

### What changes were proposed in this pull request?

This pr add a new rule(`PushFoldableIntoBranches`) to push down the foldable expressions through `CaseWhen/If`. This is a real case from production:
```sql
create table t1 using parquet as select * from range(100);
create table t2 using parquet as select * from range(200);

create temp view v1 as
select 'a' as event_type, * from t1
union all
select CASE WHEN id = 1 THEN 'b' WHEN id = 3 THEN 'c' end as event_type, * from t2

explain select * from v1 where event_type = 'a';
```

Before this PR:
```
== Physical Plan ==
Union
:- *(1) Project [a AS event_type#30533, id#30535L]
:  +- *(1) ColumnarToRow
:     +- FileScan parquet default.t1[id#30535L] Batched: true, DataFilters: [], Format: Parquet
+- *(2) Project [CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END AS event_type#30534, id#30536L]
   +- *(2) Filter (CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END = a)
      +- *(2) ColumnarToRow
         +- FileScan parquet default.t2[id#30536L] Batched: true, DataFilters: [(CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END = a)], Format: Parquet
```

After this PR:
```
== Physical Plan ==
*(1) Project [a AS event_type#8, id#4L]
+- *(1) ColumnarToRow
   +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], Format: Parquet
```

### Why are the changes needed?

Improve query performance.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #30790 from wangyum/SPARK-33798.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 06b1bbb)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala (diff)
The file was addedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PushFoldableIntoBranchesSuite.scala
Commit f23912880269723f02eadc2af4b2816c957c2357 by wenchen
[SPARK-33597][SQL] Support REGEXP_LIKE for consistent with mainstream databases

### What changes were proposed in this pull request?
There are a lot of mainstream databases support regex function `REGEXP_LIKE`.
Currently, Spark supports `RLike` and we just need add a new alias `REGEXP_LIKE` for it.
**Oracle**
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Pattern-matching-Conditions.html#GUID-D2124F3A-C6E4-4CCA-A40E-2FFCABFD8E19
**Presto**
https://prestodb.io/docs/current/functions/regexp.html
**Vertica**
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_LIKE.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CRegular%20Expression%20Functions%7C_____5
**Snowflake**
https://docs.snowflake.com/en/sql-reference/functions/regexp_like.html

**Additional modifications**

1. Because test case named `check outputs of expression examples` in ExpressionInfoSuite executes the example SQL of built-in function, so the below SQL be executed:
`SELECT '%SystemDrive%\Users\John' regexp_like '%SystemDrive%\\Users.*'`
But Spark SQL not supports this syntax yet.
2. Another reason: `SELECT '%SystemDrive%\Users\John' _FUNC_ '%SystemDrive%\\Users.*';`  is an SQL syntax, not the usecase for function `RLike`.
As the above reason, this PR changes the example SQL of `RLike`.

### Why are the changes needed?
No

### Does this PR introduce _any_ user-facing change?
Make the behavior of Spark SQL consistent with mainstream databases.

### How was this patch tested?
Jenkins test

Closes #30543 from beliefer/SPARK-33597.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: f239128)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-functions/sql-expression-schema.md (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/regexp-functions.sql (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/regexp-functions.sql.out (diff)
Commit 6dca2e5d35c0b1604d0264250872b87bd0b832f6 by wenchen
[SPARK-33599][SQL] Group exception messages in catalyst/analysis

### What changes were proposed in this pull request?
This PR group exception messages in `/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis`.

### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.

### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.

### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.

Closes #30717 from beliefer/SPARK-33599.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 6dca2e5)
The file was addedsql/catalyst/src/main/scala/org/apache/spark/sql/QueryExecutionErrors.scala
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/QueryCompilationErrors.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala (diff)