Changes

Summary

  1. [SPARK-33517][SQL][DOCS] Fix the correct menu items and page links in (commit: 3d54774) (details)
  2. [SPARK-33589][SQL] Close opened session if the initialization fails (commit: f93d439) (details)
  3. [SPARK-33582][SQL] Hive Metastore support filter by not-equals (commit: a5e13ac) (details)
  4. [SPARK-33567][SQL] DSv2: Use callback instead of passing Spark session (commit: feda729) (details)
  5. [MINOR] Spelling bin core docs external mllib repl (commit: 4851453) (details)
  6. [SPARK-32976][SQL] Support column list in INSERT statement (commit: 2da7259) (details)
  7. [SPARK-33448][SQL] Support CACHE/UNCACHE TABLE commands for v2 tables (commit: 0fd9f57) (details)
  8. [SPARK-33498][SQL][FOLLOW-UP] Deduplicate the unittest by using (commit: 225c2e2) (details)
  9. [SPARK-28646][SQL] Fix bug of Count so as consistent with mainstream (commit: b665d58) (details)
  10. [SPARK-33480][SQL] Support char/varchar type (commit: 5cfbddd) (details)
  11. [SPARK-33579][UI] Fix executor blank page behind proxy (commit: 6e5446e) (details)
  12. [SPARK-33452][SQL] Support v2 SHOW PARTITIONS (commit: 0a612b6) (details)
  13. [SPARK-33569][SQL] Remove getting partitions by an identifier prefix (commit: 6fd148f) (details)
  14. [SPARK-33569][SPARK-33452][SQL][FOLLOWUP] Fix a build error in (commit: 030b313) (details)
  15. [SPARK-33185][YARN][FOLLOW-ON] Leverage RM's RPC API instead of REST to (commit: f3c2583) (details)
  16. [SPARK-33545][CORE] Support Fallback Storage during Worker decommission (commit: c699435) (details)
  17. [SPARK-33440][CORE] Use current timestamp with warning log in (commit: f5d2165) (details)
  18. [SPARK-33556][ML] Add array_to_vector function for dataframe column (commit: 596fbc1) (details)
  19. [SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests (commit: aeb3649) (details)
  20. [SPARK-33592] Fix: Pyspark ML Validator params in estimatorParamMaps may (commit: 8016123) (details)
  21. [SPARK-33607][SS][WEBUI] Input Rate timeline/histogram aren't rendered (commit: c50fcac) (details)
  22. [SPARK-30900][SS] FileStreamSource: Avoid reading compact metadata log (commit: 2af2da5) (details)
  23. [SPARK-33530][CORE] Support --archives and spark.archives option (commit: 1a042cc) (details)
  24. [SPARK-27188][SS] FileStreamSink: provide a new option to have retention (commit: 52e5cc4) (details)
  25. [SPARK-33572][SQL] Datetime building should fail if the year, month, (commit: 1034815) (details)
  26. [SPARK-32032][SS] Avoid infinite wait in driver because of (commit: e5bb293) (details)
  27. [SPARK-32405][SQL][FOLLOWUP] Throw Exception if provider is specified in (commit: d38883c) (details)
  28. [SPARK-33045][SQL][FOLLOWUP] Support built-in function like_any and fix (commit: 9273d42) (details)
  29. [SPARK-33503][SQL] Refactor SortOrder class to allow multiple childrens (commit: cf4ad21) (details)
  30. [SPARK-33608][SQL] Handle DELETE/UPDATE/MERGE in (commit: 478fb7f) (details)
  31. [SPARK-33612][SQL] Add dataSourceRewriteRules batch to Optimizer (commit: c24f2b2) (details)
  32. [SPARK-33611][UI] Avoid encoding twice on the query parameter of (commit: 5d0045e) (details)
  33. [SPARK-33622][R][ML] Add array_to_vector to SparkR (commit: 5a1c5ac) (details)
  34. [SPARK-33544][SQL] Optimize size of CreateArray/CreateMap to be the size (commit: f71f345) (details)
  35. [SPARK-32863][SS] Full outer stream-stream join (commit: 51ebcd9) (details)
  36. [MINOR][SS] Rename auxiliary protected methods in StreamingJoinSuite (commit: a4788ee) (details)
  37. [SPARK-33618][CORE] Use hadoop-client instead of hadoop-client-api to (commit: 290aa02) (details)
  38. [SPARK-33557][CORE][MESOS][TEST] Ensure the relationship between (commit: 084d38b) (details)
  39. [SPARK-33504][CORE] The application log in the Spark history server (commit: 28dad1b) (details)
  40. [SPARK-33544][SQL][FOLLOW-UP] Rename NoSideEffect to NoThrow and clarify (commit: df8d3f1) (details)
  41. [SPARK-33619][SQL] Fix GetMapValueUtil code generation error (commit: 58583f7) (details)
  42. [SPARK-33626][K8S][TEST] Allow k8s integration tests to assert both (commit: 91182d6) (details)
  43. [SPARK-33071][SPARK-33536][SQL] Avoid changing dataset_id of LogicalPlan (commit: a082f46) (details)
  44. [SPARK-33627][SQL] Add new function UNIX_SECONDS, UNIX_MILLIS and (commit: b76c6b7) (details)
  45. [SPARK-33631][DOCS][TEST] Clean up (commit: 92bfbcb) (details)
Commit 3d54774fb9cbf674580851aa2323991c7e462a1e by gurwls223
[SPARK-33517][SQL][DOCS] Fix the correct menu items and page links in PySpark Usage Guide for Pandas with Apache Arrow

### What changes were proposed in this pull request?

Change "Apache Arrow in Spark" to "Apache Arrow in PySpark"
and the link to “/sql-pyspark-pandas-with-arrow.html#apache-arrow-in-pyspark”

### Why are the changes needed?
When I click on the menu item it doesn't point to the correct page, and from the parent menu I can infer that the correct menu item name and link should be "Apache Arrow in PySpark".
like this:
image
![image](https://user-images.githubusercontent.com/28332082/99954725-2b64e200-2dbe-11eb-9576-cf6a3d758980.png)

### Does this PR introduce any user-facing change?
Yes, clicking on the menu item will take you to the correct guide page

### How was this patch tested?
Manually build the doc. This can be verified as below:

cd docs
SKIP_API=1 jekyll build
open _site/sql-pyspark-pandas-with-arrow.html

Closes #30466 from liucht-inspur/master.

Authored-by: liucht <liucht@inspur.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 3d54774)
The file was modifieddocs/_data/menu-sql.yaml (diff)
Commit f93d4395b25ea546cebb1ff16879dea696a217b5 by gurwls223
[SPARK-33589][SQL] Close opened session if the initialization fails

### What changes were proposed in this pull request?

This pr add try catch when opening session.

### Why are the changes needed?

Close opened session if the initialization fails.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Before this pr:

```
[rootspark-3267648 spark]#  bin/beeline -u jdbc:hive2://localhost:10000/db_not_exist
NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly.
Connecting to jdbc:hive2://localhost:10000/db_not_exist
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Error: Could not open client transport with JDBC Uri: jdbc:hive2://localhost:10000/db_not_exist: Database 'db_not_exist' not found; (state=08S01,code=0)
Beeline version 2.3.7 by Apache Hive
beeline>
```
![image](https://user-images.githubusercontent.com/5399861/100560975-73ba5d80-32f2-11eb-8f92-b2509e7a121f.png)

After this pr:
```
[rootspark-3267648 spark]#  bin/beeline -u jdbc:hive2://localhost:10000/db_not_exist
NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly.
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Connecting to jdbc:hive2://localhost:10000/db_not_exist
Error: Could not open client transport with JDBC Uri: jdbc:hive2://localhost:10000/db_not_exist: Failed to open new session: org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'db_not_exist' not found; (state=08S01,code=0)
Beeline version 2.3.7 by Apache Hive
beeline>
```
![image](https://user-images.githubusercontent.com/5399861/100560917-479edc80-32f2-11eb-986f-7a997f1163fc.png)

Closes #30536 from wangyum/SPARK-33589.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: f93d439)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLSessionManager.scala (diff)
Commit a5e13acd19871831a93a5bdcbc99a9eb9f1aba07 by gurwls223
[SPARK-33582][SQL] Hive Metastore support filter by not-equals

### What changes were proposed in this pull request?

This pr make partition predicate pushdown into Hive metastore support not-equals operator.

Hive related changes:
https://github.com/apache/hive/blob/b8bd4594bef718b1eeac9fceb437d7df7b480ed1/itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java#L2194-L2207
https://issues.apache.org/jira/browse/HIVE-2702

### Why are the changes needed?

Improve query performance.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #30534 from wangyum/SPARK-33582.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: a5e13ac)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/client/FiltersSuite.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/client/HivePartitionFilteringSuite.scala (diff)
Commit feda7299e3d8ebe665b8fae0328f22a4927c66da by wenchen
[SPARK-33567][SQL] DSv2: Use callback instead of passing Spark session and v2 relation for refreshing cache

### What changes were proposed in this pull request?

This replaces Spark session and `DataSourceV2Relation` in V2 write plans by replacing them with a callback `afterWrite`.

### Why are the changes needed?

Per discussion in #30429, it's better to not pass Spark session and `DataSourceV2Relation` through Spark plans. Instead we can use a callback which makes the interface cleaner.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

Closes #30491 from sunchao/SPARK-33492-followup.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: feda729)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V1FallbackWriters.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DropTableExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/RefreshTableExec.scala (diff)
Commit 485145326a9c97ede260b0e267ee116f182cfd56 by yamamuro
[MINOR] Spelling bin core docs external mllib repl

### What changes were proposed in this pull request?

This PR intends to fix typos in the sub-modules:
* `bin`
* `core`
* `docs`
* `external`
* `mllib`
* `repl`
* `pom.xml`

Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618

NOTE: The misspellings have been reported at https://github.com/jsoref/spark/commit/706a726f87a0bbf5e31467fae9015218773db85b#commitcomment-44064356

### Why are the changes needed?

Misspelled words make it harder to read / understand content.

### Does this PR introduce _any_ user-facing change?

There are various fixes to documentation, etc...

### How was this patch tested?

No testing was performed

Closes #30530 from jsoref/spelling-bin-core-docs-external-mllib-repl.

Authored-by: Josh Soref <jsoref@users.noreply.github.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(commit: 4851453)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/util/Utils.scala (diff)
The file was modifieddocs/monitoring.md (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/ui/jobs/JobPage.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/history/HybridStore.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/ExecutorAllocationManagerSuite.scala (diff)
The file was modifieddocs/sql-ref-syntax-ddl-create-table-hiveformat.md (diff)
The file was modifieddocs/css/main.css (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/util/io/ChunkedByteBuffer.scala (diff)
The file was modifieddocs/sql-data-sources-jdbc.md (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/r/AFTSurvivalRegressionWrapper.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala (diff)
The file was modifieddocs/sql-migration-guide.md (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/benchmark/BenchmarkBase.scala (diff)
The file was modifieddocs/running-on-yarn.md (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/deploy/worker/WorkerSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/metrics/sink/PrometheusServlet.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/clustering/GaussianMixtureSuite.scala (diff)
The file was modifieddocs/sql-ref-syntax-qry-select-orderby.md (diff)
The file was modifiedmllib/src/test/java/org/apache/spark/ml/feature/JavaStopWordsRemoverSuite.java (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/BlockManager.scala (diff)
The file was modifieddocs/running-on-kubernetes.md (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/Binarizer.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/rdd/RDD.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/feature/LSHTest.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/feature/DCTSuite.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/executor/ExecutorSuite.scala (diff)
The file was modifieddocs/building-spark.md (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/network/netty/NettyBlockTransferServiceSuite.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/scheduler/SchedulerIntegrationSuite.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/stat/test/KolmogorovSmirnovTest.scala (diff)
The file was modifiedpom.xml (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/scheduler/ReplayListenerSuite.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/evaluation/RegressionEvaluatorSuite.scala (diff)
The file was modifieddocs/sql-ref-syntax-aux-conf-mgmt-set-timezone.md (diff)
The file was modifieddocs/sql-ref-syntax-qry-select-lateral-view.md (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala (diff)
The file was modifiedcore/src/test/java/org/apache/spark/shuffle/sort/UnsafeShuffleWriterSuite.java (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala (diff)
The file was modifieddocs/sql-ref-syntax-qry-select-groupby.md (diff)
The file was modifieddocs/graphx-programming-guide.md (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/JsonProtocol.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/CheckpointSuite.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/feature/ANOVASelectorSuite.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/util/SizeEstimatorSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/resource/TaskResourceRequest.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/security/CryptoStreamUtils.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/internal/io/FileCommitProtocolInstantiationSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/Selector.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala (diff)
The file was modifieddocs/sparkr.md (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/rdd/PairRDDFunctionsSuite.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/metrics/InputOutputMetricsSuite.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/rdd/RDDSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/FileSuite.scala (diff)
The file was modifieddocs/running-on-mesos.md (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/resource/ResourceUtilsSuite.scala (diff)
The file was modifiedrepl/src/test/scala/org/apache/spark/repl/ExecutorClassLoaderSuite.scala (diff)
The file was modifieddocs/mllib-clustering.md (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/api/python/PythonRDD.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/rdd/DoubleRDDFunctions.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/ContextCleanerSuite.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/feature/VarianceThresholdSelectorSuite.scala (diff)
The file was modifieddocs/sql-ref-syntax-dml-insert-into.md (diff)
The file was modifieddocs/ml-migration-guide.md (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/regression/FMRegressor.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala (diff)
The file was modifiedbin/docker-image-tool.sh (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/rdd/OrderedRDDFunctions.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/clustering/DistanceMeasure.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/classification/SVM.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/executor/Executor.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/scheduler/HealthTracker.scala (diff)
The file was modifiedcore/src/test/java/test/org/apache/spark/JavaAPISuite.java (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/scheduler/BarrierJobAllocationFailed.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala (diff)
The file was modifieddocs/mllib-data-types.md (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala (diff)
The file was modifieddocs/_plugins/include_example.rb (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/util/ClosureCleaner.scala (diff)
The file was modifiedcore/src/main/resources/org/apache/spark/ui/static/spark-dag-viz.js (diff)
The file was modifieddocs/sql-ref-syntax-dml-insert-overwrite-table.md (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringMetrics.scala (diff)
The file was modifieddocs/configuration.md (diff)
The file was modifiedcore/src/main/resources/org/apache/spark/ui/static/utils.js (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala (diff)
Commit 2da72593c1cf63fc6f815416b8d553f0a53f3e65 by wenchen
[SPARK-32976][SQL] Support column list in INSERT statement

### What changes were proposed in this pull request?

#### JIRA expectations
```
   INSERT currently does not support named column lists.

   INSERT INTO <table> (col1, col2,…) VALUES( 'val1', 'val2', … )
   Note, we assume the column list contains all the column names. Issue an exception if the list is not complete. The column order could be different from the column order defined in the table definition.
```
#### implemetations
In this PR, we add a column list  as an optional part to the `INSERT OVERWRITE/INTO` statements:
```
  /**
   * {{{
   *   INSERT OVERWRITE TABLE tableIdentifier [partitionSpec [IF NOT EXISTS]]? [identifierList] ...
   *   INSERT INTO [TABLE] tableIdentifier [partitionSpec]  [identifierList] ...
   * }}}
   */
```
The column list represents all expected columns with an explicit order that you want to insert to the target table. **Particularly**,  we assume the column list contains all the column names in the current implementation, it will fail when the list is incomplete.

In **Analyzer**, we add a code path to resolve the column list in the `ResolveOutputRelation` rule before it is transformed to v1 or v2 command. It will fail here if the list has any field that not belongs to the target table.

Then, for v2 command, e.g. `AppendData`, we use the resolved column list and output of the target table to resolve the output of the source query `ResolveOutputRelation` rule. If the list has duplicated columns, we fail. If the list is not empty but the list size does not match the target table, we fail. If no other exceptions occur, we use the column list to map the output of the source query to the output of the target table.  The column list will be set to Nil and it will not hit the rule again after it is resolved.

for v1 command, those all happen in the `PreprocessTableInsertion` rule

### Why are the changes needed?
new feature support

### Does this PR introduce _any_ user-facing change?

yes, insert into/overwrite table support specify column list
### How was this patch tested?

new tests

Closes #29893 from yaooqinn/SPARK-32976.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 2da7259)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FallBackFileSourceV2.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala (diff)
The file was modifiedsql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/PlanResolutionSuite.scala (diff)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/SQLInsertTestSuite.scala
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statements.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was addedsql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSQLInsertTestSuite.scala
Commit 0fd9f57dd4cee32b4d0a16345f98e628a9d5f0fe by wenchen
[SPARK-33448][SQL] Support CACHE/UNCACHE TABLE commands for v2 tables

### What changes were proposed in this pull request?

This PR proposes to support `CHACHE/UNCACHE TABLE` commands for v2 tables.

In addtion, this PR proposes to migrate `CACHE/UNCACHE TABLE` to use `UnresolvedTableOrView` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).

### Why are the changes needed?

To support `CACHE/UNCACHE TABLE` commands for v2 tables.

Note that `CACHE/UNCACHE TABLE` for v1 tables/views go through `SparkSession.table` to resolve identifier, which resolves temp views first, so there is no change in the behavior by moving to the new framework.

### Does this PR introduce _any_ user-facing change?

Yes. Now the user can run `CACHE/UNCACHE TABLE` commands on v2 tables.

### How was this patch tested?

Added/updated existing tests.

Closes #30403 from imback82/cache_table.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 0fd9f57)
The file was modifiedsql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/test/TestHive.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statements.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/CachedTableSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/cache.scala (diff)
Commit 225c2e2815988ebf3e0926a4ca2af9a933b48467 by gurwls223
[SPARK-33498][SQL][FOLLOW-UP] Deduplicate the unittest by using checkCastWithParseError

### What changes were proposed in this pull request?

Dup code removed in SPARK-33498 as follow-up.

### Why are the changes needed?

Nit.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing UT.

Closes #30540 from leanken/leanken-SPARK-33498.

Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 225c2e2)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala (diff)
Commit b665d5881915f042930f502bcc3c6ee3cb00c50d by gurwls223
[SPARK-28646][SQL] Fix bug of Count so as consistent with mainstream databases

### What changes were proposed in this pull request?
Currently, Spark allows calls to `count` even for non parameterless aggregate function. For example, the following query actually works:
`SELECT count() FROM tenk1;`
On the other hand, mainstream databases will throw an error.
**Oracle**
`> ORA-00909: invalid number of arguments`
**PgSQL**
`ERROR:  count(*) must be used to call a parameterless aggregate function`
**MySQL**
`> 1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ')`

### Why are the changes needed?
Fix a bug so that consistent with mainstream databases.
There is an example query output with/without this fix.
`SELECT count() FROM testData;`
The output before this fix:
`0`
The output after this fix:
```
org.apache.spark.sql.AnalysisException
cannot resolve 'count()' due to data type mismatch: count requires at least one argument.; line 1 pos 7
```

### Does this PR introduce _any_ user-facing change?
Yes.
If not specify parameter for `count`, will throw an error.

### How was this patch tested?
Jenkins test.

Closes #30541 from beliefer/SPARK-28646.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: b665d58)
The file was modifiedsql/core/src/test/resources/sql-tests/results/count.sql.out (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/count.sql (diff)
Commit 5cfbdddefe0753c3aff03f326b31c0ba8882b3a9 by wenchen
[SPARK-33480][SQL] Support char/varchar type

### What changes were proposed in this pull request?

This PR adds the char/varchar type which is kind of a variant of string type:
1. Char type is fixed-length string. When comparing char type values, we need to pad the shorter one to the longer length.
2. Varchar type is string with a length limitation.

To implement the char/varchar semantic, this PR:
1. Do string length check when writing to char/varchar type columns.
2. Do string padding when reading char type columns. We don't do it at the writing side to save storage space.
3. Do string padding when comparing char type column with string literal or another char type column. (string literal is fixed length so should be treated as char type as well)

To simplify the implementation, this PR doesn't propagate char/varchar type info through functions/operators(e.g. `substring`). That said, a column can only be char/varchar type if it's a table column, not a derived column like `SELECT substring(col)`.

To be safe, this PR doesn't add char/varchar type to the query engine(expression input check, internal row framework, codegen framework, etc.). We will replace char/varchar type by string type with metadata (`Attribute.metadata` or `StructField.metadata`) that includes the original type string before it goes into the query engine. That said, the existing code will not see char/varchar type but only string type.

char/varchar type may come from several places:
1. v1 table from hive catalog.
2. v2 table from v2 catalog.
3. user-specified schema in `spark.read.schema` and `spark.readStream.schema`
4. `Column.cast`
5. schema string in places like `from_json`, pandas UDF, etc. These places use SQL parser which replaces char/varchar with string already, even before this PR.

This PR covers all the above cases, implements the length check and padding feature by looking at string type with special metadata.

### Why are the changes needed?

char and varchar are standard SQL types. varchar is widely used in other databases instead of string type.

### Does this PR introduce _any_ user-facing change?

For hive tables: now the table insertion fails if the value exceeds char/varchar length. Previously we truncate the value silently.

For other tables:
1. now char type is allowed.
2. now we have length check when inserting to varchar columns. Previously we write the value as it is.

### How was this patch tested?

new tests

Closes #30412 from cloud-fan/char.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 5cfbddd)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/TableSchemaParserSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala (diff)
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ApplyCharTypePadding.scala
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/sources/TableScanSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala (diff)
The file was addedsql/catalyst/src/main/scala/org/apache/spark/sql/types/VarcharType.scala
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/Column.scala (diff)
The file was addedsql/catalyst/src/main/scala/org/apache/spark/sql/types/CharType.scala
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was addedsql/hive/src/test/scala/org/apache/spark/sql/HiveCharVarcharTestSuite.scala
The file was removedsql/catalyst/src/main/scala/org/apache/spark/sql/types/HiveStringType.scala
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolvePartitionSpec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/types/package.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/CatalogV2UtilSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/PlanResolutionSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifieddocs/sql-ref-datatypes.md (diff)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/CharVarcharTestSuite.scala
The file was addedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CharVarcharUtils.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala (diff)
Commit 6e5446e61f278e9afac342e8f33905f5630aa7d5 by sarutak
[SPARK-33579][UI] Fix executor blank page behind proxy

### What changes were proposed in this pull request?

Fix some "hardcoded" API urls in Web UI.
More specifically, we avoid the use of `location.origin` when constructing URLs for internal API calls within the JavaScript.
Instead, we use `apiRoot` global variable.

### Why are the changes needed?

On one hand, it allows us to build relative URLs. On the other hand, `apiRoot` reflects the Spark property `spark.ui.proxyBase` which can be set to change the root path of the Web UI.

If `spark.ui.proxyBase` is actually set, original URLs become incorrect, and we end up with an executors blank page.
I encounter this bug when accessing the Web UI behind a proxy (in my case a Kubernetes Ingress).

See the following link for more context:
https://github.com/jupyterhub/jupyter-server-proxy/issues/57#issuecomment-699163115

### Does this PR introduce _any_ user-facing change?

Yes, as all the changes introduced are in the JavaScript for the Web UI.

### How the changes have been tested ?
I modified/debugged the JavaScript as in the commit with the help of the developer tools in Google Chrome, while accessing the Web UI of my Spark app behind my k8s ingress.

Closes #30523 from pgillet/fix-executors-blank-page-behind-proxy.

Authored-by: Pascal Gillet <pascal.gillet@stack-labs.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
(commit: 6e5446e)
The file was modifiedcore/src/main/resources/org/apache/spark/ui/static/stagepage.js (diff)
The file was modifiedcore/src/main/resources/org/apache/spark/ui/static/utils.js (diff)
Commit 0a612b6a40696ed8ce00997ebb4e76d05adbbd82 by wenchen
[SPARK-33452][SQL] Support v2 SHOW PARTITIONS

### What changes were proposed in this pull request?
1. Remove V2 logical node `ShowPartitionsStatement `, and replace it by V2 `ShowPartitions`.
2. Implement V2 execution node `ShowPartitionsExec` similar to V1 `ShowPartitionsCommand`.

### Why are the changes needed?
To have feature parity with Datasource V1.

### Does this PR introduce _any_ user-facing change?
Yes.

Before the change, `SHOW PARTITIONS` fails in V2 table catalogs with the exception:
```
org.apache.spark.sql.AnalysisException: SHOW PARTITIONS is only supported with v1 tables.
   at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog.org$apache$spark$sql$catalyst$analysis$ResolveSessionCatalog$$parseV1Table(ResolveSessionCatalog.scala:628)
   at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog$$anonfun$apply$1.applyOrElse(ResolveSessionCatalog.scala:466)
```

### How was this patch tested?
By running the following test suites:
1. Modified `ShowPartitionsParserSuite` where `ShowPartitionsStatement` is replaced by V2 `ShowPartitions`.
2. `v2.ShowPartitionsSuite`

Closes #30398 from MaxGekk/show-partitions-exec-v2.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 0a612b6)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowPartitionsParserSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowPartitionsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowPartitionsSuiteBase.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/AlterTableDropPartitionExec.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statements.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolvePartitionSpec.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/v2/ShowPartitionsSuite.scala (diff)
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowPartitionsExec.scala
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/v2ResolutionPlans.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/AlterTableAddPartitionExec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala (diff)
Commit 6fd148fea890391941f876e0a14446d875fe72e1 by wenchen
[SPARK-33569][SQL] Remove getting partitions by an identifier prefix

### What changes were proposed in this pull request?
1. Remove the method `listPartitionIdentifiers()` from the `SupportsPartitionManagement` interface. The method lists partitions by ident prefix.
2. Rename `listPartitionByNames()` to `listPartitionIdentifiers()`.
3. Re-implement the default method `partitionExists()` using new method.

### Why are the changes needed?
Getting partitions by ident prefix only is not used, and it can be removed to improve code maintenance. Also this makes the `SupportsPartitionManagement` interface cleaner.

### Does this PR introduce _any_ user-facing change?
Should not.

### How was this patch tested?
By running the affected test suites:
```
$ build/sbt "test:testOnly org.apache.spark.sql.connector.catalog.*"
```

Closes #30514 from MaxGekk/remove-listPartitionIdentifiers.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 6fd148f)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryPartitionTable.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/SupportsPartitionManagementSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsPartitionManagement.java (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/AlterTablePartitionV2SQLSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/SupportsAtomicPartitionManagementSuite.scala (diff)
Commit 030b3139dadc342e82d71f3fb241c320a7577131 by wenchen
[SPARK-33569][SPARK-33452][SQL][FOLLOWUP] Fix a build error in `ShowPartitionsExec`

### What changes were proposed in this pull request?
Use `listPartitionIdentifiers ` instead of `listPartitionByNames` in `ShowPartitionsExec`. The `listPartitionByNames` was renamed by https://github.com/apache/spark/pull/30514.

### Why are the changes needed?
To fix build error.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running tests for the `SHOW PARTITIONS` command:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowPartitionsSuite"
```

Closes #30553 from MaxGekk/fix-build-show-partitions-exec.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 030b313)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowPartitionsExec.scala (diff)
Commit f3c2583cc3ad6a2a24bfb09e2ee7af4e63e5bf66 by mridulatgmail.com
[SPARK-33185][YARN][FOLLOW-ON] Leverage RM's RPC API instead of REST to fetch driver log links in yarn.Client

### What changes were proposed in this pull request?
This is a follow-on to PR #30096 which initially added support for printing direct links to the driver stdout/stderr logs from the application report output in `yarn.Client` using the `spark.yarn.includeDriverLogsLink` configuration. That PR made use of the ResourceManager's REST APIs to fetch the necessary information to construct the links. This PR proposes removing the dependency on the REST API, since the new logic is the only place in `yarn.Client` which makes use of this API, and instead leverages the RPC API via `YarnClient`, which brings the code in line with the rest of `yarn.Client`.

### Why are the changes needed?

While the old logic worked okay when running a Spark application in a "standard" environment with full access to Kerberos credentials, it can fail when run in an environment with restricted Kerberos credentials. In our case, this environment is represented by [Azkaban](https://azkaban.github.io/), but it likely affects other job scheduling systems as well. In such an environment, the application has delegation tokens which enabled it to communicate with services such as YARN, but the RM REST API is not typically covered by such delegation tokens (note that although YARN does actually support accessing the RM REST API via a delegation token as documented [here](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Delegation_Tokens_API), it is a new feature in alpha phase, and most deployments are likely not retrieving this token today).

Besides this enhancement, leveraging the `YarnClient` APIs greatly simplifies the processing logic, such as removing all JSON parsing.

### Does this PR introduce _any_ user-facing change?

Very minimal user-facing changes on top of PR #30096. Basically expands the scope of environments in which that feature will operate correctly.

### How was this patch tested?

In addition to redoing the `spark-submit` testing as mentioned in PR #30096, I also tested this logic in a restricted-credentials environment (Azkaban). It succeeds where the previous logic would fail with a 401 error.

Closes #30450 from xkrogen/xkrogen-SPARK-33185-driverlogs-followon.

Authored-by: Erik Krogen <xkrogen@apache.org>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(commit: f3c2583)
The file was modifiedresource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala (diff)
The file was modifiedresource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala (diff)
The file was modifiedresource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala (diff)
Commit c6994354f70061b2a15445dbd298a2db926b548c by dongjoon
[SPARK-33545][CORE] Support Fallback Storage during Worker decommission

### What changes were proposed in this pull request?

This PR aims to support storage migration to the fallback storage like cloud storage (`S3`) during worker decommission for the corner cases where the exceptions occur or there is no live peer left.

Although this PR focuses on cloud storage like `S3` which has a TTL feature in order to simplify Spark's logic, we can use alternative fallback storages like HDFS/NFS(EFS) if the user provides a clean-up mechanism.

### Why are the changes needed?

Currently, storage migration is not possible when there is no available executor. For example, when there is one executor, the executor cannot perform storage migration because it has no peer.

### Does this PR introduce _any_ user-facing change?

Yes. This is a new feature.

### How was this patch tested?

Pass the CIs with newly added test cases.

Closes #30492 from dongjoon-hyun/SPARK-33545.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: c699435)
The file was modifiedcore/src/main/scala/org/apache/spark/SparkContext.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala (diff)
The file was addedcore/src/test/scala/org/apache/spark/storage/FallbackStorageSuite.scala
The file was modifiedcore/src/main/scala/org/apache/spark/storage/BlockManager.scala (diff)
The file was modifiedcore/pom.xml (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala (diff)
The file was addedcore/src/main/scala/org/apache/spark/storage/FallbackStorage.scala
The file was modifiedcore/src/main/scala/org/apache/spark/internal/config/package.scala (diff)
Commit f5d2165c95fe83f24be9841807613950c1d5d6d0 by kabhwan.opensource
[SPARK-33440][CORE] Use current timestamp with warning log in HadoopFSDelegationTokenProvider when the issue date for token is not set up properly

### What changes were proposed in this pull request?

This PR proposes to use current timestamp with warning log when the issue date for token is not set up properly. The next section will explain the rationalization with details.

### Why are the changes needed?

Unfortunately not every implementations respect the `issue date` in `AbstractDelegationTokenIdentifier`, which Spark relies on while calculating. The default value of issue date is 0L, which is far from actual issue date, breaking logic on calculating next renewal date under some circumstance, leading to 0 interval (immediate) on rescheduling token renewal.

In HadoopFSDelegationTokenProvider, Spark calculates token renewal interval as below:

https://github.com/apache/spark/blob/2c64b731ae6a976b0d75a95901db849b4a0e2393/core/src/main/scala/org/apache/spark/deploy/security/HadoopFSDelegationTokenProvider.scala#L123-L134

The interval is calculated as `token.renew() - identifier.getIssueDate`, which is providing correct interval assuming both `token.renew()` and `identifier.getIssueDate` produce correct value, but it's going to be weird when `identifier.getIssueDate` provides 0L (default value), like below:

```
20/10/13 06:34:19 INFO security.HadoopFSDelegationTokenProvider: Renewal interval is 1603175657000 for token S3ADelegationToken/IDBroker
20/10/13 06:34:19 INFO security.HadoopFSDelegationTokenProvider: Renewal interval is 86400048 for token HDFS_DELEGATION_TOKEN
```

Hopefully we pick the minimum value as safety guard (so in this case, `86400048` is being picked up), but the safety guard leads unintentional bad impact on this case.

https://github.com/apache/spark/blob/2c64b731ae6a976b0d75a95901db849b4a0e2393/core/src/main/scala/org/apache/spark/deploy/security/HadoopFSDelegationTokenProvider.scala#L58-L71

Spark leverages the interval being calculated in above, "minimum" value of intervals, and blindly adds the value to token's issue date to calculates the next renewal date for the token, and picks "minimum" value again. In problematic case, the value would be `86400048` (86400048 + 0) which is quite smaller than current timestamp.

https://github.com/apache/spark/blob/2c64b731ae6a976b0d75a95901db849b4a0e2393/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala#L228-L234

The next renewal date is subtracted with current timestamp again to get the interval, and multiplexed by configured ratio to produce the final schedule interval. In problematic case, this value goes to negative.

https://github.com/apache/spark/blob/2c64b731ae6a976b0d75a95901db849b4a0e2393/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala#L180-L188

There's a safety guard to not allow negative value, but that's simply 0 meaning schedule immediately. This triggers next calculation of next renewal date to calculate the schedule interval, lead to the same behavior, hence updating delegation token immediately and continuously.

As we fetch token just before the calculation happens, the actual issue date is likely slightly before, hence it's not that dangerous to use current timestamp as issue date for the token the issue date has not been set up properly. Still, it's better not to leave the token implementation as it is, so we log warn message to let end users consult with token implementer.

### Does this PR introduce _any_ user-facing change?

Yes. End users won't encounter the tight loop of schedule of token renewal after the PR. In end users' perspective of reflection, there's nothing end users need to change.

### How was this patch tested?

Manually tested with problematic environment.

Closes #30366 from HeartSaVioR/SPARK-33440.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
(commit: f5d2165)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/security/HadoopFSDelegationTokenProvider.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala (diff)
Commit 596fbc1d292259c8850f026e2d7267056abee3bc by gurwls223
[SPARK-33556][ML] Add array_to_vector function for dataframe column

### What changes were proposed in this pull request?

Add array_to_vector function for dataframe column

### Why are the changes needed?
Utility function for array to vector conversion.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
scala unit test & doctest.

Closes #30498 from WeichenXu123/array_to_vec.

Lead-authored-by: Weichen Xu <weichen.xu@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 596fbc1)
The file was modifiedpython/docs/source/reference/pyspark.ml.rst (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/functions.scala (diff)
The file was modifiedpython/pyspark/ml/functions.pyi (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/FunctionsSuite.scala (diff)
The file was modifiedpython/pyspark/ml/functions.py (diff)
Commit aeb3649fb9103a7541ef54f451c60fcd5a091934 by gurwls223
[SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests

### What changes were proposed in this pull request?

This replaces deprecated API usage in PySpark tests with the preferred APIs. These have been deprecated for some time and usage is not consistent within tests.

- https://docs.python.org/3/library/unittest.html#deprecated-aliases

### Why are the changes needed?

For consistency and eventual removal of deprecated APIs.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests

Closes #30557 from BryanCutler/replace-deprecated-apis-in-tests.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: aeb3649)
The file was modifiedpython/pyspark/sql/tests/test_conf.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_utils.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_arrow.py (diff)
The file was modifiedpython/pyspark/ml/tests/test_tuning.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_pandas_map.py (diff)
The file was modifiedpython/pyspark/ml/tests/test_wrapper.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_pandas_udf_grouped_agg.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_column.py (diff)
The file was modifiedpython/pyspark/ml/tests/test_feature.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_pandas_udf_window.py (diff)
The file was modifiedpython/pyspark/ml/tests/test_image.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_dataframe.py (diff)
The file was modifiedpython/pyspark/tests/test_worker.py (diff)
The file was modifiedpython/pyspark/ml/tests/test_persistence.py (diff)
The file was modifiedpython/pyspark/ml/tests/test_param.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_datasources.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_types.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_pandas_udf_typehints.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_pandas_udf.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_pandas_grouped_map.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_functions.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_udf.py (diff)
The file was modifiedpython/pyspark/tests/test_rdd.py (diff)
The file was modifiedpython/pyspark/tests/test_profiler.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_pandas_udf_scalar.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_catalog.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_pandas_cogrouped_map.py (diff)
Commit 80161238fe9393aabd5fcd56752ff1e43f6989b1 by ruifengz
[SPARK-33592] Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading

### What changes were proposed in this pull request?
Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading

When saving validator estimatorParamMaps, will check all nested stages in tuned estimator to get correct param parent.

Two typical cases to manually test:
~~~python
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression()
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, [10, 100]) \
    .addGrid(lr.maxIter, [100, 200]) \
    .build()
tvs = TrainValidationSplit(estimator=pipeline,
                           estimatorParamMaps=paramGrid,
                           evaluator=MulticlassClassificationEvaluator())

tvs.save(tvsPath)
loadedTvs = TrainValidationSplit.load(tvsPath)

# check `loadedTvs.getEstimatorParamMaps()` restored correctly.
~~~

~~~python
lr = LogisticRegression()
ova = OneVsRest(classifier=lr)
grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build()
evaluator = MulticlassClassificationEvaluator()
tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, evaluator=evaluator)

tvs.save(tvsPath)
loadedTvs = TrainValidationSplit.load(tvsPath)

# check `loadedTvs.getEstimatorParamMaps()` restored correctly.
~~~

### Why are the changes needed?
Bug fix.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit test.

Closes #30539 from WeichenXu123/fix_tuning_param_maps_io.

Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
(commit: 8016123)
The file was modifiedpython/pyspark/ml/classification.py (diff)
The file was modifiedpython/pyspark/ml/param/__init__.py (diff)
The file was modifiedpython/pyspark/ml/tuning.py (diff)
The file was modifiedpython/pyspark/ml/pipeline.py (diff)
The file was modifiedpython/pyspark/ml/tests/test_tuning.py (diff)
The file was addedpython/pyspark/ml/tests/test_util.py
The file was modifiedpython/pyspark/ml/util.pyi (diff)
The file was modifiedpython/pyspark/ml/util.py (diff)
The file was modifieddev/sparktestsupport/modules.py (diff)
Commit c50fcac00ea9b86aa6f6edb738e53ba476261027 by kabhwan.opensource
[SPARK-33607][SS][WEBUI] Input Rate timeline/histogram aren't rendered if built with Scala 2.13

### What changes were proposed in this pull request?

This PR fixes an issue that the histogram and timeline aren't rendered in the `Streaming Query Statistics` page if we built Spark with Scala 2.13.

![before-fix-the-issue](https://user-images.githubusercontent.com/4736016/100612855-f543d700-3356-11eb-90d9-ede57b8b3f4f.png)
![NaN_Error](https://user-images.githubusercontent.com/4736016/100612879-00970280-3357-11eb-97cf-43978bbe2d3a.png)

The reason is [`maxRecordRate` can be `NaN`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryStatisticsPage.scala#L371) for Scala 2.13.

The `NaN` is the result of [`query.recentProgress.map(_.inputRowsPerSecond).max`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryStatisticsPage.scala#L372) when the first element of `query.recentProgress.map(_.inputRowsPerSecond)` is `NaN`.
Actually, the comparison logic for `Double` type was changed in Scala 2.13.
https://github.com/scala/bug/issues/12107
https://github.com/scala/scala/pull/6410

So this issue happens as of Scala 2.13.

The root cause of the `NaN` is [here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L164).
This `NaN` seems to be an initial value of `inputTimeSec` so I think `Double.PositiveInfinity` is suitable rather than `NaN` and this change can resolve this issue.

### Why are the changes needed?

To make sure we can use the histogram/timeline with Scala 2.13.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

First, I built with the following commands.
```
$ /dev/change-scala-version.sh 2.13
$ build/sbt -Phive -Phive-thriftserver -Pscala-2.13 package
```

Then, ran the following query (this is brought from #30427 ).
```
import org.apache.spark.sql.streaming.Trigger
val query = spark
  .readStream
  .format("rate")
  .option("rowsPerSecond", 1000)
  .option("rampUpTime", "10s")
  .load()
  .selectExpr("*", "CAST(CAST(timestamp AS BIGINT) - CAST((RAND() * 100000) AS BIGINT) AS TIMESTAMP) AS tsMod")
  .selectExpr("tsMod", "mod(value, 100) as mod", "value")
  .withWatermark("tsMod", "10 seconds")
  .groupBy(window($"tsMod", "1 minute", "10 seconds"), $"mod")
  .agg(max("value").as("max_value"), min("value").as("min_value"), avg("value").as("avg_value"))
  .writeStream
  .format("console")
  .trigger(Trigger.ProcessingTime("5 seconds"))
  .outputMode("append")
  .start()
```

Finally, I confirmed that the timeline and histogram are rendered.
![after-fix-the-issue](https://user-images.githubusercontent.com/4736016/100612736-c9285600-3356-11eb-856d-7e53cc656c36.png)

```

Closes #30546 from sarutak/ss-nan.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
(commit: c50fcac)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala (diff)
Commit 2af2da5a4b1f5dbf0b55afd0b2514a52f03ffa94 by kabhwan.opensource
[SPARK-30900][SS] FileStreamSource: Avoid reading compact metadata log twice if the query restarts from compact batch

### What changes were proposed in this pull request?

This patch addresses the case where compact metadata file is read twice in FileStreamSource during restarting query.

When restarting the query, there is a case which the query starts from compaction batch, and the batch has source metadata file to read. One case is that the previous query succeeded to read from inputs, but not finalized the batch for various reasons.

The patch finds the latest compaction batch when restoring from metadata log, and put entries for the batch into the file entry cache which would avoid reading compact batch file twice.

FileStreamSourceLog doesn't know about offset / commit metadata in checkpoint so doesn't know which exactly batch to start from, but in practice, only couple of latest batches are candidates to
be started from when restarting query. This patch leverages the fact to skip calculation if possible.

### Why are the changes needed?

Spark incurs unnecessary cost on reading the compact metadata file twice on some case, which may not be ignorable when the query has been processed huge number of files so far.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

New UT.

Closes #27649 from HeartSaVioR/SPARK-30900.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
(commit: 2af2da5)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSourceLog.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala (diff)
Commit 1a042cc414c0c720535798b9a1197fe8885d6f6e by gurwls223
[SPARK-33530][CORE] Support --archives and spark.archives option natively

### What changes were proposed in this pull request?

TL;DR:
- This PR completes the support of archives in Spark itself instead of Yarn-only
  - It makes `--archives` option work in other cluster modes too and adds `spark.archives` configuration.
-  After this PR, PySpark users can leverage Conda to ship Python packages together as below:
    ```python
    conda create -y -n pyspark_env -c conda-forge pyarrow==2.0.0 pandas==1.1.4 conda-pack==0.5.0
    conda activate pyspark_env
    conda pack -f -o pyspark_env.tar.gz
    PYSPARK_DRIVER_PYTHON=python PYSPARK_PYTHON=./environment/bin/python pyspark --archives pyspark_env.tar.gz#environment
   ```
- Issue a warning that undocumented and hidden behavior of partial archive handling in `spark.files` / `SparkContext.addFile` will be deprecated, and users can use `spark.archives` and `SparkContext.addArchive`.

This PR proposes to add Spark's native `--archives` in Spark submit, and `spark.archives` configuration. Currently, both are supported only in Yarn mode:

```bash
./bin/spark-submit --help
```

```
Options:
...
Spark on YARN only:
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
  --archives ARCHIVES         Comma separated list of archives to be extracted into the
                              working directory of each executor.
```

This `archives` feature is useful often when you have to ship a directory and unpack into executors. One example is native libraries to use e.g. JNI. Another example is to ship Python packages together by Conda environment.

Especially for Conda, PySpark currently does not have a nice way to ship a package that works in general, please see also https://hyukjin-spark.readthedocs.io/en/stable/user_guide/python_packaging.html#using-zipped-virtual-environment (PySpark new documentation demo for 3.1.0).

The neatest way is arguably to use Conda environment by shipping zipped Conda environment but this is currently dependent on this archive feature. NOTE that we are able to use `spark.files` by relying on its undocumented behaviour that untars `tar.gz` but I don't think we should document such ways and promote people to more rely on it.

Also, note that this PR does not target to add the feature parity of `spark.files.overwrite`, `spark.files.useFetchCache`, etc. yet. I documented that this is an experimental feature as well.

### Why are the changes needed?

To complete the feature parity, and to provide a better support of shipping Python libraries together with Conda env.

### Does this PR introduce _any_ user-facing change?

Yes, this makes `--archives` works in Spark instead of Yarn-only, and adds a new configuration `spark.archives`.

### How was this patch tested?

I added unittests. Also, manually tested in standalone cluster, local-cluster, and local modes.

Closes #30486 from HyukjinKwon/native-archive.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 1a042cc)
The file was modifiedcore/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/deploy/rest/SubmitRestProtocolSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/util/Utils.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/SparkEnv.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/scheduler/EventLoggingListenerSuite.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/executor/ExecutorSuite.scala (diff)
The file was modifieddocs/configuration.md (diff)
The file was modifiedresource-managers/mesos/src/test/scala/org/apache/spark/scheduler/cluster/mesos/MesosFineGrainedSchedulerBackendSuite.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/scheduler/TaskDescriptionSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/executor/Executor.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/scheduler/CoarseGrainedSchedulerBackendSuite.scala (diff)
The file was modifiedproject/MimaExcludes.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/SparkContextSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/SparkContext.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/internal/config/package.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/executor/CoarseGrainedExecutorBackendSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala (diff)
The file was modifiedpython/docs/source/user_guide/python_packaging.rst (diff)
Commit 52e5cc46bc184bf582f9bc9ebcc5c8180222c421 by kabhwan.opensource
[SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files

### What changes were proposed in this pull request?

This patch proposes to provide a new option to specify time-to-live (TTL) for output file entries in FileStreamSink. TTL is defined via current timestamp - the last modified time for the file.

This patch will filter out outdated output files in metadata while compacting batches (other batches don't have functionality to clean entries), which helps metadata to not grow linearly, as well as filtered out files will be "eventually" no longer seen in reader queries which leverage File(Stream)Source.

### Why are the changes needed?

The metadata log greatly helps to easily achieve exactly-once but given the output path is open to arbitrary readers, there's no way to compact the metadata log, which ends up growing the metadata file as query runs for long time, especially for compacted batch.

Lots of end users have been reporting the issue: see comments in [SPARK-24295](https://issues.apache.org/jira/browse/SPARK-24295) and [SPARK-29995](https://issues.apache.org/jira/browse/SPARK-29995), and [SPARK-30462](https://issues.apache.org/jira/browse/SPARK-30462).
(There're some reports from end users which include their workarounds: SPARK-24295)

### Does this PR introduce any user-facing change?

No, as the configuration is new and by default it is not applied.

### How was this patch tested?

New UT.

Closes #28363 from HeartSaVioR/SPARK-27188-v2.

Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
(commit: 52e5cc4)
The file was modifieddocs/structured-streaming-programming-guide.md (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLog.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLogSuite.scala (diff)
Commit 103481551979297729123aaa56896d182d74847f by wenchen
[SPARK-33572][SQL] Datetime building should fail if the year, month, ..., second combination is invalid

### What changes were proposed in this pull request?
Datetime building should fail if the year, month, ..., second combination is invalid, when ANSI mode is enabled. This patch should update MakeDate, MakeTimestamp and MakeInterval.

### Why are the changes needed?
For ANSI mode.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added UT and Existing UT.

Closes #30516 from waitinfuture/SPARK-33498.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: waitinfuture <waitinfuture@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 1034815)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/intervalExpressions.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/IntervalExpressionsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/date.sql.out (diff)
Commit e5bb2937f6682239e83605b65214dfca3bdd50e5 by kabhwan.opensource
[SPARK-32032][SS] Avoid infinite wait in driver because of KafkaConsumer.poll(long) API

### What changes were proposed in this pull request?
Deprecated `KafkaConsumer.poll(long)` API calls may cause infinite wait in the driver. In this PR I've added a new `AdminClient` based offset fetching which is turned off by default. There is a new flag named `spark.sql.streaming.kafka.useDeprecatedOffsetFetching` (default: `true`) which can be set to `false` to reach the newly added functionality. The Structured Streaming migration guide contains more information what migration consideration must be done. Please see the following [doc](https://docs.google.com/document/d/1gAh0pKgZUgyqO2Re3sAy-fdYpe_SxpJ6DkeXE8R1P7E/edit?usp=sharing) for further details.

The PR contains the following changes:
* Added `AdminClient` based offset fetching
* GroupId prefix feature removed from driver but only in `AdminClient` based approach (`AdminClient` doesn't need any GroupId)
* GroupId override feature removed from driver but only in `AdminClient` based approach  (`AdminClient` doesn't need any GroupId)
* Additional unit tests
* Code comment changes
* Minor bugfixes here and there
* Removed Kafka auto topic creation feature but only in `AdminClient` based approach (please see doc for rationale). In short, it's super hidden, not sure anybody ever used in production + error prone.
* Added documentation to `ss-migration-guide` and `structured-streaming-kafka-integration`

### Why are the changes needed?
Driver may hang forever.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing + additional unit tests.
Cluster test with simple Kafka topic to another topic query.
Documentation:
```
cd docs/
SKIP_API=1 jekyll build
```
Manual webpage check.

Closes #29729 from gaborgsomogyi/SPARK-32032.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
(commit: e5bb293)
The file was modifiedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/ConsumerStrategy.scala (diff)
The file was modifiedexternal/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedexternal/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaRelationSuite.scala (diff)
The file was addedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReaderAdmin.scala
The file was modifiedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala (diff)
The file was modifiedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaRelation.scala (diff)
The file was modifiedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaBatch.scala (diff)
The file was addedexternal/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/ConsumerStrategySuite.scala
The file was modifieddocs/structured-streaming-kafka-integration.md (diff)
The file was addedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReaderConsumer.scala
The file was modifieddocs/ss-migration-guide.md (diff)
The file was modifiedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala (diff)
The file was modifiedexternal/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaOffsetReaderSuite.scala (diff)
Commit d38883c1d811f57e5b9f07b29730b7ac6a6731ca by wenchen
[SPARK-32405][SQL][FOLLOWUP] Throw Exception if provider is specified in JDBCTableCatalog create table

### What changes were proposed in this pull request?
Throw Exception if JDBC Table Catalog has provider in create table.

### Why are the changes needed?
JDBC Table Catalog doesn't support provider and we should throw Exception. Previously CREATE TABLE syntax forces people to specify a provider so we have to add a `USING_`. Now the problem was fix and we will throw Exception for provider.

### Does this PR introduce _any_ user-facing change?
Yes. We throw Exception if a provider is specified in CREATE TABLE for JDBC Table catalog.

### How was this patch tested?
Existing tests (remove `USING _`)

Closes #30544 from huaxingao/followup.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: d38883c)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalog.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala (diff)
Commit 9273d4250ddd5e011487a5a942c1b4d0f0412f78 by wenchen
[SPARK-33045][SQL][FOLLOWUP] Support built-in function like_any and fix StackOverflowError issue

### What changes were proposed in this pull request?
Spark already support `LIKE ANY` syntax, but it will throw `StackOverflowError` if there are many elements(more than 14378 elements). We should implement built-in function for LIKE ANY to fix this issue.

Why the stack overflow can happen in the current approach ?
The current approach uses reduceLeft to connect each `Like(e, p)`, this will lead the the call depth of the thread is too large, causing `StackOverflowError` problems.

Why the fix in this PR can avoid the error?
This PR support built-in function for `LIKE ANY` and avoid this issue.

### Why are the changes needed?
1.Fix the `StackOverflowError` issue.
2.Support built-in function `like_any`.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
Jenkins test.

Closes #30465 from beliefer/SPARK-33045-like_any-bak.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 9273d42)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/like-all.sql (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/RegexpExpressionsSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/like-any.sql (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ExpressionParserSuite.scala (diff)
Commit cf4ad212b100901b7065f2db8c1688c83423141d by yamamuro
[SPARK-33503][SQL] Refactor SortOrder class to allow multiple childrens

### What changes were proposed in this pull request?
This is a followup of #30302 . As part of this PR, sameOrderExpressions set is made part of children of SortOrder node - so that they don't need any special handling as done in #30302 .

### Why are the changes needed?
sameOrderExpressions should get same treatment as child. So making them part of children helps in transforming them easily.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing UTs

Closes #30430 from prakharjain09/SPARK-33400-sortorder-refactor.

Authored-by: Prakhar Jain <prakharjain09@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(commit: cf4ad21)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SortOrder.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/AliasAwareOutputExpression.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/Column.scala (diff)
Commit 478fb7f5280d8da2c68b858114eda358708e681b by wenchen
[SPARK-33608][SQL] Handle DELETE/UPDATE/MERGE in PullupCorrelatedPredicates

### What changes were proposed in this pull request?

This PR adds logic to handle DELETE/UPDATE/MERGE plans in `PullupCorrelatedPredicates`.
### Why are the changes needed?

Right now, `PullupCorrelatedPredicates` applies only to filters and unary nodes. As a result, correlated predicates in DELETE/UPDATE/MERGE are not rewritten.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The PR adds 3 new test cases.

Closes #30555 from aokolnychyi/spark-33608.

Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 478fb7f)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PullupCorrelatedPredicatesSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala (diff)
Commit c24f2b2d6afb411fbfffb90fa87150f3b6912343 by dongjoon
[SPARK-33612][SQL] Add dataSourceRewriteRules batch to Optimizer

### What changes were proposed in this pull request?

This PR adds a new batch to the optimizer for executing rules that rewrite plans for data sources.

### Why are the changes needed?

Right now, we have a special place in the optimizer where we construct v2 scans. As time shows, we need more rewrite rules that would be executed after the operator optimization and before any stats-related rules for v2 tables. Not all rules will be specific to reads. One option is to rename the current batch into something more generic but it would require changing quite some places. That's why it seems better to introduce a new batch and use it for all rewrites. The name is generic so that we don't limit ourselves to v2 data sources only.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The change is trivial and SPARK-23889 will depend on it.

Closes #30558 from aokolnychyi/spark-33612.

Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: c24f2b2)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala (diff)
Commit 5d0045eedf4b138c031accac2b1fa1e8d6f3f7c6 by gengliang.wang
[SPARK-33611][UI] Avoid encoding twice on the query parameter of rewritten proxy URL

### What changes were proposed in this pull request?

When running Spark behind a reverse proxy(e.g. Nginx, Apache HTTP server), the request URL can be encoded twice if we pass the query string directly to the constructor of `java.net.URI`:
```
> val uri = "http://localhost:8081/test"
> val query = "order%5B0%5D%5Bcolumn%5D=0"  // query string of URL from the reverse proxy
> val rewrittenURI = URI.create(uri.toString())

> new URI(rewrittenURI.getScheme(),
      rewrittenURI.getAuthority(),
      rewrittenURI.getPath(),
      query,
      rewrittenURI.getFragment()).toString
result: http://localhost:8081/test?order%255B0%255D%255Bcolumn%255D=0
```

In Spark's stage page, the URL of "/taskTable" contains query parameter order[0][dir]. After encoding twice, the query parameter becomes `order%255B0%255D%255Bdir%255D` and it will be decoded as `order%5B0%5D%5Bdir%5D` instead of `order[0][dir]`. As a result, there will be NullPointerException from https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/api/v1/StagesResource.scala#L176
Other than that, the other parameter may not work as expected after encoded twice.

This PR is to fix the bug by calling the method `URI.create(String URL)` directly. This convenience method can avoid encoding twice on the query parameter.
```
> val uri = "http://localhost:8081/test"
> val query = "order%5B0%5D%5Bcolumn%5D=0"
> URI.create(s"$uri?$query").toString
result: http://localhost:8081/test?order%5B0%5D%5Bcolumn%5D=0

> URI.create(s"$uri?$query").getQuery
result: order[0][column]=0
```

### Why are the changes needed?

Fix a potential bug when Spark's reverse proxy is enabled.
The bug itself is similar to https://github.com/apache/spark/pull/29271.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Add a new unit test.
Also, Manual UI testing for master, worker and app UI with an nginx proxy

Spark config:
```
spark.ui.port 8080
spark.ui.reverseProxy=true
spark.ui.reverseProxyUrl=/path/to/spark/
```
nginx config:
```
server {
    listen 9000;
    set $SPARK_MASTER http://127.0.0.1:8080;
    # split spark UI path into prefix and local path within master UI
    location ~ ^(/path/to/spark/) {
        # strip prefix when forwarding request
        rewrite /path/to/spark(/.*) $1  break;
        #rewrite /path/to/spark/ "/" ;
        # forward to spark master UI
        proxy_pass $SPARK_MASTER;
        proxy_intercept_errors on;
        error_page 301 302 307 = handle_redirects;
    }
    location handle_redirects {
        set $saved_redirect_location '$upstream_http_location';
        proxy_pass $saved_redirect_location;
    }
}
```

Closes #30552 from gengliangwang/decodeProxyRedirect.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
(commit: 5d0045e)
The file was modifiedcore/src/main/scala/org/apache/spark/ui/JettyUtils.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/ui/UISuite.scala (diff)
Commit 5a1c5ac8073ab46c145146485c71cc6aceb8c5b8 by dongjoon
[SPARK-33622][R][ML] Add array_to_vector to SparkR

### What changes were proposed in this pull request?

This PR adds `array_to_vector` to R API.

### Why are the changes needed?

Feature parity.

### Does this PR introduce _any_ user-facing change?

New function exposed in the public API.

### How was this patch tested?

New unit test.
Manual verification of the documentation examples.

Closes #30561 from zero323/SPARK-33622.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 5a1c5ac)
The file was modifiedR/pkg/R/generics.R (diff)
The file was modifiedR/pkg/tests/fulltests/test_sparkSQL.R (diff)
The file was modifiedR/pkg/R/functions.R (diff)
The file was modifiedR/pkg/NAMESPACE (diff)
Commit f71f34572d5510e50953ccd0191c833962b63a32 by gurwls223
[SPARK-33544][SQL] Optimize size of CreateArray/CreateMap to be the size of its children

### What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-32295 added in an optimization to insert a filter for not null and size > 0 when using inner explode/inline. This is fine in most cases but the extra filter is not needed if the explode is with a create array and not using Literals (it already handles LIterals).  When this happens you know that the values aren't null and it has a size.  It already handles the empty array.

The not null check is already optimized out because Createarray and createMap are not nullable, that leaves the size > 0 check. To handle that this PR makes it so that the size > 0 check gets optimized in ConstantFolding to be the size of the children in the array or map.  That makes it a literal and then makes it ultimately be optimized out.

### Why are the changes needed?
remove unneeded filter

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
Unit tests added and manually tested various cases

Closes #30504 from tgravescs/SPARK-33544.

Lead-authored-by: Thomas Graves <tgraves@nvidia.com>
Co-authored-by: Thomas Graves <tgraves@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: f71f345)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/InferFiltersFromGenerateSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ConstantFoldingSuite.scala (diff)
Commit 51ebcd95a5f7e377245f302a91e90f9b3db9953e by kabhwan.opensource
[SPARK-32863][SS] Full outer stream-stream join

### What changes were proposed in this pull request?

This PR is to add full outer stream-stream join, and the implementation of full outer join is:
* For left side input row, check if there's a match on right side state store.
  * if there's a match, output the joined row, o.w. output nothing. Put the row in left side state store.
* For right side input row, check if there's a match on left side state store.
  * if there's a match, output the joined row, o.w. output nothing. Put the row in right side state store.
* State store eviction: evict rows from left/right side state store below watermark, and output rows never matched before (a combination of left outer and right outer join).

### Why are the changes needed?

Enable more use cases for spark stream-stream join.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit tests in `UnsupportedOperationChecker.scala` and `StreamingJoinSuite.scala`.

Closes #30395 from c21/stream-foj.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
(commit: 51ebcd9)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingJoinSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala (diff)
Commit a4788ee8c61e1373e6eded41bb57d84c68149968 by kabhwan.opensource
[MINOR][SS] Rename auxiliary protected methods in StreamingJoinSuite

### What changes were proposed in this pull request?

Per request from https://github.com/apache/spark/pull/30395#issuecomment-735028698, here we remove `Windowed` from methods names `setupWindowedJoinWithRangeCondition` and `setupWindowedSelfJoin` as they don't join on time window.

### Why are the changes needed?

There's no such official name for `windowed join`, so this is to help avoid confusion for future developers.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit tests.

Closes #30563 from c21/stream-minor.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
(commit: a4788ee)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingJoinSuite.scala (diff)
Commit 290aa021796139e503454d315e5cd350f836ab42 by gurwls223
[SPARK-33618][CORE] Use hadoop-client instead of hadoop-client-api to make hadoop-aws work

### What changes were proposed in this pull request?

This reverts commit SPARK-33212 (cb3fa6c9368e64184a5f7b19688181d11de9511c) mostly with three exceptions:
1. `SparkSubmitUtils` was updated recently by SPARK-33580
2. `resource-managers/yarn/pom.xml` was updated recently by SPARK-33104 to add `hadoop-yarn-server-resourcemanager` test dependency.
3. Adjust `com.fasterxml.jackson.module:jackson-module-jaxb-annotations` dependency in K8s module which is updated recently by SPARK-33471.

### Why are the changes needed?

According to [HADOOP-16080](https://issues.apache.org/jira/browse/HADOOP-16080) since Apache Hadoop 3.1.1, `hadoop-aws` doesn't work with `hadoop-client-api`. It fails at write operation like the following.

**1. Spark distribution with `-Phadoop-cloud`**

```scala
$ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY
20/11/30 23:01:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context available as 'sc' (master = local[*], app id = local-1606806088715).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
      /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_272)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.read.parquet("s3a://dongjoon/users.parquet").show
20/11/30 23:01:34 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
+------+--------------+----------------+
|  name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa|          null|  [3, 9, 15, 20]|
|   Ben|           red|              []|
+------+--------------+----------------+

scala> Seq(1).toDF.write.parquet("s3a://dongjoon/out.parquet")
20/11/30 23:02:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)/ 1]
java.lang.NoSuchMethodError: org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V
```

**2. Spark distribution without `-Phadoop-cloud`**
```scala
$ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY -c spark.eventLog.enabled=true -c spark.eventLog.dir=s3a://dongjoon/spark-events/ --packages org.apache.hadoop:hadoop-aws:3.2.0,org.apache.hadoop:hadoop-common:3.2.0
...
java.lang.NoSuchMethodError: org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V
  at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:772)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CI.

Closes #30508 from dongjoon-hyun/SPARK-33212-REVERT.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 290aa02)
The file was modifiedexternal/kinesis-asl-assembly/pom.xml (diff)
The file was modifiedlauncher/pom.xml (diff)
The file was modifiedcommon/network-yarn/pom.xml (diff)
The file was modifieddev/deps/spark-deps-hadoop-3.2-hive-2.3 (diff)
The file was modifiedexternal/kafka-0-10-sql/pom.xml (diff)
The file was modifiedhadoop-cloud/pom.xml (diff)
The file was modifiedpom.xml (diff)
The file was modifiedresource-managers/kubernetes/core/pom.xml (diff)
The file was modifiedresource-managers/yarn/pom.xml (diff)
The file was modifiedresource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala (diff)
The file was modifiedsql/hive/pom.xml (diff)
The file was modifieddev/deps/spark-deps-hadoop-2.7-hive-2.3 (diff)
The file was modifiedcore/pom.xml (diff)
The file was modifiedexternal/kafka-0-10-assembly/pom.xml (diff)
The file was modifiedsql/catalyst/pom.xml (diff)
The file was modifiedexternal/kafka-0-10-token-provider/pom.xml (diff)
The file was modifiedresource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClusterSuite.scala (diff)
Commit 084d38b64ecbcaa9fac47ffca5604cf2a72936fc by gurwls223
[SPARK-33557][CORE][MESOS][TEST] Ensure the relationship between STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT and NETWORK_TIMEOUT

### What changes were proposed in this pull request?
As described in SPARK-33557, `HeartbeatReceiver` and `MesosCoarseGrainedSchedulerBackend` will always use `Network.NETWORK_TIMEOUT.defaultValueString` as value of `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT` when we configure `NETWORK_TIMEOUT` without configure `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT`, this is different from the relationship described in `configuration.md`.

To fix this problem,the main change of this pr as follow:

- Remove the explicitly default value of `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT`

- Use actual value of `NETWORK_TIMEOUT` as `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT` when `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT` not configured in `HeartbeatReceiver` and `MesosCoarseGrainedSchedulerBackend`

### Why are the changes needed?
To ensure the relationship between `NETWORK_TIMEOUT` and  `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT` as we described in `configuration.md`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Pass the Jenkins or GitHub Action

- Manual test configure `NETWORK_TIMEOUT` and `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT` locally

Closes #30547 from LuciferYang/SPARK-33557.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 084d38b)
The file was modifiedrepl/src/test/scala/org/apache/spark/repl/ExecutorClassLoaderSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/HeartbeatReceiver.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/internal/config/package.scala (diff)
The file was modifiedresource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala (diff)
Commit 28dad1ba770e5b7f7cf542da1ae3f05975a969c6 by tgraves
[SPARK-33504][CORE] The application log in the Spark history server contains sensitive attributes should be redacted

### What changes were proposed in this pull request?
To make sure the sensitive attributes to be redacted in the history server log.

### Why are the changes needed?
We found the secure attributes like password  in SparkListenerJobStart and SparkListenerStageSubmitted events would not been redated, resulting in sensitive attributes can be viewd directly.
The screenshot can be viewed in the attachment of JIRA spark-33504
### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
muntual test works well, I have also added unit testcase.

Closes #30446 from akiyamaneko/eventlog_unredact.

Authored-by: neko <echohlne@gmail.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
(commit: 28dad1b)
The file was modifiedcore/src/test/scala/org/apache/spark/scheduler/EventLoggingListenerSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala (diff)
Commit df8d3f1bf779ce1a9f3520939ab85814f09b48b7 by wenchen
[SPARK-33544][SQL][FOLLOW-UP] Rename NoSideEffect to NoThrow and clarify the documentation more

### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/30504. It proposes:

- Rename `NoSideEffect` to `NoThrow`, and use `Expression.deterministic` together where it is used.
- Clarify, in the docs in the expressions, that it means they don't throw exceptions

### Why are the changes needed?

`NoSideEffect` virtually means that `Expression.eval` does not throw an exception, and the expressions are deterministic.
It's best to be explicit so `NoThrow` was proposed -  I looked if there's a similar name to represent this concept and borrowed the name of [nothrow](https://clang.llvm.org/docs/AttributeReference.html#nothrow).
For determinism, we already have a way to note it under `Expression.deterministic`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manually ran the existing unittests written.

Closes #30570 from HyukjinKwon/SPARK-33544.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: df8d3f1)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ConstantFoldingSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/InferFiltersFromGenerateSuite.scala (diff)
Commit 58583f7c3fdcac1232607a7ab4b0d052320ac3ea by wenchen
[SPARK-33619][SQL] Fix GetMapValueUtil code generation error

### What changes were proposed in this pull request?

Code Gen bug fix that introduced by SPARK-33460

```
GetMapValueUtil

s"""throw new NoSuchElementException("Key " + $eval2 + " does not exist.");"""

SHOULD BE

s"""throw new java.util.NoSuchElementException("Key " + $eval2 + " does not exist.");"""
```

And the reason why SPARK-33460 failed to detect this bug via UT,  it was because that `checkExceptionInExpression ` did not work as expect like `checkEvaluation` which will try eval expression with BOTH `CODEGEN_ONLY` and `NO_CODEGEN` mode, and in this PR, will also fix this Test bug, too.

### Why are the changes needed?

Bug Fix.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Add UT and Existing UT.

Closes #30560 from leanken/leanken-SPARK-33619.

Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 58583f7)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/MathExpressionsSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/IntervalExpressionsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/intervalExpressions.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelperSuite.scala (diff)
Commit 91182d6cce0a56a50801d530aff0c8e3aba59e27 by dongjoon
[SPARK-33626][K8S][TEST] Allow k8s integration tests to assert both driver and executor logs for expected log(s)

### What changes were proposed in this pull request?

Allow k8s integration tests to assert both driver and executor logs for expected log(s)

### Why are the changes needed?

Some of the tests will be able to provide full coverage of the use case, by asserting both driver and executor logs.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

TBD

Closes #30568 from ScrapCodes/expectedDriverLogChanges.

Authored-by: Prashant Sharma <prashsh1@in.ibm.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 91182d6)
The file was modifiedresource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala (diff)
The file was modifiedresource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/SparkConfPropagateSuite.scala (diff)
The file was modifiedresource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/RTestsSuite.scala (diff)
The file was modifiedresource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesSuite.scala (diff)
The file was modifiedresource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DecommissionSuite.scala (diff)
The file was modifiedresource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/PythonTestsSuite.scala (diff)
Commit a082f4600b1cb814442beed1b578bc3430a257a7 by wenchen
[SPARK-33071][SPARK-33536][SQL] Avoid changing dataset_id of LogicalPlan in join() to not break DetectAmbiguousSelfJoin

### What changes were proposed in this pull request?

Currently, `join()` uses `withPlan(logicalPlan)` for convenient to call some Dataset functions. But it leads to the `dataset_id` inconsistent between the `logicalPlan` and the original `Dataset`(because `withPlan(logicalPlan)` will create a new Dataset with the new id and reset the `dataset_id` with the new id of the `logicalPlan`). As a result, it breaks the rule `DetectAmbiguousSelfJoin`.

In this PR, we propose to drop the usage of `withPlan` but use the `logicalPlan` directly so its `dataset_id` doesn't change.

Besides, this PR also removes related metadata (`DATASET_ID_KEY`,  `COL_POS_KEY`) when an `Alias` tries to construct its own metadata. Because the `Alias` is no longer a reference column after converting to an `Attribute`.  To achieve that, we add a new field, `deniedMetadataKeys`, to indicate the metadata that needs to be removed.

### Why are the changes needed?

For the query below, it returns the wrong result while it should throws ambiguous self join exception instead:

```scala
val emp1 = Seq[TestData](
  TestData(1, "sales"),
  TestData(2, "personnel"),
  TestData(3, "develop"),
  TestData(4, "IT")).toDS()
val emp2 = Seq[TestData](
  TestData(1, "sales"),
  TestData(2, "personnel"),
  TestData(3, "develop")).toDS()
val emp3 = emp1.join(emp2, emp1("key") === emp2("key")).select(emp1("*"))
emp1.join(emp3, emp1.col("key") === emp3.col("key"), "left_outer")
  .select(emp1.col("*"), emp3.col("key").as("e2")).show()

// wrong result
+---+---------+---+
|key|    value| e2|
+---+---------+---+
|  1|    sales|  1|
|  2|personnel|  2|
|  3|  develop|  3|
|  4|       IT|  4|
+---+---------+---+
```
This PR fixes the wrong behaviour.

### Does this PR introduce _any_ user-facing change?

Yes, users hit the exception instead of the wrong result after this PR.

### How was this patch tested?

Added a new unit test.

Closes #30488 from Ngone51/fix-self-join.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: a082f46)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameSelfJoinSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/Dataset.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/Column.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SparkSessionExtensionSuite.scala (diff)
Commit b76c6b759c8dd549290aa174b62b8d34ea34aa3f by dongjoon
[SPARK-33627][SQL] Add new function UNIX_SECONDS, UNIX_MILLIS and UNIX_MICROS

### What changes were proposed in this pull request?

As https://github.com/apache/spark/pull/28534 adds functions from [BigQuery](https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions) for converting numbers to timestamp, this PR is to add functions UNIX_SECONDS, UNIX_MILLIS and UNIX_MICROS for converting timestamp to numbers.

### Why are the changes needed?

1. Symmetry of the conversion functions
2. Casting timestamp type to numeric types is disallowed in ANSI mode, we should provide functions for users to complete the conversion.

### Does this PR introduce _any_ user-facing change?

3 new functions UNIX_SECONDS, UNIX_MILLIS and UNIX_MICROS for converting timestamp to long type.

### How was this patch tested?

Unit tests.

Closes #30566 from gengliangwang/timestampLong.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: b76c6b7)
The file was modifiedsql/core/src/test/resources/sql-tests/results/datetime-legacy.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/datetime.sql (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/DateExpressionsSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-functions/sql-expression-schema.md (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/datetime.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala (diff)
Commit 92bfbcb2e372e8fecfe65bc582c779d9df4036bb by dongjoon
[SPARK-33631][DOCS][TEST] Clean up spark.core.connection.ack.wait.timeout from configuration.md

### What changes were proposed in this pull request?
SPARK-9767  remove `ConnectionManager` and related files, the configuration `spark.core.connection.ack.wait.timeout` previously used by `ConnectionManager` is no longer used by other Spark code, but it still exists in the `configuration.md`.

So this pr cleans up the useless configuration item spark.core.connection.ack.wait.timeout` from `configuration.md`.

### Why are the changes needed?
Clean up useless configuration from `configuration.md`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #30569 from LuciferYang/SPARK-33631.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 92bfbcb)
The file was modifiedcore/src/test/scala/org/apache/spark/storage/BlockManagerReplicationSuite.scala (diff)
The file was modifieddocs/configuration.md (diff)