Changes

Summary

  1. [SPARK-37496][SQL] Migrate ReplaceTableAsSelectStatement to v2 command (commit: f97de30) (details)
  2. [SPARK-37514][PYTHON] Remove workarounds due to older pandas (commit: ffe3fc9) (details)
  3. [SPARK-37494][SQL] Unify v1 and v2 options output of `SHOW CREATE TABLE` (commit: b9b5562) (details)
  4. [SPARK-37461][YARN][FOLLOWUP] Refactor YARN Client code to avoid add (commit: 0b42cd4) (details)
  5. [SPARK-37326][SQL][FOLLOW-UP] Update code and tests for TimestampNTZ (commit: f7dabd8) (details)
  6. [SPARK-36396][PYTHON][FOLLOWUP] Fix test with extensions dtype when (commit: b7a5543) (details)
  7. [SPARK-37511][DOCS][FOLLOW-UP] Fix documentation build warning from (commit: 893feb9) (details)
  8. [SPARK-37442][SQL] InMemoryRelation statistics bug causing broadcast (commit: c37b726) (details)
  9. [SPARK-37450][SQL] Prune unnecessary fields from Generate (commit: c758b44) (details)
  10. [SPARK-37522][PYTHON][TESTS] Fix (commit: eec9fec) (details)
  11. [SPARK-37504][PYTHON] Pyspark create SparkSession with existed session (commit: 8952fbc) (details)
  12. [SPARK-37520][SQL] Add the `startswith()` and `endswith()` string (commit: 2479796) (details)
  13. [SPARK-37512][PYTHON] Support TimedeltaIndex creation (from (commit: 15d1122) (details)
  14. [SPARK-37524][SQL] We should drop all tables after testing dynamic (commit: 2433c94) (details)
  15. [SPARK-37526][INFRA][PYTHON][TESTS] Add Java17 PySpark daily test (commit: f99e2e6) (details)
  16. [SPARK-37531][INFRA][PYTHON][TESTS] Use PyArrow 6.0.0 in Python 3.9 (commit: eba4f5c) (details)
  17. [SPARK-37534][BUILD] Bump dev.ludovic.netlib to 2.2.1 (commit: ae9aeba) (details)
  18. [SPARK-37530][CORE] Spark reads many paths very slow though (commit: e0d41e8) (details)
  19. [SPARK-35162][TESTS][FOLLOWUP] Test try_arithmetic.sql under ANSI mode (commit: b1b05fb) (details)
  20. [SPARK-36902][SQL] Migrate CreateTableAsSelectStatement to v2 command (commit: 688fa23) (details)
  21. [SPARK-37286][SQL] Move compileAggregates from JDBCRDD to JdbcDialect (commit: 16f6295) (details)
  22. [SPARK-37455][SQL] Replace hash with sort aggregate if child is already (commit: 544865d) (details)
  23. [SPARK-37471][SQL] spark-sql  support `;` in nested bracketed comment (commit: 6e19125) (details)
  24. [SPARK-37286][DOCS][FOLLOWUP] Fix the wrong parameter name for Javadoc (commit: f570d01) (details)
  25. [SPARK-37495][PYTHON] Skip identical index checking of Series.compare (commit: b2a4e8f) (details)
  26. [SPARK-37508][SQL][DOCS][FOLLOW-UP] Update expression desc of (commit: b5fc6da) (details)
  27. [SPARK-37510][PYTHON] Support basic operations of timedelta Series/Index (commit: 1f3eb73) (details)
  28. [SPARK-37330][SQL] Migrate ReplaceTableStatement to v2 command (commit: c411d26) (details)
Commit f97de309792f382ae823894e978f7e54f34f1a29 by huaxin_gao
[SPARK-37496][SQL] Migrate ReplaceTableAsSelectStatement to v2 command

### What changes were proposed in this pull request?
This PR migrates `ReplaceTableAsSelectStatement` to the v2 command

### Why are the changes needed?
Migrate to the standard V2 framework

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing tests

Closes #34754 from huaxingao/replace_table.

Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Huaxin Gao <huaxin_gao@apple.com>
(commit: f97de30)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/V2CommandsCaseSensitivitySuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CreateTableExec.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statements.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala (diff)
Commit ffe3fc9d23967e41092cf67539aa7f0d77b9eb75 by gurwls223
[SPARK-37514][PYTHON] Remove workarounds due to older pandas

### What changes were proposed in this pull request?

Removes workarounds due to older pandas.

### Why are the changes needed?

Now that we upgraded the minimum version of pandas to `1.0.5`.
We can remove workarounds for pandas API on Spark to run with older pandas.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Modified existing tests to remove workarounds for older pandas.

Closes #34772 from ueshin/issues/SPARK-37514/older_pandas.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: ffe3fc9)
The file was modifiedpython/pyspark/pandas/tests/plot/test_series_plot_plotly.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_ops_on_diff_frames.py (diff)
The file was modifiedpython/pyspark/pandas/tests/indexes/test_category.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_reshape.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/plot/test_series_plot_matplotlib.py (diff)
The file was modifiedpython/pyspark/pandas/indexes/multi.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_indexing.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_series.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_num_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_stats.py (diff)
The file was modifiedpython/pyspark/pandas/tests/indexes/test_base.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_groupby.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_expanding.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_ops_on_diff_frames_groupby.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_series_conversion.py (diff)
The file was modifiedpython/pyspark/pandas/plot/matplotlib.py (diff)
The file was modifiedpython/pyspark/pandas/frame.py (diff)
The file was modifiedpython/pyspark/pandas/tests/plot/test_frame_plot_matplotlib.py (diff)
The file was modifiedpython/pyspark/pandas/groupby.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_dataframe_conversion.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_dataframe_spark_io.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_ops_on_diff_frames_groupby_expanding.py (diff)
The file was modifiedpython/pyspark/pandas/generic.py (diff)
The file was modifiedpython/pyspark/pandas/namespace.py (diff)
The file was modifiedpython/pyspark/pandas/tests/plot/test_frame_plot_plotly.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_dataframe.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_numpy_compat.py (diff)
Commit b9b5562f6b458c59a706fd6ab0d5b2813d0bdb96 by wenchen
[SPARK-37494][SQL] Unify v1 and v2 options output of `SHOW CREATE TABLE` command

### What changes were proposed in this pull request?
1. Change the v1 `SHOW CREATE TABLE` command behaviors that options output match v2. eg:
   `'key' = 'value'`
2. sort the order of options output.
3. sort the order of properties output.

### Why are the changes needed?
match v2 behaviors and disscuss at [#comments](https://github.com/apache/spark/pull/34719#discussion_r758156350)

### Does this PR introduce _any_ user-facing change?
Yes. when `SHOW CREATE TABLE` the output of properties and options is sorted and options output is like `'key' = 'value'`

### How was this patch tested?
Add test case.

Closes #34753 from Peng-Lei/v1-show-create-table.

Authored-by: PengLei <peng.8lei@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: b9b5562)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/show-create-table.sql.out (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/ShowCreateTableSuite.scala (diff)
Commit 0b42cd44f2d6e7b55b7203f7bf354b8f9c9a36b6 by mridulatgmail.com
[SPARK-37461][YARN][FOLLOWUP] Refactor YARN Client code to avoid add unnecessary parameter of `appId`

### What changes were proposed in this pull request?
In https://github.com/apache/spark/pull/34710, we assign ApplicationId to `appId` in client mode too. After this change we can refactor more code:

1. We add a method `getApplicationId` to get `appId` from `Client`, and avoid it can be changed outside of `Client`.
2. `submitApplication()` don't return `appId` now. we need to call `getApplicationId` instead.
3. Remove `appId` argument from `monitorApplication()` and `getApplicationReport()`.

### Why are the changes needed?
Refactor code.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existed UT

Closes #34767 from AngersZhuuuu/SPARK-37461-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(commit: 0b42cd4)
The file was modifiedresource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala (diff)
The file was modifiedresource-managers/yarn/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala (diff)
Commit f7dabd8e57fc4edf057b9219d6d9382bc6adf749 by wenchen
[SPARK-37326][SQL][FOLLOW-UP] Update code and tests for TimestampNTZ support in CSV data source

### What changes were proposed in this pull request?

This is a follow-up PR to https://github.com/apache/spark/pull/34596. There were a few comments and suggestions raised after the PR was merged, so I addressed them in this follow-up:
- Instead of using `failOnError`, which was confusing as no error was thrown in the method, we use `allowTimeZone` which has an opposite meaning of `failOnError` and far more descriptive.
- I updated a few test names to resolve ambiguity.
- I changed the tests to use `withTempPath` as was suggested in the original PR.

### Why are the changes needed?

Code cleanup and clarifications.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing unit and integration tests.

Closes #34777 from sadikovi/timestamp-ntz-csv-follow-up.

Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: f7dabd8)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala (diff)
Commit b7a55435c9f69f3d1e7a4f1967a44a54c4d5d050 by gurwls223
[SPARK-36396][PYTHON][FOLLOWUP] Fix test with extensions dtype when pandas version < 1.2

### What changes were proposed in this pull request?
Fix test of `pd.Dataframe.cov` with extensions dtype when pandas version < 1.2

### Why are the changes needed?
Pass test of `ps.Dataframe.cov` with pandas version < 1.2

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests and Manual test

Closes #34778 from dchvn/SPARK-36396-FU.

Authored-by: dch nguyen <dgd_contributor@viettel.com.vn>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: b7a5543)
The file was modifiedpython/pyspark/pandas/tests/test_dataframe.py (diff)
Commit 893feb90a73c0c7827ebb6892980dc6227e64ae4 by gurwls223
[SPARK-37511][DOCS][FOLLOW-UP] Fix documentation build warning from TimedeltaIndex

### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/34657 that adds underline to match with the title.

### Why are the changes needed?

To fix the PySpark documentation build warning:

```
/.../spark/python/docs/source/reference/pyspark.pandas/indexing.rst:340: WARNING: Title underline too short.

TimedeltaIndex
-------------
```

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manual build of the PySpark documentation.

Closes #34775 from HyukjinKwon/SPARK-37511.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 893feb9)
The file was modifiedpython/docs/source/reference/pyspark.pandas/indexing.rst (diff)
Commit c37b726bd09d34e1115a8af1969485e60dc02592 by wenchen
[SPARK-37442][SQL] InMemoryRelation statistics bug causing broadcast join failures with AQE enabled

### What changes were proposed in this pull request?
Immediately materialize underlying rdd cache (using .count) for an InMemoryRelation when `buildBuffers` is called.

### Why are the changes needed?

Currently, when `CachedRDDBuilder.buildBuffers` is called, `InMemoryRelation.computeStats` will try to read the accumulators to determine what the relation size is. However, the accumulators are not actually accurate until the cachedRDD is executed and finishes. While this has not happened, the accumulators will report a range from 0 bytes to the accumulator value when the cachedRDD finishes. In AQE, join planning can happen during this time and, if it reads the size as 0 bytes, will likely plan a broadcast join mistakenly believing the build side is very small. If the InMemoryRelation is actually very large in size, then this will cause many issues during execution such as job failure due to broadcasting over 8GB.

### Does this PR introduce _any_ user-facing change?

Yes. Before, cache materialization doesn't happen until the job starts to run. Now, it happens when trying to get the rdd representing an InMemoryRelation.

### How was this patch tested?

Tests added

Closes #34684 from ChenMichael/SPARK-37442-InMemoryRelation-statistics-inaccurate-during-join-planning.

Authored-by: Michael Chen <mike.chen@workday.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: c37b726)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala (diff)
Commit c758b44d2d1ef9dc87378ff866b7d3a93c552683 by viirya
[SPARK-37450][SQL] Prune unnecessary fields from Generate

### What changes were proposed in this pull request?

This patch proposes an optimization rule to prune unnecessary fields from `Generate` for some cases, e.g. under count-only `Aggregate`.

### Why are the changes needed?

As shown in the JIRA, if a query counts nested elements (structs) on an array by exploding the array in `Generate`, because there is no particular nested field is specified, Spark currently reads the full nested struct without any pruning, e.g.,

```
== Optimized Logical Plan ==
Aggregate [count(1) AS count(true)#20299L]
+- Project
   +- Generate explode(items#20293), [0], false, [item#20296]
      +- Filter ((size(items#20293, true) > 0) AND isnotnull(items#20293))
         +- Relation default.table[items#20293] parquet
```

An optimization can be made to pick up arbitrary nested field from the struct. So we can prune unnecessary field access and still count the number of array elements.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added test.

Closes #34701 from viirya/SPARK-37450.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(commit: c758b44)
The file was addedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/GenerateOptimizationSuite.scala
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleIdCollection.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala (diff)
Commit eec9fecf5e3c1d0631caed427d4d468f44f98de9 by dongjoon
[SPARK-37522][PYTHON][TESTS] Fix MultilayerPerceptronClassifierTest.test_raw_and_probability_prediction

### What changes were proposed in this pull request?

This PR aims to update a PySpark unit test case by increasing the tolerance by `10%` from `0.1` to `0.11`.

### Why are the changes needed?

```
$ java -version
openjdk version "17.0.1" 2021-10-19 LTS
OpenJDK Runtime Environment Zulu17.30+15-CA (build 17.0.1+12-LTS)
OpenJDK 64-Bit Server VM Zulu17.30+15-CA (build 17.0.1+12-LTS, mixed mode, sharing)

$ build/sbt test:package

$ python/run-tests --testname 'pyspark.ml.tests.test_algorithms MultilayerPerceptronClassifierTest.test_raw_and_probability_prediction' --python-executables=python3
...
======================================================================
FAIL: test_raw_and_probability_prediction (pyspark.ml.tests.test_algorithms.MultilayerPerceptronClassifierTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/ml/tests/test_algorithms.py", line 104, in test_raw_and_probability_prediction
    self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, rtol=0.102))
AssertionError: False is not true

----------------------------------------------------------------------
Ran 1 test in 7.385s

FAILED (failures=1)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually on native AppleSilicon Java 17.

Closes #34784 from dongjoon-hyun/SPARK-37522.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: eec9fec)
The file was modifiedpython/pyspark/ml/tests/test_algorithms.py (diff)
Commit 8952fbc0359ee163067b98f5e8fbbfc522c6871d by gurwls223
[SPARK-37504][PYTHON] Pyspark create SparkSession with existed session should not pass static conf

### What changes were proposed in this pull request?
In current pyspark, we have code as below
```
for key, value in self._options.items():
      session._jsparkSession.sessionState().conf().setConfString(key, value)
return session
```

Here will pass all options to created/existed SparkSession, in Scala code path, spark only pass non-static sql conf.
```
    private def applyModifiableSettings(session: SparkSession): Unit = {
      val (staticConfs, otherConfs) =
        options.partition(kv => SQLConf.isStaticConfigKey(kv._1))

      otherConfs.foreach { case (k, v) => session.sessionState.conf.setConfString(k, v) }

      if (staticConfs.nonEmpty) {
        logWarning("Using an existing SparkSession; the static sql configurations will not take" +
          " effect.")
      }
      if (otherConfs.nonEmpty) {
        logWarning("Using an existing SparkSession; some spark core configurations may not take" +
          " effect.")
      }
    }
```

In this pr, we keep this behavior consistent

### Why are the changes needed?
Keep consistent behavior between pyspark and Scala code. when initialize SparkSession, when their are existed Session, only overwrite non-static sql conf.

### Does this PR introduce _any_ user-facing change?
User can't overwrite static sql conf when use pyspark with existed SparkSession

### How was this patch tested?
Modefied UT

Closes #34757 from AngersZhuuuu/SPARK-37504.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 8952fbc)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala (diff)
The file was modifiedpython/pyspark/sql/session.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_session.py (diff)
Commit 2479796d7ba333ae93b697759d535344959ff2b6 by gurwls223
[SPARK-37520][SQL] Add the `startswith()` and `endswith()` string functions

### What changes were proposed in this pull request?
In the PR, I propose to expose the existing string expression `StartsWith` and `EndsWith` as functions to Spark SQL.

### Why are the changes needed?
To make migration from other systems to Spark SQL easier. The functions are popular, and broadly used in other systems. For example,

Snowflake:
https://docs.snowflake.com/en/sql-reference/functions/startswith.html
https://docs.snowflake.com/en/sql-reference/functions/endswith.html

Cosmos DB:
https://docs.microsoft.com/en-us/azure/cosmos-db/sql/sql-query-startswith
https://docs.microsoft.com/en-us/azure/cosmos-db/sql/sql-query-endswith

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running new tests:
```
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z string-functions.sql"
$ build/sbt "sql/test:testOnly org.apache.spark.sql.expressions.ExpressionInfoSuite"
$ build/sbt "sql/testOnly *ExpressionsSchemaSuite"
```

Closes #34782 from MaxGekk/begin-end-with-func.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 2479796)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-functions/sql-expression-schema.md (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/string-functions.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/string-functions.sql (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/ansi/string-functions.sql.out (diff)
Commit 15d112262064f98312ddfd8421f01d7a5c8b783c by gurwls223
[SPARK-37512][PYTHON] Support TimedeltaIndex creation (from Series/Index) and TimedeltaIndex.astype

### What changes were proposed in this pull request?
Support TimedeltaIndex creation given a timedelta Series/Index.

`astype` is also supported naturally in this PR.

### Why are the changes needed?
Follow panda's behavior.

### Does this PR introduce _any_ user-facing change?
Yes.

#### TimedeltaIndex creation (from Series/Index)
```py
>>> idx = ps.TimedeltaIndex([timedelta(1), timedelta(microseconds=2)])
>>> s = ps.Series([timedelta(1), timedelta(microseconds=2)], index=[10, 20])

## FROM ##
>>> ps.TimedeltaIndex(idx)
Traceback (most recent call last):
...
NotImplementedError: Create a TimedeltaIndex from Index/Series is not supported

>>> ps.TimedeltaIndex(s)
Traceback (most recent call last):
...
NotImplementedError: Create a TimedeltaIndex from Index/Series is not supported

## TO ##
>>> ps.TimedeltaIndex(idx)
TimedeltaIndex(['1 days 00:00:00', '0 days 00:00:00.000002'], dtype='timedelta64[ns]', freq=None)

>>> ps.TimedeltaIndex(s)
TimedeltaIndex(['1 days 00:00:00', '0 days 00:00:00.000002'], dtype='timedelta64[ns]', freq=None)

```
#### TimedeltaIndex.astype
```py
>>> psidx = ps.TimedeltaIndex([timedelta(1)])
>>> psidx.astype(str)
Index(['INTERVAL '1 00:00:00' DAY TO SECOND'], dtype='object')
>>> psidx.astype(int)
Int64Index([86400], dtype='int64')
>>> psidx.astype('category')
CategoricalIndex(['1 days'], categories=[1 days 00:00:00], ordered=False, dtype='category')
```

### How was this patch tested?
Unit tests.

Closes #34776 from xinrong-databricks/timedeltaCreate.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 15d1122)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_base.py (diff)
The file was modifiedpython/pyspark/pandas/indexes/timedelta.py (diff)
The file was addedpython/pyspark/pandas/tests/data_type_ops/test_timedelta_ops.py
The file was modifiedpython/pyspark/pandas/data_type_ops/timedelta_ops.py (diff)
Commit 2433c942ca39b948efe804aeab0185a3f37f3eea by gurwls223
[SPARK-37524][SQL] We should drop all tables after testing dynamic partition pruning

### What changes were proposed in this pull request?

Drop all tables after testing dynamic partition pruning.

### Why are the changes needed?
We should drop all tables after testing dynamic partition pruning.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Exist unittests

Closes #34768 from weixiuli/SPARK-11150-fix.

Authored-by: weixiuli <weixiuli@jd.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 2433c94)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala (diff)
Commit f99e2e61b2d2b067f0ee9ce2e6886f1218ccda0e by gurwls223
[SPARK-37526][INFRA][PYTHON][TESTS] Add Java17 PySpark daily test coverage

### What changes were proposed in this pull request?

This PR aims to add Java 17 PySpark daily test coverage.

### Why are the changes needed?

To support Java 17 at Apache Spark 3.3.

After SPARK-37522, I verified the following with Python 3.9.7 on Linux. (not every Python libraries).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- Pass the CIs to verify this doesn't break anything.
- After manually review, this should be verified after merging.

Closes #34788 from dongjoon-hyun/SPARK-37526.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: f99e2e6)
The file was modified.github/workflows/build_and_test.yml (diff)
Commit eba4f5c6b605829565f213fdcf8444d16d672504 by dongjoon
[SPARK-37531][INFRA][PYTHON][TESTS] Use PyArrow 6.0.0 in Python 3.9 tests at GitHub Action job

### What changes were proposed in this pull request?

This PR aims to use `PyArrow 6.0.0` in `Python 3.9` unit tests at GitHub Action jobs.

Although the new change is removing `<5.0.0' limitation, there are other minor changes because it's built more recently, too.
- https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage/commit/4f7408f4a95ef9784fdaf490be56bcfd7ff309bb
```
- RUN python3.9 -m pip install numpy 'pyarrow<5.0.0' pandas scipy xmlrunner plotly>=4.8 sklearn 'mlflow>=1.0'
+ RUN python3.9 -m pip install numpy pyarrow pandas scipy xmlrunner plotly>=4.8 sklearn 'mlflow>=1.0'
```

```
$ docker run -it --rm dongjoon/apache-spark-github-action-image:20211116 pip3.9 list > 20211116
$ docker run -it --rm dongjoon/apache-spark-github-action-image:20210930 pip3.9 list > 20210930
$ diff 20210930 20211116
# The following is manually formatted for simplicity.
...
Jinja2                    3.0.1         3.0.3
mlflow                    1.20.2        1.21.0
numpy                     1.21.2        1.21.4
pandas                    1.3.3         1.3.4
plotly                    5.3.1         5.4.0
pyarrow                   4.0.1         6.0.0
scikit-learn              1.0           1.0.1
scipy                     1.7.1         1.7.2
```

### Why are the changes needed?

SPARK-37342 upgrade Apache Arrow to 6.0.0 in Java/Scala.
This is a corresponding upgrade in PySpark.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action.

Closes #34793 from dongjoon-hyun/SPARK-37531.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: eba4f5c)
The file was modified.github/workflows/build_and_test.yml (diff)
Commit ae9aebab940b0e5683c4b7a14302d3aedb149275 by dongjoon
[SPARK-37534][BUILD] Bump dev.ludovic.netlib to 2.2.1

### What changes were proposed in this pull request?

Bump the version of dev.ludovic.netlib from 2.2.0 to 2.2.1. This fixes a computation bug in sgemm. See [1]. Diff is [2]

[1] https://github.com/luhenry/netlib/issues/7
[2] https://github.com/luhenry/netlib/compare/v2.2.0...v2.2.1

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #34783 from luhenry/patch-1.

Authored-by: Ludovic Henry <git@ludovic.dev>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: ae9aeba)
The file was modifieddev/deps/spark-deps-hadoop-2.7-hive-2.3 (diff)
The file was modifiedpom.xml (diff)
The file was modifieddev/deps/spark-deps-hadoop-3.2-hive-2.3 (diff)
Commit e0d41e887ea18ff3b82f0451db89075777c510d1 by yao
[SPARK-37530][CORE] Spark reads many paths very slow though newAPIHadoopFile

### What changes were proposed in this pull request?

Same as https://github.com/apache/spark/pull/18441, we parallelize FileInputFormat.listStatus for newAPIHadoopFile

### Why are the changes needed?

![image](https://user-images.githubusercontent.com/8326978/144562490-d8005bf2-2052-4b50-9a5d-8b253ee598cc.png)

Spark can be slow when accessing external storage at driver side, improve perf by parallelizing

### Does this PR introduce _any_ user-facing change?

no
### How was this patch tested?

passing GA

Closes #34792 from yaooqinn/SPARK-37530.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
(commit: e0d41e8)
The file was modifiedcore/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala (diff)
Commit b1b05fb366da4785973c124215f49c14f9bd7b08 by gengliang
[SPARK-35162][TESTS][FOLLOWUP] Test try_arithmetic.sql under ANSI mode as well

### What changes were proposed in this pull request?

In https://github.com/apache/spark/pull/32292, `try_arithmetic.sql` is only tested with ANSI mode off. We should test it under ANSI mode as well.

### Why are the changes needed?

Improve test coverage

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test

Closes #34795 from gengliangwang/try_arithmetic.sql.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(commit: b1b05fb)
The file was addedsql/core/src/test/resources/sql-tests/results/ansi/try_arithmetic.sql.out
The file was addedsql/core/src/test/resources/sql-tests/inputs/ansi/try_arithmetic.sql
Commit 688fa239265b981dc3acc69b79443905afe6a8cf by wenchen
[SPARK-36902][SQL] Migrate CreateTableAsSelectStatement to v2 command

### What changes were proposed in this pull request?
Migrate CreateTableAsSelectStatement to v2 command

### Why are the changes needed?
Migrate CreateTableAsSelectStatement to v2 command

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing tests

Closes #34667 from dchvn/migrate-CTAS.

Authored-by: dch nguyen <dgd_contributor@viettel.com.vn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 688fa23)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/PlanResolutionSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/CreateTablePartitioningValidationSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/V2CommandsCaseSensitivitySuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statements.scala (diff)
Commit 16f6295a3fc7fd1fcc77c6084a60d00fd79d202b by wenchen
[SPARK-37286][SQL] Move compileAggregates from JDBCRDD to JdbcDialect

### What changes were proposed in this pull request?
Currently, the method `compileAggregates` is a member of `JDBCRDD`. But it is not reasonable, because the JDBC source knowns how to compile aggregate expressions to itself's dialect well.

### Why are the changes needed?
JDBC source knowns how to compile aggregate expressions to itself's dialect well.
After this PR, we can extend the pushdown(e.g. aggregate) based on different dialect between different JDBC database.

There are two situations:
First, database A and B implement a different number of aggregate functions that meet the SQL standard.

### Does this PR introduce _any_ user-facing change?
'No'. Just change the inner implementation.

### How was this patch tested?
Jenkins tests.

Closes #34554 from beliefer/SPARK-37286.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 16f6295)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCScanBuilder.scala (diff)
Commit 544865db77d942fbbeabde96e644c98a892d5045 by wenchen
[SPARK-37455][SQL] Replace hash with sort aggregate if child is already sorted

### What changes were proposed in this pull request?

In the query plan, if the child of hash aggregate is already sorted on group-by columns, we can replace hash aggregate with sort aggregate for better performance, as sort aggregate does not have hashing overhead of hash aggregate. Add a physical plan rule `ReplaceHashWithSortAgg` here, and can be disabled by config `spark.sql.execution.replaceHashWithSortAgg`.

In addition, to help review as this PR triggers several TPCDS plan files change. The files below are having the real code change:

* `SQLConf.scala`
* `QueryExecution.scala`
* `ReplaceHashWithSortAgg.scala`
* `AdaptiveSparkPlanExec.scala`
* `HashAggregateExec.scala`
* `ReplaceHashWithSortAggSuite.scala`
* `SQLMetricsSuite.scala`

### Why are the changes needed?

To get better query performance by leveraging sort ordering in query plan.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit test in `ReplaceHashWithSortAggSuite.scala`.

Closes #34702 from c21/agg-rule.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 544865d)
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/ReplaceHashWithSortAgg.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/execution/ReplaceHashWithSortAggSuite.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala (diff)
Commit 6e1912590dcdad56d82b4fe1a5ae0b62560a1a08 by wenchen
[SPARK-37471][SQL] spark-sql  support `;` in nested bracketed comment

### What changes were proposed in this pull request?
In current spark-sql, when use -e and -f, it can't support nested bracketed comment such as
```
/* SELECT /*+ BROADCAST(b) */ 4;
*/
SELECT  1
;
```
When run `spark-sql -f` with `--verbose` got below error
```
park master: yarn, Application Id: application_1632999510150_6968442
/* sielect /* BROADCAST(b) */ 4
Error in query:
mismatched input '4' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 30)

== SQL ==
/* sielect /* BROADCAST(b) */ 4
------------------------------^^^
```

In current code
```
else if (line.charAt(index) == '/' && !insideSimpleComment) {
        val hasNext = index + 1 < line.length
        if (insideSingleQuote || insideDoubleQuote) {
          // Ignores '/' in any case of quotes
        } else if (insideBracketedComment && line.charAt(index - 1) == '*' ) {
          // Decrements `bracketedCommentLevel` at the beginning of the next loop
          leavingBracketedComment = true
        } else if (hasNext && !insideBracketedComment &&  line.charAt(index + 1) == '*') {
          bracketedCommentLevel += 1
        }
      }
```

If it meet an `*/`, it will mark  `leavingBracketedComment` as true, then  when call next char, bracketed comment level -1.
```
      if (leavingBracketedComment) {
        bracketedCommentLevel -= 1
        leavingBracketedComment = false
      }

```

But when meet `/*`,  it need   `!insideBracketedComment`, then means if we have a  case
```
/* aaa /* bbb */  ; ccc */ select 1;
```

when meet second `/*` , `insideBracketedComment` is true, so this `/*` won't be treat as a start of bracket comment.
Then meet the first `*/`, bracketed comment end, this  query is split as
```
/* aaa /* bbb */;    =>  comment
ccc */ select 1;   => query
```

Then query failed.

So here we remove the condition of `!insideBracketedComment`,  then we can have `bracketedCommentLevel > 1` and  since
```
def insideBracketedComment: Boolean = bracketedCommentLevel > 0
```
So chars between all level of bracket are treated as comment.

### Why are the changes needed?
In spark #37389 we support nested bracketed comment in SQL, here for spark-sql we should support too.

### Does this PR introduce _any_ user-facing change?
User can use nested bracketed comment in spark-sql

### How was this patch tested?

Since spark-sql  console mode have special logic about handle `;`
```
    while (line != null) {
      if (!line.startsWith("--")) {
        if (prefix.nonEmpty) {
          prefix += '\n'
        }

        if (line.trim().endsWith(";") && !line.trim().endsWith("\\;")) {
          line = prefix + line
          ret = cli.processLine(line, true)
          prefix = ""
          currentPrompt = promptWithCurrentDB
        } else {
          prefix = prefix + line
          currentPrompt = continuedPromptWithDBSpaces
        }
      }
      line = reader.readLine(currentPrompt + "> ")
    }
```

If we write sql as below
```
/* SELECT /*+ BROADCAST(b) */ 4\\;
*/
SELECT  1
;
```
the `\\;` is escaped.

Manuel  test wit spark-sql -f
```
(spark.submit.pyFiles,)
(spark.submit.deployMode,client)
(spark.master,local[*])
Classpath elements:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/11/26 16:32:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/11/26 16:32:10 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
21/11/26 16:32:10 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
21/11/26 16:32:13 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
21/11/26 16:32:13 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore yi.zhu10.12.189.175
Spark master: local[*], Application Id: local-1637915529831
/* select /* BROADCAST(b) */ 4;
*/
select  1

1
Time taken: 3.851 seconds, Fetched 1 row(s)
C02D45VVMD6T:spark yi.zhu$
```

In current PR, un completed bracket comment won't execute now, for SQL file
```
/* select /* BROADCAST(b) */ 4;
*/
select  1
;

/* select /* braoad */ ;
select 1;
```

It only execute
```
/* select /* BROADCAST(b) */ 4;
*/
select  1
;
```

The next part
```
/* select /* braoad */ ;
select 1;
```
are still treated as inprogress SQL.

Closes #34721 from AngersZhuuuu/SPARK-37471.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 6e19125)
The file was modifiedsql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala (diff)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala (diff)
Commit f570d01c0d009bb035d3c89d77661a5432f982cb by srowen
[SPARK-37286][DOCS][FOLLOWUP] Fix the wrong parameter name for Javadoc

### What changes were proposed in this pull request?

This PR fixes an issue that the Javadoc generation fails due to the wrong parameter name of a method added in SPARK-37286 (#34554).
https://github.com/apache/spark/runs/4409267346?check_suite_focus=true#step:9:5081

### Why are the changes needed?

To keep the build clean.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA itself.

Closes #34801 from sarutak/followup-SPARK-37286.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(commit: f570d01)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala (diff)
Commit b2a4e8f3bab8219aa92f7392f2e8e10423d662bf by gurwls223
[SPARK-37495][PYTHON] Skip identical index checking of Series.compare when config 'compute.eager_check' is disabled

### What changes were proposed in this pull request?
Skip identical index checking of Series.compare when config 'compute.eager_check' is disabled

### Why are the changes needed?
Identical index checking is expensive, so we should use config 'compute.eager_check' to skip this one

### Does this PR introduce _any_ user-facing change?
Yes

Before this PR
```python
>>> psser1 = ps.Series([1, 2, 3, 4, 5], index=pd.Index([1, 2, 3, 4, 5]))
>>> psser2 = ps.Series([1, 2, 3, 4, 5], index=pd.Index([1, 2, 4, 3, 6]))
>>> psser1.compare(psser2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/u02/spark/python/pyspark/pandas/series.py", line 5851, in compare
    raise ValueError("Can only compare identically-labeled Series objects")
ValueError: Can only compare identically-labeled Series objects
```
After this PR, when config 'compute.eager_check' is False, pandas-on-Spark just proceeds and performs by ignoring the identical index checking.
```python
>>> with ps.option_context("compute.eager_check", False):
...     psser1.compare(psser2)
...
   self  other
3   3.0    4.0
4   4.0    3.0
5   5.0    NaN
6   NaN    5.0
```
### How was this patch tested?
Unit tests

Closes #34750 from dchvn/SPARK-37495.

Authored-by: dch nguyen <dgd_contributor@viettel.com.vn>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: b2a4e8f)
The file was modifiedpython/pyspark/pandas/config.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_ops_on_diff_frames.py (diff)
The file was modifiedpython/docs/source/user_guide/pandas_on_spark/options.rst (diff)
The file was modifiedpython/pyspark/pandas/series.py (diff)
Commit b5fc6dade261fe9917ff2c835911bc54da121bbd by gurwls223
[SPARK-37508][SQL][DOCS][FOLLOW-UP] Update expression desc of `CONTAINS()` string function

### What changes were proposed in this pull request?
Update usage doc of `CONTAINS()` string function to
```
_FUNC_(left, right) - Returns a boolean. The value is True if right is found inside left.
    Returns NULL if either input expression is NULL. Otherwise, returns False.
```
To clarify that when left and right expression both not `NULL` and if right is not found inside left, returns `False`

### Why are the changes needed?
Make function description more clear

### Does this PR introduce _any_ user-facing change?
`DESCRIBE FUNCTION EXTENDED contains`'s usage now return:

```
contains(left, right) - Returns a boolean. The value is True if right is found inside left.
    Returns NULL if either input expression is NULL. Otherwise, returns False.
```

### How was this patch tested?
Verified it after documentation build or SQL command such s DESCRIBE FUNCTION EXTENDED contains.

Closes #34786 from AngersZhuuuu/SPARK-37508-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: b5fc6da)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala (diff)
Commit 1f3eb737f15cad101f961e0f153905c76a5dd12d by gurwls223
[SPARK-37510][PYTHON] Support basic operations of timedelta Series/Index

### What changes were proposed in this pull request?
Support basic operations of timedelta Series/Index

### Why are the changes needed?
To be consistent with pandas

### Does this PR introduce _any_ user-facing change?
Yes.
```py
>>> psdf = ps.DataFrame(
... {'this': [timedelta(1), timedelta(microseconds=2), timedelta(weeks=3)],
...  'that': [timedelta(0), timedelta(microseconds=1), timedelta(seconds=2)]}
... )
>>> psdf
                    this                   that
0        1 days 00:00:00        0 days 00:00:00
1 0 days 00:00:00.000002 0 days 00:00:00.000001
2       21 days 00:00:00        0 days 00:00:02

# __sub__
>>> psdf.this - psdf.that
0          1 days 00:00:00
1   0 days 00:00:00.000001
2         20 days 23:59:58
dtype: timedelta64[ns]
>>> psdf.this - timedelta(1)
0            0 days 00:00:00
1   -1 days +00:00:00.000002
2           20 days 00:00:00
Name: this, dtype: timedelta64[ns]

#  __rsub__
>>> timedelta(1) - psdf.this
0          0 days 00:00:00
1   0 days 23:59:59.999998
2       -20 days +00:00:00
Name: this, dtype: timedelta64[ns]
```

### How was this patch tested?
Unit tests.

Closes #34787 from xinrong-databricks/timedeltaBasicOps.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 1f3eb73)
The file was modifiedpython/pyspark/pandas/data_type_ops/timedelta_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_timedelta_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/datetime_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/testing_utils.py (diff)
Commit c411d2681b2e16e28346e1c55df8d71013d746cb by huaxin_gao
[SPARK-37330][SQL] Migrate ReplaceTableStatement to v2 command

### What changes were proposed in this pull request?
This PR migrates ReplaceTableStatement to the v2 command

### Why are the changes needed?
Migrate to the standard V2 framework

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing tests

Closes #34764 from dchvn/migrate-replacetable.

Authored-by: dch nguyen <dgd_contributor@viettel.com.vn>
Signed-off-by: Huaxin Gao <huaxin_gao@apple.com>
(commit: c411d26)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statements.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ReplaceTableExec.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala (diff)