Changes

Summary

  1. [SPARK-34951][INFRA][PYTHON][TESTS] Set the system encoding as UTF-8 to (commit: 82ad2f9) (details)
  2. [SPARK-34479][FOLLOWUP][DOC] Add zstandard codec to (commit: ba32b20) (details)
  3. [SPARK-34954][SQL] Use zstd codec name in ORC file names (commit: 748f05f) (details)
  4. [SPARK-34492][DOCS] Add "CSV Files" page for Data Source documents (commit: a72f0d7) (details)
  5. [SPARK-34934] Fix race condition while adding/removing sources in (commit: aff6c0f) (details)
  6. [SPARK-34959][BUILD] Upgrade SBT to 1.5.0 (commit: ac4334b) (details)
  7. [SPARK-34949][CORE] Prevent BlockManager reregister when Executor is (commit: a9ca197) (details)
  8. [SPARK-34932][SQL] deprecate GROUP BY ... GROUPING SETS (...) and (commit: 39d5677) (details)
  9. [SPARK-34935][SQL] CREATE TABLE LIKE should respect the reserved table (commit: 7cfface) (details)
  10. [SPARK-34890][PYTHON] Port/integrate Koalas main codes into PySpark (commit: caf04f9) (details)
  11. [SPARK-34923][SQL] Metadata output should be empty for more plans (commit: 3b634f6) (details)
  12. Revert "[SPARK-34884][SQL] Improve DPP evaluation to make filtering side (commit: 19c7d2f) (details)
  13. [SPARK-34667][SQL] Support casting of year-month intervals to strings (commit: 4b5fc1d) (details)
Commit 82ad2f9dff9b8d9cb8b6b166f14311f6066681c2 by gurwls223
[SPARK-34951][INFRA][PYTHON][TESTS] Set the system encoding as UTF-8 to recover the Sphinx build in GitHub Actions

### What changes were proposed in this pull request?

This PR proposes to set the system encoding as UTF-8. For some reasons, it looks like GitHub Actions machines changed theirs to ASCII by default. This leads to default encoding/decoding to use ASCII in Python, e.g.) `"a".encode()`, and looks like Sphinx depends on that.

### Why are the changes needed?

To recover GItHub Actions build.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Tested in https://github.com/apache/spark/pull/32046

Closes #32047 from HyukjinKwon/SPARK-34951.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 82ad2f9)
The file was modified.github/workflows/build_and_test.yml (diff)
Commit ba32b200e48cd3c979d7010cd492c606f904f707 by yumwang
[SPARK-34479][FOLLOWUP][DOC] Add zstandard codec to AvroOptions.compression comment

### What changes were proposed in this pull request?

This PR aims to add zstandard codec to the `AvroOptions.compression` comment.

### Why are the changes needed?

SPARK-34479 added zstandard codec.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
N/A

Closes #32050 from williamhyun/avro.

Authored-by: William Hyun <williamhyun3@gmail.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
(commit: ba32b20)
The file was modifiedexternal/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala (diff)
Commit 748f05fca939d460976e46a4ed74c37034d934b6 by dhyun
[SPARK-34954][SQL] Use zstd codec name in ORC file names

### What changes were proposed in this pull request?

This PR aims to add `zstd` codec names in the Spark generated ORC file names for consistency.

### Why are the changes needed?

Like the other ORC supported codecs, we had better have `zstd` in the Spark generated ORC file names. Please note that there is no problem at reading/writing ORC zstd files currently. This PR only aims to revise the file name format for consistency.

**SNAPPY**
```
scala> spark.range(10).repartition(1).write.option("compression", "snappy").orc("/tmp/snappy")

$ ls -al /tmp/snappy
total 24
drwxr-xr-x   6 dongjoon  wheel  192 Apr  4 12:17 .
drwxrwxrwt  14 root      wheel  448 Apr  4 12:17 ..
-rw-r--r--   1 dongjoon  wheel    8 Apr  4 12:17 ._SUCCESS.crc
-rw-r--r--   1 dongjoon  wheel   12 Apr  4 12:17 .part-00000-833bb7ad-d1e1-48cc-9719-07b2d594aa4c-c000.snappy.orc.crc
-rw-r--r--   1 dongjoon  wheel    0 Apr  4 12:17 _SUCCESS
-rw-r--r--   1 dongjoon  wheel  231 Apr  4 12:17 part-00000-833bb7ad-d1e1-48cc-9719-07b2d594aa4c-c000.snappy.orc
```

**ZSTD (AS-IS)**
```
scala> spark.range(10).repartition(1).write.option("compression", "zstd").orc("/tmp/zstd")

$ ls -al /tmp/zstd
total 24
drwxr-xr-x   6 dongjoon  wheel  192 Apr  4 12:17 .
drwxrwxrwt  14 root      wheel  448 Apr  4 12:17 ..
-rw-r--r--   1 dongjoon  wheel    8 Apr  4 12:17 ._SUCCESS.crc
-rw-r--r--   1 dongjoon  wheel   12 Apr  4 12:17 .part-00000-2f403ce9-7314-4db5-bca3-b1c1dd83335f-c000.orc.crc
-rw-r--r--   1 dongjoon  wheel    0 Apr  4 12:17 _SUCCESS
-rw-r--r--   1 dongjoon  wheel  231 Apr  4 12:17 part-00000-2f403ce9-7314-4db5-bca3-b1c1dd83335f-c000.orc
```

**ZSTD (After this PR)**
```
scala> spark.range(10).repartition(1).write.option("compression", "zstd").orc("/tmp/zstd_new")

$ ls -al /tmp/zstd_new
total 24
drwxr-xr-x   6 dongjoon  wheel  192 Apr  4 12:28 .
drwxrwxrwt  15 root      wheel  480 Apr  4 12:28 ..
-rw-r--r--   1 dongjoon  wheel    8 Apr  4 12:28 ._SUCCESS.crc
-rw-r--r--   1 dongjoon  wheel   12 Apr  4 12:28 .part-00000-49d57329-7196-4caf-839c-4251c876e26b-c000.zstd.orc.crc
-rw-r--r--   1 dongjoon  wheel    0 Apr  4 12:28 _SUCCESS
-rw-r--r--   1 dongjoon  wheel  231 Apr  4 12:28 part-00000-49d57329-7196-4caf-839c-4251c876e26b-c000.zstd.orc
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs with the updated UT.

Closes #32051 from dongjoon-hyun/SPARK-34954.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: 748f05f)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala (diff)
Commit a72f0d7c90583b05648872eccf8ee1a72f780609 by gurwls223
[SPARK-34492][DOCS] Add "CSV Files" page for Data Source documents

### What changes were proposed in this pull request?

Fix [SPARK-34492], add Scala examples to read/write CSV files.

### Why are the changes needed?

Fix [SPARK-34492].

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Build the document with "SKIP_API=1 bundle exec jekyll build", and everything looks fine.

Closes #31827 from twoentartian/master.

Authored-by: twoentartian <twoentartian@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: a72f0d7)
The file was addeddocs/sql-data-sources-csv.md
The file was modifiedexamples/src/main/python/sql/datasource.py (diff)
The file was modifieddocs/_data/menu-sql.yaml (diff)
The file was modifiedexamples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala (diff)
The file was modifieddocs/sql-data-sources.md (diff)
The file was modifiedexamples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java (diff)
Commit aff6c0febb40d9713895ba00d8c77ba00f04bd16 by srowen
[SPARK-34934] Fix race condition while adding/removing sources in MetricsSystem

### What changes were proposed in this pull request?

Synchronise access to `registerSource` and `removeSource` method since underlying `ArrayBuffer` is not thread safe.

### Why are the changes needed?

Unexpected behaviours are possible due to lack of thread safety, Like we got `ArrayIndexOutOfBoundsException` while adding new source.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Closes #32024 from BOOTMGR/SPARK-34934.

Lead-authored-by: Harsh Panchal <BOOTMGR@users.noreply.github.com>
Co-authored-by: BOOTMGR <panchal.harsh18@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(commit: aff6c0f)
The file was modifiedcore/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala (diff)
Commit ac4334bcca18c98b0a1df18c9e0e8f3b6ed8b420 by dhyun
[SPARK-34959][BUILD] Upgrade SBT to 1.5.0

### What changes were proposed in this pull request?

This PR aims to upgrade SBT to 1.5.0.

### Why are the changes needed?

SBT 1.5.0 is released yesterday with the built-in Scala 3 support.
- https://github.com/sbt/sbt/releases/tag/v1.5.0

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the SBT CIs (Build/Test/Docs/Plugins).

Closes #32055 from dongjoon-hyun/SPARK-34959.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: ac4334b)
The file was modifiedproject/build.properties (diff)
Commit a9ca1978ae8ecc53e2ef9e14b4be70dc8f5d9341 by mridulatgmail.com
[SPARK-34949][CORE] Prevent BlockManager reregister when Executor is shutting down

### What changes were proposed in this pull request?

This PR prevents reregistering BlockManager when a Executor is shutting down. It is achieved by checking  `executorShutdown` before calling `env.blockManager.reregister()`.

### Why are the changes needed?

This change is required since Spark reports executors as active, even they are removed.
I was testing Dynamic Allocation on K8s with about 300 executors. While doing so, when the executors were torn down due to `spark.dynamicAllocation.executorIdleTimeout`, I noticed all the executor pods being removed from K8s, however, under the "Executors" tab in SparkUI, I could see some executors listed as alive.  [spark.sparkContext.statusTracker.getExecutorInfos.length](https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L105) also returned a value greater than 1.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added a new test.

## Logs
Following are the logs of the executor(Id:303) which re-registers `BlockManager`
```
21/04/02 21:33:28 INFO CoarseGrainedExecutorBackend: Got assigned task 1076
21/04/02 21:33:28 INFO Executor: Running task 4.0 in stage 3.0 (TID 1076)
21/04/02 21:33:28 INFO MapOutputTrackerWorker: Updating epoch to 302 and clearing cache
21/04/02 21:33:28 INFO TorrentBroadcast: Started reading broadcast variable 3
21/04/02 21:33:28 INFO TransportClientFactory: Successfully created connection to /100.100.195.227:33703 after 76 ms (62 ms spent in bootstraps)
21/04/02 21:33:28 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 2.4 KB, free 168.0 MB)
21/04/02 21:33:28 INFO TorrentBroadcast: Reading broadcast variable 3 took 168 ms
21/04/02 21:33:28 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 3.9 KB, free 168.0 MB)
21/04/02 21:33:29 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 1, fetching them
21/04/02 21:33:29 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTrackerda-lite-test-4-7a57e478947d206d-driver-svc.dex-app-n5ttnbmg.svc:7078)
21/04/02 21:33:29 INFO MapOutputTrackerWorker: Got the output locations
21/04/02 21:33:29 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks including 1 local blocks and 1 remote blocks
21/04/02 21:33:30 INFO TransportClientFactory: Successfully created connection to /100.100.80.103:40971 after 660 ms (528 ms spent in bootstraps)
21/04/02 21:33:30 INFO ShuffleBlockFetcherIterator: Started 1 remote fetches in 1042 ms
21/04/02 21:33:31 INFO Executor: Finished task 4.0 in stage 3.0 (TID 1076). 1276 bytes result sent to driver
.
.
.
21/04/02 21:34:16 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
21/04/02 21:34:16 INFO Executor: Told to re-register on heartbeat
21/04/02 21:34:16 INFO BlockManager: BlockManager BlockManagerId(303, 100.100.122.34, 41265, None) re-registering with master
21/04/02 21:34:16 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(303, 100.100.122.34, 41265, None)
21/04/02 21:34:16 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(303, 100.100.122.34, 41265, None)
21/04/02 21:34:16 INFO BlockManager: Reporting 0 blocks to the master.
21/04/02 21:34:16 INFO MemoryStore: MemoryStore cleared
21/04/02 21:34:16 INFO BlockManager: BlockManager stopped
21/04/02 21:34:16 INFO FileDataSink: Closing sink with output file = /tmp/safari-events/.des_analysis/safari-events/hdp_spark_monitoring_random-container-037caf27-6c77-433f-820f-03cd9c7d9b6e-spark-8a492407d60b401bbf4309a14ea02ca2_events.tsv
21/04/02 21:34:16 INFO HonestProfilerBasedThreadSnapshotProvider: Stopping agent
21/04/02 21:34:16 INFO HonestProfilerHandler: Stopping honest profiler agent
21/04/02 21:34:17 INFO ShutdownHookManager: Shutdown hook called
21/04/02 21:34:17 INFO ShutdownHookManager: Deleting directory /var/data/spark-d886588c-2a7e-491d-bbcb-4f58b3e31001/spark-4aa337a0-60c0-45da-9562-8c50eaff3cea

```

Closes #32043 from sumeetgajjar/SPARK-34949.

Authored-by: Sumeet Gajjar <sumeetgajjar93@gmail.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(commit: a9ca197)
The file was modifiedcore/src/main/scala/org/apache/spark/executor/Executor.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/executor/ExecutorSuite.scala (diff)
Commit 39d5677ee3261bb0fea0ddb61356741655b926a9 by yamamuro
[SPARK-34932][SQL] deprecate GROUP BY ... GROUPING SETS (...) and promote GROUP BY GROUPING SETS (...)

### What changes were proposed in this pull request?

GROUP BY ... GROUPING SETS (...) is a weird SQL syntax we copied from Hive. It's not in the SQL standard or any other mainstream databases. This syntax requires users to repeat the expressions inside `GROUPING SETS (...)` after `GROUP BY`, and has a weird null semantic if `GROUP BY` contains extra expressions than `GROUPING SETS (...)`.

This PR deprecates this syntax:
1. Do not promote it in the document and only mention it as a Hive compatible sytax.
2. Simplify the code to only keep it for Hive compatibility.

### Why are the changes needed?

Deprecate a weird grammar.

### Does this PR introduce _any_ user-facing change?

No breaking change, but it removes a check to simplify the code: `GROUP BY a GROUPING SETS(a, b)` fails before and forces users to also put `b` after `GROUP BY`. Now this works just as `GROUP BY GROUPING SETS(a, b)`.

### How was this patch tested?

existing tests

Closes #32022 from cloud-fan/followup.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(commit: 39d5677)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala (diff)
The file was modifieddocs/sql-ref-syntax-qry-select-groupby.md (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveGroupingAnalyticsSuite.scala (diff)
Commit 7cffacef18e6d7555f2688fa09991b62514b3cc8 by yamamuro
[SPARK-34935][SQL] CREATE TABLE LIKE should respect the reserved table properties

### What changes were proposed in this pull request?

CREATE TABLE LIKE should respect the reserved properties of tables and fail if specified, using `spark.sql.legacy.notReserveProperties` to restore.

### Why are the changes needed?

Make DDLs consistently treat reserved properties

### Does this PR introduce _any_ user-facing change?

YES, this is a breaking change as using `create table like` w/ reserved properties will fail.

### How was this patch tested?

new test

Closes #32025 from yaooqinn/SPARK-34935.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(commit: 7cfface)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala (diff)
The file was modifieddocs/sql-migration-guide.md (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala (diff)
Commit caf04f9b77b3b215963936231fb11027dee57d6c by gurwls223
[SPARK-34890][PYTHON] Port/integrate Koalas main codes into PySpark

### What changes were proposed in this pull request?

As a first step of [SPARK-34849](https://issues.apache.org/jira/browse/SPARK-34849), this PR proposes porting the Koalas main code into PySpark.

This PR contains minimal changes to the existing Koalas code as follows:
1. `databricks.koalas` -> `pyspark.pandas`
2. `from databricks import koalas as ks` -> `from pyspark import pandas as pp`
3. `ks.xxx -> pp.xxx`

Other than them:
1. Added a line to `python/mypy.ini` in order to ignore the mypy test. See related issue at [SPARK-34941](https://issues.apache.org/jira/browse/SPARK-34941).
2. Added a comment to several lines in several files to ignore the flake8 F401. See related issue at [SPARK-34943](https://issues.apache.org/jira/browse/SPARK-34943).

When this PR is merged, all the features that were previously used in [Koalas](https://github.com/databricks/koalas) will be available in PySpark as well.

Users can access to the pandas API in PySpark as below:

```python
>>> from pyspark import pandas as pp
>>> ppdf = pp.DataFrame({"A": [1, 2, 3], "B": [15, 20, 25]})
>>> ppdf
   A   B
0  1  15
1  2  20
2  3  25
```

The existing "options and settings" in Koalas are also available in the same way:

```python
>>> from pyspark.pandas.config import set_option, reset_option, get_option
>>> ppser1 = pp.Series([1, 2, 3])
>>> ppser2 = pp.Series([3, 4, 5])
>>> ppser1 + ppser2
Traceback (most recent call last):
...
ValueError: Cannot combine the series or dataframe because it comes from a different dataframe. In order to allow this operation, enable 'compute.ops_on_diff_frames' option.

>>> set_option("compute.ops_on_diff_frames", True)
>>> ppser1 + ppser2
0    4
1    6
2    8
dtype: int64
```

Please also refer to the [API Reference](https://koalas.readthedocs.io/en/latest/reference/index.html) and [Options and Settings](https://koalas.readthedocs.io/en/latest/user_guide/options.html) for more detail.

**NOTE** that this PR intentionally ports the main codes of Koalas first almost as are with minimal changes because:
- Koalas project is fairly large. Making some changes together for PySpark will make it difficult to review the individual change.
    Koalas dev includes multiple Spark committers who will review. By doing this, the committers will be able to more easily and effectively review and drive the development.
- Koalas tests and documentation require major changes to make it look great together with PySpark whereas main codes do not require.
- We lately froze the Koalas codebase, and plan to work together on the initial porting. By porting the main codes first as are, it unblocks the Koalas dev to work on other items in parallel.

I promise and will make sure on:
- Rename Koalas to PySpark pandas APIs and/or pandas-on-Spark accordingly in documentation, and the docstrings and comments in the main codes.
- Triage APIs to remove that don’t make sense when Koalas is in PySpark

The documentation changes will be tracked in [SPARK-34885](https://issues.apache.org/jira/browse/SPARK-34885), the test code changes will be tracked in [SPARK-34886](https://issues.apache.org/jira/browse/SPARK-34886).

### Why are the changes needed?

Please refer to:
- [[DISCUSS] Support pandas API layer on PySpark](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Support-pandas-API-layer-on-PySpark-td30945.html)
- [[VOTE] SPIP: Support pandas API layer on PySpark](http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-Support-pandas-API-layer-on-PySpark-td30996.html)

### Does this PR introduce _any_ user-facing change?

Yes, now users can use the pandas APIs on Spark

### How was this patch tested?

Manually tested for exposed major APIs and options as described above.

### Koalas contributors

Koalas would not have been possible without the following contributors:

ueshin
HyukjinKwon
rxin
xinrong-databricks
RainFung
charlesdong1991
harupy
floscha
beobest2
thunterdb
garawalid
LucasG0
shril
deepyaman
gioa
fwani
90jam
thoo
AbdealiJK
abishekganesh72
gliptak
DumbMachine
dvgodoy
stbof
nitlev
hjoo
gatorsmile
tomspur
icexelloss
awdavidson
guyao
akhilputhiry
scook12
patryk-oleniuk
tracek
dennyglee
athena15
gstaubli
WeichenXu123
hsubbaraj
lfdversluis
ktksq
shengjh
margaret-databricks
LSturtew
sllynn
manuzhang
jijosg
sadikovi

Closes #32036 from itholic/SPARK-34890.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: caf04f9)
The file was addedpython/pyspark/pandas/indexes/__init__.py
The file was addedpython/pyspark/pandas/missing/__init__.py
The file was addedpython/pyspark/pandas/missing/groupby.py
The file was addedpython/pyspark/pandas/indexing.py
The file was addedpython/pyspark/pandas/series.py
The file was addedpython/pyspark/pandas/window.py
The file was addedpython/pyspark/pandas/spark/accessors.py
The file was addedpython/pyspark/pandas/indexes/base.py
The file was addedpython/pyspark/pandas/missing/series.py
The file was addedpython/pyspark/pandas/indexes/multi.py
The file was addedpython/pyspark/pandas/utils.py
The file was addedpython/pyspark/pandas/typedef/__init__.py
The file was addedpython/pyspark/pandas/plot/plotly.py
The file was addedpython/pyspark/pandas/indexes/category.py
The file was addedpython/pyspark/pandas/generic.py
The file was modifiedpython/mypy.ini (diff)
The file was addedpython/pyspark/pandas/typedef/string_typehints.py
The file was addedpython/pyspark/pandas/accessors.py
The file was addedpython/pyspark/pandas/version.py
The file was addedpython/pyspark/pandas/spark/__init__.py
The file was addedpython/pyspark/pandas/usage_logging/usage_logger.py
The file was addedpython/pyspark/pandas/spark/utils.py
The file was addedpython/pyspark/pandas/__init__.py
The file was addedpython/pyspark/pandas/categorical.py
The file was addedpython/pyspark/pandas/mlflow.py
The file was addedpython/pyspark/pandas/frame.py
The file was addedpython/pyspark/pandas/groupby.py
The file was addedpython/pyspark/pandas/spark/functions.py
The file was addedpython/pyspark/pandas/numpy_compat.py
The file was addedpython/pyspark/pandas/plot/matplotlib.py
The file was addedpython/pyspark/pandas/missing/common.py
The file was addedpython/pyspark/pandas/indexes/datetimes.py
The file was addedpython/pyspark/pandas/missing/indexes.py
The file was addedpython/pyspark/pandas/config.py
The file was addedpython/pyspark/pandas/typedef/typehints.py
The file was addedpython/pyspark/pandas/usage_logging/__init__.py
The file was addedpython/pyspark/pandas/missing/frame.py
The file was addedpython/pyspark/pandas/base.py
The file was addedpython/pyspark/pandas/exceptions.py
The file was addedpython/pyspark/pandas/namespace.py
The file was addedpython/pyspark/pandas/extensions.py
The file was addedpython/pyspark/pandas/internal.py
The file was addedpython/pyspark/pandas/plot/core.py
The file was addedpython/pyspark/pandas/ml.py
The file was addedpython/pyspark/pandas/datetimes.py
The file was addedpython/pyspark/pandas/plot/__init__.py
The file was addedpython/pyspark/pandas/sql.py
The file was addedpython/pyspark/pandas/strings.py
The file was modifieddev/lint-python (diff)
The file was addedpython/pyspark/pandas/indexes/numeric.py
The file was addedpython/pyspark/pandas/missing/window.py
Commit 3b634f66c3e4a942178a1e322ae65ce82779625d by wenchen
[SPARK-34923][SQL] Metadata output should be empty for more plans

### What changes were proposed in this pull request?

Changes the metadata propagation framework.

Previously, most `LogicalPlan`'s propagated their `children`'s `metadataOutput`. This did not make sense in cases where the `LogicalPlan` did not even propagate their `children`'s `output`.

I set the metadata output for plans that do not propagate their `children`'s `output` to be `Nil`. Notably, `Project` and `View` no longer have metadata output.

### Why are the changes needed?

Previously, `SELECT m from (SELECT a from tb)` would output `m` if it were metadata. This did not make sense.

### Does this PR introduce _any_ user-facing change?

Yes. Now, `SELECT m from (SELECT a from tb)` will encounter an `AnalysisException`.

### How was this patch tested?

Added unit tests. I did not cover all cases, as they are fairly extensive. However, the new tests cover major cases (and an existing test already covers Join).

Closes #32017 from karenfeng/spark-34923.

Authored-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 3b634f6)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala (diff)
Commit 19c7d2f3d8cda8d9bc5dfc1a0bf5d46845b1bc2f by wenchen
Revert "[SPARK-34884][SQL] Improve DPP evaluation to make filtering side must can broadcast by size or broadcast by hint"

This reverts commit de66fa63f988069435fd5564d216bb0f438b3aed.
(commit: 19c7d2f)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q72.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q72.sf100/explain.txt (diff)
Commit 4b5fc1da752ec008468ef80a5717c8beab468387 by max.gekk
[SPARK-34667][SQL] Support casting of year-month intervals to strings

### What changes were proposed in this pull request?
1. Added new method `toYearMonthIntervalString()` to `IntervalUtils` which converts an year-month interval as a number of month to a string in the form **"INTERVAL '[sign]yearField-monthField' YEAR TO MONTH"**.
2. Extended the `Cast` expression to support casting of `YearMonthIntervalType` to `StringType`.

### Why are the changes needed?
To conform the ANSI SQL standard which requires to support such casting.

### Does this PR introduce _any_ user-facing change?
Should not because new year-month interval has not been released yet.

### How was this patch tested?
Added new tests for casting:
```
$ build/sbt "testOnly *CastSuite*"
```

Closes #32056 from MaxGekk/cast-ym-interval-to-string.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: 4b5fc1d)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala (diff)