Changes

Summary

  1. [SPARK-35918][AVRO] Unify schema mismatch handling for read/write and (commit: be06e41) (details)
  2. [SPARK-36206][CORE] Support shuffle data corruption diagnosis via (commit: a98d919) (details)
  3. [SPARK-36224][SQL] Use Void as the type name of NullType (commit: 2f70077) (details)
  4. [SPARK-36382][WEBUI] Remove noisy footer from the summary table for (commit: 366f7fe) (details)
  5. [SPARK-36086][SQL] CollapseProject project replace alias should use (commit: f317395) (details)
  6. [SPARK-35430][K8S] Switch on "PVs with local storage" integration test (commit: 7b90fd2) (details)
  7. [SPARK-36379][SQL] Null at root level of a JSON array should not fail w/ (commit: 0bbcbc6) (details)
  8. [SPARK-36137][SQL] HiveShim should fallback to getAllPartitionsOf even (commit: 7a27f8a) (details)
  9. [SPARK-36373][SQL] DecimalPrecision only add necessary cast (commit: c20af53) (details)
  10. [SPARK-36331][CORE] Add standard SQLSTATEs to error guidelines (commit: 63517eb) (details)
  11. [SPARK-36345][SPARK-36367][INFRA][PYTHON] Disable tests failed by the (commit: 8cb9cf3) (details)
  12. [SPARK-36389][CORE][SHUFFLE] Revert the change that accepts negative (commit: 2712343) (details)
  13. [SPARK-36383][CORE] Avoid NullPointerException during executor shutdown (commit: 0b0f4dd) (details)
  14. [SPARK-36192][PYTHON] Better error messages for DataTypeOps against (commit: 8ca11fe) (details)
  15. [SPARK-36380][SQL] Simplify the logical plan names for ALTER TABLE ... (commit: 7cb9c1c) (details)
  16. [SPARK-36175][SQL][FOLLOWUP] Improve the comments for (commit: 1deb386) (details)
  17. [SPARK-36315][SQL] Only skip AQEShuffleReadRule in the final stage if it (commit: dd80457) (details)
  18. [SPARK-35815][SQL][FOLLOWUP] Add test considering the case (commit: 92cdb17) (details)
  19. [SPARK-36349][SQL] Disallow ANSI intervals in file-based datasources (commit: 67cbc93) (details)
  20. [SPARK-36280][SQL] Remove redundant aliases after (commit: 4a6afb4) (details)
  21. [SPARK-36381][SQL] Add case sensitive and case insensitive compare for (commit: 87d49cb) (details)
  22. [MINOR][DOC] Remove obsolete `contributing-to-spark.md` (commit: c31b653) (details)
  23. [SPARK-36404][SQL] Support ORC nested column vectorized reader for data (commit: de62b5a) (details)
  24. [SPARK-35811][PYTHON][FOLLOWUP] Deprecate DataFrame.to_spark_io (commit: 3d72c20) (details)
  25. [SPARK-32923][FOLLOW-UP] Clean up older shuffleMergeId shuffle files (commit: d816949) (details)
  26. [SPARK-36354][CORE] EventLogFileReader should skip rolling event log (commit: 28a2a22) (details)
  27. [SPARK-34309][BUILD][CORE][SQL][K8S] Use Caffeine instead of Guava Cache (commit: 01cf6f4) (details)
Commit be06e4156e9a7600a66822768b39af2f30b3a49e by gengliang
[SPARK-35918][AVRO] Unify schema mismatch handling for read/write and enhance error messages

### What changes were proposed in this pull request?
This unifies struct schema mismatch-handling logic between `AvroSerializer` and `AvroDeserializer`, pushing it into `AvroUtils` which is used by both. The newly unified exception-handling logic is updated to provide more contextual information in error messages. When a schema mismatch is found, previously we would only report the first missing field that is found, but there may be any others as well, which can make it less clear what exactly is going wrong. Now, we will report on all missing fields.

### Why are the changes needed?
While working on #31490, we discussed that there is room for improvement in how schema mismatch errors are reported ([comment1](https://github.com/apache/spark/pull/31490#discussion_r659970793), [comment2](https://github.com/apache/spark/pull/31490#issuecomment-869866848)). Additionally, the logic between `AvroSerializer` and `AvroDeserializer` was quite similar for handling these issues, but didn't share common code, causing duplication and making it harder to see exactly what differences existed between the two.

### Does this PR introduce _any_ user-facing change?
Some error messages when matching Catalyst struct schemas against Avro record schemas now include more information.

### How was this patch tested?
New unit tests added.

Closes #33308 from xkrogen/xkrogen-SPARK-35918-avroserde-unify-better-error-messages.

Authored-by: Erik Krogen <xkrogen@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(commit: be06e41)
The file was modifiedexternal/avro/src/test/scala/org/apache/spark/sql/avro/AvroSchemaHelperSuite.scala (diff)
The file was modifiedexternal/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala (diff)
The file was modifiedexternal/avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala (diff)
The file was modifiedexternal/avro/src/test/scala/org/apache/spark/sql/avro/AvroSerdeSuite.scala (diff)
The file was modifiedexternal/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala (diff)
The file was modifiedexternal/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala (diff)
Commit a98d919da470abaf2e99060f99007a5373032fe1 by mridulatgmail.com
[SPARK-36206][CORE] Support shuffle data corruption diagnosis via shuffle checksum

### What changes were proposed in this pull request?

This PR adds support to diagnose shuffle data corruption. Basically, the diagnosis mechanism works like this:
The shuffler reader would calculate the checksum (c1) for the corrupted shuffle block and send it to the server where the block is stored. At the server, it would read back the checksum (c2) that is stored in the checksum file and recalculate the checksum (c3) for the corresponding shuffle block. Then, if c2 != c3, we suspect the corruption is caused by the disk issue. Otherwise, if c1 != c3, we suspect the corruption is caused by the network issue. Otherwise, the checksum verifies pass. In any case of the error, the cause remains unknown.

After the shuffle reader receives the diagnosis response, it'd take the action bases on the type of cause. Only in case of the network issue, we'd give a retry. Otherwise, we'd throw the fetch failure directly. Also note that, if the corruption happens inside BufferReleasingInputStream, the reducer will throw the fetch failure immediately no matter what the cause is since the data has been partially consumed by downstream RDDs. If corruption happens again after retry, the reducer will throw the fetch failure directly this time without the diagnosis.

Please check out https://github.com/apache/spark/pull/32385 to see the completed proposal of the shuffle checksum project.

### Why are the changes needed?

Shuffle data corruption is a long-standing issue in Spark. For example, in SPARK-18105, people continually reports corruption issue. However, data corruption is difficult to reproduce in most cases and even harder to tell the root cause. We don't know if it's a Spark issue or not. With the diagnosis support for the shuffle corruption, Spark itself can at least distinguish the cause between disk and network, which is very important for users.

### Does this PR introduce _any_ user-facing change?

Yes, users may know the cause of the shuffle corruption after this change.

### How was this patch tested?

Added tests.

Closes #33451 from Ngone51/SPARK-36206.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(commit: a98d919)
The file was modifiedcore/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java (diff)
The file was addedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/checksum/Cause.java
The file was modifiedcore/src/main/scala/org/apache/spark/storage/BlockManager.scala (diff)
The file was addedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/DiagnoseCorruption.java
The file was modifiedcore/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/shuffle/BlockStoreShuffleReader.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/network/BlockDataManager.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/internal/config/package.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/BlockId.scala (diff)
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/BlockTransferMessage.java (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/ShuffleSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala (diff)
The file was removedcore/src/main/java/org/apache/spark/shuffle/checksum/ShuffleChecksumHelper.java
The file was modifiedcore/src/test/java/org/apache/spark/shuffle/sort/UnsafeShuffleWriterSuite.java (diff)
The file was modifiedcore/src/main/java/org/apache/spark/shuffle/sort/UnsafeShuffleWriter.java (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/shuffle/ShuffleChecksumTestHelper.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/shuffle/sort/SortShuffleWriterSuite.scala (diff)
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/BlockStoreClient.java (diff)
The file was addedcore/src/main/java/org/apache/spark/shuffle/checksum/ShuffleChecksumSupport.java
The file was addedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/checksum/ShuffleChecksumHelper.java
The file was addedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/CorruptionCause.java
The file was modifiedcommon/network-shuffle/src/test/java/org/apache/spark/network/shuffle/ExternalBlockHandlerSuite.java (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/network/netty/NettyBlockRpcServer.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala (diff)
The file was modifiedcore/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/shuffle/sort/IndexShuffleBlockResolverSuite.scala (diff)
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockStoreClient.java (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/network/netty/NettyBlockTransferService.scala (diff)
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockHandler.java (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriterSuite.scala (diff)
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java (diff)
Commit 2f700773c2e8fac26661d0aa8024253556a921ba by wenchen
[SPARK-36224][SQL] Use Void as the type name of NullType

### What changes were proposed in this pull request?
Change the `NullType.simpleString` to "void" to set "void" as the formal type name of `NullType`

### Why are the changes needed?
This PR is intended to address the type name discussion in PR #28833. Here are the reasons:
1. The type name of NullType is displayed everywhere, e.g. schema string, error message, document. Hence it's not possible to hide it from users, we have to choose a proper name
2. The "void" is widely used as the type name of "NULL", e.g. Hive, pgSQL
3. Changing to "void" can enable the round trip of `toDDL`/`fromDDL` for NullType. (i.e. make `from_json(col, schema.toDDL)`) work

### Does this PR introduce _any_ user-facing change?
Yes, the type name of "NULL" is changed from "null" to "void". for example:
```
scala> sql("select null as a, 1 as b").schema.catalogString
res5: String = struct<a:void,b:int>
```

### How was this patch tested?
existing test cases

Closes #33437 from linhongliu-db/SPARK-36224-void-type-name.

Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 2f70077)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/ansi/string-functions.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/literals.sql.out (diff)
The file was modifiedpython/pyspark/sql/types.py (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/table-valued-functions.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/ansi/literals.sql.out (diff)
The file was modifiedpython/pyspark/sql/tests/test_types.py (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/inline-table.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-functions/sql-expression-schema.md (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/text.sql.out (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/types/NullType.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/misc-functions.sql.out (diff)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/select.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/udf/udf-inline-table.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/sql-compatibility-functions.sql.out (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala (diff)
Commit 366f7febaf940ab10901026fac57d9f10f8fce47 by gengliang
[SPARK-36382][WEBUI] Remove noisy footer from the summary table for metrics

### What changes were proposed in this pull request?

This PR changed `StagePage` to remove a noisy footer from the summary table for metrics.

### Why are the changes needed?

In the WebUI, some tables are implemented using DataTables (https://datatables.net/).
By default, tables created using DataTables shows footer which says `Showing x to y of z entries`, which is helpful for some tables if table entries can grow
But the summary table for metrics in StagePage cannot grow so it's a little bit noisy.
![summary_metrics_before](https://user-images.githubusercontent.com/4736016/127866960-d2fa23fc-7260-4b99-86cc-b31f85249632.png)

Actually, ExecutorPage has a similar summary table and the footer is from the table.
![executors-no-footer](https://user-images.githubusercontent.com/4736016/127867104-23581e79-de70-49fa-aaef-ab241a8bfa0a.png)

### Does this PR introduce _any_ user-facing change?

Yes, appearance will be slightly changed but I don't think this change affects users.

### How was this patch tested?

I confirmed that the footer is removed from the table.
![summary_metrics_after](https://user-images.githubusercontent.com/4736016/127867320-097a6f52-7aa8-4fec-9d50-b982165baef7.png)

Closes #33611 from sarutak/remove-unnecessary-table-footer.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(commit: 366f7fe)
The file was modifiedcore/src/main/resources/org/apache/spark/ui/static/stagepage.js (diff)
Commit f3173956cbd64c056424b743aff8d17dd7c61fd7 by wenchen
[SPARK-36086][SQL] CollapseProject project replace alias should use origin column name

### What changes were proposed in this pull request?
For added UT, without this patch will failed as below
```
[info] - SHOW TABLES V2: SPARK-36086: CollapseProject project replace alias should use origin column name *** FAILED *** (4 seconds, 935 milliseconds)
[info]   java.lang.RuntimeException: After applying rule org.apache.spark.sql.catalyst.optimizer.CollapseProject in batch Operator Optimization before Inferring Filters, the structural integrity of the plan is broken.
[info]   at org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1217)
[info]   at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:229)
[info]   at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
[info]   at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
[info]   at scala.collection.immutable.List.foldLeft(List.scala:91)
[info]   at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:208)
[info]   at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200)
[info]   at scala.collection.immutable.List.foreach(List.scala:431)
[info]   at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200)
[info]   at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179)
[info]   at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
```

CollapseProject project replace alias should use origin column name
### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #33576 from AngersZhuuuu/SPARK-36086.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: f317395)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseProjectSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q5/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q5.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q5.sf100/simplified.txt (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q5/simplified.txt (diff)
Commit 7b90fd2ca79b9a1fec5fca0bdcc169c7962ad880 by incomplete
[SPARK-35430][K8S] Switch on "PVs with local storage" integration test on Docker driver

### What changes were proposed in this pull request?

Switching back the  "PVs with local storage" integration test on Docker driver.

I have analyzed why this test was failing on my machine (I hope the root cause of the problem is OS agnostic).
It failed because of the mounting of the host directory into the Minikube node using the `--uid=185` (Spark user user id):

```
$ minikube mount ${PVC_TESTS_HOST_PATH}:${PVC_TESTS_VM_PATH} --9p-version=9p2000.L --gid=0 --uid=185 &; MOUNT_PID=$!
```

Are referring to a nonexistent user. See the the number of occurence of 185 in "/etc/passwd":

```
$ minikube ssh "grep -c 185 /etc/passwd"
0
```

This leads to a permission denied. Skipping the `--uid=185` won't help although the path will listable before the test execution:

```
╭─attilazsoltpirosapiros-MBP16 ~/git/attilapiros/spark ‹SPARK-35430*›
╰─$ 📁  Mounting host path /var/folders/t_/fr_vqcyx23vftk81ftz1k5hw0000gn/T/tmp.k9X4Gecv into VM as /var/folders/t_/fr_vqcyx23vftk81ftz1k5hw0000gn/T/tmp.k9X4Gecv ...
    ▪ Mount type:
    ▪ User ID:      docker
    ▪ Group ID:     0
    ▪ Version:      9p2000.L
    ▪ Message Size: 262144
    ▪ Permissions:  755 (-rwxr-xr-x)
    ▪ Options:      map[]
    ▪ Bind Address: 127.0.0.1:51740
🚀  Userspace file server: ufs starting

╭─attilazsoltpirosapiros-MBP16 ~/git/attilapiros/spark ‹SPARK-35430*›
╰─$ minikube ssh "ls /var/folders/t_/fr_vqcyx23vftk81ftz1k5hw0000gn/T/tmp.k9X4Gecv"
╭─attilazsoltpirosapiros-MBP16 ~/git/attilapiros/spark ‹SPARK-35430*›
╰─$
```

But the test will fail and after its execution the `dmesg` shows the following error:
```
[13670.493359] bpfilter: Loaded bpfilter_umh pid 66153
[13670.493363] bpfilter: write fail -32
[13670.530737] bpfilter: Loaded bpfilter_umh pid 66155
...
```

This `bpfilter` is a firewall module and we are back to a permission denied when we want to list the mounted directory.

The solution is to add a spark user with 185 uid when the minikube is started.

**So this must be added to Jenkins job (and the mount should use --gid=0 --uid=185)**:

```
$ minikube ssh "sudo useradd spark -u 185 -g 0 -m -s /bin/bash"
```

### Why are the changes needed?

This integration test is needed to validate the PVs feature.

### Does this PR introduce _any_ user-facing change?

No. It is just testing.

### How was this patch tested?

Running the test locally:
```
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage
```

The "PVs with local storage" was successful but the next test `Launcher client dependencies` the minio stops the test executions on Mac (only on Mac):
```
21/06/29 04:33:32.449 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: 🏃  Starting tunnel for service minio-s3.
21/06/29 04:33:33.425 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: |----------------------------------|----------|-------------|------------------------|
21/06/29 04:33:33.426 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: |            NAMESPACE             |   NAME   | TARGET PORT |          URL           |
21/06/29 04:33:33.426 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: |----------------------------------|----------|-------------|------------------------|
21/06/29 04:33:33.426 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: | 7855c37ca34340c49a98aa8439f4935c | minio-s3 |             | http://127.0.0.1:62138 |
21/06/29 04:33:33.426 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: |----------------------------------|----------|-------------|------------------------|
21/06/29 04:33:33.449 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: http://127.0.0.1:62138
21/06/29 04:33:33.449 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: ❗  Because you are using a Docker driver on darwin, the terminal needs to be open to run it.
```
This is a different problem which is a docker desktop limitation (https://docs.docker.com/docker-for-mac/networking/#per-container-ip-addressing-is-not-possible).

Of course with the default driver on Mac, on hyperkit, all the tests are passing:
```
[INFO] --- scalatest-maven-plugin:2.0.0:test (integration-test)  spark-kubernetes-integration-tests_2.12 ---
Discovery starting.
Discovery completed in 498 milliseconds.
Run starting. Expected test count is: 26
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage
- Launcher client dependencies
- SPARK-33615: Launcher client archives
- SPARK-33748: Launcher python client respecting PYSPARK_PYTHON
- SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python
- Launcher python client dependencies using a zip file
- Test basic decommissioning
- Test basic decommissioning with shuffle cleanup
- Test decommissioning with dynamic allocation & shuffle cleanups
- Test decommissioning timeouts
...
[INFO] BUILD SUCCESS
```

Closes #32793 from attilapiros/SPARK-35430.

Authored-by: attilapiros <piros.attila.zsolt@gmail.com>
Signed-off-by: shane knapp <incomplete@gmail.com>
(commit: 7b90fd2)
The file was modifiedresource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/PVTestsSuite.scala (diff)
The file was modifiedresource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesSuite.scala (diff)
Commit 0bbcbc65080cd67a9997f49906d9d48fdf21db10 by dongjoon
[SPARK-36379][SQL] Null at root level of a JSON array should not fail w/ permissive mode

### What changes were proposed in this pull request?

This PR proposes to fail properly so JSON parser can proceed and parse the input with the permissive mode.
Previously, we passed `null`s as are, the root `InternalRow`s became `null`s, and it causes the query fails even with permissive mode on.
Now, we fail explicitly if `null` is passed when the input array contains `null`.

Note that this is consistent with non-array JSON input:

**Permissive mode:**

```scala
spark.read.json(Seq("""{"a": "str"}""", """null""").toDS).collect()
```
```
res0: Array[org.apache.spark.sql.Row] = Array([str], [null])
```

**Failfast mode**:

```scala
spark.read.option("mode", "failfast").json(Seq("""{"a": "str"}""", """null""").toDS).collect()
```
```
org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.
at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:70)
at org.apache.spark.sql.DataFrameReader.$anonfun$json$7(DataFrameReader.scala:540)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
```

### Why are the changes needed?

To make the permissive mode to proceed and parse without throwing an exception.

### Does this PR introduce _any_ user-facing change?

**Permissive mode:**

```scala
spark.read.json(Seq("""[{"a": "str"}, null]""").toDS).collect()
```

Before:

```
java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
```

After:

```
res0: Array[org.apache.spark.sql.Row] = Array([null])
```

NOTE that this behaviour is consistent when JSON object is malformed:

```scala
spark.read.schema("a int").json(Seq("""[{"a": 123}, {123123}, {"a": 123}]""").toDS).collect()
```

```
res0: Array[org.apache.spark.sql.Row] = Array([null])
```

Since we're parsing _one_ JSON array, related records all fail together.

**Failfast mode:**

```scala
spark.read.option("mode", "failfast").json(Seq("""[{"a": "str"}, null]""").toDS).collect()
```

Before:

```
java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
```

After:

```
org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.
at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:70)
at org.apache.spark.sql.DataFrameReader.$anonfun$json$7(DataFrameReader.scala:540)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
```

### How was this patch tested?

Manually tested, and unit test was added.

Closes #33608 from HyukjinKwon/SPARK-36379.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 0bbcbc6)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala (diff)
Commit 7a27f8a07fc848d668ae75222fa69c7b7b229f84 by viirya
[SPARK-36137][SQL] HiveShim should fallback to getAllPartitionsOf even if directSQL is enabled in remote HMS

### What changes were proposed in this pull request?

Change `HiveShim.getPartitionsByFilter` to always fallback to use `getAllPartitionsMethod` even if `hive.metastore.try.direct.sql` is set to true in the remote HMS.

### Why are the changes needed?

At the moment `getPartitionsByFilter` in `HiveShim` only fallback to use `getAllPartitionsMethod` when `hive.metastore.try.direct.sql` is disabled in the remote HMS, and will fail the query otherwise. However, in certain cases the remote HMS will fallback to use ORM (which only support string type for partition columns) to query the underlying RDBMS **even if this config is set to true**. In this scenario, currently Spark will not be able to recover from the exception and will just fail the query.

For instance, we encountered this bug [HIVE-21497](https://issues.apache.org/jira/browse/HIVE-21497) in HMS running Hive 3.1.2, and Spark was not able to pushdown filter for date column.

### Does this PR introduce _any_ user-facing change?

Yes, now if Spark is querying partitions from a remote HMS which throws exception even if `hive.metastore.try.direct.sql` is set to true, Spark will fallback to list all partitions and do the pruning on client side, instead of failing the query.

### How was this patch tested?

Tested locally with a HMS instance running 3.1.2. It's pretty hard to add a unit test for this since we don't have a mock HMS.

Closes #33382 from sunchao/SPARK-36137-direct-sql.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(commit: 7a27f8a)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/client/HivePartitionFilteringSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala (diff)
Commit c20af535803a7250fef047c2bf0fe30be242369d by yumwang
[SPARK-36373][SQL] DecimalPrecision only add necessary cast

### What changes were proposed in this pull request?

This pr makes `DecimalPrecision` only add necessary cast similar to [`ImplicitTypeCasts`](https://github.com/apache/spark/blob/96c2919988ddf78d104103876d8d8221e8145baa/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala#L675-L678). For example:
```
EqualTo(AttributeReference("d1", DecimalType(5, 2))(), AttributeReference("d2", DecimalType(2, 1))())
```
It will add a useless cast to _d1_:
```
(cast(d1#6 as decimal(5,2)) = cast(d2#7 as decimal(5,2)))
```

### Why are the changes needed?

1. Avoid adding unnecessary cast. Although it will be removed by  `SimplifyCasts` later.
2. I'm trying to add an extended rule similar to `PullOutGroupingExpressions`. The current behavior will introduce additional alias. For example: `cast(d1 as decimal(5,2)) as cast_d1`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #33602 from wangyum/SPARK-36373.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
(commit: c20af53)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecisionSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala (diff)
Commit 63517eb430a55176bdf5ad9192d72e80f25e61e8 by gurwls223
[SPARK-36331][CORE] Add standard SQLSTATEs to error guidelines

### What changes were proposed in this pull request?

Adds ANSI/ISO SQLSTATE standards to the error guidelines.

### Why are the changes needed?

Provides visibility and consistency to the SQLSTATEs assigned to error classes.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Not needed; docs only

Closes #33560 from karenfeng/sqlstate-manual.

Authored-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 63517eb)
The file was modifiedcore/src/main/resources/error/README.md (diff)
Commit 8cb9cf39b6a1899175aeaefb2a85480f5a514aac by gurwls223
[SPARK-36345][SPARK-36367][INFRA][PYTHON] Disable tests failed by the incompatible behavior of pandas 1.3

### What changes were proposed in this pull request?

Disable tests failed by the incompatible behavior of pandas 1.3.

### Why are the changes needed?

Pandas 1.3 has been released.
There are some behavior changes and we should follow it, but it's not ready yet.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Disabled some tests related to the behavior change.

Closes #33598 from ueshin/issues/SPARK-36367/disable_tests.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 8cb9cf3)
The file was modifiedpython/pyspark/pandas/tests/test_ops_on_diff_frames_groupby_rolling.py (diff)
The file was modifiedpython/pyspark/pandas/series.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_expanding.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_categorical_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/indexes/test_base.py (diff)
The file was modifiedpython/pyspark/pandas/tests/indexes/test_category.py (diff)
The file was modified.github/workflows/build_and_test.yml (diff)
The file was modifiedpython/pyspark/pandas/tests/test_ops_on_diff_frames_groupby_expanding.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_rolling.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_categorical.py (diff)
The file was modifiedpython/pyspark/pandas/groupby.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_series.py (diff)
Commit 2712343a276a11b46f0771fe6a6d26ee1834a34f by dongjoon
[SPARK-36389][CORE][SHUFFLE] Revert the change that accepts negative mapId in ShuffleBlockId

### What changes were proposed in this pull request?
With SPARK-32922, we added a change that ShuffleBlockId can have a negative mapId. This was to support push-based shuffle where -1 as mapId indicated a push-merged block. However with SPARK-32923, a different type of BlockId was introduced - ShuffleMergedId, but reverting the change to ShuffleBlockId was missed.

### Why are the changes needed?
This reverts the changes to `ShuffleBlockId` which will never have a negative mapId.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Modified the unit test to verify the newly added ShuffleMergedBlockId.

Closes #33616 from otterc/SPARK-36389.

Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 2712343)
The file was modifiedcore/src/test/scala/org/apache/spark/storage/BlockIdSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/BlockId.scala (diff)
Commit 0b0f4dd1862db08636d5fa1913d42be5e9f0ba14 by dongjoon
[SPARK-36383][CORE] Avoid NullPointerException during executor shutdown

### What changes were proposed in this pull request?

Fix `NullPointerException` in `Executor.stop()`.

### Why are the changes needed?

Some initialization steps could fail before the initialization of `metricsPoller`, `heartbeater`, `threadPool`, which results in the null of `metricsPoller`, `heartbeater`, `threadPool`. For example, I encountered a failure of:

https://github.com/apache/spark/blob/c20af535803a7250fef047c2bf0fe30be242369d/core/src/main/scala/org/apache/spark/executor/Executor.scala#L137

where the executor itself failed to register at the driver.

This PR helps to eliminate the error messages when the issue happens to not confuse users:

<details>
  <summary><mark><font color=darkred>[click to see the detailed error message]</font></mark></summary>
  <pre>
21/07/23 16:04:10 WARN Executor: Unable to stop executor metrics poller
java.lang.NullPointerException
        at org.apache.spark.executor.Executor.stop(Executor.scala:318)
        at org.apache.spark.executor.Executor.$anonfun$stopHookReference$1(Executor.scala:76)
        at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
        at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2025)
        at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at scala.util.Try$.apply(Try.scala:213)
        at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
        at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
21/07/23 16:04:10 WARN Executor: Unable to stop heartbeater
java.lang.NullPointerException
        at org.apache.spark.executor.Executor.stop(Executor.scala:324)
        at org.apache.spark.executor.Executor.$anonfun$stopHookReference$1(Executor.scala:76)
        at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
        at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2025)
        at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at scala.util.Try$.apply(Try.scala:213)
        at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
        at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
21/07/23 16:04:10 ERROR Utils: Uncaught exception in thread shutdown-hook-0
java.lang.NullPointerException
        at org.apache.spark.executor.Executor.$anonfun$stop$3(Executor.scala:334)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:231)
        at org.apache.spark.executor.Executor.stop(Executor.scala:334)
        at org.apache.spark.executor.Executor.$anonfun$stopHookReference$1(Executor.scala:76)
        at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
        at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2025)
        at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at scala.util.Try$.apply(Try.scala:213)
        at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
        at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
  </pre>
</details>

### Does this PR introduce _any_ user-facing change?

Yes, users won't see error messages of `NullPointerException` after this fix.

### How was this patch tested?

Pass existing tests.

Closes #33612 from Ngone51/avoid-npe-during-executor-shutdown.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 0b0f4dd)
The file was modifiedcore/src/main/scala/org/apache/spark/executor/Executor.scala (diff)
Commit 8ca11fe39f6828bb08f123d05c2a4b44da5231b7 by gurwls223
[SPARK-36192][PYTHON] Better error messages for DataTypeOps against lists

### What changes were proposed in this pull request?
Better error messages for DataTypeOps against lists.

### Why are the changes needed?
Currently, DataTypeOps against lists throw a Py4JJavaError, we shall throw a TypeError with proper messages instead.

### Does this PR introduce _any_ user-facing change?
Yes. A TypeError message will be showed rather than a Py4JJavaError.

From:
```py
>>> import pyspark.pandas as ps
>>> ps.Series([1, 2, 3]) > [3, 2, 1]
Traceback (most recent call last):
...
py4j.protocol.Py4JJavaError: An error occurred while calling o107.gt.
: java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [3, 2, 1]
...
```

To:
```py
>>> import pyspark.pandas as ps
>>> ps.Series([1, 2, 3]) > [3, 2, 1]
Traceback (most recent call last):
...
TypeError: The operation can not be applied to list.
```

### How was this patch tested?
Unit tests.

Closes #33581 from xinrong-databricks/data_type_ops_list.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 8ca11fe)
The file was modifiedpython/pyspark/pandas/data_type_ops/datetime_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/complex_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/categorical_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/binary_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/null_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_categorical_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/num_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/base.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/boolean_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/date_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/string_ops.py (diff)
Commit 7cb9c1c2415a0984515e4d4733f816673e4ae3c8 by max.gekk
[SPARK-36380][SQL] Simplify the logical plan names for ALTER TABLE ... COLUMN

### What changes were proposed in this pull request?

This a followup of the recent work such as https://github.com/apache/spark/pull/33200

For `ALTER TABLE` commands, the logical plans do not have the common `AlterTable` prefix in the name and just use names like `SetTableLocation`. This PR proposes to follow the same naming rule in `ALTER TABE ... COLUMN` commands.

This PR also moves these AlterTable commands to a individual file and give them a base trait.

### Why are the changes needed?

name simplification

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing test

Closes #33609 from cloud-fan/dsv2.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: 7cb9c1c)
The file was addedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2AlterTableCommands.scala
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/V2CommandsCaseSensitivitySuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/PlanResolutionSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala (diff)
Commit 1deb386727802978011d4d102b4bbe4632e1bcf8 by gengliang
[SPARK-36175][SQL][FOLLOWUP] Improve the comments for AvroDeserializer/AvroSerializer

### What changes were proposed in this pull request?
This PR follows up https://github.com/apache/spark/pull/33413 and just improve the comments for `AvroDeserializer`/`AvroSerializer`.

### Why are the changes needed?
Make the comment more correctly.

### Does this PR introduce _any_ user-facing change?
'No'.
Just change the comments.

### How was this patch tested?
No need.

Closes #33607 from beliefer/SPARK-36175-followup.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(commit: 1deb386)
The file was modifiedexternal/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala (diff)
The file was modifiedexternal/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala (diff)
Commit dd80457ffb1c129a1ca3c53bcf3ea5feed7ebc57 by wenchen
[SPARK-36315][SQL] Only skip AQEShuffleReadRule in the final stage if it breaks the distribution requirement

### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/30494

This PR proposes a new way to optimize the final query stage in AQE. We first collect the effective user-specified repartition (semantic-wise, user-specified repartition is only effective if it's the root node or under a few simple nodes), and get the required distribution for the final plan. When we optimize the final query stage, we skip certain `AQEShuffleReadRule` if it breaks the required distribution.

### Why are the changes needed?

The current solution for optimizing the final query stage is pretty hacky and overkill. As an example, the newly added rule `OptimizeSkewInRebalancePartitions` can hardly apply as it's very common that the query plan has shuffles with origin `ENSURE_REQUIREMENTS`, which is not supported by `OptimizeSkewInRebalancePartitions`.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

updated tests

Closes #33541 from cloud-fan/aqe.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: dd80457)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeShuffleWithLocalRead.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/joins/OuterJoinSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/joins/InnerJoinSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/joins/ExistenceJoinSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AQEShuffleReadRule.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AQEShuffleReadExec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ValidateRequirements.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala (diff)
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AQEUtils.scala
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/exchange/EnsureRequirementsSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CoalesceShufflePartitions.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewInRebalancePartitions.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala (diff)
Commit 92cdb17d1a005d4f22647c1f6ec0b0761ac8b7cb by max.gekk
[SPARK-35815][SQL][FOLLOWUP] Add test considering the case spark.sql.legacy.interval.enabled is true

### What changes were proposed in this pull request?

This PR adds test considering the case `spark.sql.legacy.interval.enabled` is `true` for SPARK-35815.

### Why are the changes needed?

SPARK-35815 (#33456) changes `Dataset.withWatermark` to accept ANSI interval literals as `delayThreshold` but I noticed the change didn't work with `spark.sql.legacy.interval.enabled=true`.
We can't detect this issue because there is no test which considers the legacy interval type at that time.
In SPARK-36323 (#33551), this issue was resolved but it's better to add test.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New test.

Closes #33606 from sarutak/test-watermark-with-legacy-interval.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: 92cdb17)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/EventTimeWatermarkSuite.scala (diff)
Commit 67cbc932638179925ebbeb76d6d6e6f25a3cb2e2 by max.gekk
[SPARK-36349][SQL] Disallow ANSI intervals in file-based datasources

### What changes were proposed in this pull request?
In the PR, I propose to ban `YearMonthIntervalType` and `DayTimeIntervalType` at the analysis phase while creating a table using a built-in filed-based datasource or writing a dataset to such datasource. In particular, add the following case:
```scala
case _: DayTimeIntervalType | _: YearMonthIntervalType => false
```
to all methods that override either:
- V2 `FileTable.supportsDataType()`
- V1 `FileFormat.supportDataType()`

### Why are the changes needed?
To improve user experience with Spark SQL, and output a proper error message at the analysis phase.

### Does this PR introduce _any_ user-facing change?
Yes but ANSI interval types haven't released yet. So, for users this is new behavior.

### How was this patch tested?
By running the affected test suites:
```
$ build/sbt -Phive-2.3 "test:testOnly *HiveOrcSourceSuite"
```

Closes #33580 from MaxGekk/interval-ban-in-ds.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: 67cbc93)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/CommonFileDataSourceSuite.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala (diff)
The file was modifiedexternal/avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcTable.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVTable.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetTable.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/json/JsonTable.scala (diff)
Commit 4a6afb4875d86ef1f1985521b254ac00ffa85ab2 by dongjoon
[SPARK-36280][SQL] Remove redundant aliases after RewritePredicateSubquery

### What changes were proposed in this pull request?

Remove redundant aliases after `RewritePredicateSubquery`. For example:
```scala
sql("CREATE TABLE t1 USING parquet AS SELECT id AS a, id AS b, id AS c FROM range(10)")
sql("CREATE TABLE t2 USING parquet AS SELECT id AS x, id AS y FROM range(8)")
sql(
  """
    |SELECT *
    |FROM  t1
    |WHERE  a IN (SELECT x
    |  FROM  (SELECT x AS x,
    |           Rank() OVER (partition BY x ORDER BY Sum(y) DESC) AS ranking
    |    FROM   t2
    |    GROUP  BY x) tmp1
    |  WHERE  ranking <= 5)
    |""".stripMargin).explain
```
Before this PR:
```
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- BroadcastHashJoin [a#10L], [x#7L], LeftSemi, BuildRight, false
   :- FileScan parquet default.t1[a#10L,b#11L,c#12L]
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true]),false), [id=#68]
      +- Project [x#7L]
         +- Filter (ranking#8 <= 5)
            +- Window [rank(_w2#25L) windowspecdefinition(x#15L, _w2#25L DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS ranking#8], [x#15L], [_w2#25L DESC NULLS LAST]
               +- Sort [x#15L ASC NULLS FIRST, _w2#25L DESC NULLS LAST], false, 0
                  +- Exchange hashpartitioning(x#15L, 5), ENSURE_REQUIREMENTS, [id=#62]
                     +- HashAggregate(keys=[x#15L], functions=[sum(y#16L)])
                        +- Exchange hashpartitioning(x#15L, 5), ENSURE_REQUIREMENTS, [id=#59]
                           +- HashAggregate(keys=[x#15L], functions=[partial_sum(y#16L)])
                              +- FileScan parquet default.t2[x#15L,y#16L]
```

After this PR:
```
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- BroadcastHashJoin [a#10L], [x#15L], LeftSemi, BuildRight, false
   :- FileScan parquet default.t1[a#10L,b#11L,c#12L]
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true]),false), [id=#67]
      +- Project [x#15L]
         +- Filter (ranking#8 <= 5)
            +- Window [rank(_w2#25L) windowspecdefinition(x#15L, _w2#25L DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS ranking#8], [x#15L], [_w2#25L DESC NULLS LAST]
               +- Sort [x#15L ASC NULLS FIRST, _w2#25L DESC NULLS LAST], false, 0
                  +- HashAggregate(keys=[x#15L], functions=[sum(y#16L)])
                     +- Exchange hashpartitioning(x#15L, 5), ENSURE_REQUIREMENTS, [id=#59]
                        +- HashAggregate(keys=[x#15L], functions=[partial_sum(y#16L)])
                           +- FileScan parquet default.t2[x#15L,y#16L]
```

### Why are the changes needed?

Reduce shuffle to improve query performance. This change can benefit TPC-DS q70.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #33509 from wangyum/SPARK-36280.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: 4a6afb4)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q11/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q11.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q74/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q11.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q4.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q11.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q4/simplified.txt (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q70.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q11.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q4.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q74.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q70a/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q70/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q70a.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q4/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q70/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q70a.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q11/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q70a/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q74/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q11/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q70.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q74.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q11/simplified.txt (diff)
Commit 87d49cbcb1b9003763148dec6a3b067cf86f6ab3 by gurwls223
[SPARK-36381][SQL] Add case sensitive and case insensitive compare for checking column name exist when alter table

### What changes were proposed in this pull request?
Add the Resolver to `checkColumnNotExists` to check name exist in case sensitive.

### Why are the changes needed?
At now the resolver is `_ == _` of `findNestedField`  called by `checkColumnNotExists`
Add `alter.conf.resolver` to it.
[SPARK-36381](https://issues.apache.org/jira/browse/SPARK-36381)
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add ut tests

Closes #33618 from Peng-Lei/sensitive-cloumn-name.

Authored-by: PengLei <peng.8lei@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 87d49cb)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/V2CommandsCaseSensitivitySuite.scala (diff)
Commit c31b653806f19ffce018651b6953bf47d019d7e8 by gurwls223
[MINOR][DOC] Remove obsolete `contributing-to-spark.md`

### What changes were proposed in this pull request?

This PR removes obsolete `contributing-to-spark.md` which is not referenced from anywhere.

### Why are the changes needed?

Just clean up.

### Does this PR introduce _any_ user-facing change?

No. Users can't have access to contributing-to-spark.html unless they directly point to the URL.

### How was this patch tested?

Built the document and confirmed that this change doesn't affect the result.

Closes #33619 from sarutak/remove-obsolete-contribution-doc.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: c31b653)
The file was removeddocs/contributing-to-spark.md
Commit de62b5ae325741e6617e9400f78c6d0fc8cea5de by dongjoon
[SPARK-36404][SQL] Support ORC nested column vectorized reader for data source v2

### What changes were proposed in this pull request?

We added support of nested columns in ORC vectorized reader for data source v1. Data source v2 and v1 both use same underlying implementation for vectorized reader (OrcColumnVector), so we can support data source v2 as well.

### Why are the changes needed?

Improve query performance for ORC data source v2 when reading nested columns.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added test in `OrcQuerySuite.scala`.

Closes #33626 from c21/orc-v2.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: de62b5a)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcPartitionReaderFactory.scala (diff)
Commit 3d72c20e64c18e6e10dc862ab19f07342fcdb2d6 by gurwls223
[SPARK-35811][PYTHON][FOLLOWUP] Deprecate DataFrame.to_spark_io

### What changes were proposed in this pull request?

This PR is followup for https://github.com/apache/spark/pull/32964, to improve the warning message.

### Why are the changes needed?

To improve the warning message.

### Does this PR introduce _any_ user-facing change?

The warning is changed from "Deprecated in 3.2, Use `spark.to_spark_io` instead." to "Deprecated in 3.2, Use `DataFrame.spark.to_spark_io` instead."

### How was this patch tested?

Manually run `dev/lint-python`

Closes #33631 from itholic/SPARK-35811-followup.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 3d72c20)
The file was modifiedpython/pyspark/pandas/frame.py (diff)
Commit d8169493b662acac31d3fc5e6c5051917428c974 by mridulatgmail.com
[SPARK-32923][FOLLOW-UP] Clean up older shuffleMergeId shuffle files when finalize request for higher shuffleMergeId is received

### What changes were proposed in this pull request?

Clean up older shuffleMergeId shuffle files when finalize request for higher shuffleMergeId is received when no blocks pushed for the corresponding shuffleMergeId. This is identified as part of https://github.com/apache/spark/pull/33034#discussion_r680610872.

### Why are the changes needed?

Without this change, older shuffleMergeId files won't be cleaned up properly.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added changes to existing unit test to address this case.

Closes #33605 from venkata91/SPARK-32923-follow-on.

Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(commit: d816949)
The file was modifiedcommon/network-shuffle/src/test/java/org/apache/spark/network/shuffle/RemoteBlockPushResolverSuite.java (diff)
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java (diff)
Commit 28a2a2238fbaf4fad3c98cfef2b3049c1f4616c8 by kabhwan.opensource
[SPARK-36354][CORE] EventLogFileReader should skip rolling event log directories with no logs

### What changes were proposed in this pull request?

This PR aims to skip rolling event log directories which has only `appstatus` file.

### Why are the changes needed?

Currently, Spark History server shows `IllegalArgumentException` warning, but the event log might arrive later. The situation also can happen when the job is killed before uploading its first log to the remote storages like S3.
```
21/07/30 07:38:26 WARN FsHistoryProvider:
Error while reading new log s3a://.../eventlog_v2_spark-95b5c736c8e44037afcf152534d08771
java.lang.IllegalArgumentException: requirement failed:
Log directory must contain at least one event log file!
...
at org.apache.spark.deploy.history.RollingEventLogFilesFileReader.files$lzycompute(EventLogFileReaders.scala:216)
```

### Does this PR introduce _any_ user-facing change?

Yes. Users will not see `IllegalArgumentException` warnings.

### How was this patch tested?

Pass the CIs with the newly added test case.

Closes #33586 from dongjoon-hyun/SPARK-36354.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(commit: 28a2a22)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/history/EventLogFileReaders.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/deploy/history/EventLogFileReadersSuite.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala (diff)
Commit 01cf6f4c6b2a593a2a8717fd2cda13725424120e by hkarau
[SPARK-34309][BUILD][CORE][SQL][K8S] Use Caffeine instead of Guava Cache

### What changes were proposed in this pull request?
There are 3 ways to use Guava cache in spark code:

1. `Loadingcache` is the main way to use Guava cache in spark code and the key usages are as follows:
  a. `LoadingCache` with `maximumsize` data eviction policy, such as `appCache` in `ApplicationCache`, `cache` in `Codegenerator`
  b. `LoadingCache` with `maximumWeight` data eviction policy, such as `shuffleIndexCache` in `ExternalShuffleBlockResolver`
  c. `LoadingCache` with 'expireAfterWrite' data eviction policy, such as `tableRelationCache` in `SessionCatalog`
2. `ManualCache` is another way to use Guava cache in spark code and the key usage is `cache` in `SharedInMemoryCache`, it use to caches partition file statuses in memory

3. The last use way is `hadoopJobMetadata` in `SparkEnv`, it uses Guava Cache to build a `soft-reference map`.

The goal of this pr is use `Caffeine` instead of `Guava Cache` because `Caffeine` is faster than `Guava Cache` from benchmarks, the main changes as follows:

1. Add `Caffeine` deps to maven `pom.xml`

2. Use `Caffeine` instead of Guava `LoadingCache`, `ManualCache` and soft-reference map in `SparkEnv`

3. Add `LocalCacheBenchmark` to compare performance of `Loadingcache` between `Guava Cache` and `Caffeine`

### Why are the changes needed?
`Caffeine` is faster than `Guava Cache` from benchmarks

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Pass the Jenkins or GitHub Action
- Add `LocalCacheBenchmark` to compare performance of `Loadingcache` between `Guava Cache` and `Caffeine`

Closes #31517 from LuciferYang/guava-cache-to-caffeine.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Holden Karau <hkarau@netflix.com>
(commit: 01cf6f4)
The file was addedcore/benchmarks/LocalCacheBenchmark-jdk11-results.txt
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubExprEvaluationRuntimeSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala (diff)
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java (diff)
The file was addedcore/benchmarks/LocalCacheBenchmark-results.txt
The file was modifiedcore/src/main/scala/org/apache/spark/util/Utils.scala (diff)
The file was modifiedpom.xml (diff)
The file was modifiedsql/catalyst/pom.xml (diff)
The file was modifiedsql/core/pom.xml (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileStatusCache.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManager.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/BlockManagerId.scala (diff)
The file was modifiedcommon/network-shuffle/pom.xml (diff)
The file was addedcore/src/test/scala/org/apache/spark/LocalCacheBenchmark.scala
The file was modifiedcore/src/test/scala/org/apache/spark/executor/ExecutorSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/BlockManager.scala (diff)
The file was modifiedresource-managers/kubernetes/core/pom.xml (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala (diff)
The file was modifieddev/deps/spark-deps-hadoop-3.2-hive-2.3 (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/history/ApplicationCache.scala (diff)
The file was modifieddev/deps/spark-deps-hadoop-2.7-hive-2.3 (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/SparkEnv.scala (diff)
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java (diff)
The file was modifiedcore/pom.xml (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGeneratorWithInterpretedFallbackSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SubExprEvaluationRuntime.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/rdd/ReliableCheckpointRDD.scala (diff)