Changes

Summary

  1. [SPARK-35434][BUILD] Upgrade scalatestplus artifacts to 3.2.9.0 (commit: 8c70c17) (details)
  2. [SPARK-35263][TEST] Refactor ShuffleBlockFetcherIteratorSuite to reduce (commit: 186477c) (details)
  3. [SPARK-35398][SQL] Simplify the way to get classes from (commit: b1493d8) (details)
  4. [SPARK-35421][SS] Remove redundant ProjectExec from streaming queries (commit: 0b3758e) (details)
  5. [SPARK-35106][CORE][SQL] Avoid failing rename caused by destination (commit: a72d05c) (details)
  6. [SPARK-35368][SQL] Update histogram statistics for RANGE operator for (commit: 46f7d78) (details)
  7. [SPARK-35418][SQL] Add sentences function to functions.{scala,py} (commit: 9283beb) (details)
  8. [SPARK-35362][SQL] Update null count in the column stats for UNION (commit: 1214213) (details)
  9. [SPARK-35093][SQL] AQE now uses newQueryStage plan as key for looking up (commit: 52e3cf9) (details)
  10. [SPARK-35338][PYTHON] Separate arithmetic operations into data type (commit: d1b24d8) (details)
  11. [SPARK-35438][SQL][DOCS] Minor documentation fix for window physical (commit: 586caae) (details)
  12. Revert "[SPARK-35338][PYTHON] Separate arithmetic operations into data (commit: d44e6c7) (details)
  13. [SPARK-35450][INFRA] Follow checkout-merge way to use the latest commit (commit: c064805) (details)
  14. [SPARK-35408][PYTHON][FOLLOW-UP] Avoid unnecessary f-string format (commit: 7eaabf4) (details)
  15. [SPARK-35338][PYTHON] Separate arithmetic operations into data type (commit: a970f85) (details)
  16. [SPARK-35443][K8S] Mark K8s ConfigMaps and Secrets created by Spark as (commit: de59e01) (details)
  17. [SPARK-27991][CORE] Defer the fetch request on Netty OOM (commit: 00b63c8) (details)
  18. [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty (commit: bdd8e1d) (details)
  19. [SPARK-35457][BUILD] Bump ANTLR runtime version to 4.8 (commit: e170e63) (details)
  20. [SPARK-35424][SHUFFLE] Remove some useless code in the (commit: 4869e43) (details)
  21. [SPARK-35459][SQL][TESTS] Move `AvroRowReaderSuite` to a separate file (commit: 2bd3254) (details)
  22. [SPARK-35373][BUILD][FOLLOWUP] Fix "binary operator expected" error on (commit: 3c3533d) (details)
  23. [SPARK-35458][BUILD] Use ` > /dev/null` to replace `-q` in shasum (commit: 38fbc0b) (details)
  24. [SPARK-35462][BUILD][K8S] Upgrade Kubernetes-client to 5.4.0 to support (commit: 3757c18) (details)
  25. [SPARK-35463][BUILD] Skip checking checksum on a system without `shasum` (commit: 8e13b8c) (details)
  26. [SPARK-35364][PYTHON] Renaming the existing Koalas related codes (commit: 6b912e4) (details)
Commit 8c70c175455219096622653d37fe8642a722e9d1 by dhyun
[SPARK-35434][BUILD] Upgrade scalatestplus artifacts to 3.2.9.0

### What changes were proposed in this pull request?

This PR upgrades the scalatestplus artifacts and scalacheck.

### Why are the changes needed?

scalatestplus artifacts Spark uses are two years old and these artifacts are currently renamed.
So, let's follow up.
Also, the latest releases seem to support Scala 3.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA passed on my repository.

Closes #32581 from sarutak/upgrade-scalatestplus.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: 8c70c17)
The file was modifiedpom.xml (diff)
Commit 186477c60e9cad71434b15fd9e08789740425d59 by mridulatgmail.com
[SPARK-35263][TEST] Refactor ShuffleBlockFetcherIteratorSuite to reduce duplicated code

### What changes were proposed in this pull request?
Introduce new shared methods to `ShuffleBlockFetcherIteratorSuite` to replace copy-pasted code. Use modern, Scala-like Mockito `Answer` syntax.

### Why are the changes needed?
`ShuffleFetcherBlockIteratorSuite` has tons of duplicate code, like https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala#L172-L185 . It's challenging to tell what the interesting parts are vs. what is just being set to some default/unused value.

Similarly but not as bad, there are many calls like the following
```
verify(transfer, times(1)).fetchBlocks(any(), any(), any(), any(), any(), any())
when(transfer.fetchBlocks(any(), any(), any(), any(), any(), any())).thenAnswer ...
```

These changes result in about 10% reduction in both lines and characters in the file:
```bash
# Before
> wc core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
    1063    3950   43201 core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala

# After
> wc core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
     928    3609   39053 core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
```

It also helps readability, e.g.:
```
    val iterator = createShuffleBlockIteratorWithDefaults(
      transfer,
      blocksByAddress,
      maxBytesInFlight = 1000L
    )
```
Now I can clearly tell that `maxBytesInFlight` is the main parameter we're interested in here.

### Does this PR introduce _any_ user-facing change?
No, test only. There aren't even any behavior changes, just refactoring.

### How was this patch tested?
Unit tests pass.

Closes #32389 from xkrogen/xkrogen-spark-35263-refactor-shuffleblockfetcheriteratorsuite.

Authored-by: Erik Krogen <xkrogen@apache.org>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(commit: 186477c)
The file was modifiedcore/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala (diff)
Commit b1493d82dd9dfd3ffa3022e37e6c5ea592ab7546 by yamamuro
[SPARK-35398][SQL] Simplify the way to get classes from ClassBodyEvaluator in `CodeGenerator.updateAndGetCompilationStats` method

### What changes were proposed in this pull request?
SPARK-35253 upgraded janino from 3.0.16 to 3.1.4, `ClassBodyEvaluator` provides the `getBytecodes` method to get
the mapping from `ClassFile#getThisClassName` to `ClassFile#toByteArray` directly in this version and we don't need to get this variable by reflection api anymore.

So the main purpose of this pr is simplify the way to get `bytecodes` from `ClassBodyEvaluator` in `CodeGenerator#updateAndGetCompilationStats` method.

### Why are the changes needed?
Code simplification.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Pass the Jenkins or GitHub Action

- Manual test:

1. Define a code fragment to be tested, for example:
```
    val codeBody = s"""
        public java.lang.Object generate(Object[] references) {
          return new TestMetricCode(references);
        }

        class TestMetricCode {

          public TestMetricCode(Object[] references) {
          }

          public long sumOfSquares(long left, long right) {
            return left * left + right * right;
          }
        }
      """
```
2. Create a `ClassBodyEvaluator` and `cook` the `codeBody` as above, the process of creating `ClassBodyEvaluator` can extract from `CodeGenerator#doCompile` method.

3. Get `bytecodes` using `ClassBodyEvaluator#getBytecodes` api(after this pr) and reflection api(before this pr) respectively, then assert that they are the same. If the `bytecodes` not changed, we can be sure that metrics state will not change. The test code example as follows:
```
    import scala.collection.JavaConverters._
    val bytecodesFromApi = evaluator.getBytecodes.asScala
    val bytecodesFromReflectionApi = {
      val scField = classOf[ClassBodyEvaluator].getDeclaredField("sc")
      scField.setAccessible(true)
      val compiler = scField.get(evaluator).asInstanceOf[SimpleCompiler]
      val loader = compiler.getClassLoader.asInstanceOf[ByteArrayClassLoader]
      val classesField = loader.getClass.getDeclaredField("classes")
      classesField.setAccessible(true)
      classesField.get(loader).asInstanceOf[java.util.Map[String, Array[Byte]]].asScala
    }

    assert(bytecodesFromApi == bytecodesFromReflectionApi)
```

Closes #32536 from LuciferYang/SPARK-35253-FOLLOWUP.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(commit: b1493d8)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala (diff)
Commit 0b3758e8cdb3eaa9d55ce3b41ecad5fa01567343 by wenchen
[SPARK-35421][SS] Remove redundant ProjectExec from streaming queries with V2Relation

### What changes were proposed in this pull request?

This PR fixes an issue that streaming queries with V2Relation can have redundant `ProjectExec` in its physical plan.
You can easily reproduce this issue with the following code.
```
import org.apache.spark.sql.streaming.Trigger

val query = spark.
  readStream.
  format("rate").
  option("rowsPerSecond", 1000).
  option("rampUpTime", "10s").
  load().
  selectExpr("timestamp", "100",  "value").
  writeStream.
  format("console").
  trigger(Trigger.ProcessingTime("5 seconds")).
  // trigger(Trigger.Continuous("5 seconds")). // You can reproduce with continuous processing too.
  outputMode("append").
  start()
```
The plan tree is here.
![ss-before](https://user-images.githubusercontent.com/4736016/118454996-ec439800-b733-11eb-8cd8-ed8af73a91b8.png)

### Why are the changes needed?

For better performance.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I run the same code above and get the following plan tree.
![ss-after](https://user-images.githubusercontent.com/4736016/118455755-1bf2a000-b734-11eb-999e-4b8c19ad34d7.png)

Closes #32570 from sarutak/fix-redundant-projectexec.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 0b3758e)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)
Commit a72d05c7e632fbb0d8a6082c3cacdf61f36518b4 by wenchen
[SPARK-35106][CORE][SQL] Avoid failing rename caused by destination directory not exist

### What changes were proposed in this pull request?

1. In HadoopMapReduceCommitProtocol, create parent directory before renaming custom partition path staging files
2. In InMemoryCatalog and HiveExternalCatalog, create new partition directory before renaming old partition path
3. Check return value of FileSystem#rename, if false, throw exception to avoid silent data loss cause by rename failure
4. Change DebugFilesystem#rename behavior to make it match HDFS's behavior (return false without rename when dst parent directory not exist)

### Why are the changes needed?

Depends on FileSystem#rename implementation, when destination directory does not exist, file system may
1. return false without renaming file nor throwing exception (e.g. HDFS), or
2. create destination directory, rename files, and return true (e.g. LocalFileSystem)

In the first case above, renames in HadoopMapReduceCommitProtocol for custom partition path will fail silently if the destination partition path does not exist. Failed renames can happen when
1. dynamicPartitionOverwrite == true, the custom partition path directories are deleted by the job before the rename; or
2. the custom partition path directories do not exist before the job; or
3. something else is wrong when file system handle `rename`

The renames in MemoryCatalog and HiveExternalCatalog for partition renaming also have similar issue.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Modified DebugFilesystem#rename, and added new unit tests.

Without the fix in src code, five InsertSuite tests and one AlterTableRenamePartitionSuite test failed:
InsertSuite.SPARK-20236: dynamic partition overwrite with custom partition path (existing test with modified FS)
```
== Results ==
!== Correct Answer - 1 ==   == Spark Answer - 0 ==
struct<>                   struct<>
![2,1,1]
```

InsertSuite.SPARK-35106: insert overwrite with custom partition path
```
== Results ==
!== Correct Answer - 1 ==   == Spark Answer - 0 ==
struct<>                   struct<>
![2,1,1]
```

InsertSuite.SPARK-35106: dynamic partition overwrite with custom partition path
```
== Results ==
!== Correct Answer - 2 ==   == Spark Answer - 1 ==
!struct<>                   struct<i:int,part1:int,part2:int>
[1,1,1]                    [1,1,1]
![1,1,2]
```

InsertSuite.SPARK-35106: Throw exception when rename custom partition paths returns false
```
Expected exception org.apache.spark.SparkException to be thrown, but no exception was thrown
```

InsertSuite.SPARK-35106: Throw exception when rename dynamic partition paths returns false
```
Expected exception org.apache.spark.SparkException to be thrown, but no exception was thrown
```

AlterTableRenamePartitionSuite.ALTER TABLE .. RENAME PARTITION V1: multi part partition (existing test with modified FS)
```
== Results ==
!== Correct Answer - 1 ==   == Spark Answer - 0 ==
struct<>                   struct<>
![3,123,3]
```

Closes #32530 from YuzhouSun/SPARK-35106.

Authored-by: Yuzhou Sun <yuzhosun@amazon.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: a72d05c)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/DebugFilesystem.scala (diff)
Commit 46f7d780d3a50ab2212030d7e16f0b5238d3fc25 by yamamuro
[SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

### What changes were proposed in this pull request?
Update histogram statistics for RANGE operator stats estimation
### Why are the changes needed?
If histogram optimization is enabled, this statistics can be used in various cost based optimizations.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UTs. Manual test.

Closes #32498 from shahidki31/shahid/histogram.

Lead-authored-by: shahid <shahidki31@gmail.com>
Co-authored-by: Shahid <shahidki31@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(commit: 46f7d78)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala (diff)
Commit 9283bebbbd5d8fdf2ed03d886773cc851fdd6094 by sarutak
[SPARK-35418][SQL] Add sentences function to functions.{scala,py}

### What changes were proposed in this pull request?

This PR adds `sentences`, a string function, which is present as of `2.0.0` but missing in `functions.{scala,py}`.

### Why are the changes needed?

This function can be only used from SQL for now.
It's good if we can use this function from Scala/Python code as well as SQL.

### Does this PR introduce _any_ user-facing change?

Yes. Users can use this function from Scala and Python.

### How was this patch tested?

New test.

Closes #32566 from sarutak/sentences-function.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
(commit: 9283beb)
The file was modifiedpython/pyspark/sql/functions.pyi (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/StringFunctionsSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/functions.scala (diff)
The file was modifiedpython/docs/source/reference/pyspark.sql.rst (diff)
The file was modifiedpython/pyspark/sql/functions.py (diff)
Commit 12142130cd551bd832a26190a0a3f608e66b3a8d by yamamuro
[SPARK-35362][SQL] Update null count in the column stats for UNION operator stats estimation

### What changes were proposed in this pull request?
Updating column stats for Union operator stats estimation
### Why are the changes needed?
This is a followup PR to update the null count also in the Union stats operator estimation. https://github.com/apache/spark/pull/30334

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Updated UTs, manual testing

Closes #32494 from shahidki31/shahid/updateNullCountForUnion.

Lead-authored-by: shahid <shahidki31@gmail.com>
Co-authored-by: Shahid <shahidki31@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(commit: 1214213)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/UnionEstimation.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/UnionEstimationSuite.scala (diff)
Commit 52e3cf9ff50b4209e29cb06df09b1ef3a18bc83b by tgraves
[SPARK-35093][SQL] AQE now uses newQueryStage plan as key for looking up cached exchanges for re-use

### What changes were proposed in this pull request?
AQE has an optimization where it attempts to reuse compatible exchanges but it does not take into account whether the exchanges are columnar or not, resulting in incorrect reuse under some circumstances.

This PR simply changes the key used to lookup cached stages. It now uses the canonicalized form of the new query stage (potentially created by a plugin) rather than using the canonicalized form of the original exchange.

### Why are the changes needed?
When using the [RAPIDS Accelerator for Apache Spark](https://github.com/NVIDIA/spark-rapids) we sometimes see a new query stage correctly create a row-based exchange and then Spark replaces it with a cached columnar exchange, which is not compatible, and this causes queries to fail.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
The patch has been tested with the query that highlighted this issue. I looked at writing unit tests for this but it would involve implementing a mock columnar exchange in the tests so would be quite a bit of work. If anyone has ideas on other ways to test this I am happy to hear them.

Closes #32195 from andygrove/SPARK-35093.

Authored-by: Andy Grove <andygrove73@gmail.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
(commit: 52e3cf9)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala (diff)
Commit d1b24d8aba8317c62542a81ca55c12700a07cb80 by ueshin
[SPARK-35338][PYTHON] Separate arithmetic operations into data type based structures

### What changes were proposed in this pull request?

The PR is proposed for **pandas APIs on Spark**, in order to separate arithmetic operations shown as below into data-type-based structures.
`__add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__,
__radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__`

DataTypeOps and subclasses are introduced.

The existing behaviors of each arithmetic operation should be preserved.

### Why are the changes needed?

Currently, the same arithmetic operation of all data types is defined in one function, so it’s difficult to extend the behavior change based on the data types.

Introducing DataTypeOps would be the foundation for [pandas APIs on Spark: Separate basic operations into data type based structures.](https://docs.google.com/document/d/12MS6xK0hETYmrcl5b9pX5lgV4FmGVfpmcSKq--_oQlc/edit?usp=sharing).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tests are introduced under pyspark.pandas.tests.data_type_ops. One test file per DataTypeOps class.

Closes #32469 from xinrong-databricks/datatypeop_arith.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(commit: d1b24d8)
The file was addedpython/pyspark/pandas/data_type_ops/num_ops.py
The file was addedpython/pyspark/pandas/tests/data_type_ops/__init__.py
The file was modifiedpython/pyspark/pandas/tests/indexes/test_datetime.py (diff)
The file was addedpython/pyspark/pandas/data_type_ops/date_ops.py
The file was addedpython/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py
The file was modifiedpython/setup.py (diff)
The file was addedpython/pyspark/pandas/tests/data_type_ops/test_string_ops.py
The file was addedpython/pyspark/pandas/tests/data_type_ops/test_categorical_ops.py
The file was addedpython/pyspark/pandas/data_type_ops/string_ops.py
The file was addedpython/pyspark/pandas/tests/data_type_ops/test_date_ops.py
The file was addedpython/pyspark/pandas/data_type_ops/categorical_ops.py
The file was addedpython/pyspark/pandas/tests/data_type_ops/testing_utils.py
The file was modifieddev/sparktestsupport/modules.py (diff)
The file was modifiedpython/pyspark/pandas/base.py (diff)
The file was addedpython/pyspark/pandas/data_type_ops/base.py
The file was addedpython/pyspark/pandas/data_type_ops/__init__.py
The file was addedpython/pyspark/pandas/tests/data_type_ops/test_num_ops.py
The file was modifiedpython/pyspark/pandas/tests/test_dataframe.py (diff)
The file was addedpython/pyspark/pandas/data_type_ops/boolean_ops.py
The file was modifiedpython/pyspark/testing/pandasutils.py (diff)
The file was addedpython/pyspark/pandas/data_type_ops/datetime_ops.py
The file was addedpython/pyspark/pandas/tests/data_type_ops/test_datetime_ops.py
The file was modifiedpython/pyspark/pandas/tests/test_series_datetime.py (diff)
Commit 586caae3cc0f43d0100335a25ada503c023c45f5 by yamamuro
[SPARK-35438][SQL][DOCS] Minor documentation fix for window physical operator

### What changes were proposed in this pull request?

As title. Fixed two places where the documentation for window operator has some error.

### Why are the changes needed?

Help people read code for window operator more easily in the future.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #32585 from c21/minor-doc.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(commit: 586caae)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExecBase.scala (diff)
Commit d44e6c7f10528556dc5a64527e6f67e2ae7947fc by ueshin
Revert "[SPARK-35338][PYTHON] Separate arithmetic operations into data type based structures"

This reverts commit d1b24d8aba8317c62542a81ca55c12700a07cb80.
(commit: d44e6c7)
The file was modifiedpython/pyspark/pandas/tests/test_series_datetime.py (diff)
The file was removedpython/pyspark/pandas/tests/data_type_ops/test_string_ops.py
The file was removedpython/pyspark/pandas/data_type_ops/string_ops.py
The file was removedpython/pyspark/pandas/tests/data_type_ops/test_datetime_ops.py
The file was removedpython/pyspark/pandas/tests/data_type_ops/test_num_ops.py
The file was removedpython/pyspark/pandas/tests/data_type_ops/test_categorical_ops.py
The file was removedpython/pyspark/pandas/data_type_ops/datetime_ops.py
The file was modifiedpython/pyspark/pandas/base.py (diff)
The file was removedpython/pyspark/pandas/data_type_ops/__init__.py
The file was modifiedpython/setup.py (diff)
The file was removedpython/pyspark/pandas/data_type_ops/categorical_ops.py
The file was modifiedpython/pyspark/testing/pandasutils.py (diff)
The file was removedpython/pyspark/pandas/tests/data_type_ops/__init__.py
The file was modifieddev/sparktestsupport/modules.py (diff)
The file was removedpython/pyspark/pandas/data_type_ops/boolean_ops.py
The file was removedpython/pyspark/pandas/tests/data_type_ops/test_date_ops.py
The file was removedpython/pyspark/pandas/tests/data_type_ops/testing_utils.py
The file was removedpython/pyspark/pandas/data_type_ops/date_ops.py
The file was modifiedpython/pyspark/pandas/tests/indexes/test_datetime.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_dataframe.py (diff)
The file was removedpython/pyspark/pandas/data_type_ops/base.py
The file was removedpython/pyspark/pandas/data_type_ops/num_ops.py
The file was removedpython/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py
Commit c06480519eb59197b433087c293870e467caa2a0 by gurwls223
[SPARK-35450][INFRA] Follow checkout-merge way to use the latest commit for linter, or other workflows

### What changes were proposed in this pull request?

Follows checkout-merge way to use the latest commit for linter, or other workflows.

### Why are the changes needed?

For linter or other workflows besides build-and-tests, we should follow checkout-merge way to use the latest commit; otherwise, those could work on the old settings.

### Does this PR introduce _any_ user-facing change?

No, this is a dev-only change.

### How was this patch tested?

Existing tests.

Closes #32597 from ueshin/issues/SPARK-35450/infra.

Lead-authored-by: Takuya UESHIN <ueshin@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: c064805)
The file was modified.github/workflows/build_and_test.yml (diff)
Commit 7eaabf4df5aa8dd6ee410b4e6cd77d3df8eacc4b by gurwls223
[SPARK-35408][PYTHON][FOLLOW-UP] Avoid unnecessary f-string format

### What changes were proposed in this pull request?

This PR avoids using f-string format that's a new feature in Python 3.6. Although it's legitimate to use this syntax because Apache Spark supports Python 3.6+, this breaks unofficial support of Python 3.5.

This specific f-string format looks something unnecessary, and doesn't look worth enough to remove such unofficial support because of one string format in an error message.

**NOTE** that this PR doesn't mean that we're maintaining Python 3.5 since we dropped. It just looks like too much to remove that unofficial support only because of one string format and error message.

### Why are the changes needed?

To keep unofficial Python 3.5 support

### Does this PR introduce _any_ user-facing change?

Officially nope.

### How was this patch tested?

Ran the linters.

Closes #32598 from HyukjinKwon/SPARK-35408=followup.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 7eaabf4)
The file was modifiedpython/pyspark/sql/dataframe.py (diff)
Commit a970f8505dabe739ab0c71053e40d205ccb6edb6 by ueshin
[SPARK-35338][PYTHON] Separate arithmetic operations into data type based structures

### What changes were proposed in this pull request?

The PR is proposed for **pandas APIs on Spark**, in order to separate arithmetic operations shown as below into data-type-based structures.
`__add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__,
__radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__`

DataTypeOps and subclasses are introduced.

The existing behaviors of each arithmetic operation should be preserved.

### Why are the changes needed?

Currently, the same arithmetic operation of all data types is defined in one function, so it’s difficult to extend the behavior change based on the data types.

Introducing DataTypeOps would be the foundation for [pandas APIs on Spark: Separate basic operations into data type based structures.](https://docs.google.com/document/d/12MS6xK0hETYmrcl5b9pX5lgV4FmGVfpmcSKq--_oQlc/edit?usp=sharing).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tests are introduced under pyspark.pandas.tests.data_type_ops. One test file per DataTypeOps class.

Closes #32596 from xinrong-databricks/datatypeop_arith_fix.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(commit: a970f85)
The file was addedpython/pyspark/pandas/data_type_ops/__init__.py
The file was addedpython/pyspark/pandas/tests/data_type_ops/test_string_ops.py
The file was addedpython/pyspark/pandas/data_type_ops/num_ops.py
The file was modifiedpython/pyspark/pandas/tests/indexes/test_datetime.py (diff)
The file was addedpython/pyspark/pandas/tests/data_type_ops/test_categorical_ops.py
The file was addedpython/pyspark/pandas/tests/data_type_ops/test_datetime_ops.py
The file was addedpython/pyspark/pandas/tests/data_type_ops/test_date_ops.py
The file was addedpython/pyspark/pandas/tests/data_type_ops/test_num_ops.py
The file was addedpython/pyspark/pandas/data_type_ops/base.py
The file was addedpython/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py
The file was modifiedpython/pyspark/pandas/tests/test_dataframe.py (diff)
The file was modifieddev/sparktestsupport/modules.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_series_datetime.py (diff)
The file was modifiedpython/pyspark/testing/pandasutils.py (diff)
The file was addedpython/pyspark/pandas/data_type_ops/categorical_ops.py
The file was addedpython/pyspark/pandas/data_type_ops/datetime_ops.py
The file was addedpython/pyspark/pandas/data_type_ops/boolean_ops.py
The file was addedpython/pyspark/pandas/tests/data_type_ops/testing_utils.py
The file was modifiedpython/pyspark/pandas/base.py (diff)
The file was addedpython/pyspark/pandas/data_type_ops/date_ops.py
The file was addedpython/pyspark/pandas/tests/data_type_ops/__init__.py
The file was addedpython/pyspark/pandas/data_type_ops/string_ops.py
The file was modifiedpython/setup.py (diff)
Commit de59e01aa4853ef951da080c0d1908d53d133ebe by dhyun
[SPARK-35443][K8S] Mark K8s ConfigMaps and Secrets created by Spark as immutable

Kubernetes supports marking secrets and config maps as immutable to gain performance.

https://kubernetes.io/docs/concepts/configuration/configmap/#configmap-immutable
https://kubernetes.io/docs/concepts/configuration/secret/#secret-immutable

For K8s clusters that run many thousands of Spark applications, this can yield significant reduction in load on the kube-apiserver.

From the K8s docs:

> For clusters that extensively use Secrets (at least tens of thousands of unique Secret to Pod mounts), preventing changes to their data has the following advantages:
> - protects you from accidental (or unwanted) updates that could cause applications outages
> - improves performance of your cluster by significantly reducing load on kube-apiserver, by closing watches for secrets marked as immutable.

For any secrets and config maps we create in Spark that are immutable, we could mark them as immutable by including the following when building the secret/config map
```
.withImmutable(true)
```
This feature has been supported in K8s as beta since K8s 1.19 and as GA since K8s 1.21

### What changes were proposed in this pull request?
All K8s secrets and config maps created by Spark are marked "immutable".

### Why are the changes needed?
See description above.

### Does this PR introduce _any_ user-facing change?
Don't think so

### How was this patch tested?
Augmented existing unit tests.

Closes #32588 from ashrayjain/patch-1.

Authored-by: Ashray Jain <ashrayjain@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: de59e01)
The file was modifiedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/submit/ClientSuite.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/HadoopConfDriverFeatureStep.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/KerberosConfDriverFeatureStep.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/PodTemplateConfigMapStep.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/features/PodTemplateConfigMapStepSuite.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientUtils.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverKubernetesCredentialsFeatureStep.scala (diff)
Commit 00b63c8dc2170cfbc3f3ea7882dfd417d7fde744 by wenchen
[SPARK-27991][CORE] Defer the fetch request on Netty OOM

### What changes were proposed in this pull request?

This PR proposes a workaround to address the Netty OOM issue (SPARK-24989, SPARK-27991):

Basically, `ShuffleBlockFetcherIterator` would catch the `OutOfDirectMemoryError` from Netty and then set a global flag for the shuffle module. Any pending fetch requests would be deferred if there're in-flight requests until the flag is unset. And the flag will be unset when there's a fetch request succeed.

Note that catching the Netty OOM rather than abort the application is feasible because Netty manage its own memory region (offheap by default) separately. So Netty OOM doesn't mean the memory shortage of Spark.

### Why are the changes needed?

The Netty OOM issue is a very corner case. It usually happens in the large-scale cluster, where a reduce task could fetch shuffle blocks from hundreds of nodes concurrently in a short time. Internally, we found a cluster that has created 260+ clients within 6s before throwing Netty OOM.

Although Spark has configurations, e.g., `spark.reducer.maxReqsInFlight` to tune the number of concurrent requests, it's usually not a easy decision for the user to set a reasonable value regarding the workloads, machine resources, etc. But with this fix, Spark would heal the Netty memory issue itself without any specific configurations.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit tests.

Closes #32287 from Ngone51/SPARK-27991.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 00b63c8)
The file was modifiedcore/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/internal/config/package.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala (diff)
The file was modifiedcommon/network-common/src/main/java/org/apache/spark/network/util/NettyUtils.java (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/shuffle/BlockStoreShuffleReader.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala (diff)
Commit bdd8e1dbb1c526e5ce08e294b5b11ace752b2e2e by wenchen
[SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory

### What changes were proposed in this pull request?

CTAS with location clause acts as an insert overwrite. This can cause problems when there are subdirectories within a location directory.
This causes some users to accidentally wipe out directories with very important data. We should not allow CTAS with location to a non-empty directory.

### Why are the changes needed?

Hive already handled this scenario: HIVE-11319

Steps to reproduce:

```scala
sql("""create external table  `demo_CTAS`( `comment` string) PARTITIONED BY (`col1` string, `col2` string) STORED AS parquet location '/tmp/u1/demo_CTAS'""")
sql("""INSERT OVERWRITE TABLE demo_CTAS partition (col1='1',col2='1') VALUES ('abc')""")
sql("select* from demo_CTAS").show
sql("""create table ctas1 location '/tmp/u2/ctas1' as select * from demo_CTAS""")
sql("select* from ctas1").show
sql("""create table ctas2 location '/tmp/u2' as select * from demo_CTAS""")
```

Before the fix: Both create table operations will succeed. But values in table ctas1 will be replaced by ctas2 accidentally.

After the fix: `create table ctas2...` will throw `AnalysisException`:

```
org.apache.spark.sql.AnalysisException: CREATE-TABLE-AS-SELECT cannot create table with location to a non-empty directory /tmp/u2 . To allow overwriting the existing non-empty directory, set 'spark.sql.legacy.allowNonEmptyLocationInCTAS' to true.
```

### Does this PR introduce _any_ user-facing change?
Yes, if the location directory is not empty, CTAS with location will throw AnalysisException

```
sql("""create table ctas2 location '/tmp/u2' as select * from demo_CTAS""")
```
```
org.apache.spark.sql.AnalysisException: CREATE-TABLE-AS-SELECT cannot create table with location to a non-empty directory /tmp/u2 . To allow overwriting the existing non-empty directory, set 'spark.sql.legacy.allowNonEmptyLocationInCTAS' to true.
```

`CREATE TABLE AS SELECT` with non-empty `LOCATION` will throw `AnalysisException`. To restore the behavior before Spark 3.2, need to  set `spark.sql.legacy.allowNonEmptyLocationInCTAS` to `true`. , default value is `false`.
Updated SQL migration guide.

### How was this patch tested?
Test case added in SQLQuerySuite.scala

Closes #32411 from vinodkc/br_fixCTAS_nonempty_dir.

Authored-by: Vinod KC <vinod.kc.in@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: bdd8e1d)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/sources/CreateTableAsSelectSuite.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala (diff)
The file was modifieddocs/sql-migration-guide.md (diff)
Commit e170e63955128629bd73d60285e67bfc33c39cb3 by yamamuro
[SPARK-35457][BUILD] Bump ANTLR runtime version to 4.8

### What changes were proposed in this pull request?
This PR changes the antlr4-runtime version from 4.8-1 to 4.8.

### Why are the changes needed?
Version 4.8 is the official release version, with a proper release note (see https://github.com/antlr/antlr4/releases) and artifiacts listed in https://www.antlr.org/download/index.html.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Will rely on tests in the PR.

Closes #32603 from bozhang2820/antlr-4.8.

Authored-by: Bo Zhang <bo.zhang@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(commit: e170e63)
The file was modifieddev/deps/spark-deps-hadoop-3.2-hive-2.3 (diff)
The file was modifieddev/deps/spark-deps-hadoop-2.7-hive-2.3 (diff)
The file was modifiedpom.xml (diff)
Commit 4869e437da0d742756eb83329e9fe8f51a825446 by gurwls223
[SPARK-35424][SHUFFLE] Remove some useless code in the ExternalBlockHandler

### What changes were proposed in this pull request?

Remove some useless code in the ExternalBlockHandler.

### Why are the changes needed?
There is some useless code in the ExternalBlockHandler, so we may remove it.

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

Exist unittests.

Closes #32571 from weixiuli/SPARK-35424.

Authored-by: weixiuli <weixiuli@jd.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 4869e43)
The file was modifiedcommon/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockHandler.java (diff)
Commit 2bd32548f520ad3fc119e54ee3b4b4f27d33e046 by gurwls223
[SPARK-35459][SQL][TESTS] Move `AvroRowReaderSuite` to a separate file

### What changes were proposed in this pull request?
Move `AvroRowReaderSuite` out from `AvroSuite.scala` and place it to `AvroRowReaderSuite.scala`.

### Why are the changes needed?
To improve code maintenance. Usually, independent test suites are placed to separate files.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the affected test suites:
```
$ build/sbt "test:testOnly *AvroRowReaderSuite"
$ build/sbt "test:testOnly *AvroV1Suite"
$ build/sbt "test:testOnly *AvroV2Suite"
```

Closes #32607 from MaxGekk/move-AvroRowReaderSuite.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 2bd3254)
The file was addedexternal/avro/src/test/scala/org/apache/spark/sql/avro/AvroRowReaderSuite.scala
The file was modifiedexternal/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala (diff)
Commit 3c3533d845bd121a9e094ac6c17a0fb3684e269c by srowen
[SPARK-35373][BUILD][FOLLOWUP] Fix "binary operator expected" error on build/mvn

### What changes were proposed in this pull request?
change $(command -v curl) to "$(command -v curl)"

### Why are the changes needed?
We need change $(command -v curl) to "$(command -v curl)" to make sure it work when `curl` or `wget` is uninstall. othewise raised:
`build/mvn: line 56: [: /root/spark/build/apache-maven-3.6.3-bin.tar.gz: binary operator expected`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
```
apt remove curl
rm -f build/apache-maven-3.6.3-bin.tar.gz
rm -r build/apache-maven-3.6.3-bin
mvn -v
```

Closes #32608 from Yikun/patch-6.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(commit: 3c3533d)
The file was modifiedbuild/mvn (diff)
Commit 38fbc0b4f77dbdd1ed9a69ca7ab170c7ccee04bc by srowen
[SPARK-35458][BUILD] Use ` > /dev/null` to replace `-q` in shasum

## What changes were proposed in this pull request?
Use ` > /dev/null` to replace `-q` in shasum validation.

### Why are the changes needed?
In PR https://github.com/apache/spark/pull/32505 , added the shasum check on maven. The `shasum -a 512 -q -c xxx.sha` is used to validate checksum, the `-q` args is for "don't print OK for each successfully verified file", but `-q` arg is introduce in shasum 6.x version.

So we got the `Unknown option: q`.

```
➜  ~ uname -a
Darwin MacBook.local 19.6.0 Darwin Kernel Version 19.6.0: Mon Apr 12 20:57:45 PDT 2021; root:xnu-6153.141.28.1~1/RELEASE_X86_64 x86_64
➜  ~ shasum -v
5.84
➜  ~ shasum -q
Unknown option: q
Type shasum -h for help
```

it makes ARM CI failed:
[1] https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
`shasum -a 512 -c wrong.sha  > /dev/null` return code 1 without print
`shasum -a 512 -c right.sha  > /dev/null` return code 0 without print
e2e test:
```
rm -f build/apache-maven-3.6.3-bin.tar.gz
rm -r build/apache-maven-3.6.3-bin
mvn -v
```

Closes #32604 from Yikun/patch-5.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(commit: 38fbc0b)
The file was modifiedbuild/mvn (diff)
Commit 3757c1803d02dcc957a4012977ff80f2dcc28eb0 by dhyun
[SPARK-35462][BUILD][K8S] Upgrade Kubernetes-client to 5.4.0 to support K8s 1.21 models

### What changes were proposed in this pull request?

This PR aims to upgrade `kubernetes-client` from 5.3.1 to 5.4.0 to support K8s 1.21 models officially.

### Why are the changes needed?

`kubernetes-client` 5.4.0 has `Kubernetes Model v1.21.0`
- https://github.com/fabric8io/kubernetes-client/releases/tag/v5.4.0

### Does this PR introduce _any_ user-facing change?

No. This is a dev-only change.

### How was this patch tested?

Pass the CIs including Jenkins K8s IT.
- https://github.com/apache/spark/pull/32612#issuecomment-845456039

I tested K8s IT with the following versions.
- minikube version: v1.20.0
- K8s Client Version: v1.21.0
- Server Version: v1.21.0

```
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- Launcher client dependencies
- SPARK-33615: Launcher client archives
- SPARK-33748: Launcher python client respecting PYSPARK_PYTHON
- SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python
- Launcher python client dependencies using a zip file
- Test basic decommissioning
- Test basic decommissioning with shuffle cleanup
- Test decommissioning with dynamic allocation & shuffle cleanups
- Test decommissioning timeouts
- Run SparkR on simple dataframe.R example
Run completed in 17 minutes, 18 seconds.
Total number of tests run: 26
Suites: completed 2, aborted 0
Tests: succeeded 26, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

Closes #32612 from dongjoon-hyun/SPARK-35462.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: 3757c18)
The file was modifieddev/deps/spark-deps-hadoop-2.7-hive-2.3 (diff)
The file was modifieddev/deps/spark-deps-hadoop-3.2-hive-2.3 (diff)
The file was modifiedpom.xml (diff)
Commit 8e13b8c3d233b910a6bebbb89fb58fcc6b299e9f by dhyun
[SPARK-35463][BUILD] Skip checking checksum on a system without `shasum`

### What changes were proposed in this pull request?

Not every build system has `shasum`. This PR aims to skip checksum checks on a system without `shasum`.

### Why are the changes needed?

**PREPARE**
```
$ docker run -it --rm -v $PWD:/spark openjdk:11-slim /bin/bash
roota0e001a6e50f:/# cd /spark/
roota0e001a6e50f:/spark# apt-get update
roota0e001a6e50f:/spark# apt-get install curl
roota0e001a6e50f:/spark# build/mvn clean
```

**BEFORE (Failure due to `command not found`)**
```
roota0e001a6e50f:/spark# build/mvn clean
exec: curl --silent --show-error -L https://downloads.lightbend.com/scala/2.12.10/scala-2.12.10.tgz
exec: curl --silent --show-error -L https://www.apache.org/dyn/closer.lua/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz?action=download
exec: curl --silent --show-error -L https://archive.apache.org/dist/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz.sha512
Veryfing checksum from /spark/build/apache-maven-3.6.3-bin.tar.gz.sha512
build/mvn: line 81: shasum: command not found
Bad checksum from https://archive.apache.org/dist/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz.sha512
```

**AFTER**
```
roota0e001a6e50f:/spark# build/mvn clean
exec: curl --silent --show-error -L https://downloads.lightbend.com/scala/2.12.10/scala-2.12.10.tgz
Skipping checksum because shasum is not installed.
exec: curl --silent --show-error -L https://www.apache.org/dyn/closer.lua/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz?action=download
exec: curl --silent --show-error -L https://archive.apache.org/dist/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz.sha512
Skipping checksum because shasum is not installed.
Using `mvn` from path: /spark/build/apache-maven-3.6.3/bin/mvn
```

### Does this PR introduce _any_ user-facing change?

Yes, this will recover the build.

### How was this patch tested?

Manually with the above process.

Closes #32613 from dongjoon-hyun/SPARK-35463.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: 8e13b8c)
The file was modifiedbuild/mvn (diff)
Commit 6b912e4179dfaf4f9868bf22235a086709e23fd2 by ueshin
[SPARK-35364][PYTHON] Renaming the existing Koalas related codes

### What changes were proposed in this pull request?

There are still naming related to Koalas in test and function name. This PR addressed them to fit pandas-on-spark.
- kdf -> psdf
- kser -> psser
- kidx -> psidx
- kmidx -> psmidx
- to_koalas() -> to_pandas_on_spark()

### Why are the changes needed?

This is because the name Koalas is no longer used in PySpark.

### Does this PR introduce _any_ user-facing change?

`to_koalas()` function is renamed to `to_pandas_on_spark()`

### How was this patch tested?

Tested in local manually.
After changing the related naming, I checked them one by one.

Closes #32516 from itholic/SPARK-35364.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(commit: 6b912e4)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/plot/test_series_plot_plotly.py (diff)
The file was modifiedpython/pyspark/testing/pandasutils.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/testing_utils.py (diff)
The file was modifiedpython/pyspark/pandas/accessors.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_ops_on_diff_frames_groupby_expanding.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_categorical_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_indexops_spark.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_utils.py (diff)
The file was modifiedpython/pyspark/pandas/ml.py (diff)
The file was modifiedpython/pyspark/pandas/utils.py (diff)
The file was modifiedpython/pyspark/pandas/spark/accessors.py (diff)
The file was modifiedpython/pyspark/pandas/tests/plot/test_frame_plot_plotly.py (diff)
The file was modifiedpython/pyspark/pandas/base.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_repr.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_dataframe_conversion.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_window.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_config.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_series_string.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_csv.py (diff)
The file was modifiedpython/pyspark/pandas/indexes/datetimes.py (diff)
The file was modifiedpython/pyspark/pandas/indexes/multi.py (diff)
The file was modifiedpython/pyspark/pandas/strings.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_date_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_datetime_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/plot/test_frame_plot.py (diff)
The file was modifiedpython/pyspark/pandas/window.py (diff)
The file was modifiedpython/pyspark/pandas/tests/indexes/test_base.py (diff)
The file was modifiedpython/pyspark/pandas/series.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_stats.py (diff)
The file was modifiedpython/pyspark/pandas/__init__.py (diff)
The file was modifiedpython/pyspark/pandas/namespace.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_expanding.py (diff)
The file was modifiedpython/pyspark/pandas/tests/indexes/test_datetime.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_reshape.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_default_index.py (diff)
The file was modifiedpython/pyspark/pandas/frame.py (diff)
The file was modifiedpython/pyspark/pandas/plot/core.py (diff)
The file was modifiedpython/pyspark/pandas/tests/plot/test_series_plot_matplotlib.py (diff)
The file was modifiedpython/pyspark/pandas/indexes/base.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_internal.py (diff)
The file was modifiedpython/pyspark/pandas/tests/plot/test_series_plot.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_frame_spark.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_typedef.py (diff)
The file was modifiedpython/pyspark/pandas/plot/matplotlib.py (diff)
The file was modifiedpython/pyspark/pandas/tests/indexes/test_category.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_ops_on_diff_frames_groupby.py (diff)
The file was modifiedpython/pyspark/pandas/extensions.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_extension.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_sql.py (diff)
The file was modifiedpython/pyspark/pandas/groupby.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_dataframe.py (diff)
The file was modifiedpython/pyspark/pandas/tests/plot/test_frame_plot_matplotlib.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_categorical.py (diff)
The file was modifiedpython/pyspark/pandas/internal.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_num_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_string_ops.py (diff)
The file was modifiedpython/pyspark/pandas/plot/plotly.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_dataframe_spark_io.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_numpy_compat.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_ops_on_diff_frames_groupby_rolling.py (diff)
The file was modifiedpython/pyspark/pandas/generic.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_namespace.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_ops_on_diff_frames.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_series_datetime.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_series.py (diff)
The file was modifiedpython/pyspark/pandas/indexing.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_indexing.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_series_conversion.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_rolling.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_groupby.py (diff)