Changes

Summary

  1. [SPARK-35558] Optimizes for multi-quantile retrieval (commit: 6f8c620) (details)
  2. [SPARK-35656][BUILD] Upgrade SBT to 1.5.3 (commit: 5e30666) (details)
  3. [SPARK-35654][CORE] Allow ShuffleDataIO control (commit: d4e32c8) (details)
  4. [SPARK-35499][PYTHON] Apply black to pandas API on Spark codes (commit: b8740a1) (details)
  5. [SPARK-35599][PYTHON] Adjust `check_exact` parameter for older (commit: 50f7686) (details)
  6. [SPARK-35100][ML][FOLLOWUP] AFT cleanup (commit: d8e37a8) (details)
  7. [SPARK-35619][ML] Refactor LinearRegression - make huber support virtual (commit: ffc61c6) (details)
  8. [SPARK-35660][BUILD][K8S] Upgrade kubernetes-client to 5.4.1 (commit: 6f2ffcc) (details)
  9. [SPARK-35646][PYTHON][DOCS] Relocate pandas-on-Spark API references in (commit: 7ce7aa4) (details)
  10. [SPARK-35665][SQL] Resolve UnresolvedAlias in CollectMetrics (commit: a70e66e) (details)
  11. [SPARK-35543][CORE] Fix memory leak in BlockManagerMasterEndpoint (commit: 4534c0c) (details)
  12. [SPARK-35663][SQL] Add Timestamp without time zone type (commit: 33f2627) (details)
  13. [SPARK-35074][CORE] hardcoded configs move to config package (commit: 6c3b7f9) (details)
  14. [SPARK-35343][PYTHON] Make the conversion from/to pandas data-type-based (commit: 04a8d2c) (details)
  15. [SPARK-35341][PYTHON] Introduce BooleanExtensionOps (commit: dfd8a8d) (details)
  16. [SPARK-35638][PYTHON] Introduce InternalField to manage dtypes and (commit: 04418e1) (details)
  17. [SPARK-35603][R][DOCS] Add data source options link for R API (commit: 745756c) (details)
  18. [SPARK-35668][INFRA] Use "concurrency" syntax on Github Actions workflow (commit: f3dc549) (details)
Commit 6f8c62047cea125d52af5dad7fb5ad3eadb7f7d0 by srowen
[SPARK-35558] Optimizes for multi-quantile retrieval

### What changes were proposed in this pull request?
Optimizes the retrieval of approximate quantiles for an array of percentiles.
* Adds an overload for QuantileSummaries.query that accepts an array of percentiles and optimizes the computation to do a single pass over the sketch and avoid redundant computation.
* Modifies the ApproximatePercentiles operator to call into the new method.

All formatting changes are the result of running ./dev/scalafmt

### Why are the changes needed?
The existing implementation does repeated calls per input percentile resulting in redundant computation.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added unit tests for the new method.

Closes #32700 from alkispoly-db/spark_35558_approx_quants_array.

Authored-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(commit: 6f8c620)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/QuantileSummariesSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala (diff)
Commit 5e3066601078df4fcbf742db04dceb47aceda250 by dhyun
[SPARK-35656][BUILD] Upgrade SBT to 1.5.3

### What changes were proposed in this pull request?

This PR proposes to upgrade SBT to 1.5.3.

### Why are the changes needed?

This release seems to include a bug fix for Scala 2.13.6+ and Scala 3.
https://github.com/sbt/sbt/releases/tag/v1.5.3

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA.

Closes #32792 from sarutak/upgrade-sbt-1.5.3.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: 5e30666)
The file was modifiedproject/build.properties (diff)
Commit d4e32c896a6d3c71f61f96f0b4c5d98fc8730a21 by dhyun
[SPARK-35654][CORE] Allow ShuffleDataIO control DiskBlockManager.deleteFilesOnStop

### What changes were proposed in this pull request?

This PR aims to change `DiskBlockManager` like the following to allow `ShuffleDataIO` to decide the behavior of shuffle file deletion.
```scala
- private[spark] class DiskBlockManager(conf: SparkConf, deleteFilesOnStop: Boolean)
+ private[spark] class DiskBlockManager(conf: SparkConf, var deleteFilesOnStop: Boolean)
```

### Why are the changes needed?

`SparkContext` creates
1. `SparkEnv` (with `BlockManager` and its `DiskBlockManager`)
2. loads `ShuffleDataIO`
3. initialize block manager.
```scala
_env = createSparkEnv(_conf, isLocal, listenerBus)

...
_shuffleDriverComponents = ShuffleDataIOUtils.loadShuffleDataIO(config).driver()
    _shuffleDriverComponents.initializeApplication().asScala.foreach { case (k, v) =>
      _conf.set(ShuffleDataIOUtils.SHUFFLE_SPARK_CONF_PREFIX + k, v)
    }
...

_env.blockManager.initialize(_applicationId)
...
```

`DiskBlockManager` is created first at `BlockManager` constructor and we cannot change `deleteFilesOnStop` later at `ShuffleDataIO`. By switching to `var`, we can implement enhanced shuffle data management feature via `ShuffleDataIO` like https://github.com/apache/spark/pull/32730 .
```
  val diskBlockManager = {
    // Only perform cleanup if an external service is not serving our shuffle files.
    val deleteFilesOnStop =
      !externalShuffleServiceEnabled || executorId == SparkContext.DRIVER_IDENTIFIER
    new DiskBlockManager(conf, deleteFilesOnStop)
  }
```

### Does this PR introduce _any_ user-facing change?

No. This is a private class.

### How was this patch tested?

N/A

Closes #32784 from dongjoon-hyun/SPARK-35654.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: d4e32c8)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala (diff)
Commit b8740a1d1ef6f7e08873a37c7ff1b4e3980679f0 by viirya
[SPARK-35499][PYTHON] Apply black to pandas API on Spark codes

### What changes were proposed in this pull request?

This PR proposes applying `black` to pandas API on Spark codes, for improving static analysis.

By executing the `./dev/reformat-python` in the spark home directory, all the code of the pandas API on Spark is fixed according to the static analysis rules.

### Why are the changes needed?

This can be reduces the cost of static analysis during development.

It has been used continuously for about a year in the Koalas project and its convenience has been proven.

### Does this PR introduce _any_ user-facing change?

No, it's dev-only.

### How was this patch tested?

Manually reformat the pandas API on Spark codes by running the `./dev/reformat-python`, and checked the `./dev/lint-python` is passed.

Closes #32779 from itholic/SPARK-35499.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(commit: b8740a1)
The file was modifiedpython/pyspark/pandas/data_type_ops/date_ops.py (diff)
The file was modifiedpython/pyspark/pandas/series.py (diff)
The file was modified.github/workflows/build_and_test.yml (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/testing_utils.py (diff)
The file was modifieddev/requirements.txt (diff)
The file was modifiedpython/pyspark/pandas/tests/test_categorical.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_dataframe.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/datetime_ops.py (diff)
The file was modifieddev/lint-python (diff)
The file was modifieddev/tox.ini (diff)
The file was modifiedpython/pyspark/pandas/namespace.py (diff)
The file was modifiedpython/pyspark/pandas/tests/plot/test_frame_plot_matplotlib.py (diff)
The file was modifiedpython/pyspark/pandas/tests/plot/test_series_plot_matplotlib.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/num_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/base.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/complex_ops.py (diff)
The file was modifiedpython/pyspark/pandas/indexing.py (diff)
The file was modifiedpython/pyspark/pandas/utils.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_date_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/plot/test_frame_plot_plotly.py (diff)
The file was modifiedpython/pyspark/pandas/accessors.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/categorical_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_ops_on_diff_frames.py (diff)
The file was addeddev/reformat-python
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_binary_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_expanding.py (diff)
The file was modifiedpython/pyspark/pandas/base.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/binary_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_num_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/string_ops.py (diff)
The file was modifiedpython/pyspark/pandas/usage_logging/usage_logger.py (diff)
The file was modifiedpython/pyspark/pandas/tests/plot/test_series_plot_plotly.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_categorical_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/indexes/test_base.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_series.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_ops_on_diff_frames_groupby.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_namespace.py (diff)
The file was modifiedpython/pyspark/pandas/plot/core.py (diff)
The file was modifiedpython/pyspark/pandas/internal.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_datetime_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/boolean_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_groupby.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_string_ops.py (diff)
The file was modifiedpython/pyspark/pandas/typedef/typehints.py (diff)
The file was modifiedpython/pyspark/pandas/config.py (diff)
The file was modifiedpython/pyspark/pandas/indexes/multi.py (diff)
The file was modifiedpython/pyspark/pandas/spark/accessors.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_complex_ops.py (diff)
The file was modifiedpython/pyspark/pandas/extensions.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_dataframe_spark_io.py (diff)
The file was modifiedpython/pyspark/pandas/plot/matplotlib.py (diff)
The file was modifiedpython/pyspark/pandas/frame.py (diff)
The file was modifiedpython/pyspark/pandas/groupby.py (diff)
Commit 50f7686de9fdf013e93f9598a2c12087916ddf07 by gurwls223
[SPARK-35599][PYTHON] Adjust `check_exact` parameter for older pd.testing

### What changes were proposed in this pull request?

Adjust the `check_exact` parameter for non-numeric columns to ensure pandas-on-Spark tests passed with all pandas versions.

### Why are the changes needed?

`pd.testing` utils are utilized in pandas-on-Spark tests.
Due to https://github.com/pandas-dev/pandas/issues/35446, `check_exact=True` for non-numeric columns doesn't work for older pd.testing utils, e.g. `assert_series_equal`.  We wanted to adjust that to ensure pandas-on-Spark tests pass for all pandas versions.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit tests.

Closes #32772 from xinrong-databricks/test_util.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 50f7686)
The file was modifiedpython/pyspark/testing/pandasutils.py (diff)
Commit d8e37a8e4091b99438d1fdbe38a6149b68836c73 by ruifengz
[SPARK-35100][ML][FOLLOWUP] AFT cleanup

### What changes were proposed in this pull request?
remove unaccessible code path

### Why are the changes needed?
for succinctness

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing testsuite

Closes #32765 from zhengruifeng/spark_35100_followup.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
(commit: d8e37a8)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/optim/aggregator/AFTBlockAggregator.scala (diff)
Commit ffc61c6af0a69cc5a868ea068f6678d484f6c3af by ruifengz
[SPARK-35619][ML] Refactor LinearRegression - make huber support virtual centering

### What changes were proposed in this pull request?
1, for huber, make it support virtual centering
2, for `LeastSquares`, it was impled in this way: optimize the linear part and then estimate the intercept. So just re-org the its agg and suite to keep in line with other aggs.

### Why are the changes needed?
for better convergence

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing and newly added testsuites

Closes #32759 from zhengruifeng/refactor_huber_agg.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
(commit: ffc61c6)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala (diff)
The file was addedmllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberBlockAggregator.scala
The file was addedmllib/src/test/scala/org/apache/spark/ml/optim/aggregator/HuberBlockAggregatorSuite.scala
The file was addedmllib/src/test/scala/org/apache/spark/ml/optim/aggregator/LeastSquaresBlockAggregatorSuite.scala
The file was removedmllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala
The file was removedmllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LeastSquaresAggregator.scala
The file was removedmllib/src/test/scala/org/apache/spark/ml/optim/aggregator/LeastSquaresAggregatorSuite.scala
The file was removedmllib/src/test/scala/org/apache/spark/ml/optim/aggregator/HuberAggregatorSuite.scala
The file was addedmllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LeastSquaresBlockAggregator.scala
Commit 6f2ffccb5e17b5ee92003c86b7ec03c5344105c3 by dhyun
[SPARK-35660][BUILD][K8S] Upgrade kubernetes-client to 5.4.1

### What changes were proposed in this pull request?

This PR aims to upgrade kubernetes-client to 5.4.1.

### Why are the changes needed?

This will bring a few bug fixes.
- https://github.com/fabric8io/kubernetes-client/releases/tag/v5.4.1

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #32798 from dongjoon-hyun/SPARK-35660.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: 6f2ffcc)
The file was modifiedpom.xml (diff)
The file was modifieddev/deps/spark-deps-hadoop-3.2-hive-2.3 (diff)
The file was modifieddev/deps/spark-deps-hadoop-2.7-hive-2.3 (diff)
Commit 7ce7aa47585f579a84dc6dd8f116a48174cba988 by gurwls223
[SPARK-35646][PYTHON][DOCS] Relocate pandas-on-Spark API references in documentation

### What changes were proposed in this pull request?

This PR proposes to change from:

![Screen Shot 2021-06-07 at 1 40 47 PM](https://user-images.githubusercontent.com/6477701/120960027-fc302400-c795-11eb-96fb-73ac1d8277fe.png)

to:

![Screen Shot 2021-06-07 at 1 41 19 PM](https://user-images.githubusercontent.com/6477701/120960074-0fdb8a80-c796-11eb-87ec-69a30692fdfe.png)

### Why are the changes needed?

pandas APIs on Spark (pandas on Spark) is a package in PySpark in the end. So it has to be documented in the same level with other packages (e.g., Spark SQL).

### Does this PR introduce _any_ user-facing change?

Yes, it changes the structure of the docs. To end users, no as it's only in development branch.

### How was this patch tested?

Manually tested as above.

Closes #32799 from HyukjinKwon/SPARK-35646.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 7ce7aa4)
The file was addedpython/docs/source/reference/pyspark.pandas.rst
The file was modifiedpython/docs/source/reference/index.rst (diff)
Commit a70e66ecfa638cacc99b4e9a7c464e41ec92ad30 by gurwls223
[SPARK-35665][SQL] Resolve UnresolvedAlias in CollectMetrics

### What changes were proposed in this pull request?

It's a long-standing bug that we forgot to resolve `UnresolvedAlias` in `CollectMetrics`. It's a bit hard to trigger this bug before 3.2 as most likely people won't create `UnresolvedAlias` when calling `Dataset.observe`. However things have been changed after https://github.com/apache/spark/pull/30974

This PR proposes to handle `CollectMetrics` in the rule `ResolveAliases`.

### Why are the changes needed?

bug fix

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

updated test

Closes #32803 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: a70e66e)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/util/DataFrameCallbackSuite.scala (diff)
Commit 4534c0c4df7af4287a40b6743f1d838147e134fe by piros.attila.zsolt
[SPARK-35543][CORE] Fix memory leak in BlockManagerMasterEndpoint removeRdd

### What changes were proposed in this pull request?

In `BlockManagerMasterEndpoint` for the disk persisted RDDs (when `spark.shuffle.service.fetch.rdd.enable` is enabled) we are keeping track the block status entries by external shuffle service instances (so on YARN we are basically keeping them by nodes). This is the `blockStatusByShuffleService` member val. And when all the RDD blocks are removed for one external shuffle service instance then the key and the empty map can be removed from `blockStatusByShuffleService`.

### Why are the changes needed?

It is a small leak and I was asked to take care of it in https://github.com/apache/spark/pull/32114#discussion_r640270377.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually by adding a temporary log line to check `blockStatusByShuffleService` value before and after the `removeRdd` and run the `SPARK-25888: using external shuffle service fetching disk persisted blocks` test in `ExternalShuffleServiceSuite`.

Closes #32790 from attilapiros/SPARK-35543.

Authored-by: attilapiros <piros.attila.zsolt@gmail.com>
Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com>
(commit: 4534c0c)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala (diff)
Commit 33f26275f4d65f54e68f38ba0d795396a5a4d2f4 by wenchen
[SPARK-35663][SQL] Add Timestamp without time zone type

### What changes were proposed in this pull request?

Extend Catalyst's type system by a new type that conforms to the SQL standard (see SQL:2016, section 4.6.2): TimestampWithoutTZType represents the timestamp without time zone type

### Why are the changes needed?

Spark SQL today supports the TIMESTAMP data type. However the semantics provided actually match TIMESTAMP WITH LOCAL TIMEZONE as defined by Oracle. Timestamps embedded in a SQL query or passed through JDBC are presumed to be in session local timezone and cast to UTC before being processed.
These are desirable semantics in many cases, such as when dealing with calendars.
In many (more) other cases, such as when dealing with log files it is desirable that the provided timestamps not be altered.
SQL users expect that they can model either behavior and do so by using TIMESTAMP WITHOUT TIME ZONE for time zone insensitive data and TIMESTAMP WITH LOCAL TIME ZONE for time zone sensitive data.
Most traditional RDBMS map TIMESTAMP to TIMESTAMP WITHOUT TIME ZONE and will be surprised to see TIMESTAMP WITH LOCAL TIME ZONE, a feature that does not exist in the standard.

In this new feature, we will introduce TIMESTAMP WITH LOCAL TIMEZONE to describe the existing timestamp type and add TIMESTAMP WITHOUT TIME ZONE for standard semantic.
Using these two types will provide clarity.
This is a starting PR. See more details in https://issues.apache.org/jira/browse/SPARK-35662

### Does this PR introduce _any_ user-facing change?

Yes, a new data type for Timestamp without time zone type. It is still in development.

### How was this patch tested?

Unit test

Closes #32802 from gengliangwang/TimestampNTZType.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 33f2627)
The file was addedsql/catalyst/src/main/scala/org/apache/spark/sql/types/TimestampWithoutTZType.scala
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/types/DataTypes.java (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala (diff)
Commit 6c3b7f92cfaf4d11c8c9c984082ea40bd1f86abd by tgraves
[SPARK-35074][CORE] hardcoded configs move to config package

### What changes were proposed in this pull request?
Currently spark.jars.xxx property keys (e.g. spark.jars.ivySettings and spark.jars.packages) are hardcoded in multiple places within Spark code across multiple modules. We should define them in config/package.scala and reference them in all other places.

### Why are the changes needed?
improvement

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
no

Closes #32746 from dgd-contributor/SPARK-35074_configs_should_be_moved_to_config_package.scala.

Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Thomas Graves <tgraves@apache.org>
(commit: 6c3b7f9)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/internal/config/package.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/util/DependencyUtils.scala (diff)
Commit 04a8d2cbcf453698170a9a9fbf1f85fe12a50d28 by ueshin
[SPARK-35343][PYTHON] Make the conversion from/to pandas data-type-based for non-ExtensionDtypes

### What changes were proposed in this pull request?

Make the conversion from/to pandas (for non-ExtensionDtype) data-type-based.
NOTE: Ops class per ExtensionDtype and its data-type-based from/to pandas will be implemented in a separate PR as https://issues.apache.org/jira/browse/SPARK-35614.

### Why are the changes needed?

The conversion from/to pandas includes logic for checking data types and behaving accordingly.
That makes code hard to change or maintain.
Since we have introduced the Ops class per non-ExtensionDtype data type, we ought to make the conversion from/to pandas data-type-based for non-ExtensionDtypes.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

Closes #32592 from xinrong-databricks/datatypeop_pd_conversion.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(commit: 04a8d2c)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_complex_ops.py (diff)
The file was addedpython/pyspark/pandas/tests/data_type_ops/test_null_ops.py
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_categorical_ops.py (diff)
The file was addedpython/pyspark/pandas/tests/data_type_ops/test_udt_ops.py
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_binary_ops.py (diff)
The file was modifieddev/sparktestsupport/modules.py (diff)
The file was addedpython/pyspark/pandas/data_type_ops/udt_ops.py
The file was modifiedpython/pyspark/pandas/tests/indexes/test_base.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/categorical_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_date_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_datetime_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/datetime_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_num_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_internal.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_string_ops.py (diff)
The file was addedpython/pyspark/pandas/data_type_ops/null_ops.py
The file was modifiedpython/pyspark/pandas/internal.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/base.py (diff)
Commit dfd8a8dc676c388c0c1bb7e4cb8d55eab10504ab by ueshin
[SPARK-35341][PYTHON] Introduce BooleanExtensionOps

### What changes were proposed in this pull request?

- Introduce BooleanExtensionOps in order to make boolean operators `and` and `or` data-type-based.
- Improve error messages for operators `and` and `or`.

### Why are the changes needed?

Boolean operators __and__, __or__, __rand__, and __ror__ should be data-type-based

BooleanExtensionDtypes processes these boolean operators differently from bool, so BooleanExtensionOps is introduced.

These boolean operators themselves are also bitwise operators, which should be able to apply to other data types classes later. However, this is not the goal of this PR.

### Does this PR introduce _any_ user-facing change?

Yes. Error messages for operators `and` and `or` are improved.
Before:
```
>>> psser = ps.Series([1, "x", "y"], dtype="category")
>>> psser | True
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: cannot resolve '(`0` OR true)' due to data type mismatch: differing types in '(`0` OR true)' (tinyint and boolean).;
'Project [unresolvedalias(CASE WHEN (isnull(0#9) OR isnull((0#9 OR true))) THEN false ELSE (0#9 OR true) END, Some(org.apache.spark.sql.Column$$Lambda$1442/17254916406fb8afba))]
+- Project [__index_level_0__#8L, 0#9, monotonically_increasing_id() AS __natural_order__#12L]
   +- LogicalRDD [__index_level_0__#8L, 0#9], false

```

After:
```
>>> psser = ps.Series([1, "x", "y"], dtype="category")
>>> psser | True
Traceback (most recent call last):
...
TypeError: Bitwise or can not be applied to categoricals.
```

### How was this patch tested?

Unit tests.

Closes #32698 from xinrong-databricks/datatypeops_extension.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(commit: dfd8a8d)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/testing_utils.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/boolean_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_num_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_complex_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_datetime_ops.py (diff)
The file was modifiedpython/pyspark/pandas/data_type_ops/base.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_binary_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_date_ops.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_string_ops.py (diff)
The file was modifiedpython/pyspark/pandas/base.py (diff)
The file was modifiedpython/pyspark/pandas/tests/data_type_ops/test_categorical_ops.py (diff)
Commit 04418e18d7714a059fbe33d357cf2db76907e159 by gurwls223
[SPARK-35638][PYTHON] Introduce InternalField to manage dtypes and StructFields

### What changes were proposed in this pull request?

Introduces `InternalField` to manage dtypes and `StructField`s.

`InternalFrame` is already managing dtypes, but when it checks the Spark's data types, column names, and nullabilities, it tries to run the analysis phase each time it needs, which will cause a performance issue.

It will use `InternalField` class which stores the retrieved Spark's data types, column names, and nullabilities, and reuse them. Also, in case those can be known, just update and reuse them without asking Spark.

### Why are the changes needed?

Currently there are some performance issues in the pandas-on-Spark layer.

One of them is accessing Java DataFrame and run analysis phase too many times, especially just for retrieving the current column names or data types.

We should reduce the amount of unnecessary access.

### Does this PR introduce _any_ user-facing change?

Improves the performance in pandas-on-Spark layer:

```py
df = ps.read_parquet("/path/to/test.parquet")  # contains ~75 columns
df = df[(df["col"] > 0) & (df["col"] < 10000)]
```

Before the PR, it took about **2.15 sec** and after **1.15 sec**.

### How was this patch tested?

Existing tests.

Closes #32775 from ueshin/issues/SPARK-35638/field.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 04418e1)
The file was modifiedpython/pyspark/pandas/indexes/base.py (diff)
The file was modifiedpython/pyspark/pandas/typedef/typehints.py (diff)
The file was modifiedpython/pyspark/pandas/groupby.py (diff)
The file was modifiedpython/pyspark/pandas/accessors.py (diff)
The file was modifiedpython/pyspark/pandas/utils.py (diff)
The file was modifiedpython/pyspark/pandas/internal.py (diff)
The file was modifiedpython/pyspark/pandas/base.py (diff)
The file was modifiedpython/pyspark/pandas/indexes/multi.py (diff)
The file was modifiedpython/pyspark/pandas/window.py (diff)
The file was modifiedpython/pyspark/pandas/frame.py (diff)
The file was modifiedpython/pyspark/pandas/indexing.py (diff)
The file was modifiedpython/pyspark/pandas/strings.py (diff)
The file was modifiedpython/pyspark/pandas/series.py (diff)
The file was modifiedpython/pyspark/pandas/namespace.py (diff)
The file was modifiedpython/pyspark/pandas/mlflow.py (diff)
The file was modifiedpython/pyspark/pandas/spark/accessors.py (diff)
Commit 745756ca4c1eef29af8dcc919c6d8d5cac06662b by gurwls223
[SPARK-35603][R][DOCS] Add data source options link for R API documentation

### What changes were proposed in this pull request?

There are options for data source are documented at Data Source Options page for every data sources.

For Python, Scala, JAVA, the link for Data Source Option page was added in each API documentation.

- Python
<img width="732" alt="Screen Shot 2021-06-07 at 12 25 45 PM" src="https://user-images.githubusercontent.com/44108233/120955187-cbe38800-c78b-11eb-9475-ccf89bbc3c95.png">

- Scala
<img width="677" alt="Screen Shot 2021-06-07 at 12 26 41 PM" src="https://user-images.githubusercontent.com/44108233/120955186-cab25b00-c78b-11eb-9fed-3f0d2024029b.png">

- JAVA
<img width="726" alt="Screen Shot 2021-06-07 at 12 27 49 PM" src="https://user-images.githubusercontent.com/44108233/120955182-c8e89780-c78b-11eb-9cf1-13e41ba35b3e.png">

However, we have no link for R documentation, so we should add the link to the R documentation as well.

### Why are the changes needed?

To provide users available options for each data source when they read/write it.

### Does this PR introduce _any_ user-facing change?

Yes, the link for Data Source Option is added to R documentation as below.

<img width="855" alt="Screen Shot 2021-06-07 at 12 29 26 PM" src="https://user-images.githubusercontent.com/44108233/120955302-064d2500-c78c-11eb-8dc3-cb22dfd5fd14.png">

### How was this patch tested?

Manually built doc and checked one by one

Closes #32797 from itholic/SPARK-35603.

Lead-authored-by: itholic <haejoon.lee@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: 745756c)
The file was modifiedR/pkg/R/DataFrame.R (diff)
The file was modifiedR/pkg/R/SQLContext.R (diff)
The file was modifiedR/pkg/R/functions.R (diff)
Commit f3dc549d9c4af90a4e01e9a3b8b6724aa4ceddca by gurwls223
[SPARK-35668][INFRA] Use "concurrency" syntax on Github Actions workflow

### What changes were proposed in this pull request?

This patch uses the "concurrency" syntax to replace the "cancel job" workflow:
- .github/workflows/benchmark.yml
- .github/workflows/labeler.yml
- .github/workflows/notify_test_workflow.yml
- .github/workflows/test_report.yml

Remove the .github/workflows/cancel_duplicate_workflow_runs.yml

Note that the push/schedule based job are not changed to keep the same config in https://github.com/apache/spark/commit/a4b70758d3dfe2d59fb3800321ffb2450206c26f:
- .github/workflows/build_and_test.yml
- .github/workflows/publish_snapshot.yml
- .github/workflows/stale.yml
- .github/workflows/update_build_status.yml

### Why are the changes needed?
We are using [cancel_duplicate_workflow_runs](https://github.com/apache/spark/blob/a70e66ecfa638cacc99b4e9a7c464e41ec92ad30/.github/workflows/cancel_duplicate_workflow_runs.yml#L1) job to cancel previous jobs when a new job is queued. Now, it has been supported by the github action by using ["concurrency"](https://docs.github.com/en/actions/reference/workflow-syntax-for-github-actions#concurrency) syntax to make sure only a single job or workflow using the same concurrency group.

Related: https://github.com/apache/arrow/pull/10416 and https://github.com/potiuk/cancel-workflow-runs

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
triger the PR manaully

Closes #32806 from Yikun/SPARK-X.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(commit: f3dc549)
The file was modified.github/workflows/benchmark.yml (diff)
The file was modified.github/workflows/test_report.yml (diff)
The file was modified.github/workflows/labeler.yml (diff)
The file was removed.github/workflows/cancel_duplicate_workflow_runs.yml
The file was modified.github/workflows/notify_test_workflow.yml (diff)