Changes

Summary

  1. [SPARK-33310][PYTHON] Relax pyspark typing for sql str functions (commit: 56587f0) (details)
  2. [SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming (commit: b8a440f) (details)
  3. [SPARK-20044][UI] Support Spark UI behind front-end reverse proxy using (commit: 2b6dfa5) (details)
  4. [SPARK-30663][SPARK-33313][TESTS][R] Drop testthat 1.x support and add (commit: d71b2fe) (details)
  5. [SPARK-33095] Follow up, support alter table column rename (commit: 6226ccc) (details)
  6. [SPARK-33027][SQL] Add DisableUnnecessaryBucketedScan rule to AQE (commit: e52b858) (details)
Commit 56587f076d282ec96c4779faa63d7d9764cf0c3c by gurwls223
[SPARK-33310][PYTHON] Relax pyspark typing for sql str functions
### What changes were proposed in this pull request?
Relax pyspark typing for sql str functions. These functions all pass the
first argument through `_to_java_column`, such that a string or Column
object is acceptable.
### Why are the changes needed?
Convenience & ensuring the typing reflects the functionality
### Does this PR introduce _any_ user-facing change?
Yes, a backwards-compatible increase in functionality. But I think
typing support is unreleased, so possibly no change to released
versions.
### How was this patch tested?
Not tested. I am newish to Python typing with stubs, so someone should
confirm this is the correct way to fix this.
Closes #30209 from dhimmel/patch-1.
Authored-by: Daniel Himmelstein <daniel.himmelstein@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 56587f0)
The file was modifiedpython/pyspark/sql/functions.pyi (diff)
Commit b8a440f09880c596325dd9e6caae6b470be76a8f by gurwls223
[SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming
after the task ends
### What changes were proposed in this pull request?
As the Python evaluation consumes the parent iterator in a separate
thread, it could consume more data from the parent even after the task
ends and the parent is closed. Thus, we should use
`ContextAwareIterator` to stop consuming after the task ends.
### Why are the changes needed?
Python/Pandas UDF right after off-heap vectorized reader could cause
executor crash.
E.g.,:
```py spark.range(0, 100000, 1, 1).write.parquet(path)
spark.conf.set("spark.sql.columnVector.offheap.enabled", True)
def f(x):
   return 0
fUdf = udf(f, LongType())
spark.read.parquet(path).select(fUdf('id')).head()
```
This is because, the Python evaluation consumes the parent iterator in a
separate thread and it consumes more data from the parent even after the
task ends and the parent is closed. If an off-heap column vector exists
in the parent iterator, it could cause segmentation fault which crashes
the executor.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added tests, and manually.
Closes #30177 from ueshin/issues/SPARK-33277/python_pandas_udf.
Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: b8a440f)
The file was modifiedpython/pyspark/sql/tests/test_pandas_udf_scalar.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_udf.py (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/python/EvalPythonExec.scala (diff)
The file was modifiedpython/pyspark/sql/tests/test_pandas_map.py (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/python/MapInPandasExec.scala (diff)
Commit 2b6dfa5f7bdd2f2ae7b4d53bb811ccb8563377c5 by gengliang.wang
[SPARK-20044][UI] Support Spark UI behind front-end reverse proxy using
a path prefix Revert proxy url
### What changes were proposed in this pull request?
Allow to run the Spark web UI behind a reverse proxy with URLs prefixed
by a context root, like www.mydomain.com/spark. In particular, this
allows to access multiple Spark clusters through the same virtual host,
only distinguishing them by context root, like
www.mydomain.com/cluster1, www.mydomain.com/cluster2, and it allows to
run the Spark UI in a common cookie domain (for SSO) with other
services.
### Why are the changes needed?
This PR is to take over https://github.com/apache/spark/pull/17455.
After changes, Spark allows showing customized prefix URL in all the
`href` links of the HTML pages.
### Does this PR introduce _any_ user-facing change?
Yes, all the links of UI pages will be contains the value of
`spark.ui.reverseProxyUrl` if it is configurated.
### How was this patch tested?
New HTML Unit tests in MasterSuite Manual UI testing for master, worker
and app UI with an nginx proxy Spark config:
``` spark.ui.port 8080 spark.ui.reverseProxy=true
spark.ui.reverseProxyUrl=/path/to/spark/
``` nginx config:
``` server {
   listen 9000;
   set $SPARK_MASTER http://127.0.0.1:8080;
   # split spark UI path into prefix and local path within master UI
   location ~ ^(/path/to/spark/) {
       # strip prefix when forwarding request
       rewrite /path/to/spark(/.*) $1  break;
       #rewrite /path/to/spark/ "/" ;
       # forward to spark master UI
       proxy_pass $SPARK_MASTER;
       proxy_intercept_errors on;
       error_page 301 302 307 = handle_redirects;
   }
   location handle_redirects {
       set $saved_redirect_location '$upstream_http_location';
       proxy_pass $saved_redirect_location;
   }
}
```
Closes #29820 from gengliangwang/revertProxyURL.
Lead-authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Co-authored-by: Oliver Köth <okoeth@de.ibm.com> Signed-off-by: Gengliang
Wang <gengliang.wang@databricks.com>
(commit: 2b6dfa5)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/master/Master.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/ui/UIUtils.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/worker/Worker.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala (diff)
The file was modifieddocs/configuration.md (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/SparkContext.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala (diff)
Commit d71b2febaf536113ffe4ad0626d1d3b4098b98a5 by gurwls223
[SPARK-30663][SPARK-33313][TESTS][R] Drop testthat 1.x support and add
testthat 3.x support
### What changes were proposed in this pull request?
This PR modifies `R/pkg/tests/run-all.R` by:
- Removing `testthat` 1.x support, as Jenkins has been upgraded to 2.x
with SPARK-30637 and this code is no longer relevant.
- Add `testthat` 3.x support to avoid AppVeyor failures.
### Why are the changes needed?
Currently used internal API has been removed in the latest `testthat`
release.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Tests executed against `testthat == 2.3.2` and `testthat == 3.0.0`
Closes #30219 from zero323/SPARK-33313.
Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: d71b2fe)
The file was modifiedR/pkg/tests/run-all.R (diff)
Commit 6226ccc092c0e24487ee80dc169eb15b32825bce by wenchen
[SPARK-33095] Follow up, support alter table column rename
### What changes were proposed in this pull request?
Support rename column for mysql dialect.
### Why are the changes needed?
At the moment, it does not work for mysql version 5.x. So, we should
throw proper exception for that case.
### Does this PR introduce _any_ user-facing change?
Yes, `column rename` with mysql dialect should work correctly.
### How was this patch tested?
Added tests for rename column. Ran the tests to pass with both versions
of mysql.
* `export MYSQL_DOCKER_IMAGE_NAME=mysql:5.7.31`
* `export MYSQL_DOCKER_IMAGE_NAME=mysql:8.0`
Closes #30142 from ScrapCodes/mysql-dialect-rename.
Authored-by: Prashant Sharma <prashsh1@in.ibm.com> Signed-off-by:
Wenchen Fan <wenchen@databricks.com>
(commit: 6226ccc)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MySQLIntegrationSuite.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/jdbc/MySQLDialect.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala (diff)
Commit e52b858ef71fd2f05e3653e15e91252c04fcefd4 by wenchen
[SPARK-33027][SQL] Add DisableUnnecessaryBucketedScan rule to AQE
### What changes were proposed in this pull request?
As a followup comment from
https://github.com/apache/spark/pull/29804#issuecomment-700650620 , here
we add add the physical plan rule DisableUnnecessaryBucketedScan into
AQE AdaptiveSparkPlanExec.queryStagePreparationRules, to make auto
bucketed scan work with AQE.
The change is mostly in:
* `AdaptiveSparkPlanExec.scala`: add physical plan rule
`DisableUnnecessaryBucketedScan`
* `DisableUnnecessaryBucketedScan.scala`: propagate logical plan link
for the file source scan exec operator, otherwise we lose the logical
plan link information when AQE is enabled, and will get exception
[here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L176).
(for example, for query `SELECT * FROM bucketed_table` with AQE is
enabled)
* `DisableUnnecessaryBucketedScanSuite.scala`: add new test suite for
AQE enabled - `DisableUnnecessaryBucketedScanWithoutHiveSupportSuiteAE`,
and changed some of tests to use `AdaptiveSparkPlanHelper.find/collect`,
to make the plan verification work when AQE enabled.
### Why are the changes needed?
It's reasonable to add the support to allow disabling unnecessary
bucketed scan with AQE is enabled, this helps optimize the query when
AQE is enabled.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added unit test in `DisableUnnecessaryBucketedScanSuite`.
Closes #30200 from c21/auto-bucket-aqe.
Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: e52b858)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/bucketing/DisableUnnecessaryBucketedScan.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/sources/DisableUnnecessaryBucketedScanSuite.scala (diff)