Changes

Summary

  1. [SPARK-32748][SQL] Revert "Support local property propagation in (commit: 55d38a4) (details)
  2. [SPARK-31511][FOLLOW-UP][TEST][SQL] Make BytesToBytesMap iterators (commit: bd3dc2f) (details)
  3. [SPARK-32817][SQL] DPP throws error when broadcast side is empty (commit: e7d9a24) (details)
  4. [SPARK-32815][ML] Fix LibSVM data source loading error on file paths (commit: aa87b0a) (details)
  5. [SPARK-32753][SQL][FOLLOWUP] Fix indentation and clean up view in test (commit: 96ff87d) (details)
  6. [SPARK-32824][CORE] Improve the error message when the user forgets the (commit: e8634d8) (details)
  7. [SPARK-32810][SQL][TESTS][FOLLOWUP] Check path globbing in JSON/CSV (commit: adc8d68) (details)
  8. [SPARK-32823][WEB UI] Fix the master ui resources reporting (commit: 514bf56) (details)
  9. [SPARK-32813][SQL] Get default config of ParquetSource vectorized reader (commit: de0dc52) (details)
  10. [SPARK-32204][SPARK-32182][DOCS][FOLLOW-UP] Use IPython instead of (commit: 794b48c) (details)
  11. [SPARK-32808][SQL] Fix some test cases of `sql/core` module in scala (commit: 513d51a) (details)
  12. [SPARK-32755][SQL][FOLLOWUP] Ensure `--` method of AttributeSet have (commit: fc10511) (details)
  13. [SPARK-32794][SS] Fixed rare corner case error in micro-batch engine (commit: e4237bb) (details)
  14. Revert "[SPARK-32677][SQL] Load function resource before create" (commit: f7995c5) (details)
Commit 55d38a479b3d9d2607e1c564f07bf617fca8a6c2 by yamamuro
[SPARK-32748][SQL] Revert "Support local property propagation in
SubqueryBroadcastExec"
### What changes were proposed in this pull request?
This reverts commit 04f7f6dac0b9177e11482cca4e7ebf7b7564e45f due to the
discussion in
[comment](https://github.com/apache/spark/pull/29589#discussion_r484657207).
### Why are the changes needed?
Based on  the discussion in
[comment](https://github.com/apache/spark/pull/29589#discussion_r484657207),
propagation for thread local properties in `SubqueryBroadcastExec` is
not necessary, since they will be propagated by broadcast exchange
threads anyway.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Also revert the added test.
Closes #29674 from wzhfy/revert_dpp_thread_local.
Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Takeshi
Yamamuro <yamamuro@apache.org>
(commit: 55d38a4)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/internal/ExecutorSideSQLConfSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SubqueryBroadcastExec.scala (diff)
Commit bd3dc2f54d871d152331612c53f586181f4e87fc by wenchen
[SPARK-31511][FOLLOW-UP][TEST][SQL] Make BytesToBytesMap iterators
thread-safe
### What changes were proposed in this pull request? Before SPARK-31511
is fixed, `BytesToBytesMap` iterator() is not thread-safe and may cause
data inaccuracy. We need to add a unit test.
### Why are the changes needed? Increase test coverage to ensure that
iterator() is thread-safe.
### Does this PR introduce _any_ user-facing change? No.
### How was this patch tested? add ut
Closes #29669 from cxzl25/SPARK-31511-test.
Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: bd3dc2f)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala (diff)
Commit e7d9a245656655e7bb1df3e04df30eb3cc9e23ad by yamamuro
[SPARK-32817][SQL] DPP throws error when broadcast side is empty
### What changes were proposed in this pull request?
In `SubqueryBroadcastExec.relationFuture`, if the `broadcastRelation` is
an `EmptyHashedRelation`, then `broadcastRelation.keys()` will throw
`UnsupportedOperationException`.
### Why are the changes needed?
To fix a bug.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added a new test.
Closes #29671 from wzhfy/dpp_empty_broadcast.
Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Takeshi
Yamamuro <yamamuro@apache.org>
(commit: e7d9a24)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala (diff)
Commit aa87b0aba3304aff195e98e76dc036d7ebd4c7c4 by wenchen
[SPARK-32815][ML] Fix LibSVM data source loading error on file paths
with glob metacharacters
### What changes were proposed in this pull request? In the PR, I
propose to fix an issue with LibSVM datasource when both of the
following are true:
* no user specified schema
* some file paths contain escaped glob metacharacters, such as `[``]`,
`{``}`, `*` etc.
The fix is based on another bug fix for CSV/JSON datasources
https://github.com/apache/spark/pull/29659.
### Why are the changes needed? To fix the issue when the follow two
queries try to read from paths `[abc]`:
```scala spark.read.format("libsvm").load("""/tmp/\[abc\].csv""").show
``` but would end up hitting an exception:
``` Path does not exist:
file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-00000-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm;
org.apache.spark.sql.AnalysisException: Path does not exist:
file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-00000-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm;
at
org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$3(DataSource.scala:770)
at
org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:373)
at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
at scala.util.Success.$anonfun$map$1(Try.scala:255)
at scala.util.Success.map(Try.scala:213)
```
### Does this PR introduce _any_ user-facing change? Yes
### How was this patch tested? Added UT to `LibSVMRelationSuite`.
Closes #29670 from MaxGekk/globbing-paths-when-inferring-schema-ml.
Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: aa87b0a)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala (diff)
Commit 96ff87dce893868c839718798e08fbe15240ec1c by gurwls223
[SPARK-32753][SQL][FOLLOWUP] Fix indentation and clean up view in test
### What changes were proposed in this pull request? Fix indentation and
clean up view in the test added by
https://github.com/apache/spark/pull/29593.
### Why are the changes needed? Address review comments in
https://github.com/apache/spark/pull/29665.
### Does this PR introduce _any_ user-facing change? No.
### How was this patch tested? Updated test.
Closes #29682 from manuzhang/spark-32753-followup.
Authored-by: manuzhang <owenzhang1990@gmail.com> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 96ff87d)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala (diff)
Commit e8634d8f6f8548852a284a32c1b7da24bedd8ff7 by gurwls223
[SPARK-32824][CORE] Improve the error message when the user forgets the
.amount in a resource config
### What changes were proposed in this pull request?
If the user forgets to specify .amount on a resource config like
spark.executor.resource.gpu, the error message thrown is very confusing:
``` ERROR SparkContext: Error initializing
SparkContext.java.lang.StringIndexOutOfBoundsException: String index out
of range:
-1 at java.lang.String.substring(String.java:1967) at
```
This makes it so we have a readable error thrown
### Why are the changes needed?
confusing error for users
### Does this PR introduce _any_ user-facing change?
just error message
### How was this patch tested?
Tested manually on standalone cluster
Closes #29685 from tgravescs/SPARK-32824.
Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: e8634d8)
The file was modifiedcore/src/main/scala/org/apache/spark/resource/ResourceUtils.scala (diff)
Commit adc8d687cee33316a4ec1e006efaea8e823491f6 by gurwls223
[SPARK-32810][SQL][TESTS][FOLLOWUP] Check path globbing in JSON/CSV
datasources v1 and v2
### What changes were proposed in this pull request? In the PR, I
propose to move the test `SPARK-32810: CSV and JSON data sources should
be able to read files with escaped glob metacharacter in the paths` from
`DataFrameReaderWriterSuite` to `CSVSuite` and to `JsonSuite`. This will
allow to run the same test in `CSVv1Suite`/`CSVv2Suite` and in
`JsonV1Suite`/`JsonV2Suite`.
### Why are the changes needed? To improve test coverage by checking
JSON/CSV datasources v1 and v2.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? By running affected test suites:
```
$ build/sbt "sql/test:testOnly
org.apache.spark.sql.execution.datasources.csv.*"
$ build/sbt "sql/test:testOnly
org.apache.spark.sql.execution.datasources.json.*"
```
Closes #29684 from MaxGekk/globbing-paths-when-inferring-schema-dsv2.
Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: adc8d68)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala (diff)
Commit 514bf563a7fa1f470b2ab088c0838317500a9aab by gurwls223
[SPARK-32823][WEB UI] Fix the master ui resources reporting
### What changes were proposed in this pull request?
Fixes the master UI for properly summing the resources total across
multiple workers. field: Resources in use: 0 / 8 gpu
The bug here is that it was creating MutableResourceInfo and then
reducing using the + operator.  the + operator in MutableResourceInfo
simple adds the address from one to the addresses of the other.  But its
using a HashSet so if the addresses are the same then you lose the
correct amount.  ie worker1 has gpu addresses 0,1,2,3 and worker2 has
addresses 0,1,2,3 then you only see 4 total GPUs when there are 8.
In this case we don't really need to create the MutableResourceInfo at
all because we just want the sums for used and total so just remove the
use of it.  The other uses of it are per Worker so those should be ok.
### Why are the changes needed?
fix UI
### Does this PR introduce _any_ user-facing change?
UI
### How was this patch tested?
tested manually on standalone cluster with multiple workers and multiple
GPUs and multiple fpgas
Closes #29683 from tgravescs/SPARK-32823.
Lead-authored-by: Thomas Graves <tgraves@nvidia.com> Co-authored-by:
Thomas Graves <tgraves@apache.org> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: 514bf56)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/StandaloneResourceUtils.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala (diff)
Commit de0dc52a842bf4374c1ae4f9546dd95b3f35c4f1 by gurwls223
[SPARK-32813][SQL] Get default config of ParquetSource vectorized reader
if no active SparkSession
### What changes were proposed in this pull request?
If no active SparkSession is available, let
`FileSourceScanExec.needsUnsafeRowConversion` look at default SQL config
of ParquetSource vectorized reader instead of failing the query
execution.
### Why are the changes needed?
Fix a bug that if no active SparkSession is available, file-based data
source scan for Parquet Source will throw exception.
### Does this PR introduce _any_ user-facing change?
Yes, this change fixes the bug.
### How was this patch tested?
Unit test.
Closes #29667 from viirya/SPARK-32813.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: de0dc52)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/SQLExecutionSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala (diff)
Commit 794b48c1728509bad9e5390646e5119789b79d2b by gengliang.wang
[SPARK-32204][SPARK-32182][DOCS][FOLLOW-UP] Use IPython instead of
ipython to check if installed in dev/lint-python
### What changes were proposed in this pull request?
It should check `IPython` as it's imported as a package. Currently,
Sphinx build is being skipped in GitHub Actions as below:
https://github.com/apache/spark/runs/1084164546
``` starting python compilation test... python compilation succeeded.
starting pycodestyle test... pycodestyle checks passed.
starting flake8 test... flake8 checks passed.
python3 does not have ipython installed. Skipping Sphinx build for now.
all lint-python tests passed!
```
### Why are the changes needed?
To run the documentation builds in Github Actions.
### Does this PR introduce _any_ user-facing change?
No, dev-only
### How was this patch tested?
Manually tested as `dev/lint-python`.
Closes #29679 from HyukjinKwon/follow-ipython.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Gengliang
Wang <gengliang.wang@databricks.com>
(commit: 794b48c)
The file was modifieddev/lint-python (diff)
Commit 513d51a2c5dd2c7ff2c2fadc26ec122883372be1 by srowen
[SPARK-32808][SQL] Fix some test cases of `sql/core` module in scala
2.13
### What changes were proposed in this pull request? The purpose of this
pr is to partial resolve
[SPARK-32808](https://issues.apache.org/jira/browse/SPARK-32808), total
of 26 failed test cases were fixed, the related suite as follow:
- `StreamingAggregationSuite` related test cases (2 FAILED -> Pass)
- `GeneratorFunctionSuite` related test cases (2 FAILED -> Pass)
- `UDFSuite` related test cases (2 FAILED -> Pass)
- `SQLQueryTestSuite` related test cases (5 FAILED -> Pass)
- `WholeStageCodegenSuite` related test cases (1 FAILED -> Pass)
- `DataFrameSuite` related test cases (3 FAILED -> Pass)
- `OrcV1QuerySuite\OrcV2QuerySuite` related test cases (4 FAILED ->
Pass)
- `ExpressionsSchemaSuite` related test cases (1 FAILED -> Pass)
- `DataFrameStatSuite` related test cases (1 FAILED -> Pass)
- `JsonV1Suite\JsonV2Suite\JsonLegacyTimeParserSuite` related test cases
(6 FAILED -> Pass)
The main change of this pr as following:
- Fix Scala 2.13 compilation problems in   `ShuffleBlockFetcherIterator`
and `Analyzer`
- Specified `Seq` to `scala.collection.Seq` in `objects.scala` and
`GenericArrayData` because internal use `Seq` maybe `mutable.ArraySeq`
and not easy to call `.toSeq`
- Should specified `Seq` to `scala.collection.Seq`  when we call
`Row.getAs[Seq]` and `Row.get(i).asInstanceOf[Seq]` because the data
maybe `mutable.ArraySeq` but `Seq` is `immutable.Seq` in Scala 2.13
- Use a compatible way to let `+` and `-` method  of `Decimal` having
the same behavior in Scala 2.12 and Scala 2.13
- Call `toList` in `RelationalGroupedDataset.toDF` method when
`groupingExprs` is `Stream` type because `Stream` can't serialize in
Scala 2.13
- Add a manual sort to `classFunsMap` in `ExpressionsSchemaSuite`
because `Iterable.groupBy` in Scala 2.13 has different result with
`TraversableLike.groupBy`  in Scala 2.12
### Why are the changes needed? We need to support a Scala 2.13 build.
### Does this PR introduce _any_ user-facing change?
Should specified `Seq` to `scala.collection.Seq`  when we call
`Row.getAs[Seq]` and `Row.get(i).asInstanceOf[Seq]` because the data
maybe `mutable.ArraySeq` but the `Seq` is `immutable.Seq` in Scala 2.13
### How was this patch tested?
- Scala 2.12: Pass the Jenkins or GitHub Action
- Scala 2.13: Do the following:
``` dev/change-scala-version.sh 2.13 mvn clean install -DskipTests  -pl
sql/core -Pscala-2.13 -am mvn test -pl sql/core -Pscala-2.13
```
**Before**
``` Tests: succeeded 8166, failed 319, canceled 1, ignored 52, pending 0
*** 319 TESTS FAILED ***
```
**After**
``` Tests: succeeded 8204, failed 286, canceled 1, ignored 52, pending 0
*** 286 TESTS FAILED ***
```
Closes #29660 from LuciferYang/SPARK-32808.
Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen
<srowen@gmail.com>
(commit: 513d51a)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/GenerateExec.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/UDFSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-functions/sql-expression-schema.md (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala (diff)
Commit fc10511d15a17c091e977dc4759be96247a8433e by wenchen
[SPARK-32755][SQL][FOLLOWUP] Ensure `--` method of AttributeSet have
same behavior under Scala 2.12 and 2.13
### What changes were proposed in this pull request?
`--` method of `AttributeSet` behave differently under Scala 2.12 and
2.13 because `--` method of `LinkedHashSet` in Scala 2.13 can't
maintains the insertion order.
This pr use a Scala 2.12 based code to ensure `--` method of
AttributeSet have same behavior under Scala 2.12 and 2.13.
### Why are the changes needed? The behavior of `AttributeSet`  needs to
be compatible with Scala 2.12 and 2.13
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? Scala 2.12: Pass the Jenkins or GitHub
Action
Scala 2.13: Manual test sub-suites of `PlanStabilitySuite`
- **Before** :293 TESTS FAILED
- **After**:13 TESTS FAILED(The remaining failures are not associated
with the current issue)
Closes #29689 from LuciferYang/SPARK-32755-FOLLOWUP.
Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: fc10511)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeSet.scala (diff)
Commit e4237bbda68c23a3367fab56fd8cdb521f8a1ae2 by tathagata.das1565
[SPARK-32794][SS] Fixed rare corner case error in micro-batch engine
with some stateful queries + no-data-batches + V1 sources
### What changes were proposed in this pull request? Make
MicroBatchExecution explicitly call `getBatch` when the start and end
offsets are the same.
### Why are the changes needed?
Structured Streaming micro-batch engine has the contract with V1 data
sources that, after a restart, it will call `source.getBatch()` on the
last batch attempted before the restart. However, a very rare
combination of sequences violates this contract. It occurs only when
- The streaming query has specific types of stateful operations with
watermarks (e.g., aggregation in append, mapGroupsWithState with
timeouts).
   - These queries can execute a batch even without new data when the
previous updates the watermark and the stateful ops are such that the
new watermark can cause new output/cleanup. Such batches are called
no-data-batches.
- The last batch before termination was an incomplete no-data-batch.
Upon restart, the micro-batch engine fails to call `source.getBatch`
when attempting to re-execute the incomplete no-data-batch.
This occurs because no-data-batches has the same and end offsets, and
when a batch is executed, if the start and end offset is same then
calling `source.getBatch` is skipped as it is assumed the generated plan
will be empty. This only affects V1 data sources like Delta and
Autoloader which rely on this invariant to detect in the source whether
the query is being started from scratch or restarted.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested?
New unit test with a mock v1 source that fails without the fix.
Closes #29651 from tdas/SPARK-32794.
Authored-by: Tathagata Das <tathagata.das1565@gmail.com> Signed-off-by:
Tathagata Das <tathagata.das1565@gmail.com>
(commit: e4237bb)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecutionSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala (diff)
Commit f7995c576aa16f04681ca8f43f8c0f71818a9f44 by wenchen
Revert "[SPARK-32677][SQL] Load function resource before create"
This reverts commit 05fcf26b7966781772338e1f6d53690ab52cc66f.
(commit: f7995c5)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/HiveUDFDynamicLoadSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/udaf.sql.out (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/udf/udf-udaf.sql.out (diff)
The file was modifiedpython/pyspark/sql/tests/test_catalog.py (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala (diff)