Changes

Summary

  1. [SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet (commit: 09bb9be) (details)
  2. [SPARK-33427][SQL] Add subexpression elimination for interpreted (commit: 9283484) (details)
  3. [SPARK-32222][K8S][TESTS] Add K8s IT for conf propagation (commit: 2a8e253) (details)
Commit 09bb9bedcd27e08b86d63a6aed90d42ca4c606be by wenchen
[SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet
predicate have many values
### What changes were proposed in this pull request?
We
[rewrite](https://github.com/apache/spark/blob/5197c5d2e7648d75def3e159e0d2aa3e20117105/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L722-L724)
`In`/`InSet` predicate to `or` expressions when pruning Hive partitions.
That will cause Hive metastore stack over flow if there are a lot of
values.
This pr rewrite `InSet` predicate to `GreaterThanOrEqual` min value and
`LessThanOrEqual ` max value when pruning Hive partitions to avoid Hive
metastore stack overflow.
From our experience,
`spark.sql.hive.metastorePartitionPruningInSetThreshold` should be less
than 10000.
### Why are the changes needed?
Avoid Hive metastore stack overflow when `InSet` predicate have many
values. Especially DPP, it may generate many values.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test.
Closes #30325 from wangyum/SPARK-33416.
Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 09bb9be)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/client/FiltersSuite.scala (diff)
Commit 928348408eac59eaa0c8ad6412d36ea8e8bea83f by wenchen
[SPARK-33427][SQL] Add subexpression elimination for interpreted
expression evaluation
### What changes were proposed in this pull request?
This patch proposes to add subexpression elimination for interpreted
expression evaluation. Interpreted expression evaluation is used when
codegen was not able to work, for example complex schema.
### Why are the changes needed?
Currently we only do subexpression elimination for codegen. For some
reasons, we may need to run interpreted expression evaluation. For
example, codegen fails to compile and fallbacks to interpreted mode, or
complex input/output schema of expressions. It is commonly seen for
complex schema from expressions that is possibly caused by the query
optimizer too, e.g. SPARK-32945.
We should also support subexpression elimination for interpreted
evaluation. That could reduce performance difference when Spark
fallbacks from codegen to interpreted expression evaluation, and improve
Spark usability.
#### Benchmark
Update `SubExprEliminationBenchmark`:
Before:
``` OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Mac OS X 10.15.6
Intel(R) Core(TM) i7-9750H CPU  2.60GHz
from_json as subExpr:                      Best Time(ms)   Avg Time(ms)
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------
subexpressionElimination on, codegen off           24707          25688
       903          0.0   247068775.9       1.0X
```
After:
``` OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Mac OS X 10.15.6
Intel(R) Core(TM) i7-9750H CPU  2.60GHz
from_json as subExpr:                      Best Time(ms)   Avg Time(ms)
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------
subexpressionElimination on, codegen off            2360           2435
        87          0.0    23604320.7      11.2X
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test. Benchmark manually.
Closes #30341 from viirya/SPARK-33427.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen
Fan <wenchen@databricks.com>
(commit: 9283484)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedUnsafeProjection.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/core/benchmarks/SubExprEliminationBenchmark-jdk11-results.txt (diff)
The file was addedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SubExprEvaluationRuntime.scala
The file was addedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubExprEvaluationRuntimeSuite.scala
The file was modifiedsql/core/benchmarks/SubExprEliminationBenchmark-results.txt (diff)
Commit 2a8e253cdbfdf11993e41b6e14664df890efffce by dongjoon
[SPARK-32222][K8S][TESTS] Add K8s IT for conf propagation
### What changes were proposed in this pull request?
Added integration test - which tries to configure a log4j.properties and
checks if, it is the one pickup by the driver.
### Why are the changes needed?
Improved test coverage.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By running integration tests.
Closes #30388 from ScrapCodes/SPARK-32222/k8s-it-spark-conf-propagate.
Authored-by: Prashant Sharma <prashsh1@in.ibm.com> Signed-off-by:
Dongjoon Hyun <dongjoon@apache.org>
(commit: 2a8e253)
The file was addedresource-managers/kubernetes/integration-tests/src/test/resources/log-config-test-log4j.properties
The file was addedresource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/SparkConfPropagateSuite.scala
The file was modifiedresource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesSuite.scala (diff)