Changes

Summary

  1. [SPARK-33215][WEBUI] Speed up event log download by skipping UI rebuild (commit: 4b0e23e) (details)
  2. [SPARK-33140][SQL] remove SQLConf and SparkSession in all sub-class of (commit: 537a49f) (details)
  3. [SPARK-33225][SQL] Extract AliasHelper trait (commit: 281f99c) (details)
  4. [SPARK-33137][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: update (commit: f284218) (details)
  5. [SPARK-33231][SPARK-33262][CORE] Make pod allocation executor timeouts (commit: 98f0a21) (details)
  6. [SPARK-33260][SQL] Fix incorrect results from SortExec when sortOrder is (commit: 3f2a2b5) (details)
  7. [SPARK-33246][SQL][DOCS] Correct documentation for null semantics of (commit: 7d11d97) (details)
  8. [SPARK-33258][R][SQL] Add asc_nulls_* and desc_nulls_* methods to SparkR (commit: ea709d6) (details)
  9. [SPARK-33264][SQL][DOCS] Add a dedicated page for SQL-on-file in SQL (commit: c2bea04) (details)
  10. [SPARK-33240][SQL] Fail fast when fails to instantiate configured v2 (commit: fcf8aa5) (details)
  11. [SPARK-33174][SQL] Migrate DROP TABLE to use UnresolvedTableOrView to (commit: 528160f) (details)
  12. [SPARK-33183][SQL] Fix Optimizer rule EliminateSorts and add a physical (commit: 9fb4536) (details)
  13. [SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the (commit: 3c3ad5f) (details)
  14. [SPARK-33269][INFRA] Ignore ".bsp/" directory in Git (commit: 2b8fe6d) (details)
  15. [SPARK-33208][SQL] Update the document of SparkSession#sql (commit: b26ae98) (details)
  16. [SPARK-33268][SQL][PYTHON] Fix bugs for casting data from/to (commit: a6216e2) (details)
  17. [SPARK-33267][SQL] Fix NPE issue on 'In' filter when one of values (commit: a744fea) (details)
Commit 4b0e23e646b579b852056ffc87164b16adef5a09 by kabhwan.opensource
[SPARK-33215][WEBUI] Speed up event log download by skipping UI rebuild
### What changes were proposed in this pull request? This patch
separates the view permission checks from the getAppUi in
FsHistoryServerProvider, thus enabling SHS to do view permissions check
of a given attempt for a given user without rebuilding the UI. This is
achieved by adding a method "checkUIViewPermissions(appId: String,
attemptId: Option[String], user: String): Boolean" to many layers of
history server components. Currently, this feature is useful for event
log download.
### Why are the changes needed? Right now, when we want to download the
event logs from the spark history server, SHS will need to parse entire
the event log to rebuild UI, and this is just for view permission
checks. UI rebuilding is a time-consuming and memory-intensive task,
especially for large logs. However, this process is unnecessary for
event log download. With this patch, UI rebuild can be skipped when
downloading event logs from the history server. Thus the time of
downloading a GB scale event log can be reduced from several minutes to
several seconds, and the memory consumption of UI rebuilding can be
avoided.
### Does this PR introduce _any_ user-facing change? No.
### How was this patch tested? Added test cases to confirm the view
permission checks work properly and download event logs won't trigger UI
loading. Also did some manual tests to verify the download speed can be
drastically improved and the authentication works properly.
Closes #30126 from baohe-zhang/bypass_ui_rebuild_for_log_download.
Authored-by: Baohe Zhang <baohe.zhang@verizonmedia.com> Signed-off-by:
Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
(commit: 4b0e23e)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/history/ApplicationHistoryProvider.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/ui/SparkUI.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/status/api/v1/ApiRootResource.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/status/api/v1/OneApplicationResource.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala (diff)
Commit 537a49fc0966b0b289b67ac9c6ea20093165b0da by wenchen
[SPARK-33140][SQL] remove SQLConf and SparkSession in all sub-class of
Rule[QueryPlan]
### What changes were proposed in this pull request?
Since Issue
[SPARK-33139](https://issues.apache.org/jira/browse/SPARK-33139) has
been done, and SQLConf.get and SparkSession.active are more reliable. We
are trying to refine the existing code usage of passing SQLConf and
SparkSession into sub-class of Rule[QueryPlan].
In this PR.
* remove SQLConf from ctor-parameter of all sub-class of
Rule[QueryPlan].
* using SQLConf.get to replace the original SQLConf instance.
* remove SparkSession from ctor-parameter of all sub-class of
Rule[QueryPlan].
* using SparkSession.active to replace the original SparkSession
instance.
### Why are the changes needed?
Code refine.
### Does this PR introduce any user-facing change? No.
### How was this patch tested?
Existing test
Closes #30097 from leanken/leanken-SPARK-33140.
Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 537a49f)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveLambdaVariablesSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/ColumnarRulesSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PlanDynamicPruningFilters.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ReuseAdaptiveSubquery.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/bucketing/DisableUnnecessaryBucketedScan.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/InsertAdaptiveSparkPlan.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeLocalShuffleReader.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/AggregateOptimizeSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/SubstituteUnresolvedOrdinalsSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/exchange/Exchange.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisTest.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CoalesceShufflePartitions.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTablesSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/bucketing/CoalesceBucketsInJoinSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/analysis/DetectAmbiguousSelfJoin.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolvedUuidExpressionsSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/RemoveRedundantProjects.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/joins/InnerJoinSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/timeZoneAnalysis.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/sources/DataSourceAnalysisSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/PlanResolutionSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/joins/OuterJoinSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/DemoteBroadcastHashJoin.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/bucketing/CoalesceBucketsInJoin.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitions.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AQEOptimizer.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/exchange/EnsureRequirementsSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/DataSourceV2AnalysisSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FallBackFileSourceV2.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/V2CommandsCaseSensitivitySuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/Rule.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveInlineTables.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/joins/ExistenceJoinSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/SubstituteUnresolvedOrdinals.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/higherOrderFunctions.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/Columnar.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/RemoveRedundantProjectsSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveGroupingAnalyticsSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SelectedFieldSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ObjectExpressionsSuite.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitionsSuite.scala (diff)
Commit 281f99c70b2fab2839495638d07acc1e534e5ad6 by yamamuro
[SPARK-33225][SQL] Extract AliasHelper trait
### What changes were proposed in this pull request?
Extract methods related to handling Aliases to a trait.
### Why are the changes needed?
Avoid code duplication
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing UTs cover this
Closes #30134 from tanelk/SPARK-33225_aliasHelper.
Lead-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com>
Co-authored-by: Tanel Kiis <tanel.kiis@reach-u.com> Signed-off-by:
Takeshi Yamamuro <yamamuro@apache.org>
(commit: 281f99c)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushDownLeftSemiAntiJoin.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
The file was addedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala (diff)
Commit f284218dae23bf91e72e221943188cdb85e13dac by wenchen
[SPARK-33137][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: update
type and nullability of columns (Postgres dialect)
### What changes were proposed in this pull request? Override the
default SQL strings in Postgres Dialect for:
- ALTER TABLE UPDATE COLUMN TYPE
- ALTER TABLE UPDATE COLUMN NULLABILITY
Add new docker integration test suite
`jdbc/v2/PostgreSQLIntegrationSuite.scala`
### Why are the changes needed? supports Postgres specific ALTER TABLE
syntax.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? Add new test `PostgreSQLIntegrationSuite`
Closes #30089 from huaxingao/postgres_docker.
Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: f284218)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala (diff)
The file was addedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/PostgresIntegrationSuite.scala
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/DB2IntegrationSuite.scala (diff)
Commit 98f0a219915dc9ed696602b9bfad82d9cf6c4113 by dhyun
[SPARK-33231][SPARK-33262][CORE] Make pod allocation executor timeouts
configurable & allow scheduling with pending pods
### What changes were proposed in this pull request?
Make pod allocation executor timeouts configurable. Keep all known pods
in mind when allocating executors to avoid over-allocating if the
pending time is much higher than the allocation interval.
This PR increases the default wait time to 600s from the current 60s.
Since nodes can now remain "pending" for long periods of time, we allow
additional batches to be scheduled during pending allocation but keep
the total number of pods in account.
### Why are the changes needed? The current executor timeouts do not
match that of all real world clusters especially under load. While this
can be worked around by increasing the allocation batch delay, that will
decrease the speed at which the total number of executors will be able
to be requested.
The increase in default timeout is needed to handle real-world testing
environments I've encountered on moderately busy clusters and K8s
clusters with their own underlying dynamic scale-up of hardware (e.g.
GKE, EKS, etc.)
### Does this PR introduce _any_ user-facing change?
Yes new configuration property
### How was this patch tested?
Updated existing test to use the timeout from the new configuration
property. Verified test failed without the update.
Closes #30155 from
holdenk/SPARK-33231-make-pod-creation-timeout-configurable.
Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
(commit: 98f0a21)
The file was modifiedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocatorSuite.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala (diff)
Commit 3f2a2b5fe6ada37ef86f00737387e6cf2496df74 by dhyun
[SPARK-33260][SQL] Fix incorrect results from SortExec when sortOrder is
Stream
### What changes were proposed in this pull request?
The following query produces incorrect results. The query has two
essential features: (1) it contains a string aggregate, resulting in a
`SortExec` node, and (2) it contains a duplicate grouping key, causing
`RemoveRepetitionFromGroupExpressions` to produce a sort order stored as
a `Stream`.
```sql SELECT bigint_col_1, bigint_col_9, MAX(CAST(bigint_col_1 AS
string)) FROM table_4 GROUP BY bigint_col_1, bigint_col_9, bigint_col_9
```
When the sort order is stored as a `Stream`, the line
`ordering.map(_.child.genCode(ctx))` in
`GenerateOrdering#createOrderKeys()` produces unpredictable side effects
to `ctx`. This is because `genCode(ctx)` modifies `ctx`. When ordering
is a `Stream`, the modifications will not happen immediately as
intended, but will instead occur lazily when the returned `Stream` is
used later.
Similar bugs have occurred at least three times in the past:
https://issues.apache.org/jira/browse/SPARK-24500,
https://issues.apache.org/jira/browse/SPARK-25767,
https://issues.apache.org/jira/browse/SPARK-26680.
The fix is to check if `ordering` is a `Stream` and force the
modifications to happen immediately if so.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added a unit test for `SortExec` where `sortOrder` is a `Stream`. The
test previously failed and now passes.
Closes #30160 from ankurdave/SPARK-33260.
Authored-by: Ankur Dave <ankurdave@gmail.com> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
(commit: 3f2a2b5)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/SortSuite.scala (diff)
Commit 7d11d972c356140d21909c6a62cdb8d813bd015e by yamamuro
[SPARK-33246][SQL][DOCS] Correct documentation for null semantics of
"NULL AND False"
### What changes were proposed in this pull request?
The documentation of the Spark SQL null semantics states that "NULL AND
False" yields NULL.  This is incorrect.  "NULL AND False" yields False.
``` Seq[(java.lang.Boolean, java.lang.Boolean)](
(null, false)
)
.toDF("left_operand", "right_operand")
.withColumn("AND", 'left_operand && 'right_operand)
.show(truncate = false)
+------------+-------------+-----+
|left_operand|right_operand|AND  |
+------------+-------------+-----+
|null        |false        |false|
+------------+-------------+-----+
```
I propose the documentation be updated to reflect that "NULL AND False"
yields False.
This contribution is my original work and I license it to the project
under the project’s open source license.
### Why are the changes needed?
This change improves the accuracy of the documentation.
### Does this PR introduce _any_ user-facing change?
Yes.  This PR introduces a fix to the documentation.
### How was this patch tested?
Since this is only a documentation change, no tests were added.
Closes #30161 from stwhit/SPARK-33246.
Authored-by: Stuart White <stuart@spotright.com> Signed-off-by: Takeshi
Yamamuro <yamamuro@apache.org>
(commit: 7d11d97)
The file was modifieddocs/sql-ref-null-semantics.md (diff)
Commit ea709d67486dd6329977df6c3ed7a443b835dd48 by gurwls223
[SPARK-33258][R][SQL] Add asc_nulls_* and desc_nulls_* methods to SparkR
### What changes were proposed in this pull request?
This PR adds the following `Column` methods to R API:
- asc_nulls_first
- asc_nulls_last
- desc_nulls_first
- desc_nulls_last
### Why are the changes needed?
Feature parity.
### Does this PR introduce _any_ user-facing change?
No, new methods.
### How was this patch tested?
New unit tests.
Closes #30159 from zero323/SPARK-33258.
Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: ea709d6)
The file was modifiedR/pkg/NAMESPACE (diff)
The file was modifiedR/pkg/R/generics.R (diff)
The file was modifiedR/pkg/tests/fulltests/test_sparkSQL.R (diff)
The file was modifiedR/pkg/R/column.R (diff)
Commit c2bea045e3628081bca1ba752669a5bc009ebd00 by yamamuro
[SPARK-33264][SQL][DOCS] Add a dedicated page for SQL-on-file in SQL
documents
### What changes were proposed in this pull request?
This PR intends to add a dedicated page for SQL-on-file in SQL
documents. This comes from the comment:
https://github.com/apache/spark/pull/30095/files#r508965149
### Why are the changes needed?
For better documentations.
### Does this PR introduce _any_ user-facing change?
<img width="544" alt="Screen Shot 2020-10-28 at 9 56 59"
src="https://user-images.githubusercontent.com/692303/97378051-c1fbcb80-1904-11eb-86c0-a88c5269d41c.png">
### How was this patch tested?
N/A
Closes #30165 from maropu/DocForFile.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by:
Takeshi Yamamuro <yamamuro@apache.org>
(commit: c2bea04)
The file was modifieddocs/sql-ref-syntax.md (diff)
The file was modifieddocs/sql-ref-syntax-qry-select.md (diff)
The file was modifieddocs/sql-ref-syntax-qry.md (diff)
The file was addeddocs/sql-ref-syntax-qry-select-file.md
The file was modifieddocs/_data/menu-sql.yaml (diff)
Commit fcf8aa59b5025dde9b4af36953146894659967e2 by wenchen
[SPARK-33240][SQL] Fail fast when fails to instantiate configured v2
session catalog
### What changes were proposed in this pull request?
This patch proposes to change the behavior on failing fast when Spark
fails to instantiate configured v2 session catalog.
### Why are the changes needed?
The Spark behavior is against the intention of the end users - if end
users configure session catalog which Spark would fail to initialize,
Spark would swallow the error with only logging the error message and
silently use the default catalog implementation.
This follows the voices on [discussion
thread](https://lists.apache.org/thread.html/rdfa22a5ebdc4ac66e2c5c8ff0cd9d750e8a1690cd6fb456d119c2400%40%3Cdev.spark.apache.org%3E)
in dev mailing list.
### Does this PR introduce _any_ user-facing change?
Yes. After the PR Spark will fail immediately if Spark fails to
instantiate configured session catalog.
### How was this patch tested?
New UT added.
Closes #30147 from HeartSaVioR/SPARK-33240.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: fcf8aa5)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogManager.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/SupportsCatalogOptionsSuite.scala (diff)
Commit 528160f0014206eaceb01ae0f3ad316bfbdc6885 by wenchen
[SPARK-33174][SQL] Migrate DROP TABLE to use UnresolvedTableOrView to
resolve the identifier
### What changes were proposed in this pull request?
This PR proposes to migrate `DROP TABLE` to use `UnresolvedTableOrView`
to resolve the table/view identifier. This allows consistent resolution
rules (temp view first, etc.) to be applied for both v1/v2 commands.
More info about the consistent resolution rule proposal can be found in
[JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal
doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).
### Why are the changes needed?
The current behavior is not consistent between v1 and v2 commands when
resolving a temp view. In v2, the `t` in the following example is
resolved to a table:
```scala sql("CREATE TABLE testcat.ns.t (id bigint) USING foo")
sql("CREATE TEMPORARY VIEW t AS SELECT 2") sql("USE testcat.ns")
sql("DROP TABLE t") // 't' is resolved to testcat.ns.t
``` whereas in v1, the `t` is resolved to a temp view:
```scala sql("CREATE DATABASE test") sql("CREATE TABLE
spark_catalog.test.t (id bigint) USING csv") sql("CREATE TEMPORARY VIEW
t AS SELECT 2") sql("USE spark_catalog.test") sql("DROP TABLE t") // 't'
is resolved to a temp view
```
### Does this PR introduce _any_ user-facing change?
After this PR, for v2, `DROP TABLE t` is resolved to a temp view `t`
instead of `testcat.ns.t`, consistent with v1 behavior.
### How was this patch tested?
Added a new test
Closes #30079 from imback82/drop_table_consistent.
Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 528160f)
The file was addedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveNoopDropTable.scala
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/v2ResolutionPlans.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalogSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statements.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/PlanResolutionSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveCatalogs.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/TestV2SessionCatalogBase.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala (diff)
Commit 9fb45361fd00b046e04748e1a1c8add3fa09f01c by wenchen
[SPARK-33183][SQL] Fix Optimizer rule EliminateSorts and add a physical
rule to remove redundant sorts
### What changes were proposed in this pull request? This PR aims to fix
a correctness bug in the optimizer rule `EliminateSorts`. It also adds a
new physical rule to remove redundant sorts that cannot be eliminated in
the Optimizer rule after the bugfix.
### Why are the changes needed? A global sort should not be eliminated
even if its child is ordered since we don't know if its child ordering
is global or local. For example, in the following scenario, the first
sort shouldn't be removed because it has a stronger guarantee than the
second sort even if the sort orders are the same for both sorts.
``` Sort(orders, global = True, ...)
Sort(orders, global = False, ...)
```
Since there is no straightforward way to identify whether a node's
output ordering is local or global, we should not remove a global sort
even if its child is already ordered.
### Does this PR introduce _any_ user-facing change? Yes
### How was this patch tested? Unit tests
Closes #30093 from allisonwang-db/fix-sort.
Authored-by: allisonwang-db
<66282705+allisonwang-db@users.noreply.github.com> Signed-off-by:
Wenchen Fan <wenchen@databricks.com>
(commit: 9fb4536)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala (diff)
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/RemoveRedundantSorts.scala
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala (diff)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/execution/RemoveRedundantSortsSuite.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala (diff)
Commit 3c3ad5f7c00f6f68bc659d4cf7020fa944b7bc69 by wenchen
[SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the
OffsetWindowFunction
### What changes were proposed in this pull request? Spark SQL supports
some window function like `NTH_VALUE`. If we specify window frame like
`UNBOUNDED PRECEDING AND CURRENT ROW` or `UNBOUNDED PRECEDING AND
UNBOUNDED FOLLOWING`, we can elimate some calculations. For example: if
we execute the SQL show below:
``` SELECT NTH_VALUE(col,
        2) OVER(ORDER BY rank UNBOUNDED PRECEDING
       AND CURRENT ROW) FROM tab;
``` The output for row number greater than 1, return the fixed value.
otherwise, return null. So we just calculate the value once and notice
whether the row number less than 2.
`UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING` is simpler.
### Why are the changes needed? Improve the performance for `NTH_VALUE`,
`FIRST_VALUE` and `LAST_VALUE`.
### Does this PR introduce _any_ user-facing change?
'No'.
### How was this patch tested? Jenkins test.
Closes #29800 from beliefer/optimize-nth_value.
Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer
<beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 3c3ad5f)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExecBase.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameWindowFunctionsSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/window.sql.out (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/window.sql (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowFunctionFrame.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala (diff)
Commit 2b8fe6d9ae2fe31d1545da98003f931ee1aa11d5 by gurwls223
[SPARK-33269][INFRA] Ignore ".bsp/" directory in Git
### What changes were proposed in this pull request?
After SBT upgrade into 1.4.0 and above. there is always a ".bsp"
directory after sbt starts:
https://github.com/sbt/sbt/releases/tag/v1.4.0 This PR is to put the
directory in to `.gitignore`.
### Why are the changes needed?
The ".bsp" directory is an untracked file for git during development.
This is annoying.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manual local test
Closes #30171 from gengliangwang/ignoreBSP.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 2b8fe6d)
The file was modified.gitignore (diff)
Commit b26ae98407c6c017a4061c0c420f48685ddd6163 by wenchen
[SPARK-33208][SQL] Update the document of SparkSession#sql
Change-Id: I82db1f9e8f667573aa3a03e05152cbed0ea7686b
### What changes were proposed in this pull request? Update the document
of SparkSession#sql, mention that this API eagerly runs DDL/DML
commands, but not for SELECT queries.
### Why are the changes needed? To clarify the behavior of
SparkSession#sql.
### Does this PR introduce _any_ user-facing change? No.
### How was this patch tested? No needed.
Closes #30168 from waitinfuture/master.
Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: b26ae98)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala (diff)
Commit a6216e2446b6befc3f6d6b370e694421aadda9dd by dhyun
[SPARK-33268][SQL][PYTHON] Fix bugs for casting data from/to
PythonUserDefinedType
### What changes were proposed in this pull request?
This PR intends to fix bus for casting data from/to
PythonUserDefinedType. A sequence of queries to reproduce this issue is
as follows;
```
>>> from pyspark.sql import Row
>>> from pyspark.sql.functions import col
>>> from pyspark.sql.types import *
>>> from pyspark.testing.sqlutils import *
>>>
>>> row = Row(point=ExamplePoint(1.0, 2.0))
>>> df = spark.createDataFrame([row])
>>> df.select(col("point").cast(PythonOnlyUDT())) Traceback (most recent
call last):
File "<stdin>", line 1, in <module>
File
"/Users/maropu/Repositories/spark/spark-master/python/pyspark/sql/dataframe.py",
line 1402, in select
   jdf = self._jdf.select(self._jcols(*cols))
File
"/Users/maropu/Repositories/spark/spark-master/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
line 1305, in __call__
File
"/Users/maropu/Repositories/spark/spark-master/python/pyspark/sql/utils.py",
line 111, in deco
   return f(*a, **kw)
File
"/Users/maropu/Repositories/spark/spark-master/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
line 328, in get_return_value py4j.protocol.Py4JJavaError: An error
occurred while calling o44.select.
: java.lang.NullPointerException
at
org.apache.spark.sql.types.UserDefinedType.acceptsType(UserDefinedType.scala:84)
at
org.apache.spark.sql.catalyst.expressions.Cast$.canCast(Cast.scala:96)
at
org.apache.spark.sql.catalyst.expressions.CastBase.checkInputDataTypes(Cast.scala:267)
at
org.apache.spark.sql.catalyst.expressions.CastBase.resolved$lzycompute(Cast.scala:290)
at
org.apache.spark.sql.catalyst.expressions.CastBase.resolved(Cast.scala:290)
``` A root cause of this issue is that, since
`PythonUserDefinedType#userClassis` always null, `isAssignableFrom` in
`UserDefinedType#acceptsType` throws a null exception. To fix it, this
PR defines  `acceptsType` in `PythonUserDefinedType` and filters out the
null case in `UserDefinedType#acceptsType`.
### Why are the changes needed?
Bug fixes.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added tests.
Closes #30169 from maropu/FixPythonUDTCast.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by:
Dongjoon Hyun <dhyun@apple.com>
(commit: a6216e2)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/types/UserDefinedType.scala (diff)
The file was modifiedpython/pyspark/sql/tests/test_types.py (diff)
Commit a744fea3be12f1a53ab553040b95da730210bc88 by dhyun
[SPARK-33267][SQL] Fix NPE issue on 'In' filter when one of values
contains null
### What changes were proposed in this pull request?
This PR proposes to fix the NPE issue on `In` filter when one of values
contain null. In real case, you can trigger this issue when you try to
push down the filter with `in (..., null)` against V2 source table.
`DataSourceStrategy` caches the mapping (filter instance -> expression)
in HashMap, which leverages hash code on the key, hence it could trigger
the NPE issue.
### Why are the changes needed?
This is an obvious bug as `In` filter doesn't care about null value when
calculating hash code.
### Does this PR introduce _any_ user-facing change?
Yes, previously the query with having `null` in "in" condition against
data source V2 source table supporting push down filter failed with NPE,
whereas after the PR the query will not fail.
### How was this patch tested?
UT added. The new UT fails without the PR and passes with the PR.
Closes #30170 from HeartSaVioR/SPARK-33267.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: a744fea)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala (diff)