Changes

Summary

  1. [SPARK-33362][SQL] skipSchemaResolution should still require query to be (commit: 26ea417) (details)
  2. [SPARK-33353][BUILD] Cache dependencies for Coursier with new sbt in (commit: 208b94e) (details)
  3. [SPARK-33290][SQL][DOCS][FOLLOW-UP] Update SQL migration guide (commit: 1a70479) (details)
  4. [SPARK-33185][YARN] Set up yarn.Client to print direct links to driver (commit: 324275a) (details)
  5. [SPARK-33360][SQL] Simplify DS v2 write resolution (commit: cd4e3d3) (details)
  6. [SPARK-33365][BUILD] Update SBT to 1.4.2 (commit: 4941b7a) (details)
  7. [MINOR][SQL] Fix incorrect JIRA ID comments in Analyzer (commit: 90f35c6) (details)
  8. [SPARK-32934][SQL][FOLLOW-UP] Refine class naming and code comments (commit: d163110) (details)
  9. [SPARK-33342][WEBUI] fix the wrong url and display name of blocking (commit: f6c0007) (details)
  10. [SPARK-33130][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: add, (commit: 733a468) (details)
  11. [SPARK-33364][SQL] Introduce the "purge" option in (commit: 68c032c) (details)
  12. [SPARK-23432][UI] Add executor peak jvm memory metrics in executors page (commit: 93ad26b) (details)
  13. [SPARK-33291][SQL] Improve DataFrame.show for nulls in arrays and (commit: 09fa7ec) (details)
  14. [SPARK-33347][CORE] Cleanup useless variables of MutableApplicationInfo (commit: fb9c873) (details)
Commit 26ea417b1448d679fdc777705ee2f99f4e741ef3 by dhyun
[SPARK-33362][SQL] skipSchemaResolution should still require query to be
resolved
### What changes were proposed in this pull request?
Fix a small bug in `V2WriteCommand.resolved`. It should always require
the `table` and `query` to be resolved.
### Why are the changes needed?
To prevent potential bugs that we skip resolve the input query.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
a new test
Closes #30265 from cloud-fan/ds-minor-2.
Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by:
Dongjoon Hyun <dhyun@apple.com>
(commit: 26ea417)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/DataSourceV2AnalysisSuite.scala (diff)
Commit 208b94e4c1e5c500e76c54e8f7a2be6a07ef3f7a by dhyun
[SPARK-33353][BUILD] Cache dependencies for Coursier with new sbt in
GitHub Actions
### What changes were proposed in this pull request?
This PR change the behavior of GitHub Actions job that caches
dependencies. SPARK-33226 upgraded sbt to 1.4.1. As of 1.3.0, sbt uses
Coursier as the dependency resolver / fetcher. So let's change the
dependency cache configuration for the GitHub Actions job.
### Why are the changes needed?
To make build faster with Coursier for the GitHub Actions job.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Should be done by GitHub Actions itself.
Closes #30259 from sarutak/coursier-cache.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by:
Dongjoon Hyun <dhyun@apple.com>
(commit: 208b94e)
The file was modified.github/workflows/build_and_test.yml (diff)
Commit 1a704793f4846610307d18a8bf5e23a3f97525d3 by dhyun
[SPARK-33290][SQL][DOCS][FOLLOW-UP] Update SQL migration guide
### What changes were proposed in this pull request?
Update SQL migration guide for SPARK-33290
### Why are the changes needed?
Make the change better documented.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
N/A
Closes #30256 from sunchao/SPARK-33290-2.
Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun
<dhyun@apple.com>
(commit: 1a70479)
The file was modifieddocs/sql-migration-guide.md (diff)
Commit 324275ae8350ec15844ce384f40f1ecc4acdc072 by mridulatgmail.com
[SPARK-33185][YARN] Set up yarn.Client to print direct links to driver
stdout/stderr
### What changes were proposed in this pull request? Currently when run
in `cluster` mode on YARN, the Spark `yarn.Client` will print out the
application report into the logs, to be easily viewed by users. For
example:
``` INFO yarn.Client:
client token: Token { kind: YARN_CLIENT_TOKEN, service:  }
diagnostics: N/A
ApplicationMaster host: X.X.X.X
ApplicationMaster RPC port: 0
queue: default
start time: 1602782566027
final status: UNDEFINED
tracking URL: http://hostname:8888/proxy/application_<id>/
user: xkrogen
```
I propose adding, alongside the application report, some additional
lines like:
```
        Driver Logs (stdout):
http://hostname:8042/node/containerlogs/container_<id>/xkrogen/stdout?start=-4096
        Driver Logs (stderr):
http://hostname:8042/node/containerlogs/container_<id>/xkrogen/stderr?start=-4096
```
This information isn't contained in the `ApplicationReport`, so it's
necessary to query the ResourceManager REST API. For now I have added
this as an always-on feature, but if there is any concern about adding
this REST dependency, I think hiding this feature behind an
off-by-default flag is reasonable.
### Why are the changes needed? Typically, the tracking URL can be used
to find the logs of the ApplicationMaster/driver while the application
is running. Later, the Spark History Server can be used to track this
information down, using the stdout/stderr links on the Executors page.
However, in the situation when the driver crashed _before_ writing out a
history file, the SHS may not be aware of this application, and thus
does not contain links to the driver logs. When this situation arises,
it can be difficult for users to debug further, since they can't easily
find their driver logs.
It is possible to reach the logs by using the `yarn logs` commands, but
the average Spark user isn't aware of this and shouldn't have to be.
With this information readily available in the logs, users can quickly
jump to their driver logs, even if it crashed before the SHS became
aware of the application. This has the additional benefit of providing a
quick way to access driver logs, which often contain useful information,
in a single click (instead of navigating through the Spark UI).
### Does this PR introduce _any_ user-facing change? Yes, some
additional print statements will be created in the application report
when using YARN in cluster mode.
### How was this patch tested? Added unit tests for the parsing logic in
`yarn.ClientSuite`. Also tested against a live cluster. When the driver
is running:
``` INFO Client: Application report for application_XXXXXXXXX_YYYYYY
(state: RUNNING) INFO Client:
        client token: Token { kind: YARN_CLIENT_TOKEN, service:  }
        diagnostics: N/A
        ApplicationMaster host: host.example.com
        ApplicationMaster RPC port: ######
        queue: queue_name
        start time: 1604529046091
        final status: UNDEFINED
        tracking URL:
http://host.example.com:8080/proxy/application_XXXXXXXXX_YYYYYY/
        user: xkrogen
        Driver Logs (stdout):
http://host.example.com:8042/node/containerlogs/container_e07_XXXXXXXXX_YYYYYY_01_000001/xkrogen/stdout?start=-4096
        Driver Logs (stderr):
http://host.example.com:8042/node/containerlogs/container_e07_XXXXXXXXX_YYYYYY_01_000001/xkrogen/stderr?start=-4096
INFO Client: Application report for application_XXXXXXXXX_YYYYYY (state:
RUNNING)
``` I confirmed that when the driver has not yet launched, the report
does not include the two Driver Logs items. Will omit the output here
for brevity since it looks the same.
Closes #30096 from xkrogen/xkrogen-SPARK-33185-yarn-client-print.
Authored-by: Erik Krogen <xkrogen@apache.org> Signed-off-by: Mridul
Muralidharan <mridul<at>gmail.com>
(commit: 324275a)
The file was modifiedresource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala (diff)
The file was modifiedresource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala (diff)
The file was modifiedresource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala (diff)
The file was modifiedresource-managers/yarn/src/main/scala/org/apache/spark/util/YarnContainerInfoHelper.scala (diff)
Commit cd4e3d3b0c7b1ec645ec9c3b2a1847ce29a65765 by dhyun
[SPARK-33360][SQL] Simplify DS v2 write resolution
### What changes were proposed in this pull request?
Removing duplicated code in `ResolveOutputRelation`, by adding
`V2WriteCommand.withNewQuery`
### Why are the changes needed?
code cleanup
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing tests
Closes #30264 from cloud-fan/ds-minor.
Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by:
Dongjoon Hyun <dhyun@apple.com>
(commit: cd4e3d3)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameWriterV2Suite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/DataSourceV2AnalysisSuite.scala (diff)
Commit 4941b7ae18d4081233953cc11328645d0b4cf208 by dhyun
[SPARK-33365][BUILD] Update SBT to 1.4.2
### What changes were proposed in this pull request? This PR aims to
update SBT from 1.4.1 to 1.4.2.
### Why are the changes needed?
This will bring the latest bug fixes.
- https://github.com/sbt/sbt/releases/tag/v1.4.2
### Does this PR introduce _any_ user-facing change? No.
### How was this patch tested? Pass the CIs.
Closes #30268 from williamhyun/sbt.
Authored-by: William Hyun <williamhyun3@gmail.com> Signed-off-by:
Dongjoon Hyun <dhyun@apple.com>
(commit: 4941b7a)
The file was modifiedproject/build.properties (diff)
Commit 90f35c663e4118b7a716e614f37b8d888d0d6bd6 by gurwls223
[MINOR][SQL] Fix incorrect JIRA ID comments in Analyzer
### What changes were proposed in this pull request?
This PR fixes incorrect JIRA ids in `Analyzer.scala` introduced by
SPARK-31670 (https://github.com/apache/spark/pull/28490)
```scala
- // SPARK-31607: Resolve Struct field in
selectedGroupByExprs/groupByExprs and aggregations
+ // SPARK-31670: Resolve Struct field in
selectedGroupByExprs/groupByExprs and aggregations
```
### Why are the changes needed?
Fix the wrong information.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
This is a comment change. Manually review.
Closes #30269 from dongjoon-hyun/SPARK-31670-MINOR.
Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: 90f35c6)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
Commit d16311051d4c67b65116ed182c87f96656b63333 by wenchen
[SPARK-32934][SQL][FOLLOW-UP] Refine class naming and code comments
### What changes were proposed in this pull request?
1. Rename `OffsetWindowSpec` to `OffsetWindowFunction`, as it's the base
class for all offset based window functions. 2. Refine and add more
comments. 3. Remove `isRelative` as it's useless.
### Why are the changes needed?
code refinement
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
existing tests
Closes #30261 from cloud-fan/window.
Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen
Fan <wenchen@databricks.com>
(commit: d163110)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExecBase.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowFunctionFrame.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala (diff)
Commit f6c00079709b6dcda72b08d3e9865ca6b49f8b74 by gengliang.wang
[SPARK-33342][WEBUI] fix the wrong url and display name of blocking
thread in threadDump page
### What changes were proposed in this pull request? fix the wrong url
and display name of blocking thread in threadDump page. The
blockingThreadId variable passed to the page should be of string type
instead of Option type.
### Why are the changes needed? blocking threadId in the ui page is not
displayed well, and the corresponding  url cannot be redirected normally
### Does this PR introduce _any_ user-facing change? NO
### How was this patch tested? The pr  only involves minor changes to
the page and does not affect other functions, The manual test results
are as follows. The thread name displayed on the page is correct, and
you can click on the URL to jump to the corresponding url
![shows_ok](https://user-images.githubusercontent.com/52202080/98108177-89488d00-1ed6-11eb-9488-8446c3f38bad.gif)
Closes #30249 from akiyamaneko/thread-dump-improve.
Authored-by: neko <echohlne@gmail.com> Signed-off-by: Gengliang Wang
<gengliang.wang@databricks.com>
(commit: f6c0007)
The file was modifiedcore/src/main/scala/org/apache/spark/ui/exec/ExecutorThreadDumpPage.scala (diff)
Commit 733a468726849ba17ab27bd20895f253590fedcb by wenchen
[SPARK-33130][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: add,
update type and nullability of columns (MsSqlServer dialect)
### What changes were proposed in this pull request?
Override the default SQL strings for: ALTER TABLE RENAME COLUMN ALTER
TABLE UPDATE COLUMN NULLABILITY in the following MsSQLServer JDBC
dialect according to official documentation. Write MsSqlServer
integration tests for JDBC.
### Why are the changes needed?
To add the support for alter table when interacting with MSSql Server.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
added tests
Closes #30038 from ScrapCodes/mssql-dialect.
Authored-by: Prashant Sharma <prashsh1@in.ibm.com> Signed-off-by:
Wenchen Fan <wenchen@databricks.com>
(commit: 733a468)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/jdbc/MsSqlServerDialect.scala (diff)
The file was addedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MsSqlServerIntegrationSuite.scala
Commit 68c032c246bb091b25d80e436b9288cca9245265 by dhyun
[SPARK-33364][SQL] Introduce the "purge" option in
TableCatalog.dropTable for v2 catalog
### What changes were proposed in this pull request?
This PR proposes to introduce the `purge` option in
`TableCatalog.dropTable` so that v2 catalogs can use the option if
needed.
Related discussion:
https://github.com/apache/spark/pull/30079#discussion_r510594110
### Why are the changes needed?
Spark DDL supports passing the purge option to `DROP TABLE` command.
However, the option is not used (ignored) for v2 catalogs.
### Does this PR introduce _any_ user-facing change?
This PR introduces a new API in `TableCatalog`.
### How was this patch tested?
Added a test.
Closes #30267 from imback82/purge_table.
Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun
<dhyun@apple.com>
(commit: 68c032c)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DropTableExec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)
Commit 93ad26be01a47fb075310a26188e238d55110098 by kabhwan.opensource
[SPARK-23432][UI] Add executor peak jvm memory metrics in executors page
### What changes were proposed in this pull request? Add executor peak
jvm memory metrics in executors page
![image](https://user-images.githubusercontent.com/1633312/97767765-9121bf00-1adb-11eb-93c7-7912d9fe7826.png)
### Why are the changes needed? Users can know executor peak jvm metrics
on in executors page
### Does this PR introduce _any_ user-facing change? Users can know
executor peak jvm metrics on in executors page
### How was this patch tested? Manually tested
Closes #30186 from warrenzhu25/23432.
Authored-by: Warren Zhu <warren.zhu25@gmail.com> Signed-off-by: Jungtaek
Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
(commit: 93ad26b)
The file was modifiedcore/src/main/resources/org/apache/spark/ui/static/executorspage.js (diff)
The file was modifiedcore/src/main/resources/org/apache/spark/ui/static/executorspage-template.html (diff)
Commit 09fa7ecae146c0865fc535b4b17175ca5714cfa4 by viirya
[SPARK-33291][SQL] Improve DataFrame.show for nulls in arrays and
structs
### What changes were proposed in this pull request? The changes in
[SPARK-32501 Inconsistent NULL conversions to
strings](https://issues.apache.org/jira/browse/SPARK-32501) introduced
some behavior that I'd like to clean up a bit.
Here's sample code to illustrate the behavior I'd like to clean up:
```scala val rows = Seq[String](null)
.toDF("value")
.withColumn("struct1", struct('value as "value1"))
.withColumn("struct2", struct('value as "value1", 'value as "value2"))
.withColumn("array1", array('value))
.withColumn("array2", array('value, 'value))
// Show the DataFrame using the "first" codepath.
rows.show(truncate=false)
+-----+-------+-------------+------+--------+
|value|struct1|struct2      |array1|array2  |
+-----+-------+-------------+------+--------+
|null |{ null}|{ null, null}|[]    |[, null]|
+-----+-------+-------------+------+--------+
// Write the DataFrame to disk, then read it back and show it to trigger
the "codegen" code path: rows.write.parquet("rows")
spark.read.parquet("rows").show(truncate=false)
+-----+-------+-------------+-------+-------------+
|value|struct1|struct2      |array1 |array2       |
+-----+-------+-------------+-------+-------------+
|null |{ null}|{ null, null}|[ null]|[ null, null]|
+-----+-------+-------------+-------+-------------+
```
Notice:
1. If the first element of a struct is null, it is printed with a
leading space (e.g. "\{ null\}").  I think it's preferable to print it
without the leading space (e.g. "\{null\}").  This is consistent with
how non-null values are printed inside a struct. 2. If the first element
of an array is null, it is not printed at all in the first code path,
and the "codegen" code path prints it with a leading space.  I think
both code paths should be consistent and print it without a leading
space (e.g. "[null]").
The desired result of this PR is to product the following output via
both code paths:
```
+-----+-------+------------+------+------------+
|value|struct1|struct2     |array1|array2      |
+-----+-------+------------+------+------------+
|null |{null} |{null, null}|[null]|[null, null]|
+-----+-------+------------+------+------------+
```
This contribution is my original work and I license the work to the
project under the project’s open source license.
### Why are the changes needed?
To correct errors and inconsistencies in how DataFrame.show() displays
nulls inside arrays and structs.
### Does this PR introduce _any_ user-facing change?
Yes.  This PR changes what is printed out by DataFrame.show().
### How was this patch tested?
I added new test cases in CastSuite.scala to cover the cases addressed
by this PR.
Closes #30189 from stwhit/show_nulls.
Authored-by: Stuart White <stuart.white1@gmail.com> Signed-off-by:
Liang-Chi Hsieh <viirya@gmail.com>
(commit: 09fa7ec)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuite.scala (diff)
Commit fb9c873e7d5c81f312b26e46df32b1aadc6670b7 by kabhwan.opensource
[SPARK-33347][CORE] Cleanup useless variables of MutableApplicationInfo
### What changes were proposed in this pull request? There are 4 fields
in `MutableApplicationInfo ` seems useless:
- `coresGranted`
- `maxCores`
- `coresPerExecutor`
- `memoryPerExecutorMB`
They are always `None` and not reassigned.
So the main change of this pr is  cleanup these useless fields in
`MutableApplicationInfo`.
### Why are the changes needed? Cleanup useless variables.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? Pass the Jenkins or GitHub Action
Closes #30251 from LuciferYang/SPARK-33347.
Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Jungtaek Lim
(HeartSaVioR) <kabhwan.opensource@gmail.com>
(commit: fb9c873)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala (diff)