Changes

Summary

  1. [SPARK-33228][SQL] Don't uncache data when replacing a view having the (commit: 87b4984) (details)
  2. [SPARK-33234][INFRA] Generates SHA-512 using shasum (commit: ce0ebf5) (details)
  3. [SPARK-32388][SQL] TRANSFORM with schema-less mode should keep the same (commit: 56ab60f) (details)
  4. Revert "[SPARK-32388][SQL] TRANSFORM with schema-less mode should keep (commit: 369cc61) (details)
  5. [SPARK-32862][SS] Left semi stream-stream join (commit: d87a0bb) (details)
  6. [SPARK-33197][SQL] Make changes to spark.sql.analyzer.maxIterations take (commit: a21945c) (details)
  7. [SPARK-33239][INFRA] Use pre-built image at GitHub Action SparkR job (commit: 850adeb) (details)
  8. [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only (commit: 1042d49) (details)
  9. [SPARK-33204][UI] The 'Event Timeline' area cannot be opened when a (commit: 11bbb13) (details)
  10. [SPARK-33230][SQL] Hadoop committers to get unique job ID in (commit: 02fa19f) (details)
Commit 87b498462b82fce02dd50286887092cf7858d2e8 by dhyun
[SPARK-33228][SQL] Don't uncache data when replacing a view having the
same logical plan
### What changes were proposed in this pull request?
SPARK-30494's updated the `CreateViewCommand` code to implicitly drop
cache when replacing an existing view. But, this change drops cache even
when replacing a view having the same logical plan. A sequence of
queries to reproduce this as follows;
```
// Spark v2.4.6+ scala> val df = spark.range(1).selectExpr("id a", "id
b") scala> df.cache() scala> df.explain()
== Physical Plan ==
*(1) ColumnarToRow
+- InMemoryTableScan [a#2L, b#3L]
     +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory,
deserialized, 1 replicas)
           +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
              +- *(1) Range (0, 1, step=1, splits=4)
scala> df.createOrReplaceTempView("t") scala> sql("select * from
t").explain()
== Physical Plan ==
*(1) ColumnarToRow
+- InMemoryTableScan [a#2L, b#3L]
     +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory,
deserialized, 1 replicas)
           +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
              +- *(1) Range (0, 1, step=1, splits=4)
// If one re-runs the same query `df.createOrReplaceTempView("t")`, the
cache's swept away scala> df.createOrReplaceTempView("t") scala>
sql("select * from t").explain()
== Physical Plan ==
*(1) Project [id#0L AS a#2L, id#0L AS b#3L]
+- *(1) Range (0, 1, step=1, splits=4)
// Until v2.4.6 scala> val df = spark.range(1).selectExpr("id a", "id
b") scala> df.cache() scala> df.createOrReplaceTempView("t") scala>
sql("select * from t").explain() 20/10/23 22:33:42 WARN ObjectStore:
Failed to get database global_temp, returning NoSuchObjectException
== Physical Plan ==
*(1) InMemoryTableScan [a#2L, b#3L]
  +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory,
deserialized, 1 replicas)
        +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
           +- *(1) Range (0, 1, step=1, splits=4)
scala> df.createOrReplaceTempView("t") scala> sql("select * from
t").explain()
== Physical Plan ==
*(1) InMemoryTableScan [a#2L, b#3L]
  +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory,
deserialized, 1 replicas)
        +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
           +- *(1) Range (0, 1, step=1, splits=4)
```
### Why are the changes needed?
bugfix.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added tests.
Closes #30140 from maropu/FixBugInReplaceView.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by:
Dongjoon Hyun <dhyun@apple.com>
(commit: 87b4984)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala (diff)
Commit ce0ebf5f023b1d2230bbd4b9ffad294edef3bca7 by dhyun
[SPARK-33234][INFRA] Generates SHA-512 using shasum
### What changes were proposed in this pull request?
I am generating the SHA-512 using the standard shasum which also has a
better output compared to GPG.
### Why are the changes needed?
Which makes the hash much easier to verify for users that don't have
GPG.
Because an user having GPG can check the keys but an user without GPG
will have a hard time validating the SHA-512 based on the 'pretty
printed' format.
Apache Spark is the only project where I've seen this format. Most other
Apache projects have a one-line hash file.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
This patch assumes the build system has shasum (it should, but I can't
test this).
Closes #30123 from emilianbold/master.
Authored-by: Emi <emilian.bold@gmail.com> Signed-off-by: Dongjoon Hyun
<dhyun@apple.com>
(commit: ce0ebf5)
The file was modifieddev/create-release/release-build.sh (diff)
Commit 56ab60fb7ae37ca64d668bc4a1f18216cc7186fd by gurwls223
[SPARK-32388][SQL] TRANSFORM with schema-less mode should keep the same
with hive
### What changes were proposed in this pull request? In current Spark
script transformation with hive serde mode, in case of schema less,
result is different with hive. This pr to keep result same with hive
script transform  serde.
#### Hive Scrip Transform with serde in schemaless
``` hive> create table t (c0 int, c1 int, c2 int); hive> INSERT INTO t
VALUES (1, 1, 1); hive> INSERT INTO t VALUES (2, 2, 2); hive> CREATE
VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t;
hive> DESCRIBE v; key                 string value              
string
hive> SELECT * FROM v; 1 1 1 2 2 2
hive> SELECT key FROM v; 1 2
hive> SELECT value FROM v; 1 1 2 2
```
#### Spark script transform with hive serde in schema less.
``` hive> create table t (c0 int, c1 int, c2 int); hive> INSERT INTO t
VALUES (1, 1, 1); hive> INSERT INTO t VALUES (2, 2, 2); hive> CREATE
VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t;
hive> SELECT * FROM v; 1   1 2   2
```
**No serde mode in hive (ROW FORMATTED DELIMITED)**
![image](https://user-images.githubusercontent.com/46485123/90088770-55841e00-dd52-11ea-92dd-7fe52d93f0b3.png)
### Why are the changes needed? Keep same behavior with hive script
transform
### Does this PR introduce _any_ user-facing change? Before this pr with
hive serde script transform
``` select transform(*) USING 'cat' from ( select 1, 2, 3, 4
) tmp
key     value 1         2
``` After
``` select transform(*) USING 'cat' from ( select 1, 2, 3, 4
) tmp
key     value 1         2   3  4
```
### How was this patch tested? UT
Closes #29421 from AngersZhuuuu/SPARK-32388.
Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: 56ab60f)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/BaseScriptTransformationSuite.scala (diff)
Commit 369cc614f369f9fd9be5b13a3f047a261c8e8d90 by gurwls223
Revert "[SPARK-32388][SQL] TRANSFORM with schema-less mode should keep
the same with hive"
This reverts commit 56ab60fb7ae37ca64d668bc4a1f18216cc7186fd.
(commit: 369cc61)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/BaseScriptTransformationSuite.scala (diff)
Commit d87a0bb2caa6804d59130c41a4c005acb2e4aad2 by kabhwan.opensource
[SPARK-32862][SS] Left semi stream-stream join
### What changes were proposed in this pull request?
This is to support left semi join in stream-stream join. The
implementation of left semi join is (mostly in
`StreamingSymmetricHashJoinExec` and `SymmetricHashJoinStateManager`):
* For left side input row, check if there's a match on right side state
store.
* if there's a match, output the left side row, but do not put the row
in left side state store (no need to put in state store).
* if there's no match, output nothing, but put the row in left side
state store (with "matched" field to set to false in state store).
* For right side input row, check if there's a match on left side state
store.
* For all matched left rows in state store, output the rows with
"matched" field as false. Set all left rows with "matched" field to be
true. Only output the left side rows matched for the first time to
guarantee left semi join semantics.
* State store eviction: evict rows from left/right side state store
below watermark, same as inner join.
Note a followup optimization can be to evict matched left side rows from
state store earlier, even when the rows are still above watermark.
However this needs more change in `SymmetricHashJoinStateManager`, so
will leave this as a followup.
### Why are the changes needed?
Current stream-stream join supports inner, left outer and right outer
join
(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L166
). We do see internally a lot of users are using left semi stream-stream
join (not spark structured streaming), e.g. I want to get the ad
impression (join left side) which has click (joint right side), but I
don't care how many clicks per ad (left semi semantics).
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added unit tests in `UnsupportedOperationChecker.scala` and
`StreamingJoinSuite.scala`.
Closes #30076 from c21/stream-join.
Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Jungtaek Lim
(HeartSaVioR) <kabhwan.opensource@gmail.com>
(commit: d87a0bb)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingJoinSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/JoinedRow.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationsSuite.scala (diff)
Commit a21945ce6c725896d19647891d1f9fa9ef74bd87 by yamamuro
[SPARK-33197][SQL] Make changes to spark.sql.analyzer.maxIterations take
effect at runtime
### What changes were proposed in this pull request?
Make changes to `spark.sql.analyzer.maxIterations` take effect at
runtime.
### Why are the changes needed?
`spark.sql.analyzer.maxIterations` is not a static conf. However, before
this patch, changing `spark.sql.analyzer.maxIterations` at runtime does
not take effect.
### Does this PR introduce _any_ user-facing change?
Yes. Before this patch, changing `spark.sql.analyzer.maxIterations` at
runtime does not take effect.
### How was this patch tested?
modified unit test
Closes #30108 from yuningzh-db/dynamic-analyzer-max-iterations.
Authored-by: Yuning Zhang <yuning.zhang@databricks.com> Signed-off-by:
Takeshi Yamamuro <yamamuro@apache.org>
(commit: a21945c)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala (diff)
Commit 850adeb0fd188cc3cb6319758d58a12554cb6149 by dhyun
[SPARK-33239][INFRA] Use pre-built image at GitHub Action SparkR job
### What changes were proposed in this pull request?
This PR aims to use a pre-built image for Github Action SparkR job.
### Why are the changes needed?
This will reduce the execution time and the flakiness.
**BEFORE (21 minutes 39 seconds)**
![Screen Shot 2020-10-16 at 1 24 43
PM](https://user-images.githubusercontent.com/9700541/96305593-fbeada80-0fb2-11eb-9b8e-86d8abaad9ef.png)
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the GitHub Action `sparkr` job in this PR.
Closes #30066 from dongjoon-hyun/SPARKR.
Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
(commit: 850adeb)
The file was modified.github/workflows/build_and_test.yml (diff)
Commit 1042d49bf9d7bb5162215e981e2f8e98164b2aff by yamamuro
[SPARK-33075][SQL] Enable auto bucketed scan by default (disable only
for cached query)
### What changes were proposed in this pull request?
This PR is to enable auto bucketed table scan by default, with exception
to only disable for cached query (similar to AQE). The reason why
disabling auto scan for cached query is that, the cached query output
partitioning can be leveraged later to avoid shuffle and sort when doing
join and aggregate.
### Why are the changes needed?
Enable auto bucketed table scan by default is useful as it can optimize
query automatically under the hood, without users interaction.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added unit test for cached query in
`DisableUnnecessaryBucketedScanSuite.scala`. Also change a bunch of unit
tests which should disable auto bucketed scan to make them work.
Closes #30138 from c21/enable-auto-bucket.
Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Takeshi Yamamuro
<yamamuro@apache.org>
(commit: 1042d49)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanHelper.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/sources/DisableUnnecessaryBucketedScanSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala (diff)
Commit 11bbb130df7b083f42acf0207531efe3912d89eb by gengliang.wang
[SPARK-33204][UI] The 'Event Timeline' area cannot be opened when a
spark application has some failed jobs
### What changes were proposed in this pull request? The page returned
by /jobs in Spark UI will  store the detail information of each job in
javascript like this:
```javascript
{
'className': 'executor added',
'group': 'executors',
'start': new Date(1602834008978),
'content': '<div class="executor-event-content"' +
   'data-toggle="tooltip" data-placement="top"' +
   'data-title="Executor 3<br>' +
   'Added at 2020/10/16 15:40:08"' +
   'data-html="true">Executor 3 added</div>'
}
``` if an application has a failed job, the failure reason corresponding
to the job will be stored in the ` content`  field in the javascript .
if the failure  reason contains the character: **'**,   the  javascript
code will throw an exception to cause the `event timeline url` had no
response , The following is an example of error json:
```javascript
{
'className': 'executor removed',
'group': 'executors',
'start': new Date(1602925908654),
'content': '<div class="executor-event-content"' +
   'data-toggle="tooltip" data-placement="top"' +
   'data-title="Executor 2<br>' +
   'Removed at 2020/10/17 17:11:48' +
   '<br>Reason: Container from a bad node: ...   20/10/17 16:00:42 WARN
ShutdownHookManager: ShutdownHook **'$anon$2'** timeout..."' +
   'data-html="true">Executor 2 removed</div>'
}
```
So we need to considier this special case , if the returned job info
contains the character:**'**, just remove it
### Why are the changes needed?
Ensure that the UI page can function normally
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
This pr only  fixes an exception in a special case, manual test result
as blows:
![fixed](https://user-images.githubusercontent.com/52202080/96711638-74490580-13d0-11eb-93e0-b44d9ed5da5c.gif)
Closes #30119 from akiyamaneko/timeline_view_cannot_open.
Authored-by: neko <echohlne@gmail.com> Signed-off-by: Gengliang Wang
<gengliang.wang@databricks.com>
(commit: 11bbb13)
The file was modifiedcore/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/ui/jobs/JobPage.scala (diff)
Commit 02fa19f102122f06e4358cf86c5e903fda28b289 by dhyun
[SPARK-33230][SQL] Hadoop committers to get unique job ID in
"spark.sql.sources.writeJobUUID"
### What changes were proposed in this pull request?
This reinstates the old option `spark.sql.sources.write.jobUUID` to set
a unique jobId in the jobconf so that hadoop MR committers have a unique
ID which is (a) consistent across tasks and workers and (b) not brittle
compared to generated-timestamp job IDs. The latter matches that of what
JobID requires, but as they are generated per-thread, may not always be
unique within a cluster.
### Why are the changes needed?
If a committer (e.g s3a staging committer) uses job-attempt-ID as a
unique ID then any two jobs started within the same second have the same
ID, so can clash.
### Does this PR introduce _any_ user-facing change?
Good Q. It is "developer-facing" in the context of anyone writing a
committer. But it reinstates a property which was in Spark 1.x and "went
away"
### How was this patch tested?
Testing: no test here. You'd have to create a new committer which
extracted the value in both job and task(s) and verified consistency.
That is possible (with a task output whose records contained the UUID),
but it would be pretty convoluted and a high maintenance cost.
Because it's trying to address a race condition, it's hard to regenerate
the problem downstream and so verify a fix in a test run...I'll just
look at the logs to see what temporary dir is being used in the cluster
FS and verify it's a UUID
Closes #30141 from steveloughran/SPARK-33230-jobId.
Authored-by: Steve Loughran <stevel@cloudera.com> Signed-off-by:
Dongjoon Hyun <dhyun@apple.com>
(commit: 02fa19f)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala (diff)