Changes

Summary

  1. [SPARK-32691][BUILD] Update commons-crypto to v1.1.0 (commit: 83a8079) (details)
  2. [SPARK-33251][PYTHON][DOCS] Migration to NumPy documentation style in ML (commit: 090962c) (details)
  3. [SPARK-33397][YARN][DOC] Fix generating md to html for (commit: 036c11b) (details)
  4. [SPARK-33405][BUILD] Upgrade commons-compress to 1.20 (commit: 35ac314) (details)
  5. [SPARK-33363] Add prompt information related to the current task when (commit: 4360c6f) (details)
  6. [SPARK-33223][SS][UI] Structured Streaming Web UI state information (commit: 4ac8133) (details)
  7. [SPARK-33213][BUILD] Upgrade Apache Arrow to 2.0.0 (commit: c2caf25) (details)
  8. [SPARK-33369][SQL] DSV2: Skip schema inference in write if table (commit: a1f84d8) (details)
  9. [SPARK-33366][SQL] Migrate LOAD DATA command to use UnresolvedTable to (commit: 90f6f39) (details)
  10. [SPARK-33244][SQL] Unify the code paths for spark.table and (commit: ad02ced) (details)
  11. [SPARK-33391][SQL] element_at with CreateArray not respect one based (commit: e3a768d) (details)
  12. [SPARK-33339][PYTHON] Pyspark application will hang due to non Exception (commit: 27bb40b) (details)
  13. [SPARK-33305][SQL] DSv2: DROP TABLE command should also invalidate cache (commit: 4934da5) (details)
  14. [SPARK-33302][SQL] Push down filters through Expand (commit: 34f5e7c) (details)
  15. [SPARK-33376][SQL] Remove the option of "sharesHadoopClasses" in Hive (commit: 3165ca7) (details)
  16. [SPARK-33251][FOLLOWUP][PYTHON][DOCS][MINOR] Adjusts returns (commit: 122c899) (details)
  17. [SPARK-33337][SQL] Support subexpression elimination in branches of (commit: 6fa80ed) (details)
  18. [SPARK-33404][SQL] Fix incorrect results in `date_trunc` expression (commit: 4634694) (details)
  19. [SPARK-33390][SQL] Make Literal support char array (commit: 5197c5d) (details)
  20. [SPARK-33382][SQL][TESTS] Unify datasource v1 and v2 SHOW TABLES tests (commit: 1e2eeda) (details)
Commit 83a80796aa5b1ecfbe0ebc4c385932d93fd578ea by dongjoon
[SPARK-32691][BUILD] Update commons-crypto to v1.1.0
### What changes were proposed in this pull request? Update the package
commons-crypto to v1.1.0 to support aarch64 platform
- https://issues.apache.org/jira/browse/CRYPTO-139
### Why are the changes needed?
The package commons-crypto-1.0.0 available in the Maven repository
doesn't support aarch64 platform. It costs long time in
CryptoRandomFactory.getCryptoRandom(properties).nextBytes(iv) when
NettyBlockRpcSever receive block data from client,  if the time more
than the default value 120s, IOException raised and client will retry
replicate the block data to other executors. But in fact the replication
is complete, it makes the replication number incorrect. This makes
DistributedSuite tests pass.
### Does this PR introduce any user-facing change? No
### How was this patch tested? Pass the CIs.
Closes #30275 from huangtianhua/SPARK-32691.
Authored-by: huangtianhua <huangtianhua223@gmail.com> Signed-off-by:
Dongjoon Hyun <dongjoon@apache.org>
(commit: 83a8079)
The file was modifieddev/deps/spark-deps-hadoop-3.2-hive-2.3 (diff)
The file was modifieddev/deps/spark-deps-hadoop-2.7-hive-2.3 (diff)
The file was modifiedpom.xml (diff)
Commit 090962cd4279cfee5b73a928180c3c19fec647c6 by gurwls223
[SPARK-33251][PYTHON][DOCS] Migration to NumPy documentation style in ML
(pyspark.ml.*)
### What changes were proposed in this pull request?
This PR proposes migration of `pyspark.ml` to NumPy documentation style.
### Why are the changes needed?
To improve documentation style.
### Does this PR introduce _any_ user-facing change?
Yes, this changes both rendered HTML docs and console representation
(SPARK-33243).
### How was this patch tested?
`dev/lint-python` and manual inspection.
Closes #30285 from zero323/SPARK-33251.
Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: 090962c)
The file was modifiedpython/pyspark/ml/evaluation.py (diff)
The file was modifiedpython/pyspark/ml/classification.py (diff)
The file was modifiedpython/pyspark/ml/linalg/__init__.py (diff)
The file was modifiedpython/pyspark/ml/fpm.py (diff)
The file was modifiedpython/pyspark/ml/base.py (diff)
The file was modifiedpython/pyspark/ml/util.py (diff)
The file was modifiedpython/pyspark/ml/tuning.py (diff)
The file was modifiedpython/pyspark/ml/wrapper.py (diff)
The file was modifiedpython/pyspark/ml/functions.py (diff)
The file was modifiedpython/pyspark/ml/image.py (diff)
The file was modifiedpython/pyspark/ml/regression.py (diff)
The file was modifiedpython/pyspark/ml/feature.py (diff)
The file was modifiedpython/pyspark/ml/pipeline.py (diff)
The file was modifiedpython/pyspark/ml/recommendation.py (diff)
The file was modifiedpython/pyspark/ml/clustering.py (diff)
The file was modifiedpython/pyspark/ml/stat.py (diff)
The file was modifiedpython/pyspark/ml/base.pyi (diff)
The file was modifiedpython/pyspark/ml/param/__init__.py (diff)
Commit 036c11b0d4ee88ce88d7869bbbbba6a589754786 by yamamuro
[SPARK-33397][YARN][DOC] Fix generating md to html for
available-patterns-for-shs-custom-executor-log-url
### What changes were proposed in this pull request?
1. replace `{{}}`  with `&#123;&#123;&#125;&#125;` 2. using
`<code></code>` in td-tag
### Why are the changes needed?
to fix this.
![image](https://user-images.githubusercontent.com/8326978/98544155-8c74bc00-22ce-11eb-8889-8dacb726b762.png)
### Does this PR introduce _any_ user-facing change?
yes, you will see the correct online doc with this change
![image](https://user-images.githubusercontent.com/8326978/98545256-2e48d880-22d0-11eb-9dd9-b8cae3df8659.png)
### How was this patch tested?
shown as the above pic via jekyll serve.
Closes #30298 from yaooqinn/SPARK-33397.
Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi
Yamamuro <yamamuro@apache.org>
(commit: 036c11b)
The file was modifieddocs/running-on-yarn.md (diff)
Commit 35ac314181129374b02f8f8c07341b43a734e1c7 by gurwls223
[SPARK-33405][BUILD] Upgrade commons-compress to 1.20
### What changes were proposed in this pull request?
This PR aims to upgrade `commons-compress` from 1.8 to 1.20.
### Why are the changes needed?
-
https://commons.apache.org/proper/commons-compress/security-reports.html
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs.
Closes #30304 from dongjoon-hyun/SPARK-33405.
Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: 35ac314)
The file was modifieddev/deps/spark-deps-hadoop-3.2-hive-2.3 (diff)
The file was modifieddev/deps/spark-deps-hadoop-2.7-hive-2.3 (diff)
The file was modifiedpom.xml (diff)
Commit 4360c6f12ae8f192fb65ae1c6ad6ee05e0217c7d by gurwls223
[SPARK-33363] Add prompt information related to the current task when
pyspark/sparkR starts
### What changes were proposed in this pull request? add prompt
information about current applicationId, current URL and master info
when pyspark / sparkR starts.
### Why are the changes needed? The information printed when
pyspark/sparkR starts does not prompt the basic information of current
application, and it is not convenient when used pyspark/sparkR in dos.
### Does this PR introduce _any_ user-facing change? no
### How was this patch tested? manual test result shows below:
![pyspark new
print](https://user-images.githubusercontent.com/52202080/98274268-2a663f00-1fce-11eb-88ce-964ce90b439e.png)
![sparkR](https://user-images.githubusercontent.com/52202080/98541235-1a01dd00-22ca-11eb-9304-09bcde87b05e.png)
Closes #30266 from akiyamaneko/pyspark-hint-info.
Authored-by: neko <echohlne@gmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: 4360c6f)
The file was modifiedpython/pyspark/shell.py (diff)
The file was modifiedR/pkg/inst/profile/shell.R (diff)
Commit 4ac8133866e7b97e04ab75cad0e0bf54565b0ba5 by kabhwan.opensource
[SPARK-33223][SS][UI] Structured Streaming Web UI state information
### What changes were proposed in this pull request? Structured
Streaming UI is not containing state information. In this PR I've added
it.
### Why are the changes needed? Missing state information.
### Does this PR introduce _any_ user-facing change? Additional UI
elements appear.
### How was this patch tested? Existing unit tests + manual test.
<img width="1044" alt="Screenshot 2020-10-30 at 15 14 21"
src="https://user-images.githubusercontent.com/18561820/97715405-a1797000-1ac2-11eb-886a-e3e6efa3af3e.png">
Closes #30151 from gaborgsomogyi/SPARK-33223.
Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by:
Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
(commit: 4ac8133)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryStatisticsPage.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/ui/UISeleniumSuite.scala (diff)
Commit c2caf2522b2e65a93a797580f08ac36461000969 by dhyun
[SPARK-33213][BUILD] Upgrade Apache Arrow to 2.0.0
### What changes were proposed in this pull request?
This upgrade Apache Arrow version from 1.0.1 to 2.0.0
### Why are the changes needed?
Apache Arrow 2.0.0 was released with some improvements from Java side,
so it's better to upgrade Spark to the new version. Note that the format
version in Arrow 2.0.0 is still 1.0.0 so API should still be compatible
between 1.0.1 and 2.0.0.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing UTs.
Closes #30306 from sunchao/SPARK-33213.
Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun
<dhyun@apple.com>
(commit: c2caf25)
The file was modifieddev/deps/spark-deps-hadoop-2.7-hive-2.3 (diff)
The file was modifieddev/deps/spark-deps-hadoop-3.2-hive-2.3 (diff)
The file was modifiedpom.xml (diff)
Commit a1f84d8714cd1bd6cc6e2da6eb97fb9f58f3ee8f by wenchen
[SPARK-33369][SQL] DSV2: Skip schema inference in write if table
provider supports external metadata
### What changes were proposed in this pull request?
When TableProvider.supportsExternalMetadata() is true, Spark will use
the input Dataframe's schema in
`DataframeWriter.save()`/`DataStreamWriter.start()` and skip
schema/partitioning inference.
### Why are the changes needed?
For all the v2 data sources which are not FileDataSourceV2, Spark always
infers the table schema/partitioning on
`DataframeWriter.save()`/`DataStreamWriter.start()`. The inference of
table schema/partitioning can be expensive. However, there is no such
trait or flag for indicating a V2 source can use the input DataFrame's
schema on `DataframeWriter.save()`/`DataStreamWriter.start()`. We can
resolve the problem by adding a new expected behavior for the method
`TableProvider.supportsExternalMetadata()`.
### Does this PR introduce _any_ user-facing change?
Yes, a new behavior for the data source v2 API
`TableProvider.supportsExternalMetadata()` when it returns true.
### How was this patch tested?
Unit test
Closes #30273 from gengliangwang/supportsExternalMetadata.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: a1f84d8)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableProvider.java (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala (diff)
The file was modifiedsql/core/src/test/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/sources/StreamingDataSourceV2Suite.scala (diff)
Commit 90f6f39e429e0db00e234bdcf679a70dfce3272e by wenchen
[SPARK-33366][SQL] Migrate LOAD DATA command to use UnresolvedTable to
resolve the identifier
### What changes were proposed in this pull request?
This PR proposes to migrate `LOAD DATA` to use `UnresolvedTable` to
resolve the table identifier. This allows consistent resolution rules
(temp view first, etc.) to be applied for both v1/v2 commands. More info
about the consistent resolution rule proposal can be found in
[JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal
doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).
Note that `LOAD DATA` is not supported for v2 tables.
### Why are the changes needed?
The changes allow consistent resolution behavior when resolving the
table identifier. For example, the following is the current behavior:
```scala sql("CREATE TEMPORARY VIEW t AS SELECT 1") sql("CREATE DATABASE
db") sql("CREATE TABLE t (key INT, value STRING) USING hive") sql("USE
db") sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt'
INTO TABLE t") // Succeeds
``` With this change, `LOAD DATA` above fails with the following:
``` org.apache.spark.sql.AnalysisException: t is a temp view not table.;
line 1 pos 0
   at
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
   at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.$anonfun$applyOrElse$39(Analyzer.scala:865)
   at scala.Option.foreach(Option.scala:407)
```
, which is expected since temporary view is resolved first and `LOAD
DATA` doesn't support a temporary view.
### Does this PR introduce _any_ user-facing change?
After this PR, `LOAD DATA ... t` is resolved to a temp view `t` instead
of table `db.t` in the above scenario.
### How was this patch tested?
Updated existing tests.
Closes #30270 from imback82/load_data_cmd.
Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 90f6f39)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/test/TestHive.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statements.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/SQLViewSuite.scala (diff)
Commit ad02ceda29c60f9c6e0430caff0d174558c0c661 by wenchen
[SPARK-33244][SQL] Unify the code paths for spark.table and
spark.read.table
### What changes were proposed in this pull request?
- Call `spark.read.table` in `spark.table`.
- Add comments for `spark.table` to emphasize it also support streaming
temp view reading.
### Why are the changes needed? The code paths of `spark.table` and
`spark.read.table` should be the same. This behavior is broke in
SPARK-32592 since we need to respect options in `spark.read.table` API.
### Does this PR introduce _any_ user-facing change? No.
### How was this patch tested? Existing UT.
Closes #30148 from xuanyuanking/SPARK-33244.
Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by:
Wenchen Fan <wenchen@databricks.com>
(commit: ad02ced)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala (diff)
Commit e3a768dd79558b04f6ae71380876bcde2354008c by wenchen
[SPARK-33391][SQL] element_at with CreateArray not respect one based
index
### What changes were proposed in this pull request?
element_at with CreateArray not respect one based index.
repo step:
``` var df = spark.sql("select element_at(array(3, 2, 1), 0)")
df.printSchema()
df = spark.sql("select element_at(array(3, 2, 1), 1)") df.printSchema()
df = spark.sql("select element_at(array(3, 2, 1), 2)") df.printSchema()
df = spark.sql("select element_at(array(3, 2, 1), 3)") df.printSchema()
root
– element_at(array(3, 2, 1), 0): integer (nullable = false)
root
– element_at(array(3, 2, 1), 1): integer (nullable = false)
root
– element_at(array(3, 2, 1), 2): integer (nullable = false)
root
– element_at(array(3, 2, 1), 3): integer (nullable = true)
correct answer should be 0 true which is outOfBounds return default
true. 1 false 2 false 3 false
```
For expression eval, it respect the oneBasedIndex, but within checking
the nullable, it calculates with zeroBasedIndex using
`computeNullabilityFromArray`.
### Why are the changes needed?
Correctness issue.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added UT and existing UT.
Closes #30296 from leanken/leanken-SPARK-33391.
Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: e3a768d)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala (diff)
Commit 27bb40b6297361985e3590687f0332a72b71bc85 by gurwls223
[SPARK-33339][PYTHON] Pyspark application will hang due to non Exception
error
### What changes were proposed in this pull request?
When a system.exit exception occurs during the process, the python
worker exits abnormally, and then the executor task is still waiting for
the worker for reading from socket, causing it to hang. The system.exit
exception may be caused by the user's error code, but spark should at
least throw an error to remind the user, not get stuck we can run a
simple test to reproduce this case:
``` from pyspark.sql import SparkSession def err(line):
raise SystemExit spark =
SparkSession.builder.appName("test").getOrCreate()
spark.sparkContext.parallelize(range(1,2), 2).map(err).collect()
spark.stop()
```
### Why are the changes needed?
to make sure pyspark application won't hang if there's non-Exception
error in python worker
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
added a new test and also manually tested the case above
Closes #30248 from li36909/pyspark.
Lead-authored-by: lrz <lrz@lrzdeMacBook-Pro.local> Co-authored-by:
Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: 27bb40b)
The file was modifiedpython/pyspark/worker.py (diff)
The file was modifiedpython/pyspark/tests/test_worker.py (diff)
Commit 4934da56bcc13fc61afc8e8cc44fb5290b5e7b32 by wenchen
[SPARK-33305][SQL] DSv2: DROP TABLE command should also invalidate cache
### What changes were proposed in this pull request?
This changes `DropTableExec` to also invalidate caches referencing the
table to be dropped, in a cascading manner.
### Why are the changes needed?
In DSv1, `DROP TABLE` command also invalidate caches as described in
[SPARK-19765](https://issues.apache.org/jira/browse/SPARK-19765).
However in DSv2 the same command only drops the table but doesn't handle
the caches. This could lead to correctness issue.
### Does this PR introduce _any_ user-facing change?
Yes. Now DSv2 `DROP TABLE` command also invalidates cache.
### How was this patch tested?
Added a new UT
Closes #30211 from sunchao/SPARK-33305.
Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 4934da5)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DropTableExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala (diff)
Commit 34f5e7ce77647d3b5eb11700566e0bbce73960e2 by wenchen
[SPARK-33302][SQL] Push down filters through Expand
### What changes were proposed in this pull request? Push down filter
through expand.  For case below:
``` create table t1(pid int, uid int, sid int, dt date, suid int) using
parquet; create table t2(pid int, vs int, uid int, csid int) using
parquet;
SELECT
      years,
      appversion,
      SUM(uusers) AS users FROM   (SELECT
              Date_trunc('year', dt)          AS years,
              CASE
                WHEN h.pid = 3 THEN 'iOS'
                WHEN h.pid = 4 THEN 'Android'
                ELSE 'Other'
              END                             AS viewport,
              h.vs                            AS appversion,
              Count(DISTINCT u.uid)           AS uusers
              ,Count(DISTINCT u.suid)         AS srcusers
       FROM   t1 u
              join t2 h
                ON h.uid = u.uid
       GROUP  BY 1,
                 2,
                 3) AS a WHERE  viewport = 'iOS' GROUP  BY 1,
         2
```
Plan. before this pr:
```
== Physical Plan ==
*(5) HashAggregate(keys=[years#30, appversion#32],
functions=[sum(uusers#33L)])
+- Exchange hashpartitioning(years#30, appversion#32, 200), true,
[id=#251]
  +- *(4) HashAggregate(keys=[years#30, appversion#32],
functions=[partial_sum(uusers#33L)])
     +- *(4) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS
TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4)
THEN 'Android' ELSE 'Other' END#46, vs#12], functions=[count(if ((gid#44
= 1)) u.`uid`#47 else null)])
        +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` AS
TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4)
THEN 'Android' ELSE 'Other' END#46, vs#12, 200), true, [id=#246]
           +- *(3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS
TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4)
THEN 'Android' ELSE 'Other' END#46, vs#12], functions=[partial_count(if
((gid#44 = 1)) u.`uid`#47 else null)])
              +- *(3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt`
AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4)
THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48,
gid#44], functions=[])
                 +- Exchange hashpartitioning(date_trunc('year',
CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN
(h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47,
u.`suid`#48, gid#44, 200), true, [id=#241]
                    +- *(2) HashAggregate(keys=[date_trunc('year',
CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN
(h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47,
u.`suid`#48, gid#44], functions=[])
                       +- *(2) Filter (CASE WHEN (h.`pid` = 3) THEN
'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46 = iOS)
                          +- *(2) Expand [ArrayBuffer(date_trunc(year,
cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE WHEN (pid#11 = 3) THEN
iOS WHEN (pid#11 = 4) THEN Android ELSE Other END, vs#12, uid#7, null,
1), ArrayBuffer(date_trunc(year, cast(dt#9 as timestamp),
Some(Etc/GMT+7)), CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN
Android ELSE Other END, vs#12, null, suid#10, 2)], [date_trunc('year',
CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN
(h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47,
u.`suid`#48, gid#44]
                             +- *(2) Project [uid#7, dt#9, suid#10,
pid#11, vs#12]
                                +- *(2) BroadcastHashJoin [uid#7],
[uid#13], Inner, BuildRight
                                   :- *(2) Project [uid#7, dt#9,
suid#10]
                                   :  +- *(2) Filter isnotnull(uid#7)
                                   :     +- *(2) ColumnarToRow
                                   :        +- FileScan parquet
default.t1[uid#7,dt#9,suid#10] Batched: true, DataFilters:
[isnotnull(uid#7)], Format: Parquet, Location:
InMemoryFileIndex[file:/root/spark-3.0.0-bin-hadoop3.2/spark-warehouse/t1],
PartitionFilters: [], PushedFilters: [IsNotNull(uid)], ReadSchema:
struct<uid:int,dt:date,suid:int>
                                   +- BroadcastExchange
HashedRelationBroadcastMode(List(cast(input[2, int, true] as bigint))),
[id=#233]
                                      +- *(1) Project [pid#11, vs#12,
uid#13]
                                         +- *(1) Filter
isnotnull(uid#13)
                                            +- *(1) ColumnarToRow
                                               +- FileScan parquet
default.t2[pid#11,vs#12,uid#13] Batched: true, DataFilters:
[isnotnull(uid#13)], Format: Parquet, Location:
InMemoryFileIndex[file:/root/spark-3.0.0-bin-hadoop3.2/spark-warehouse/t2],
PartitionFilters: [], PushedFilters: [IsNotNull(uid)], ReadSchema:
struct<pid:int,vs:int,uid:int>
```
Plan. after. this pr. :
```
== Physical Plan == AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[years#0, appversion#2],
functions=[sum(uusers#3L)], output=[years#0, appversion#2, users#5L])
  +- Exchange hashpartitioning(years#0, appversion#2, 5), true, [id=#71]
     +- HashAggregate(keys=[years#0, appversion#2],
functions=[partial_sum(uusers#3L)], output=[years#0, appversion#2,
sum#22L])
        +- HashAggregate(keys=[date_trunc(year, cast(dt#9 as timestamp),
Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN
(pid#11 = 4) THEN Android ELSE Other END#24, vs#12],
functions=[count(distinct uid#7)], output=[years#0, appversion#2,
uusers#3L])
           +- Exchange hashpartitioning(date_trunc(year, cast(dt#9 as
timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN
iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, 5), true,
[id=#67]
              +- HashAggregate(keys=[date_trunc(year, cast(dt#9 as
timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN
iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12],
functions=[partial_count(distinct uid#7)], output=[date_trunc(year,
cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN
(pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24,
vs#12, count#27L])
                 +- HashAggregate(keys=[date_trunc(year, cast(dt#9 as
timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN
iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, uid#7],
functions=[], output=[date_trunc(year, cast(dt#9 as timestamp),
Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN
(pid#11 = 4) THEN Android ELSE Other END#24, vs#12, uid#7])
                    +- Exchange hashpartitioning(date_trunc(year,
cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN
(pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24,
vs#12, uid#7, 5), true, [id=#63]
                       +- HashAggregate(keys=[date_trunc(year, cast(dt#9
as timestamp), Some(America/Los_Angeles)) AS date_trunc(year, cast(dt#9
as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3)
THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END AS CASE WHEN
(pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24,
vs#12, uid#7], functions=[], output=[date_trunc(year, cast(dt#9 as
timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN
iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, uid#7])
                          +- Project [uid#7, dt#9, pid#11, vs#12]
                             +- BroadcastHashJoin [uid#7], [uid#13],
Inner, BuildRight, false
                                :- Filter isnotnull(uid#7)
                                :  +- FileScan parquet
default.t1[uid#7,dt#9] Batched: true, DataFilters: [isnotnull(uid#7)],
Format: Parquet, Location:
InMemoryFileIndex[file:/private/var/folders/4l/7_c5c97s1_gb0d9_d6shygx00000gn/T/warehouse-c069d87...,
PartitionFilters: [], PushedFilters: [IsNotNull(uid)], ReadSchema:
struct<uid:int,dt:date>
                                +- BroadcastExchange
HashedRelationBroadcastMode(List(cast(input[2, int, false] as
bigint)),false), [id=#58]
                                   +- Filter ((CASE WHEN (pid#11 = 3)
THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END = iOS) AND
isnotnull(uid#13))
                                      +- FileScan parquet
default.t2[pid#11,vs#12,uid#13] Batched: true, DataFilters: [(CASE WHEN
(pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END =
iOS), isnotnull..., Format: Parquet, Location:
InMemoryFileIndex[file:/private/var/folders/4l/7_c5c97s1_gb0d9_d6shygx00000gn/T/warehouse-c069d87...,
PartitionFilters: [], PushedFilters: [IsNotNull(uid)], ReadSchema:
struct<pid:int,vs:int,uid:int>
```
### Why are the changes needed? Improve  performance, filter more data.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? Added UT
Closes #30278 from AngersZhuuuu/SPARK-33302.
Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 34f5e7c)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/LeftSemiAntiJoinPushDownSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala (diff)
Commit 3165ca742a7508dca35a1e40b303c337939df86f by wenchen
[SPARK-33376][SQL] Remove the option of "sharesHadoopClasses" in Hive
IsolatedClientLoader
### What changes were proposed in this pull request?
This removes the `sharesHadoopClasses` flag from `IsolatedClientLoader`
in Hive module.
### Why are the changes needed?
Currently, when initializing `IsolatedClientLoader`, users can set the
`sharesHadoopClasses` flag to decide whether the `HiveClient` created
should share Hadoop classes with Spark itself or not. In the latter
case, the client will only load Hadoop classes from the Hive
dependencies.
There are two reasons to remove this: 1. this feature is currently used
in two cases: 1) unit tests, 2) when the Hadoop version defined in Maven
can not be found when `spark.sql.hive.metastore.jars` is equal to
"maven", which could be very rare. 2. when `sharesHadoopClasses` is
false, Spark doesn't really only use Hadoop classes from Hive jars: we
also download `hadoop-client` jar and put all the sub-module jars (e.g.,
`hadoop-common`, `hadoop-hdfs`) together with the Hive jars, and the
Hadoop version used by `hadoop-client` is the same version used by Spark
itself. As result, we're mixing two versions of Hadoop jars in the
classpath, which could potentially cause issues, especially considering
that the default Hadoop version is already 3.2.0 while most Hive
versions supported by the `IsolatedClientLoader` is still using Hadoop
2.x or even lower.
### Does this PR introduce _any_ user-facing change?
This affects Spark users in one scenario: when
`spark.sql.hive.metastore.jars` is set to `maven` AND the Hadoop version
specified in pom file cannot be downloaded, currently the behavior is to
switch to _not_ share Hadoop classes, but with the PR it will share
Hadoop classes with Spark.
### How was this patch tested?
Existing UTs.
Closes #30284 from sunchao/SPARK-33376.
Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 3165ca7)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/client/HivePartitionFilteringSuite.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/client/HadoopVersionInfoSuite.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveClientBuilder.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/client/HiveVersionSuite.scala (diff)
Commit 122c8999cbf2a1f9484ae973864a843cfa32b6c6 by huaxing
[SPARK-33251][FOLLOWUP][PYTHON][DOCS][MINOR] Adjusts returns
PrefixSpan.findFrequentSequentialPatterns
### What changes were proposed in this pull request?
Changes
    pyspark.sql.dataframe.DataFrame
to
    :py:class:`pyspark.sql.DataFrame`
### Why are the changes needed?
Consistency (see
https://github.com/apache/spark/pull/30285#pullrequestreview-526764104).
### Does this PR introduce _any_ user-facing change?
User will see shorter reference with a link.
### How was this patch tested?
`dev/lint-python` and manual check of the rendered docs.
Closes #30313 from zero323/SPARK-33251-FOLLOW-UP.
Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Huaxin Gao
<huaxing@us.ibm.com>
(commit: 122c899)
The file was modifiedpython/pyspark/ml/fpm.py (diff)
Commit 6fa80ed1dd43c2ecd092c10933330b501641c51b by viirya
[SPARK-33337][SQL] Support subexpression elimination in branches of
conditional expressions
### What changes were proposed in this pull request?
Currently we skip subexpression elimination in branches of conditional
expressions including `If`, `CaseWhen`, and `Coalesce`. Actually we can
do subexpression elimination for such branches if the subexpression is
common across all branches. This patch proposes to support subexpression
elimination in branches of conditional expressions.
### Why are the changes needed?
We may miss subexpression elimination chances in branches of conditional
expressions. This kind of subexpression is frequently seen. It may be
written manually by users or come from query optimizer. For example,
project collapsing could embed expressions between two `Project`s and
produces conditional expression like:
``` CASE WHEN jsonToStruct(json).a = '1' THEN 1.0 WHEN
jsonToStruct(json).a = '2' THEN 2.0 ... ELSE 1.2 END
```
If `jsonToStruct(json)` is time-expensive expression, we don't eliminate
the duplication and waste time on running it repeatedly now.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test.
Closes #30245 from viirya/SPARK-33337.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi
Hsieh <viirya@gmail.com>
(commit: 6fa80ed)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala (diff)
Commit 46346943bb6c312dc87ac3fcdfd1dbeac68c53b5 by yamamuro
[SPARK-33404][SQL] Fix incorrect results in `date_trunc` expression
### What changes were proposed in this pull request? The following query
produces incorrect results:
``` SELECT date_trunc('minute', '1769-10-17 17:10:02')
``` Spark currently incorrectly returns
``` 1769-10-17 17:10:02
``` against the expected return value of
``` 1769-10-17 17:10:00
```
**Steps to repro** Run the following commands in spark-shell:
``` spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
spark.sql("SELECT date_trunc('minute', '1769-10-17 17:10:02')").show()
``` This happens as `truncTimestamp` in package
`org.apache.spark.sql.catalyst.util.DateTimeUtils` incorrectly assumes
that time zone offsets can never have the granularity of a second and
thus does not account for time zone adjustment when truncating the given
timestamp to `minute`. This assumption is currently used when truncating
the timestamps to `microsecond, millisecond, second, or minute`.
This PR fixes this issue and always uses time zone knowledge when
truncating timestamps regardless of the truncation unit.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? Added new tests to `DateTimeUtilsSuite`
which previously failed and pass now.
Closes #30303 from utkarsh39/trunc-timestamp-fix.
Authored-by: Utkarsh <utkarsh.agarwal@databricks.com> Signed-off-by:
Takeshi Yamamuro <yamamuro@apache.org>
(commit: 4634694)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala (diff)
Commit 5197c5d2e7648d75def3e159e0d2aa3e20117105 by gurwls223
[SPARK-33390][SQL] Make Literal support char array
### What changes were proposed in this pull request?
Make Literal support char array.
### Why are the changes needed?
We always use `Literal()` to create foldable value, and `char[]` is a
usual data type. We can make it easy that support create String Literal
with `char[]`.
### Does this PR introduce _any_ user-facing change?
Yes, user can call `Literal()` with `char[]`.
### How was this patch tested?
Add test.
Closes #30295 from ulysses-you/SPARK-33390.
Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: 5197c5d)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/LiteralExpressionSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/CatalystTypeConvertersSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala (diff)
Commit 1e2eeda20e062a77dfd8f944abeaeeb609817ae3 by wenchen
[SPARK-33382][SQL][TESTS] Unify datasource v1 and v2 SHOW TABLES tests
### What changes were proposed in this pull request? In the PR, I
propose to gather common `SHOW TABLES` tests into one trait
`org.apache.spark.sql.execution.command.ShowTablesSuite`, and put
datasource specific tests to the `v1.ShowTablesSuite` and
`v2.ShowTablesSuite`. Also tests for parsing `SHOW TABLES` are extracted
to `ShowTablesParserSuite`.
### Why are the changes needed?
- The unification will allow to run common `SHOW TABLES` tests for both
DSv1 and DSv2
- We can detect missing features and differences between DSv1 and DSv2
implementations.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? By running new test suites:
- `org.apache.spark.sql.execution.command.v1.ShowTablesSuite`
- `org.apache.spark.sql.execution.command.v2.ShowTablesSuite`
- `ShowTablesParserSuite`
Closes #30287 from MaxGekk/unify-dsv1_v2-tests.
Lead-authored-by: Max Gekk <max.gekk@gmail.com> Co-authored-by: Maxim
Gekk <max.gekk@gmail.com> Co-authored-by: Wenchen Fan
<wenchen@databricks.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 1e2eeda)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala (diff)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/execution/command/v2/ShowTablesSuite.scala
The file was addedsql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowTablesSuite.scala
The file was addedsql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowTablesParserSuite.scala
The file was addedsql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowTablesSuite.scala