Changes

Summary

  1. [SPARK-32729][SQL][DOCS] Add missing since version for math functions (commit: 0626901) (details)
  2. [SPARK-32704][SQL] Logging plan changes for execution (commit: 0cb91b8) (details)
  3. [SPARK-32639][SQL] Support GroupType parquet mapkey field (commit: 58f87b3) (details)
  4. [SPARK-32092][ML][PYSPARK][FOLLOWUP] Fixed CrossValidatorModel.copy() to (commit: a0bd273) (details)
  5. [SPARK-32718][SQL] Remove unnecessary keywords for interval units (commit: ccc0250) (details)
  6. [SPARK-32629][SQL] Track metrics of BitSet/OpenHashSet in full outer SHJ (commit: cfe012a) (details)
Commit 0626901bcbeebceb6937001e1f32934c71876210 by gurwls223
[SPARK-32729][SQL][DOCS] Add missing since version for math functions
### What changes were proposed in this pull request?
Add missing since version for math functions, including SPARK-8223
shiftright/shiftleft SPARK-8215 pi SPARK-8212 e SPARK-6829
sin/asin/sinh/cos/acos/cosh/tan/atan/tanh/ceil/floor/rint/cbrt/signum/isignum/Fsignum/Lsignum/degrees/radians/log/log10/log1p/exp/expm1/pow/hypot/atan2
SPARK-8209 conv SPARK-8213 factorial SPARK-20751 cot SPARK-2813 sqrt
SPARK-8227 unhex SPARK-8218 log(a,b) SPARK-8207 bin SPARK-8214 hex
SPARK-8206 round SPARK-14614 bround
### Why are the changes needed?
fix SQL docs
### Does this PR introduce _any_ user-facing change?
yes, doc updated
### How was this patch tested?
passing doc generation.
Closes #29571 from yaooqinn/minor.
Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: 0626901)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala (diff)
Commit 0cb91b8c184fddd7c489fcbdbb8e75078ec34e54 by wenchen
[SPARK-32704][SQL] Logging plan changes for execution
### What changes were proposed in this pull request?
Since we only log plan changes for analyzer/optimizer now, this PR
intends to add code to log plan changes in the preparation phase in
`QueryExecution` for execution.
``` scala> spark.sql("SET spark.sql.optimizer.planChangeLog.level=WARN")
scala> spark.range(10).groupBy("id").count().queryExecution.executedPlan
... 20/08/26 09:32:36 WARN PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.execution.CollapseCodegenStages
===
!HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L,
count#23L])              *(1) HashAggregate(keys=[id#19L],
functions=[count(1)], output=[id#19L, count#23L])
!+- HashAggregate(keys=[id#19L], functions=[partial_count(1)],
output=[id#19L, count#27L])   +- *(1) HashAggregate(keys=[id#19L],
functions=[partial_count(1)], output=[id#19L, count#27L])
!   +- Range (0, 10, step=1, splits=4)                                 
                       +- *(1) Range (0, 10, step=1, splits=4)
20/08/26 09:32:36 WARN PlanChangeLogger:
=== Result of Batch Preparations ===
!HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L,
count#23L])              *(1) HashAggregate(keys=[id#19L],
functions=[count(1)], output=[id#19L, count#23L])
!+- HashAggregate(keys=[id#19L], functions=[partial_count(1)],
output=[id#19L, count#27L])   +- *(1) HashAggregate(keys=[id#19L],
functions=[partial_count(1)], output=[id#19L, count#27L])
!   +- Range (0, 10, step=1, splits=4)                                 
                       +- *(1) Range (0, 10, step=1, splits=4)
```
### Why are the changes needed?
Easy debugging for executed plans
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added unit tests.
Closes #29544 from maropu/PlanLoggingInPreparations.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by:
Wenchen Fan <wenchen@databricks.com>
(commit: 0cb91b8)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/internal/SQLConfSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/QueryExecutionSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizerLoggingSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala (diff)
Commit 58f87b31789e64eeb5364cc2b8b86f1d29816f72 by wenchen
[SPARK-32639][SQL] Support GroupType parquet mapkey field
### What changes were proposed in this pull request? Remove the
assertion in ParquetSchemaConverter that the parquet mapKey field must
be PrimitiveType.
### Why are the changes needed? There is a parquet file in the
attachment of
[SPARK-32639](https://issues.apache.org/jira/browse/SPARK-32639), and
the MessageType recorded in the file is:
``` message parquet_schema {
optional group value (MAP) {
   repeated group key_value {
     required group key {
       optional binary first (UTF8);
       optional binary middle (UTF8);
       optional binary last (UTF8);
     }
     optional binary value (UTF8);
   }
}
}
```
Use `spark.read.parquet("000.snappy.parquet")` to read the file. Spark
will throw an exception when converting Parquet MessageType to Spark SQL
StructType:
> AssertionError(Map key type is expected to be a primitive type, but
found...)
Use `spark.read.schema("value MAP<STRUCT<first:STRING, middle:STRING,
last:STRING>, STRING>").parquet("000.snappy.parquet")` to read the file,
spark returns the correct result .
According to the parquet project document
(https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps),
the mapKey in the parquet format does not need to be a primitive type.
Note: This parquet file is not written by spark, because spark will
write additional sparkSchema string information in the parquet file.
When Spark reads, it will directly use the additional sparkSchema
information in the file instead of converting Parquet MessageType to
Spark SQL StructType.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? Added a unit test case
Closes #29451 from izchen/SPARK-32639.
Authored-by: Chen Zhang <izchen@126.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 58f87b3)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala (diff)
Commit a0bd273bb04d9a5684e291ec44617972dcd4accd by huaxing
[SPARK-32092][ML][PYSPARK][FOLLOWUP] Fixed CrossValidatorModel.copy() to
copy models instead of list
### What changes were proposed in this pull request?
Fixed `CrossValidatorModel.copy()` so that it correctly calls `.copy()`
on the models instead of lists of models.
### Why are the changes needed?
`copy()` was first changed in #29445 . The issue was found in CI of
#29524 and fixed. This PR introduces the exact same change so that
`CrossValidatorModel.copy()` and its related tests are aligned in branch
`master` and branch `branch-3.0`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Updated `test_copy` to make sure `copy()` is called on models instead of
lists of models.
Closes #29553 from Louiszr/fix-cv-copy.
Authored-by: Louiszr <zxhst14@gmail.com> Signed-off-by: Huaxin Gao
<huaxing@us.ibm.com>
(commit: a0bd273)
The file was modifiedpython/pyspark/ml/tests/test_tuning.py (diff)
The file was modifiedpython/pyspark/ml/tuning.py (diff)
Commit ccc0250a08fa9519a6dc3c1153e51c1e110f1d7d by dongjoon
[SPARK-32718][SQL] Remove unnecessary keywords for interval units
### What changes were proposed in this pull request?
Remove the YEAR, MONTH, DAY, HOUR, MINUTE, SECOND keywords. They are not
useful in the parser, as we need to support plural like YEARS, so the
parser has to accept the general identifier as interval unit anyway.
### Why are the changes needed?
These keywords are reserved in ANSI. If Spark has these keywords, then
they become reserved under ANSI mode. This makes Spark not able to run
TPCDS queries as they use YEAR as alias name.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added `TPCDSQueryANSISuite`, to make sure Spark with ANSI mode can run
TPCDS queries.
Closes #29560 from cloud-fan/keyword.
Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by:
Dongjoon Hyun <dongjoon@apache.org>
(commit: ccc0250)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/interval.sql.out (diff)
The file was modifieddocs/sql-ref-ansi-compliance.md (diff)
The file was modifiedsql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/interval.sql (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/ansi/interval.sql.out (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala (diff)
Commit cfe012a4311a8fb1fc3a82390c3e68f6afcb1da6 by yamamuro
[SPARK-32629][SQL] Track metrics of BitSet/OpenHashSet in full outer SHJ
### What changes were proposed in this pull request?
This is followup from https://github.com/apache/spark/pull/29342, where
to do two things:
* Per https://github.com/apache/spark/pull/29342#discussion_r470153323,
change from java `HashSet` to spark in-house `OpenHashSet` to track
matched rows for non-unique join keys. I checked `OpenHashSet`
implementation which is built from a key index (`OpenHashSet._bitset` as
`BitSet`) and key array (`OpenHashSet._data` as `Array`). Java `HashSet`
is built from `HashMap`, which stores value in `Node` linked list and by
theory should have taken more memory than `OpenHashSet`. Reran the same
benchmark query used in https://github.com/apache/spark/pull/29342, and
verified the query has similar performance here between `HashSet` and
`OpenHashSet`.
* Track metrics of the extra data structure `BitSet`/`OpenHashSet` for
full outer SHJ. This depends on above thing, because there seems no easy
way to get java `HashSet` memory size.
### Why are the changes needed?
To better surface the memory usage for full outer SHJ more accurately.
This can help users/developers to debug/improve full outer SHJ.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added unite test in `SQLMetricsSuite.scala` .
Closes #29566 from c21/add-metrics.
Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Takeshi Yamamuro
<yamamuro@apache.org>
(commit: cfe012a)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsTestUtils.scala (diff)