Changes

Summary

  1. Revert "[SPARK-34581][SQL] Don't optimize out grouping expressions from (commit: fdccd88) (details)
  2. [SPARK-35078][SQL] Add tree traversal pruning in expression rules (commit: 9af338c) (details)
  3. [SPARK-35201][SQL] Format empty grouping set exception in CUBE/ROLLUP (commit: e503b9c) (details)
  4. [SPARK-35204][SQL] CatalystTypeConverters of date/timestamp should (commit: a9345a0) (details)
  5. [SPARK-34297][SQL][SS] Add metrics for data loss and offset out range (commit: b2a2b5d) (details)
  6. [SPARK-35210][BUILD] Upgrade Jetty to 9.4.40 to fix ERR_CONNECTION_RESET (commit: 44c1387) (details)
  7. [SPARK-34990][SQL][TESTS] Add ParquetEncryptionSuite (commit: 166cc62) (details)
  8. [SPARK-35200][CORE] Avoid to recompute the pending speculative tasks in (commit: bcac733) (details)
  9. [SPARK-35024][ML] Refactor LinearSVC - support virtual centering (commit: 1f150b9) (details)
  10. [SPARK-33913][SS] Upgrade Kafka to 2.8.0 (commit: b108e7f) (details)
Commit fdccd88c2a6dd18c9d446b63fccd5c6188ea125c by wenchen
Revert "[SPARK-34581][SQL] Don't optimize out grouping expressions from aggregate expressions without aggregate function"

This reverts commit c8d78a70b43f8373ff3f2c170d6aa3eaece03b7c.
(commit: fdccd88)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantAggregatesSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/group-by.sql.out (diff)
The file was addedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UpdateAttributeNullability.scala
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/group-by.sql (diff)
The file was removedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UpdateNullability.scala
The file was removedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/EnforceGroupingReferencesInAggregates.scala
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/complexTypesSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AliasHelper.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ComplexTypes.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala (diff)
Commit 9af338cd685bce26abbc2dd4d077bde5068157b1 by ltnwgl
[SPARK-35078][SQL] Add tree traversal pruning in expression rules

### What changes were proposed in this pull request?

Added the following TreePattern enums:
- AND_OR
- BINARY_ARITHMETIC
- BINARY_COMPARISON
- CASE_WHEN
- CAST
- CONCAT
- COUNT
- IF
- LIKE_FAMLIY
- NOT
- NULL_CHECK
- UNARY_POSITIVE
- UPPER_OR_LOWER

Used them in the following rules:
- ConstantPropagation
- ReorderAssociativeOperator
- BooleanSimplification
- SimplifyBinaryComparison
- SimplifyCaseConversionExpressions
- SimplifyConditionals
- PushFoldableIntoBranches
- LikeSimplification
- NullPropagation
- SimplifyCasts
- RemoveDispensableExpressions
- CombineConcats

### Why are the changes needed?

Reduce the number of tree traversals and hence improve the query compilation latency.

### How was this patch tested?

Existing tests.

Closes #32280 from sigmod/expression.

Authored-by: Yingyi Bu <yingyi.bu@databricks.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
(commit: 9af338c)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleIdCollection.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Count.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreePatterns.scala (diff)
Commit e503b9c462676c381ba51ad48edd968d6e184fc2 by yamamuro
[SPARK-35201][SQL] Format empty grouping set exception in CUBE/ROLLUP

### What changes were proposed in this pull request?
Format empty grouping set exception in CUBE/ROLLUP

### Why are the changes needed?
Format empty grouping set exception in CUBE/ROLLUP

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Not need

Closes #32307 from AngersZhuuuu/SPARK-35201.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(commit: e503b9c)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryParsingErrors.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
Commit a9345a05911b8620387b36023ddc6525ed1a508c by max.gekk
[SPARK-35204][SQL] CatalystTypeConverters of date/timestamp should accept both the old and new Java time classes

### What changes were proposed in this pull request?

`CatalystTypeConverters` is useful when the type of the input data classes are not known statically (otherwise we can use `ExpressionEncoder`). However, the current `CatalystTypeConverters` requires you to know the datetime data class statically, which makes it hard to use.

This PR improves the `CatalystTypeConverters` for date/timestamp, to support the old and new Java time classes at the same time.

### Why are the changes needed?

Make `CatalystTypeConverters` easier to use.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

new test

Closes #32312 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: a9345a0)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/CatalystTypeConvertersSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala (diff)
Commit b2a2b5d8206b7c09b180b8b6363f73c6c3fdb1d8 by viirya
[SPARK-34297][SQL][SS] Add metrics for data loss and offset out range for KafkaMicroBatchStream

### What changes were proposed in this pull request?

This patch proposes to add a couple of metrics in scan node for Kafka batch streaming query.

### Why are the changes needed?

When testing SS, I found it is hard to track data loss of SS reading from Kafka. The micro batch scan node has only one metric, number of output rows. Users have no idea how many offsets to fetch are out of Kafka, how many times data loss happens. These metrics are important for users to know the quality of SS query running.

### Does this PR introduce _any_ user-facing change?

Yes, adding two metrics to micro batch scan node for Kafka batch streaming.

### How was this patch tested?

Currently I tested on internal cluster with Kafka:

<img width="1193" alt="Screen Shot 2021-04-22 at 7 16 29 PM" src="https://user-images.githubusercontent.com/68855/115808460-61bf8100-a39f-11eb-99a9-65d22c3f5fb0.png">

I was trying to add unit test. But as our batch streaming query disallows to specify ending offsets. If I only specify an out-of-range starting offset, when we get offset range in `getRanges`,  any negative size range will be filtered out. So it cannot actually test the case of fetched non-existing offset.

Closes #31398 from viirya/micro-batch-metrics.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(commit: b2a2b5d)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/metric/CustomMetricsSuite.scala (diff)
The file was modifiedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/metric/CustomMetrics.scala (diff)
The file was modifiedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaBatchPartitionReader.scala (diff)
The file was modifiedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/consumer/KafkaDataConsumer.scala (diff)
Commit 44c13871d1f36a3fcf99bcd65aa17aa77e833eb6 by sarutak
[SPARK-35210][BUILD] Upgrade Jetty to 9.4.40 to fix ERR_CONNECTION_RESET issue

### What changes were proposed in this pull request?

This PR proposes to upgrade Jetty to 9.4.40.

### Why are the changes needed?

SPARK-34988 (#32091) upgraded Jetty to 9.4.39 for CVE-2021-28165.
But after the upgrade, Jetty 9.4.40 was released to fix the ERR_CONNECTION_RESET issue (https://github.com/eclipse/jetty.project/issues/6152).
This issue seems to affect Jetty 9.4.39 when POST method is used with SSL.
For Spark, job submission using REST and ThriftServer with HTTPS protocol can be affected.

### Does this PR introduce _any_ user-facing change?

No. No released version uses Jetty 9.3.39.

### How was this patch tested?

CI.

Closes #32318 from sarutak/upgrade-jetty-9.4.40.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
(commit: 44c1387)
The file was modifiedpom.xml (diff)
Commit 166cc6204c96665e7b568cfcc8ba243e79dbf837 by dhyun
[SPARK-34990][SQL][TESTS] Add ParquetEncryptionSuite

### What changes were proposed in this pull request?

A simple test that writes and reads an encrypted parquet and verifies that it's encrypted by checking its magic string (in encrypted footer mode).

### Why are the changes needed?

To provide a test coverage for Parquet encryption.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- [x] [SBT / Hadoop 3.2 / Java8 (the default)](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137785/testReport)
- [ ] ~SBT / Hadoop 3.2 / Java11 by adding [test-java11] to the PR title.~ (Jenkins Java11 build is broken due to missing JDK11 installation)
- [x] [SBT / Hadoop 2.7 / Java8 by adding [test-hadoop2.7] to the PR title.](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137836/testReport)
- [x] Maven / Hadoop 3.2 / Java8 by adding [test-maven] to the PR title.
- [x] Maven / Hadoop 2.7 / Java8 by adding [test-maven][test-hadoop2.7] to the PR title.

Closes #32146 from andersonm-ibm/pme_testing.

Authored-by: Maya Anderson <mayaa@il.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: 166cc62)
The file was addedsql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetEncryptionSuite.scala
The file was modifiedsql/hive/pom.xml (diff)
Commit bcac733bf1a302bd98593ed0f633aa38c8bd75d8 by dhyun
[SPARK-35200][CORE] Avoid to recompute the pending speculative tasks in the ExecutorAllocationManager and remove some unnecessary code

### What changes were proposed in this pull request?
Avoid to recompute the pending speculative tasks in the ExecutorAllocationManager, and remove some unnecessary code.

### Why are the changes needed?

The number of the pending speculative tasks is recomputed in the ExecutorAllocationManager to calculate the maximum number of executors required.  While , it only needs to be computed once to improve performance.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests.

Closes #32306 from weixiuli/SPARK-35200.

Authored-by: weixiuli <weixiuli@jd.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: bcac733)
The file was modifiedcore/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala (diff)
Commit 1f150b9392706293946278dd35e8f5a5016ed6df by ruifengz
[SPARK-35024][ML] Refactor LinearSVC - support virtual centering

### What changes were proposed in this pull request?
1, remove existing agg, and use a new agg supporting virtual centering
2, add related testsuites

### Why are the changes needed?
centering vectors should accelerate convergence, and generate solution more close to R

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
updated testsuites and added testsuites

Closes #32124 from zhengruifeng/svc_agg_refactor.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
(commit: 1f150b9)
The file was modifiedR/pkg/tests/fulltests/test_mllib_classification.R (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala (diff)
The file was addedmllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HingeBlockAggregator.scala
The file was addedmllib/src/test/scala/org/apache/spark/ml/optim/aggregator/HingeBlockAggregatorSuite.scala
The file was removedmllib/src/test/scala/org/apache/spark/ml/optim/aggregator/HingeAggregatorSuite.scala
The file was modifiedpython/pyspark/ml/classification.py (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/optim/aggregator/MultinomialLogisticBlockAggregator.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala (diff)
The file was modifiedpython/pyspark/ml/tests/test_training_summary.py (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/optim/aggregator/BinaryLogisticBlockAggregator.scala (diff)
The file was removedmllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HingeAggregator.scala
Commit b108e7fff3f9b6d3164cfa5cb4a9fbecadca7ad6 by hyukjinkwon
[SPARK-33913][SS] Upgrade Kafka to 2.8.0

### What changes were proposed in this pull request?

This PR aims to upgrade Kafka client to 2.8.0.
Note that Kafka 2.8.0 uses ZSTD JNI 1.4.9-1 like Apache Spark 3.2.0.

### Why are the changes needed?

This will bring the latest client-side improvement and bug fixes like the following examples.

- KAFKA-10631 ProducerFencedException is not Handled on Offest Commit
- KAFKA-10134 High CPU issue during rebalance in Kafka consumer after upgrading to 2.5
- KAFKA-12193 Re-resolve IPs when a client is disconnected
- KAFKA-10090 Misleading warnings: The configuration was supplied but isn't a known config
- KAFKA-9263 The new hw is added to incorrect log when  ReplicaAlterLogDirsThread is replacing log
- KAFKA-10607 Ensure the error counts contains the NONE
- KAFKA-10458 Need a way to update quota for TokenBucket registered with Sensor
- KAFKA-10503 MockProducer doesn't throw ClassCastException when no partition for topic

**RELEASE NOTE**
- https://downloads.apache.org/kafka/2.8.0/RELEASE_NOTES.html
- https://downloads.apache.org/kafka/2.7.0/RELEASE_NOTES.html

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs with the existing tests because this is a dependency change.

Closes #32325 from dongjoon-hyun/SPARK-33913.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
(commit: b108e7f)
The file was modifiedpom.xml (diff)
The file was modifiedexternal/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala (diff)