Changes

Summary

  1. [SPARK-35168][SQL] mapred.reduce.tasks should be shuffle.partitions not (commit: 5b1353f) (details)
  2. [SPARK-35220][SQL] DayTimeIntervalType/YearMonthIntervalType show (commit: 6f782ef) (details)
  3. [SPARK-35087][UI] Some columns in table Aggregated Metrics by Executor (commit: 2d6467d) (details)
  4. [SPARK-32921][SHUFFLE] MapOutputTracker extensions to support push-based (commit: 38ef477) (details)
  5. [SPARK-35224][SQL][TESTS] Fix buffer overflow in (commit: d572a85) (details)
  6. [SPARK-35213][SQL] Keep the correct ordering of nested structs in (commit: 74afc68) (details)
  7. [SPARK-35088][SQL] Accept ANSI intervals by the Sequence expression (commit: c0a3c0c) (details)
  8. [SPARK-35223] Add IssueNavigationLink (commit: 84026d7) (details)
  9. [SPARK-35230][SQL] Move custom metric classes to proper package (commit: bdac191) (details)
  10. [SPARK-35220][DOCS][FOLLOWUP] DayTimeIntervalType/YearMonthIntervalType (commit: 1db031f) (details)
  11. [SPARK-34638][SQL] Single field nested column prune on generator output (commit: c59988a) (details)
  12. [SPARK-33985][SQL][TESTS] Add query test of combine usage of TRANSFORM (commit: f009046) (details)
  13. [SPARK-35060][SQL] Group exception messages in sql/types (commit: 1b609c7) (details)
  14. [SPARK-28247][SS][TEST] Fix flaky test "query without test harness" on (commit: 0df3b50) (details)
  15. [SPARK-35227][BUILD] Update the resolver for spark-packages in (commit: f738fe0) (details)
  16. [SPARK-35225][SQL] EXPLAIN command should handle empty output of (commit: 7779fce) (details)
  17. [SPARK-26164][SQL] Allow concurrent writers for writing dynamic (commit: 7f51106) (details)
  18. [SPARK-35139][SQL] Support ANSI intervals as Arrow Column vectors (commit: eb08b90) (details)
  19. [SPARK-35235][SQL][TEST] Add row-based hash map into aggregate benchmark (commit: c4ad86f) (details)
  20. [SPARK-35169][SQL] Fix wrong result of min ANSI interval division by -1 (commit: 2d2f467) (details)
  21. [SPARK-34837][SQL][FOLLOWUP] Fix division by zero in the avg function (commit: 55dea2d) (details)
  22. [SPARK-35239][SQL] Coalesce shuffle partition should handle empty input (commit: 4ff9f1f) (details)
  23. [SPARK-35091][SPARK-35090][SQL] Support extract from ANSI Intervals (commit: 16d223e) (details)
  24. [MINOR][DOCS][ML] Explicit return type of array_to_vector utility (commit: 592230e) (details)
  25. [SPARK-35238][DOC] Add JindoFS SDK in cloud integration documents (commit: 26a8d2f) (details)
Commit 5b1353f690bf416fdb3a34c94741425b95f97308 by yao
[SPARK-35168][SQL] mapred.reduce.tasks should be shuffle.partitions not adaptive.coalescePartitions.initialPartitionNum

### What changes were proposed in this pull request?

```sql
spark-sql> set spark.sql.adaptive.coalescePartitions.initialPartitionNum=1;
spark.sql.adaptive.coalescePartitions.initialPartitionNum 1
Time taken: 2.18 seconds, Fetched 1 row(s)
spark-sql> set mapred.reduce.tasks;
21/04/21 14:27:11 WARN SetCommand: Property mapred.reduce.tasks is deprecated, showing spark.sql.shuffle.partitions instead.
spark.sql.shuffle.partitions 1
Time taken: 0.03 seconds, Fetched 1 row(s)
spark-sql> set spark.sql.shuffle.partitions;
spark.sql.shuffle.partitions 200
Time taken: 0.024 seconds, Fetched 1 row(s)
spark-sql> set mapred.reduce.tasks=2;
21/04/21 14:31:52 WARN SetCommand: Property mapred.reduce.tasks is deprecated, automatically converted to spark.sql.shuffle.partitions instead.
spark.sql.shuffle.partitions 2
Time taken: 0.017 seconds, Fetched 1 row(s)
spark-sql> set mapred.reduce.tasks;
21/04/21 14:31:55 WARN SetCommand: Property mapred.reduce.tasks is deprecated, showing spark.sql.shuffle.partitions instead.
spark.sql.shuffle.partitions 1
Time taken: 0.017 seconds, Fetched 1 row(s)
spark-sql>
```

`mapred.reduce.tasks` is mapping to `spark.sql.shuffle.partitions` at write-side, but `spark.sql.adaptive.coalescePartitions.initialPartitionNum` might take precede of `spark.sql.shuffle.partitions`

### Why are the changes needed?

roundtrip for `mapred.reduce.tasks`

### Does this PR introduce _any_ user-facing change?

yes, `mapred.reduce.tasks` will always report `spark.sql.shuffle.partitions` whether `spark.sql.adaptive.coalescePartitions.initialPartitionNum` exists or not.

### How was this patch tested?

a new test

Closes #32265 from yaooqinn/SPARK-35168.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
(commit: 5b1353f)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/internal/SQLConfSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/SetCommand.scala (diff)
Commit 6f782efb044403ab3ca79662fdeca7f1f906e1bf by hyukjinkwon
[SPARK-35220][SQL] DayTimeIntervalType/YearMonthIntervalType show different between Hive SerDe and row format delimited

### What changes were proposed in this pull request?
DayTimeIntervalType/YearMonthIntervalString show different between Hive SerDe and row format delimited.
Create this pr to add a test and  have disscuss.

For this problem I think we have two direction:

1. leave it as current and add a item t explain this  in migration guide docs.
2. Since we should not change hive serde's behavior, so we can cast spark row format delimited's behavior to use cast  DayTimeIntervalType/YearMonthIntervalType as HIVE_STYLE

### Why are the changes needed?
Add UT

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
added ut

Closes #32335 from AngersZhuuuu/SPARK-35220.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
(commit: 6f782ef)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/BaseScriptTransformationSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/SparkScriptTransformationSuite.scala (diff)
Commit 2d6467d6d1633a06b6416eacd4f8167973e88136 by sarutak
[SPARK-35087][UI] Some columns in table Aggregated Metrics by Executor of stage-detail page shows incorrectly.

### What changes were proposed in this pull request?

columns like 'Shuffle Read Size / Records', 'Output Size/ Records' etc  in table ` Aggregated Metrics by Executor` of stage-detail page should be sorted as numerical-order instead of lexicographical-order.

### Why are the changes needed?
buf fix,the sorting style should be consistent between different columns.

The correspondence between the table and the index is shown below(it is defined in stagespage-template.html):
| index | column name                            |
| ----- | -------------------------------------- |
| 0     | Executor ID                            |
| 1     | Logs                                   |
| 2     | Address                                |
| 3     | Task Time                              |
| 4     | Total Tasks                            |
| 5     | Failed Tasks                           |
| 6     | Killed Tasks                           |
| 7     | Succeeded Tasks                        |
| 8     | Excluded                               |
| 9     | Input Size / Records                   |
| 10    | Output Size / Records                  |
| 11    | Shuffle Read Size / Records            |
| 12    | Shuffle Write Size / Records           |
| 13    | Spill (Memory)                         |
| 14    | Spill (Disk)                           |
| 15    | Peak JVM Memory OnHeap / OffHeap       |
| 16    | Peak Execution Memory OnHeap / OffHeap |
| 17    | Peak Storage Memory OnHeap / OffHeap   |
| 18    | Peak Pool Memory Direct / Mapped       |

I constructed some data to simulate the sorting results of the index columns from 9 to 18.
As shown below,it can be seen that the sorting results of columns 9-12 are wrong:

![simulate-result](https://user-images.githubusercontent.com/52202080/115120775-c9fa1580-9fe1-11eb-8514-71f29db3a5eb.png)

The reason is that the real data corresponding to columns 9-12 (note that it is not the data displayed on the page) are **all strings similar to`94685/131`(bytes/records),while the real data corresponding to columns 13-18 are all numbers,**
so the sorting corresponding to columns 13-18 loos well, but the results of columns 9-12 are incorrect because the strings are sorted according to lexicographical order.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Only JS was modified, and the manual test result works well.

**before modified:**
![looks-illegal](https://user-images.githubusercontent.com/52202080/115120812-06c60c80-9fe2-11eb-9ada-fa520fe43c4e.png)

**after modified:**
![sort-result-corrent](https://user-images.githubusercontent.com/52202080/114865187-7c847980-9e24-11eb-9fbc-39ee224726d6.png)

Closes #32190 from kyoty/aggregated-metrics-by-executor-sorted-incorrectly.

Authored-by: kyoty <echohlne@gmail.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
(commit: 2d6467d)
The file was modifiedcore/src/main/resources/org/apache/spark/ui/static/stagepage.js (diff)
Commit 38ef4771d447f6135382ee2767b3f32b96cb1b0e by mridulatgmail.com
[SPARK-32921][SHUFFLE] MapOutputTracker extensions to support push-based shuffle

### What changes were proposed in this pull request?
This is one of the patches for SPIP SPARK-30602 for push-based shuffle.
Summary of changes:

- Introduce `MergeStatus` which tracks the partition level metadata for a merged shuffle partition in the Spark driver
- Unify `MergeStatus` and `MapStatus` under a single trait to allow code reusing inside `MapOutputTracker`
- Extend `MapOutputTracker` to support registering / unregistering `MergeStatus`, calculate preferred locations for a shuffle taking into consideration of merged shuffle partitions, and serving reducer requests for block fetching locations with merged shuffle partitions.

The added APIs in `MapOutputTracker` will be used by `DAGScheduler` in SPARK-32920 and by `ShuffleBlockFetcherIterator` in SPARK-32922

### Why are the changes needed?
Refer to SPARK-30602

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added unit tests.

Lead-authored-by: Min Shen mshenlinkedin.com
Co-authored-by: Chandni Singh chsinghlinkedin.com
Co-authored-by: Venkata Sowrirajan vsowrirajanlinkedin.com

Closes #30480 from Victsm/SPARK-32921.

Lead-authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Co-authored-by: Min Shen <mshen@linkedin.com>
Co-authored-by: Chandni Singh <singh.chandni@gmail.com>
Co-authored-by: Chandni Singh <chsingh@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(commit: 38ef477)
The file was modifiedcore/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/scheduler/MapStatus.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/MapOutputTrackerSuite.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/ShuffleSuite.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/MapStatusesSerDeserBenchmark.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/MapOutputTracker.scala (diff)
The file was addedcore/src/main/scala/org/apache/spark/scheduler/MergeStatus.scala
Commit d572a859891547f73b57ae8c9a3b800c48029678 by max.gekk
[SPARK-35224][SQL][TESTS] Fix buffer overflow in `MutableProjectionSuite`

### What changes were proposed in this pull request?
In the test `"unsafe buffer with NO_CODEGEN"` of `MutableProjectionSuite`, fix unsafe buffer size calculation to be able to place all input fields without buffer overflow + meta-data.

### Why are the changes needed?
To make the test suite `MutableProjectionSuite` more stable.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the affected test suite:
```
$ build/sbt "test:testOnly *MutableProjectionSuite"
```

Closes #32339 from MaxGekk/fix-buffer-overflow-MutableProjectionSuite.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: d572a85)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/MutableProjectionSuite.scala (diff)
Commit 74afc68e2172cb0dc3567e12a8a2c304bb7ea138 by viirya
[SPARK-35213][SQL] Keep the correct ordering of nested structs in chained withField operations

### What changes were proposed in this pull request?

Modifies the UpdateFields optimizer to fix correctness issues with certain nested and chained withField operations. Examples for recreating the issue are in the new unit tests as well as the JIRA issue.

### Why are the changes needed?

Certain withField patterns can cause Exceptions or even incorrect results. It appears to be a result of the additional UpdateFields optimization added in https://github.com/apache/spark/pull/29812. It traverses fieldOps in reverse order to take the last one per field, but this can cause nested structs to change order which leads to mismatches between the schema and the actual data. This updates the optimization to maintain the initial ordering of nested structs to match the generated schema.

### Does this PR introduce _any_ user-facing change?

It fixes exceptions and incorrect results for valid uses in the latest Spark release.

### How was this patch tested?

Added new unit tests for these edge cases.

Closes #32338 from Kimahriman/bug/optimize-with-fields.

Authored-by: Adam Binford <adamq43@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(commit: 74afc68)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeWithFieldsSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UpdateFields.scala (diff)
Commit c0a3c0cbbe87757f7e86e3d6d3c2ed2815c647cb by max.gekk
[SPARK-35088][SQL] Accept ANSI intervals by the Sequence expression

### What changes were proposed in this pull request?
This PR makes `Sequence` expression supports ANSI intervals as step expression.
If the start and stop expression is `TimestampType,` then the step expression could select year-month or day-time interval.
If the start and stop expression is `DateType,` then the step expression must be year-month.

### Why are the changes needed?
Extends the function of `Sequence` expression.

### Does this PR introduce _any_ user-facing change?
'Yes'. Users could use ANSI intervals as step expression for `Sequence` expression.

### How was this patch tested?
New tests.

Closes #32311 from beliefer/SPARK-35088.

Lead-authored-by: beliefer <beliefer@163.com>
Co-authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: c0a3c0c)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala (diff)
Commit 84026d794aa4c6b13a1493dffcd660b5837d6fd9 by yao
[SPARK-35223] Add IssueNavigationLink

### What changes were proposed in this pull request?

Add `IssueNavigationLink` to make IDEA support hyperlink on JIRA Ticket and GitHub PR on Git plugin.

![image](https://user-images.githubusercontent.com/26535726/115997353-5ecdc600-a615-11eb-99eb-6acbf15d8626.png)

### Why are the changes needed?

Make it more friendly for developers which using IDEA.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Closes #32337 from pan3793/SPARK-35223.

Authored-by: Cheng Pan <379377944@qq.com>
Signed-off-by: Kent Yao <yao@apache.org>
(commit: 84026d7)
The file was modified.gitignore (diff)
The file was added.idea/vcs.xml
Commit bdac19184ac615355c16b5c097ac19970aca126a by dhyun
[SPARK-35230][SQL] Move custom metric classes to proper package

### What changes were proposed in this pull request?

This patch moves DS v2 custom metric classes to `org.apache.spark.sql.connector.metric` package. Moving `CustomAvgMetric` and `CustomSumMetric` to above package and make them as public java abstract class too.

### Why are the changes needed?

`CustomAvgMetric` and `CustomSumMetric`  should be public APIs for developers to extend. As there are a few metric classes, we should put them together in one package.

### Does this PR introduce _any_ user-facing change?

No, dev only and they are not released yet.

### How was this patch tested?

Unit tests.

Closes #32348 from viirya/move-custom-metric-classes.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: bdac191)
The file was addedsql/catalyst/src/main/java/org/apache/spark/sql/connector/metric/CustomAvgMetric.java
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/connector/read/PartitionReader.java (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/metric/CustomMetricsSuite.scala (diff)
The file was addedsql/catalyst/src/main/java/org/apache/spark/sql/connector/metric/CustomSumMetric.java
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Scan.java (diff)
The file was modifiedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaBatchPartitionReader.scala (diff)
The file was removedsql/catalyst/src/main/java/org/apache/spark/sql/connector/CustomMetric.java
The file was addedsql/catalyst/src/main/java/org/apache/spark/sql/connector/metric/CustomMetric.java
The file was removedsql/catalyst/src/main/java/org/apache/spark/sql/connector/CustomTaskMetric.java
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/metric/CustomMetrics.scala (diff)
The file was addedsql/catalyst/src/main/java/org/apache/spark/sql/connector/metric/CustomTaskMetric.java
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListenerSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala (diff)
The file was modifiedexternal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala (diff)
Commit 1db031f158b6ab94d8e789b278b2bd8f170c5bf2 by max.gekk
[SPARK-35220][DOCS][FOLLOWUP] DayTimeIntervalType/YearMonthIntervalType show different between Hive SerDe and row format delimited

### What changes were proposed in this pull request?
Add note in migration guide about  DayTimeIntervalType/YearMonthIntervalType show different between Hive SerDe and row format delimited

### Why are the changes needed?
Add note

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Not need

Closes #32343 from AngersZhuuuu/SPARK-35220-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: 1db031f)
The file was modifieddocs/sql-migration-guide.md (diff)
Commit c59988aa799a2aab1cd231ae2b1cf90f1a2b1af8 by viirya
[SPARK-34638][SQL] Single field nested column prune on generator output

### What changes were proposed in this pull request?

This patch proposes an improvement on nested column pruning if the pruning target is generator's output. Previously we disallow such case. This patch allows to prune on it if there is only one single nested column is accessed after `Generate`.

E.g., `df.select(explode($"items").as('item)).select($"item.itemId")`. As we only need `itemId` from `item`, we can prune other fields out and only keep `itemId`.

In this patch, we only address explode-like generators. We will address other generators in followups.

### Why are the changes needed?

This helps to extend the availability of nested column pruning.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test

Closes #31966 from viirya/SPARK-34638.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(commit: c59988a)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ProjectionOverSchema.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala (diff)
Commit f0090463a8601d9137b668bc139b293c2b15fdc1 by wenchen
[SPARK-33985][SQL][TESTS] Add query test of combine usage of TRANSFORM and CLUSTER BY/ORDER BY

### What changes were proposed in this pull request?
Under hive's document  https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform there are many usage about  TRANSFORM and CLUSTER BY/ORDER BY, in this pr add some test about this cases.

### Why are the changes needed?
Add UT

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #32333 from AngersZhuuuu/SPARK-33985.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: f009046)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/transform.sql (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/transform.sql.out (diff)
Commit 1b609c7dcfc3a30aefff12a71aac5c1d6273b2c0 by wenchen
[SPARK-35060][SQL] Group exception messages in sql/types

### What changes were proposed in this pull request?
This PR group exception messages in `sql/catalyst/src/main/scala/org/apache/spark/sql/types`.

### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.

### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.

### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.

Closes #32244 from beliefer/SPARK-35060.

Lead-authored-by: beliefer <beliefer@163.com>
Co-authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 1b609c7)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/types/numerics.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/types/Metadata.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala (diff)
Commit 0df3b501aeb2a88997e5a68a6a8f8e7f5196342c by kabhwan.opensource
[SPARK-28247][SS][TEST] Fix flaky test "query without test harness" on ContinuousSuite

### What changes were proposed in this pull request?

This is another attempt to fix the flaky test "query without test harness" on ContinuousSuite.

`query without test harness` is flaky because it starts a continuous query with two partitions but assumes they will run at the same speed.

In this test, 0 and 2 will be written to partition 0, 1 and 3 will be written to partition 1. It assumes when we see 3, 2 should be written to the memory sink. But this is not guaranteed. We can add `if (currentValue == 2) Thread.sleep(5000)` at this line https://github.com/apache/spark/blob/b2a2b5d8206b7c09b180b8b6363f73c6c3fdb1d8/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousRateStreamSource.scala#L135 to reproduce the failure: `Result set Set([0], [1], [3]) are not a superset of Set(0, 1, 2, 3)!`

The fix is changing `waitForRateSourceCommittedValue` to wait until all partitions reach the desired values before stopping the query.

### Why are the changes needed?

Fix a flaky test.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests. Manually verify the reproduction I mentioned above doesn't fail after this change.

Closes #32316 from zsxwing/SPARK-28247-fix.

Authored-by: Shixiong Zhu <zsxwing@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(commit: 0df3b50)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/ContinuousSuite.scala (diff)
Commit f738fe07b6fc85c880b64a1cc2f6c7cc1cc1379b by hyukjinkwon
[SPARK-35227][BUILD] Update the resolver for spark-packages in SparkSubmit

### What changes were proposed in this pull request?
This change is to use repos.spark-packages.org instead of Bintray as the repository service for spark-packages.

### Why are the changes needed?
The change is needed because Bintray will no longer be available from May 1st.

### Does this PR introduce _any_ user-facing change?
This should be transparent for users who use SparkSubmit.

### How was this patch tested?
Tested running spark-shell with --packages manually.

Closes #32346 from bozhang2820/replace-bintray.

Authored-by: Bo Zhang <bo.zhang@databricks.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
(commit: f738fe0)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala (diff)
Commit 7779fce79ab0c87db67240804ab0776b71ed9d05 by hyukjinkwon
[SPARK-35225][SQL] EXPLAIN command should handle empty output of analyzed plan

### What changes were proposed in this pull request?

EXPLAIN command puts an empty line if there is no output for an analyzed plan. For example,

`sql("CREATE VIEW test AS SELECT 1").explain(true)` produces:
```
== Parsed Logical Plan ==
'CreateViewStatement [test], SELECT 1, false, false, PersistedView
+- 'Project [unresolvedalias(1, None)]
   +- OneRowRelation

== Analyzed Logical Plan ==

CreateViewCommand `default`.`test`, SELECT 1, false, false, PersistedView, true
   +- Project [1 AS 1#7]
      +- OneRowRelation

== Optimized Logical Plan ==
CreateViewCommand `default`.`test`, SELECT 1, false, false, PersistedView, true
   +- Project [1 AS 1#7]
      +- OneRowRelation

== Physical Plan ==
Execute CreateViewCommand
   +- CreateViewCommand `default`.`test`, SELECT 1, false, false, PersistedView, true
         +- Project [1 AS 1#7]
            +- OneRowRelation
```

### Why are the changes needed?

To handle empty output of analyzed plan and remove the unneeded empty line.

### Does this PR introduce _any_ user-facing change?

Yes, now the EXPLAIN command for the analyzed plan produces the following without the empty line:
```
== Analyzed Logical Plan ==
CreateViewCommand `default`.`test`, SELECT 1, false, false, PersistedView, true
   +- Project [1 AS 1#7]
      +- OneRowRelation
```

### How was this patch tested?

Added a test.

Closes #32342 from imback82/analyzed_plan_blank_line.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
(commit: 7779fce)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala (diff)
Commit 7f51106c0df190b3364848f8808e03105da63681 by wenchen
[SPARK-26164][SQL] Allow concurrent writers for writing dynamic partitions and bucket table

### What changes were proposed in this pull request?

This is a re-proposal of https://github.com/apache/spark/pull/23163. Currently spark always requires a [local sort](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L188) before writing to output table with dynamic partition/bucket columns. The sort can be unnecessary if cardinality of partition/bucket values is small, and can be avoided by keeping multiple output writers concurrently.

This PR introduces a config `spark.sql.maxConcurrentOutputFileWriters` (which disables this feature by default), where user can tune the maximal number of concurrent writers. The config is needed here as we cannot keep arbitrary number of writers in task memory which can cause OOM (especially for Parquet/ORC vectorization writer).

The feature is to first use concurrent writers to write rows. If the number of writers exceeds the above config specified limit. Sort rest of rows and write rows one by one (See `DynamicPartitionDataConcurrentWriter.writeWithIterator()`).

In addition, interface `WriteTaskStatsTracker` and its implementation `BasicWriteTaskStatsTracker` are also changed because previously they are relying on the assumption that only one writer is active for writing dynamic partitions and bucketed table.

### Why are the changes needed?

Avoid the sort before writing output for dynamic partitioned query and bucketed table.
Help improve CPU and IO performance for these queries.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit test in `DataFrameReaderWriterSuite.scala`.

Closes #32198 from c21/writer.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 7f51106)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonOutputWriter.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CsvOutputWriter.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/OutputWriter.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOutputWriter.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/sources/SimpleTextRelation.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcOutputWriter.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextOutputWriter.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileWriterFactory.scala (diff)
The file was modifiedexternal/avro/src/main/scala/org/apache/spark/sql/avro/AvroOutputWriter.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteStatsTracker.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala (diff)
Commit eb08b9010a0eff248e2afd8d35bf3cee8b1fe760 by wenchen
[SPARK-35139][SQL] Support ANSI intervals as Arrow Column vectors

### What changes were proposed in this pull request?
Support YearMonthIntervalType and DayTimeIntervalType to extend ArrowColumnVector

### Why are the changes needed?
https://issues.apache.org/jira/browse/SPARK-35139

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
1. By checking coding style via:
    $ ./dev/scalastyle
    $ ./dev/lint-java
2. Run the test "ArrowWriterSuite"

Closes #32340 from Peng-Lei/SPARK-35139.

Authored-by: PengLei <18066542445@189.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: eb08b90)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowWriterSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/util/ArrowUtilsSuite.scala (diff)
Commit c4ad86f3118d31575f587b5b899430627c8fdc71 by wenchen
[SPARK-35235][SQL][TEST] Add row-based hash map into aggregate benchmark

### What changes were proposed in this pull request?

`AggregateBenchmark` is only testing the performance for vectorized fast hash map, but not row-based hash map (which is used by default). We should add the row-based hash map into the benchmark.

java 8 benchmark run - https://github.com/c21/spark/actions/runs/787731549
java 11 benchmark run - https://github.com/c21/spark/actions/runs/787742858

### Why are the changes needed?

To have and track a basic sense of benchmarking different fast hash map used in hash aggregate.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit test, as this only touches benchmark code.

Closes #32357 from c21/agg-benchmark.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: c4ad86f)
The file was modifiedsql/core/benchmarks/AggregateBenchmark-jdk11-results.txt (diff)
The file was modifiedsql/core/benchmarks/AggregateBenchmark-results.txt (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala (diff)
Commit 2d2f467831d17df38fc6522da5066a8d848caaf5 by wenchen
[SPARK-35169][SQL] Fix wrong result of min ANSI interval division by -1

### What changes were proposed in this pull request?
Before this patch
```
scala> Seq(java.time.Period.ofMonths(Int.MinValue)).toDF("i").select($"i" / -1).show(false)
+-------------------------------------+
|(i / -1)                             |
+-------------------------------------+
|INTERVAL '-178956970-8' YEAR TO MONTH|
+-------------------------------------+
scala> Seq(java.time.Duration.of(Long.MinValue, java.time.temporal.ChronoUnit.MICROS)).toDF("i").select($"i" / -1).show(false)
+---------------------------------------------------+
|(i / -1)                                           |
+---------------------------------------------------+
|INTERVAL '-106751991 04:00:54.775808' DAY TO SECOND|
+---------------------------------------------------+
```

Wrong result of min ANSI interval division by -1, this pr fix this

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #32314 from AngersZhuuuu/SPARK-35169.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 2d2f467)
The file was modifiedsql/core/src/test/resources/sql-tests/results/interval.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/interval.sql (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/intervalExpressions.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/ansi/interval.sql.out (diff)
Commit 55dea2d937a375d9929937ee66aa9bfed158b883 by max.gekk
[SPARK-34837][SQL][FOLLOWUP] Fix division by zero in the avg function over ANSI intervals

### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/32229 support ANSI SQL intervals by the aggregate function `avg`.
But have not treat that the input zero rows. so this will lead to:
```
Caused by: java.lang.ArithmeticException: / by zero
at com.google.common.math.LongMath.divide(LongMath.java:367)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1864)
at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1253)
at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1253)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2248)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```

### Why are the changes needed?
Fix a bug.

### Does this PR introduce _any_ user-facing change?
No. Just new feature.

### How was this patch tested?
new tests.

Closes #32358 from beliefer/SPARK-34837-followup.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: 55dea2d)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala (diff)
Commit 4ff9f1fe3bedb5422453838d904112454ddd5675 by wenchen
[SPARK-35239][SQL] Coalesce shuffle partition should handle empty input RDD

### What changes were proposed in this pull request?

Create empty partition for custom shuffle reader if input RDD is empty.

### Why are the changes needed?

If input RDD partition is empty then the map output statistics will be null. And if all shuffle stage's input RDD partition is empty, we will skip it and lose the chance to coalesce partition.

We can simply create a empty partition for these custom shuffle reader to reduce the partition number.

### Does this PR introduce _any_ user-facing change?

Yes, the shuffle partition might be changed in AQE.

### How was this patch tested?

add new test.

Closes #32362 from ulysses-you/SPARK-35239.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 4ff9f1f)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CoalesceShufflePartitions.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ShufflePartitionsUtil.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala (diff)
Commit 16d223efee66b6299fa7ab29d7572e4717da7a11 by wenchen
[SPARK-35091][SPARK-35090][SQL] Support extract from ANSI Intervals

### What changes were proposed in this pull request?

In this PR, we add extract/date_part support for ANSI Intervals

The `extract` is an ANSI expression and `date_part` is NON-ANSI but exists as an equivalence for `extract`

#### expression

```
<extract expression> ::=
  EXTRACT <left paren> <extract field> FROM <extract source> <right paren>
```

#### <extract field> for interval source

```

<primary datetime field> ::=
    <non-second primary datetime field>
| SECOND
<non-second primary datetime field> ::=
    YEAR
  | MONTH
  | DAY
  | HOUR
  | MINUTE
```

#### dataType

```
If <extract field> is a <primary datetime field> that does not specify SECOND or <extract field> is not a <primary datetime field>, then the declared type of the result is an implementation-defined exact numeric type with scale 0 (zero)

Otherwise, the declared type of the result is an implementation-defined exact numeric type with scale not less than the specified or implied <time fractional seconds precision> or <interval fractional seconds precision>, as appropriate, of the SECOND <primary datetime field> of the <extract source>.
```

### Why are the changes needed?

Subtask of ANSI Intervals Support

### Does this PR introduce _any_ user-facing change?

Yes
1. extract/date_part support ANSI intervals
2. for non-ansi intervals, the return type is changed from long to byte when extracting hours

### How was this patch tested?

new added tests

Closes #32351 from yaooqinn/SPARK-35091.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 16d223e)
The file was modifiedsql/core/src/test/resources/sql-tests/results/extract.sql.out (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/intervalExpressions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/extract.sql (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/IntervalExpressionsSuite.scala (diff)
Commit 592230e47b1dd95d7e776bfc649081d7d64e2f07 by srowen
[MINOR][DOCS][ML] Explicit return type of array_to_vector utility function

There are two types of dense vectors:
* pyspark.ml.linalg.DenseVector
* pyspark.mllib.linalg.DenseVector

In spark-3.1.1, array_to_vector returns instances of pyspark.ml.linalg.DenseVector.
The documentation is ambiguous & can lead to the false conclusion that instances of
pyspark.mllib.linalg.DenseVector will be returned.
Conversion from ml versions to mllib versions can easly be achieved with
mlutils.convertVectorColumnsToML helper.

### What changes were proposed in this pull request?
Make documentation more explicit

### Why are the changes needed?
The documentation is a bit misleading and users can lose time investigating & realizing there are two DenseVector types.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No test were run as only the documentation was changed

Closes #32255 from jlafaye/master.

Authored-by: Julien Lafaye <jlafaye@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(commit: 592230e)
The file was modifiedpython/pyspark/ml/functions.py (diff)
Commit 26a8d2f908d4c9e2f276dcc36b4a5bbb49aa8259 by srowen
[SPARK-35238][DOC] Add JindoFS SDK in cloud integration documents

### What changes were proposed in this pull request?
Add JindoFS SDK documents link in the cloud integration section of Spark's official document.

### Why are the changes needed?
If Spark users need to interact with Alibaba Cloud OSS, JindoFS SDK is the official solution provided by Alibaba Cloud.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
tested the url manually.

Closes #32360 from adrian-wang/jindodoc.

Authored-by: Daoyuan Wang <daoyuan.wdy@alibaba-inc.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(commit: 26a8d2f)
The file was modifieddocs/cloud-integration.md (diff)