Changes

Summary

  1. [SPARK-35016][SQL] Format ANSI intervals in Hive style (commit: 0e761c7) (details)
  2. [SPARK-34983][PYTHON] Renaming the package alias from pp to ps (commit: fd8081c) (details)
  3. [SPARK-34986][SQL] Make an error msg clearer when ordinal numbers in (commit: 03431d4) (details)
  4. [SPARK-35023][BUILD] Migrate from deprecated `in` to `slash` syntax in (commit: d6df84e) (details)
  5. [SPARK-35017][SQL] Transfer ANSI intervals via Hive Thrift server (commit: 90820b3) (details)
  6. [SPARK-34916][SQL] Add condition lambda and rule id to the transform (commit: 3db8ec2) (details)
  7. [SPARK-33229][SQL] Support partial grouping analytics and concatenated (commit: 2123237) (details)
  8. [SPARK-34905][SQL][TESTS] Enable ANSI intervals in (commit: 8f8bac6) (details)
  9. [SPARK-35005][SQL] Improve error msg if UTF8String concatWs length (commit: 1be1012) (details)
  10. [SPARK-34562][SQL] Add test and doc for Parquet Bloom filter push down (commit: e40fce9) (details)
  11. [SPARK-34593][SQL][FOLLOWUP] Fix (commit: 8627cab) (details)
  12. [SPARK-35019][PYTHON][SQL] Fix type hints mismatches in pyspark.sql.* (commit: b43f7e6) (details)
  13. [SPARK-35031][PYTHON] Port Koalas operations on different frames tests (commit: 9c1f807) (details)
  14. [SPARK-28227][SQL] Support projection, aggregate/window functions, and (commit: 278203d) (details)
  15. [SPARK-35028][SQL] ANSI mode: disallow group by aliases (commit: 79e55b4) (details)
  16. [SPARK-34944][SQL][TESTS] Replace bigint with int for web_returns and (commit: 16e2faa) (details)
  17. [SPARK-35029][CORE] Extract `tryOrFetchFailedException` method to (commit: aae4ab2) (details)
  18. [SPARK-34996][PYTHON] Port Koalas Series-related unit tests into PySpark (commit: a392633) (details)
  19. [SPARK-35037][SQL] Recognize sign before the interval string in literals (commit: 26f312e) (details)
  20. [SPARK-33411][SQL] Cardinality estimation of union, sort and range (commit: caf33be) (details)
  21. [SPARK-28379][SQL] Allow non-aggregated single row correlated scalar (commit: 6b8405b) (details)
  22. [MINOR][TESTS] Enhance the test instruction of (commit: 5d12653) (details)
  23. [SPARK-34212][SQL][FOLLOWUP] Move the added test to ParquetQuerySuite (commit: b34a84e) (details)
Commit 0e761c7307e780bd1d6cfe40e3d17ad003768411 by max.gekk
[SPARK-35016][SQL] Format ANSI intervals in Hive style

### What changes were proposed in this pull request?
1. Extend `IntervalUtils` methods: `toYearMonthIntervalString` and `toDayTimeIntervalString` to support formatting of year-month/day-time intervals in Hive style. The methods get new parameter style which can have to values; `HIVE_STYLE` and `ANSI_STYLE`.
2. Invoke `toYearMonthIntervalString` and `toDayTimeIntervalString` from the `Cast` expression with the `style` parameter is set to `ANSI_STYLE`.
3. Invoke `toYearMonthIntervalString` and `toDayTimeIntervalString` from `HiveResult` with `style` is set to `HIVE_STYLE`.

### Why are the changes needed?
The `spark-sql` shell formats its output in Hive style by using `HiveResult.hiveResultString()`. The changes are needed to match Hive behavior. For instance,

Hive:
```sql
0: jdbc:hive2://localhost:10000/default> select timestamp'2021-01-01 01:02:03.000001' - date'2020-12-31';
+-----------------------+
|          _c0          |
+-----------------------+
| 1 01:02:03.000001000  |
+-----------------------+
```

Spark before the changes:
```sql
spark-sql> select timestamp'2021-01-01 01:02:03.000001' - date'2020-12-31';
INTERVAL '1 01:02:03.000001' DAY TO SECOND
```

Also this should unblock #32099 which enables *.sql tests in `SQLQueryTestSuite`.

### Does this PR introduce _any_ user-facing change?
Yes. After the changes:
```sql
spark-sql> select timestamp'2021-01-01 01:02:03.000001' - date'2020-12-31';
1 01:02:03.000001000
```

### How was this patch tested?
1. Added new tests to `IntervalUtilsSuite`:
```
$  build/sbt "test:testOnly *IntervalUtilsSuite"
```
2. Modified existing tests in `HiveResultSuite`:
```
$  build/sbt -Phive-2.3 -Phive-thriftserver "testOnly *HiveResultSuite"
```
3. By running cast tests:
```
$ build/sbt "testOnly *CastSuite*"
```

Closes #32120 from MaxGekk/ansi-intervals-hive-thrift-server.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: 0e761c7)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/IntervalUtilsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/HiveResultSuite.scala (diff)
Commit fd8081cd2749b4f087c0c1bb36c023419f0ee11e by gurwls223
[SPARK-34983][PYTHON] Renaming the package alias from pp to ps

### What changes were proposed in this pull request?

This PR proposes to fix:

```python
import pyspark.pandas as pp
```

to

```python
import pyspark.pandas as ps
```

### Why are the changes needed?

`pp` might sound offensive in some contexts.

### Does this PR introduce _any_ user-facing change?

The change is in master only. We'll use `ps` as the short name instead of `pp`.

### How was this patch tested?

The CI in this PR will test it out.

Closes #32108 from LSturtew/renaming_pyspark.pandas.

Authored-by: Luka Sturtewagen <luka.sturtewagen@linkit.nl>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: fd8081c)
The file was modifiedpython/pyspark/pandas/window.py (diff)
The file was modifiedpython/pyspark/pandas/typedef/string_typehints.py (diff)
The file was modifiedpython/pyspark/pandas/namespace.py (diff)
The file was modifiedpython/pyspark/pandas/extensions.py (diff)
The file was modifiedpython/pyspark/pandas/ml.py (diff)
The file was modifiedpython/pyspark/pandas/indexes/datetimes.py (diff)
The file was modifiedpython/pyspark/pandas/groupby.py (diff)
The file was modifiedpython/pyspark/pandas/sql_processor.py (diff)
The file was modifiedpython/pyspark/pandas/config.py (diff)
The file was modifiedpython/pyspark/pandas/accessors.py (diff)
The file was modifiedpython/pyspark/pandas/plot/plotly.py (diff)
The file was modifiedpython/pyspark/pandas/testing/utils.py (diff)
The file was modifiedpython/pyspark/pandas/indexing.py (diff)
The file was modifiedpython/pyspark/pandas/numpy_compat.py (diff)
The file was modifiedpython/pyspark/pandas/indexes/category.py (diff)
The file was modifiedpython/pyspark/pandas/typedef/typehints.py (diff)
The file was modifiedpython/pyspark/pandas/datetimes.py (diff)
The file was modifiedpython/pyspark/pandas/base.py (diff)
The file was modifiedpython/pyspark/pandas/generic.py (diff)
The file was modifiedpython/pyspark/pandas/indexes/multi.py (diff)
The file was modifiedpython/pyspark/pandas/plot/core.py (diff)
The file was modifiedpython/pyspark/pandas/indexes/base.py (diff)
The file was modifiedpython/pyspark/pandas/spark/accessors.py (diff)
The file was modifiedpython/pyspark/pandas/strings.py (diff)
The file was modifiedpython/pyspark/pandas/utils.py (diff)
The file was modifiedpython/pyspark/pandas/exceptions.py (diff)
The file was modifiedpython/pyspark/pandas/mlflow.py (diff)
The file was modifiedpython/pyspark/pandas/tests/test_dataframe.py (diff)
The file was modifiedpython/pyspark/pandas/series.py (diff)
The file was modifiedpython/pyspark/pandas/categorical.py (diff)
The file was modifiedpython/pyspark/pandas/frame.py (diff)
The file was modifiedpython/pyspark/pandas/indexes/numeric.py (diff)
The file was modifiedpython/pyspark/pandas/internal.py (diff)
Commit 03431d40ebc86f64e514602fb6ff6ed354221fb8 by gurwls223
[SPARK-34986][SQL] Make an error msg clearer when ordinal numbers in group-by refer to agg funcs

### What changes were proposed in this pull request?
before when we use aggregate ordinal in group by expression and index position is a aggregate function, it will show error as
```
– !query
select a, b, sum(b) from data group by 3
– !query schema
struct<>
– !query output
org.apache.spark.sql.AnalysisException
aggregate functions are not allowed in GROUP BY, but found sum(data.b)
```

It't not clear enough refactor this error message in this pr

### Why are the changes needed?
refactor  error message

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existed UT

Closes #32089 from AngersZhuuuu/SPARK-34986.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 03431d4)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/group-by-ordinal.sql.out (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala (diff)
Commit d6df84e7348268ef8cf2eeab3929ffc28574d6b7 by gurwls223
[SPARK-35023][BUILD] Migrate from deprecated `in` to `slash` syntax in SBT build file

### What changes were proposed in this pull request?

SBT 1.5.0 deprecates `in` syntax from 0.13.x, so build file adjustment
is recommended.
See https://www.scala-sbt.org/1.x/docs/Migrating-from-sbt-013x.html#Migrating+to+slash+syntax

### Why are the changes needed?

Removes significant amount of deprecation warnings and prepares to syntax removal in next versions of SBT.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Build should pass on GH Actions.

Closes #32115 from gemelen/feature/sbt-1.5-fixes.

Authored-by: Denis Pyshev <git@gemelen.net>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: d6df84e)
The file was modifiedproject/SparkBuild.scala (diff)
Commit 90820b3ec3da9afc12acee21b53bcfa1733129a9 by gurwls223
[SPARK-35017][SQL] Transfer ANSI intervals via Hive Thrift server

### What changes were proposed in this pull request?
1. Map Catalyst's interval types to Hive's types:
    - YearMonthIntervalType -> `interval_year_month`
    - DayTimeIntervalType -> `interval_day_time`
2. Invoke `HiveResult.toHiveString()` to convert external intervals types ` java.time.Period`/`java.time.Duration` to strings.

### Why are the changes needed?
1. To be able to retrieve ANSI intervals via Hive Thrift server.
2. This fixes the issue:
```sql
$ ./sbin/start-thriftserver.sh
$ ./bin/beeline
Beeline version 2.3.8 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10000/default "" "" ""
Connecting to jdbc:hive2://localhost:10000/default
Connected to: Spark SQL (version 3.2.0-SNAPSHOT)
0: jdbc:hive2://localhost:10000/default> select timestamp'2021-01-01 01:02:03.000001' - date'2020-12-31';
Error: java.lang.IllegalArgumentException: Unrecognized type name: day-time interval (state=,code=0)
```
3. It should unblock https://github.com/apache/spark/pull/32099 which enables `*.sql` tests in `ThriftServerQueryTestSuite`.

### Does this PR introduce _any_ user-facing change?
Yes. After the changes:
```sql
0: jdbc:hive2://localhost:10000/default> select timestamp'2021-01-01 01:02:03.000001' - date'2020-12-31';
+----------------------------------------------------+
| subtracttimestamps(TIMESTAMP '2021-01-01 01:02:03.000001', DATE '2020-12-31') |
+----------------------------------------------------+
| 1 01:02:03.000001000                               |
+----------------------------------------------------+
1 row selected (1.637 seconds)
```

### How was this patch tested?
By running new test:
```
$ ./build/sbt -Phive -Phive-thriftserver "test:testOnly *SparkThriftServerProtocolVersionsSuite"
$ ./build/sbt -Phive -Phive-thriftserver "test:testOnly *SparkMetadataOperationSuite"
```
Also checked an array of an interval:
```sql
0: jdbc:hive2://localhost:10000/default> select array(timestamp'2021-01-01 01:02:03.000001' - date'2020-12-31');
+----------------------------------------------------+
| array(subtracttimestamps(TIMESTAMP '2021-01-01 01:02:03.000001', DATE '2020-12-31')) |
+----------------------------------------------------+
| [1 01:02:03.000001000]                             |
+----------------------------------------------------+
```

Closes #32121 from MaxGekk/ansi-intervals-thrift-protocol.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 90820b3)
The file was modifiedsql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/SparkThriftServerProtocolVersionsSuite.scala (diff)
The file was modifiedsql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/ColumnValue.java (diff)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala (diff)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetTypeInfoOperation.scala (diff)
Commit 3db8ec258c4a8438bda73c26fc7b1eb6f9d51631 by ltnwgl
[SPARK-34916][SQL] Add condition lambda and rule id to the transform family for early stopping

### What changes were proposed in this pull request?

This PR contains:
- TreeNode, QueryPlan, AnalysisHelper changes to allow the transform function family to stop earlier without traversing the entire tree;
- Example changes in a few rules to support such pruning, e.g., ReorderJoin and OptimizeIn.

Here is a [design doc](https://docs.google.com/document/d/1SEUhkbo8X-0cYAJFYFDQhxUnKJBz4lLn3u4xR2qfWqk) that elaborates the ideas and benchmark numbers.

### Why are the changes needed?

It's a framework-level change for reducing the query compilation time.
In particular, if we update existing rules and transform call sites as per the examples in this PR, the analysis time and query optimization time can be reduced as described in this [doc](https://docs.google.com/document/d/1SEUhkbo8X-0cYAJFYFDQhxUnKJBz4lLn3u4xR2qfWqk) .

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

It is tested by existing tests.

Closes #32060 from sigmod/bits.

Authored-by: Yingyi Bu <yingyi.bu@databricks.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
(commit: 3db8ec2)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushDownLeftSemiAntiJoin.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala (diff)
The file was addedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreePatternBits.scala
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushExtraPredicateThroughJoin.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala (diff)
The file was addedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreePatterns.scala
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/util/collection/BitSet.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/Rule.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/AnalysisHelper.scala (diff)
The file was addedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleIdCollection.scala
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeInSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala (diff)
Commit 21232377ba654babd0dc929b5278835beefcc6a1 by wenchen
[SPARK-33229][SQL] Support partial grouping analytics and concatenated grouping analytics

### What changes were proposed in this pull request?
Support GROUP BY use Separate columns and CUBE/ROLLUP

In postgres sql, it support
```
select a, b, c, count(1) from t group by a, b, cube (a, b, c);
select a, b, c, count(1) from t group by a, b, rollup(a, b, c);
select a, b, c, count(1) from t group by cube(a, b), rollup (a, b, c);
select a, b, c, count(1) from t group by a, b, grouping sets((a, b), (a), ());
```
In this pr, we have done two things as below:

1. Support partial grouping analytics such as `group by a, cube(a, b)`
2. Support mixed grouping analytics such as `group by cube(a, b), rollup(b,c)`

*Partial Groupings*

    Partial Groupings means there are both `group_expression` and `CUBE|ROLLUP|GROUPING SETS`
    in GROUP BY clause. For example:
    `GROUP BY warehouse, CUBE(product, location)` is equivalent to
    `GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse, location), (warehouse))`.
    `GROUP BY warehouse, ROLLUP(product, location)` is equivalent to
    `GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse))`.
    `GROUP BY warehouse, GROUPING SETS((product, location), (producet), ())` is equivalent to
    `GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, location), (warehouse))`.

*Concatenated Groupings*

    Concatenated groupings offer a concise way to generate useful combinations of groupings. Groupings specified
    with concatenated groupings yield the cross-product of groupings from each grouping set. The cross-product
    operation enables even a small number of concatenated groupings to generate a large number of final groups.
    The concatenated groupings are specified simply by listing multiple `GROUPING SETS`, `CUBES`, and `ROLLUP`,
    and separating them with commas. For example:
    `GROUP BY GROUPING SETS((warehouse), (producet)), GROUPING SETS((location), (size))` is equivalent to
    `GROUP BY GROUPING SETS((warehouse, location), (warehouse, size), (product, location), (product, size))`.
    `GROUP BY CUBE((warehouse), (producet)), ROLLUP((location), (size))` is equivalent to
    `GROUP BY GROUPING SETS((warehouse, product), (warehouse), (producet), ()), GROUPING SETS((location, size), (location), ())`
    `GROUP BY GROUPING SETS(
        (warehouse, product, location, size), (warehouse, product, location), (warehouse, product),
        (warehouse, location, size), (warehouse, location), (warehouse),
        (product, location, size), (product, location), (product),
        (location, size), (location), ())`.
    `GROUP BY order, CUBE((warehouse), (producet)), ROLLUP((location), (size))` is equivalent to
    `GROUP BY order, GROUPING SETS((warehouse, product), (warehouse), (producet), ()), GROUPING SETS((location, size), (location), ())`
    `GROUP BY GROUPING SETS(
        (order, warehouse, product, location, size), (order, warehouse, product, location), (order, warehouse, product),
        (order, warehouse, location, size), (order, warehouse, location), (order, warehouse),
        (order, product, location, size), (order, product, location), (order, product),
        (order, location, size), (order, location), (order))`.

### Why are the changes needed?
Support more flexible grouping analytics

### Does this PR introduce _any_ user-facing change?
User can use sql like
```
select a, b, c, agg_expr() from table group by a, cube(b, c)
```

### How was this patch tested?
Added UT

Closes #30144 from AngersZhuuuu/SPARK-33229.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 2123237)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/group-analytics.sql.out (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/grouping.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/group-analytics.sql (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was modifieddocs/sql-ref-syntax-qry-select-groupby.md (diff)
Commit 8f8bac64355aca3b20dee26a6ab8fa1537462512 by wenchen
[SPARK-34905][SQL][TESTS] Enable ANSI intervals in `SQLQueryTestSuite`/`ThriftServerQueryTestSuite`

### What changes were proposed in this pull request?
Remove `spark.sql.legacy.interval.enabled` settings from `SQLQueryTestSuite`/`ThriftServerQueryTestSuite` that enables new ANSI intervals by default.

### Why are the changes needed?
To use default settings for intervals, and test new ANSI intervals - year-month and day-time interval introduced by SPARK-27793.

### Does this PR introduce _any_ user-facing change?
Should not because this affects tests only.

### How was this patch tested?
By running the affected tests, for instance:
```
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z datetime.sql"
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z date.sql"
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z timestamp.sql"
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z interval.sql"
```

Closes #32099 from MaxGekk/enable-ansi-intervals-sql-tests.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 8f8bac6)
The file was modifiedsql/core/src/test/resources/sql-tests/results/datetime.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/ansi/interval.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/interval.sql.out (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/typeCoercion/native/promoteStrings.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/charvarchar.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/show-tblproperties.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/create_view.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/describe.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/date.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/timestamp.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/datetime-legacy.sql.out (diff)
The file was modifiedsql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/ThriftServerQueryTestSuite.scala (diff)
Commit 1be10124977bc32b580e37cf533328195ab6378c by max.gekk
[SPARK-35005][SQL] Improve error msg if UTF8String concatWs length overflow

### What changes were proposed in this pull request?

Add check if the byte length over `int`.

### Why are the changes needed?

We encounter a very extreme case with expression `concat_ws`, and the error msg is
```
Caused by: java.lang.NegativeArraySizeException
at org.apache.spark.unsafe.types.UTF8String.concatWs
```
Seems the `UTF8String.concat` has already done the length check at [#21064](https://github.com/apache/spark/pull/21064), so it's better to add in `concatWs`.

### Does this PR introduce _any_ user-facing change?

Yes

### How was this patch tested?

It's too heavy to add the test.

Closes #32106 from ulysses-you/SPARK-35005.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: 1be1012)
The file was modifiedcommon/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java (diff)
Commit e40fce919ab77f5faeb0bbd34dc86c56c04adbaa by max.gekk
[SPARK-34562][SQL] Add test and doc for Parquet Bloom filter push down

### What changes were proposed in this pull request?

This pr add test and document for Parquet Bloom filter push down.

### Why are the changes needed?

Improve document.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Generating docs:
![image](https://user-images.githubusercontent.com/5399861/114327472-c131bb80-9b6b-11eb-87a0-6f9a74eb1097.png)

Closes #32123 from wangyum/SPARK-34562.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: e40fce9)
The file was modifiedexamples/src/main/python/sql/datasource.py (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala (diff)
The file was modifieddocs/sql-data-sources-load-save-functions.md (diff)
The file was modifiedexamples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala (diff)
The file was modifiedexamples/src/main/r/RSparkSQLExample.R (diff)
The file was modifiedexamples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java (diff)
Commit 8627cab39da0f91e38028450f0a6f1800e631c61 by viirya
[SPARK-34593][SQL][FOLLOWUP] Fix BroadcastNestedLoopJoinExec.outputPartition with full outer join

### What changes were proposed in this pull request?

This is a follow-up of https://github.com/apache/spark/pull/31708 . For full outer join, the final result RDD is created from
```
sparkContext.union(
  matchedStreamRows,
  sparkContext.makeRDD(notMatchedBroadcastRows)
)
```

It's incorrect to say that the final output partitioning is `UnknownPartitioning(left.outputPartitioning.numPartitions)`

### Why are the changes needed?

Fix a correctness bug

### Does this PR introduce _any_ user-facing change?

Yes, see the added test. Fortunately, this bug is not released yet.

### How was this patch tested?

new test

Closes #32132 from cloud-fan/bug.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(commit: 8627cab)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala (diff)
Commit b43f7e6a974cac5aae401224c14b870c18fbded8 by gurwls223
[SPARK-35019][PYTHON][SQL] Fix type hints mismatches in pyspark.sql.*

### What changes were proposed in this pull request?
Fix type hints mismatches in pyspark.sql.*

### Why are the changes needed?
There were some mismatches in pyspark.sql.*

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
dev/lint-python passed.

Closes #32122 from Yikun/SPARK-35019.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: b43f7e6)
The file was modifiedpython/pyspark/sql/dataframe.py (diff)
The file was modifiedpython/pyspark/sql/context.pyi (diff)
The file was modifiedpython/pyspark/sql/catalog.pyi (diff)
The file was modifiedpython/pyspark/sql/functions.pyi (diff)
The file was modifiedpython/pyspark/sql/conf.pyi (diff)
The file was modifiedpython/pyspark/sql/dataframe.pyi (diff)
Commit 9c1f807549b3c8cafcc6f16a42935b9c32a65855 by gurwls223
[SPARK-35031][PYTHON] Port Koalas operations on different frames tests into PySpark

### What changes were proposed in this pull request?
Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas operations on different frames unit tests to PySpark.

### Why are the changes needed?
Currently, the pandas-on-Spark modules are not tested fully. We should enable the operations on different frames unit tests.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Enable operations on different frames unit tests.

Closes #32133 from xinrong-databricks/port.test_ops_on_diff_frames.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 9c1f807)
The file was addedpython/pyspark/pandas/tests/test_ops_on_diff_frames.py
The file was addedpython/pyspark/pandas/tests/test_ops_on_diff_frames_groupby_expanding.py
The file was modifieddev/sparktestsupport/modules.py (diff)
The file was addedpython/pyspark/pandas/tests/test_ops_on_diff_frames_groupby.py
The file was addedpython/pyspark/pandas/tests/test_ops_on_diff_frames_groupby_rolling.py
Commit 278203d969bcea2c3d3d42b5a06e89bb3d170837 by yamamuro
[SPARK-28227][SQL] Support projection, aggregate/window functions, and lateral view in the TRANSFORM clause

### What changes were proposed in this pull request?
For Spark SQL, it can't support script transform SQL with aggregationClause/windowClause/LateralView.
This case we can't directly migration Hive SQL to Spark SQL.

In this PR, we treat all script transform statement's query part (exclude transform about part)  as a  separate query block and solve it as ScriptTransformation's child and pass a UnresolvedStart as ScriptTransform's input. Then in analyzer level, we pass child's output as ScriptTransform's input. Then we can support all kind of normal SELECT query combine with script transformation.

Such as transform with aggregation:
```
SELECT TRANSFORM ( d2, max(d1) as max_d1, sum(d3))
USING 'cat' AS (a,b,c)
FROM script_trans
WHERE d1 <= 100
GROUP BY d2
HAVING max_d1 > 0
```
When we build AST, we treat it as
```
SELECT TRANSFORM (*)
USING 'cat' AS (a,b,c)
FROM (
     SELECT  d2, max(d1) as max_d1, sum(d3)
     FROM script_trans
    WHERE d1 <= 100
    GROUP BY d2
    HAVING max_d1 > 0
) tmp
```
then in Analyzer's `ResolveReferences`, resolve `* (UnresolvedStar)`, then sql behavior like
```
SELECT TRANSFORM ( d2, max(d1) as max_d1, sum(d3))
USING 'cat' AS (a,b,c)
FROM script_trans
WHERE d1 <= 100
GROUP BY d2
HAVING max_d1 > 0
```

About UT, in this pr we add a lot of different SQL to check we can support all kind of such SQL and  each kind of expressions can work well, such as alias, case when, binary compute etc...

### Why are the changes needed?
Support transform with aggregateClause/windowClause/LateralView etc , make sql migration more smoothly

### Does this PR introduce _any_ user-facing change?
User can write transform with  aggregateClause/windowClause/LateralView.

### How was this patch tested?
Added UT

Closes #29087 from AngersZhuuuu/SPARK-28227-NEW.

Lead-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(commit: 278203d)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLParserSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/transform.sql (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/transform.sql.out (diff)
Commit 79e55b44f7ea25311e1c27a152876304263a7191 by ltnwgl
[SPARK-35028][SQL] ANSI mode: disallow group by aliases

### What changes were proposed in this pull request?

Disallow group by aliases under ANSI mode.

### Why are the changes needed?

As per the ANSI SQL standard secion 7.12 <group by clause>:

>Each `grouping column reference` shall unambiguously reference a column of the table resulting from the `from clause`. A column referenced in a `group by clause` is a grouping column.

By forbidding it, we can avoid ambiguous SQL queries like:
```
SELECT col + 1 as col FROM t GROUP BY col
```

### Does this PR introduce _any_ user-facing change?

Yes, group by aliases is not allowed under ANSI mode.

### How was this patch tested?

Unit tests

Closes #32129 from gengliangwang/disallowGroupByAlias.

Authored-by: Gengliang Wang <ltnwgl@gmail.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
(commit: 79e55b4)
The file was addedsql/core/src/test/resources/sql-tests/results/ansi/group-analytics.sql.out
The file was modifieddocs/sql-ref-ansi-compliance.md (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was addedsql/core/src/test/resources/sql-tests/inputs/ansi/group-analytics.sql
Commit 16e2faadac547da1736308a5b69b4c7a6b7a627c by yao
[SPARK-34944][SQL][TESTS] Replace bigint with int for web_returns and store_returns in TPCDS tests to employ correct data type

### What changes were proposed in this pull request?

According to  http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v2.9.0.pdf

```
2.2.2 Datatype
2.2.2.1 Each column employs one of the following datatypes:
a) Identifier means that the column shall be able to hold any key value generated for that column.
b) Integer means that the column shall be able to exactly represent integer values (i.e., values in increments of
1) in the range of at least ( − 2n − 1) to (2n − 1 − 1), where n is 64.
c) Decimal(d, f) means that the column shall be able to represent decimal values up to and including d digits,
of which f shall occur to the right of the decimal place; the values can be either represented exactly or
interpreted to be in this range.
d) Char(N) means that the column shall be able to hold any string of characters of a fixed length of N.
Comment: If the string that a column of datatype char(N) holds is shorter than N characters, then trailing
spaces shall be stored in the database or the database shall automatically pad with spaces upon retrieval such
that a CHAR_LENGTH() function will return N.
e) Varchar(N) means that the column shall be able to hold any string of characters of a variable length with a
maximum length of N. Columns defined as "varchar(N)" may optionally be implemented as "char(N)".
f) Date means that the column shall be able to express any calendar day between January 1, 1900 and
December 31, 2199.
2.2.2.2 The datatypes do not correspond to any specific SQL-standard datatype. The definitions are provided to
highlight the properties that are required for a particular column. The benchmark implementer may employ any internal representation or SQL datatype that meets those requirements.
```

This PR proposes that we use int for identifiers instead of bigint to reach a compromise with TPC-DS Standard Specification.

After this PR, the field schemas are now consistent with those DDLs in the `tpcds.sql` from tpc-ds tool kit, see https://gist.github.com/yaooqinn/b9978a77bbf4f871a95d6a9103019907

### Why are the changes needed?

reach a compromise with TPC-DS Standard Specification

### Does this PR introduce _any_ user-facing change?

no test only

### How was this patch tested?

test only

Closes #32037 from yaooqinn/SPARK-34944.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
(commit: 16e2faa)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q24b.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q5a/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q50.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q93/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q17.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q24a.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q80a.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q25.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q77/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q75/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q17.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q50.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q1/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q49/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q80a/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q64.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q5a.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q77/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q94/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q50/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q75.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q25.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q77.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q78/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q84.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q49.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q77a.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q30.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q30.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q95/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q25/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q85.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q17/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q5.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q80.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q94/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q30/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q95.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q1/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q77a.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q24a/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q5a.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q24.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q78.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q95/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q5a/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q77a/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q85.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q84/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q24b/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q95.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/TPCDSBase.scala (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q83.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q49.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q49/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q93.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q50/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q77.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q80/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q94.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q78.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q5.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q5/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q24a.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q24/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q94.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q29/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q64/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q17/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q24b.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q85/explain.txt (diff)
The file was modified.github/workflows/build_and_test.yml (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q83/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q78.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q64.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q29.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q1.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q1.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q25/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q64/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q78.sf100/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q5/simplified.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q78/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q75.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q75/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q77a/explain.txt (diff)
Commit aae4ab278b5e4a67efc056c16ce0e76aa74eda06 by gurwls223
[SPARK-35029][CORE] Extract `tryOrFetchFailedException` method to eliminate duplicate code in `BufferReleasingInputStream`

### What changes were proposed in this pull request?
The main change of this pr is extract a `private` method named `tryOrFetchFailedException` to eliminate duplicate code in `BufferReleasingInputStream`.

The patterns of duplicate code as follows:

```
try {
   block
} catch {
    case e: IOException if detectCorruption =>
        IOUtils.closeQuietly(this)
        iterator.throwFetchFailedException(blockId, mapIndex, address, e)
}
```

### Why are the changes needed?
Eliminate duplicate code.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #32130 from LuciferYang/SPARK-35029.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: aae4ab2)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala (diff)
Commit a3926335669d671f15e4f323e68f7bab4f55abc8 by gurwls223
[SPARK-34996][PYTHON] Port Koalas Series-related unit tests into PySpark

### What changes were proposed in this pull request?
Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas Series related unit tests to PySpark.

### Why are the changes needed?
Currently, the pandas-on-Spark modules are not fully tested. We should enable the Series related unit tests first.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Enable Series-related unit tests.

Closes #32117 from xinrong-databricks/port.test_series_related.

Lead-authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Co-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: a392633)
The file was addedpython/pyspark/pandas/tests/test_series_conversion.py
The file was addedpython/pyspark/pandas/tests/test_series_string.py
The file was modifieddev/sparktestsupport/modules.py (diff)
The file was addedpython/pyspark/pandas/tests/test_series.py
The file was addedpython/pyspark/pandas/tests/test_series_datetime.py
Commit 26f312e95f255a3ed6fc9c09c1eabe6794135286 by max.gekk
[SPARK-35037][SQL] Recognize sign before the interval string in literals

### What changes were proposed in this pull request?
1. Extend SQL syntax rules to support a sign before the interval strings of ANSI year-month and day-time intervals.
2. Recognize `-` in `AstBuilder` and negate parsed intervals.

### Why are the changes needed?
To conform to the SQL standard which allows a sign before the string interval, see `"5.3 <literal>"`:
```
<interval literal> ::=
  INTERVAL [ <sign> ] <interval string> <interval qualifier>
<interval string> ::=
  <quote> <unquoted interval string> <quote>
<unquoted interval string> ::=
  [ <sign> ] { <year-month literal> | <day-time literal> }
<sign> ::=
    <plus sign>
  | <minus sign>
```

### Does this PR introduce _any_ user-facing change?
Should not because it just extends supported intervals syntax.

### How was this patch tested?
By running new tests in `interval.sql`:
```
$ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z interval.sql"
```

Closes #32134 from MaxGekk/negative-parsed-intervals.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(commit: 26f312e)
The file was modifiedsql/core/src/test/resources/sql-tests/results/interval.sql.out (diff)
The file was modifiedsql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/ansi/interval.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/interval.sql (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
Commit caf33be2744efc43a0de9fa15dd541b5cc087617 by yamamuro
[SPARK-33411][SQL] Cardinality estimation of union, sort and range operator

### What changes were proposed in this pull request?
Supports cardinality estimation of union, sort and range operator.

1. **Union**: number of rows in output will be the sum of number of rows in the output for each child of union, min and max for each column in the output will be the min and max of that particular column coming from its children.
Example:
Table 1
a   b
1   6
2   3
Table 2
a   b
1   3
4   1
stats for table1 union table2 would be number of rows = 4, columnStats = (a: {min: 1, max: 4}, b: {min: 1, max: 6})

2. **Sort**: row and columns stats would be same as its children.

3. **Range**: number of output rows and distinct count will be equal to number of elements, min and max is calculated from start, end and step param.

### Why are the changes needed?
The change will enhance the feature https://issues.apache.org/jira/browse/SPARK-16026 and will help in other stats based optimizations.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
New unit tests added.

Closes #30334 from ayushi-agarwal/SPARK-33411.

Lead-authored-by: ayushi agarwal <ayaga@microsoft.com>
Co-authored-by: ayushi-agarwal <36420535+ayushi-agarwal@users.noreply.github.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(commit: caf33be)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala (diff)
The file was addedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/UnionEstimationSuite.scala
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlanVisitor.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/BasicStatsPlanVisitor.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala (diff)
The file was addedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/UnionEstimation.scala
Commit 6b8405b574084034d3610bf40ad5e7d481c65db1 by wenchen
[SPARK-28379][SQL] Allow non-aggregated single row correlated scalar subquery

### What changes were proposed in this pull request?
This PR allows non-aggregated correlated scalar subquery if the max output row is less than 2. Correlated scalar subqueries need to be aggregated because they are going to be decorrelated and rewritten as LEFT OUTER joins. If the correlated scalar subquery produces more than one output row, the rewrite will yield wrong results.

But this constraint can be relaxed when the subquery plan's the max number of output rows is less than or equal to 1.

### Why are the changes needed?
To relax a constraint in CheckAnalysis for the correlated scalar subquery.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Unit tests

Closes #32111 from allisonwang-db/spark-28379-aggregated.

Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 6b8405b)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/postgreSQL/join.sql (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/postgreSQL/window_part1.sql (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/udf/postgreSQL/udf-join.sql (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala (diff)
Commit 5d126537d3931d01e42b69316476777d7f9b2e39 by ltnwgl
[MINOR][TESTS] Enhance the test instruction of ThriftServerQueryTestSuite

### What changes were proposed in this pull request?

Enhance the test instruction of ThriftServerQueryTestSuite:
1. how to run a single test case
2. how to regenerate golden file for a single test

### Why are the changes needed?

Better documentation.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

No, just enhance the comments.

Closes #32141 from gengliangwang/updateComment.

Authored-by: Gengliang Wang <ltnwgl@gmail.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
(commit: 5d12653)
The file was modifiedsql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/ThriftServerQueryTestSuite.scala (diff)
Commit b34a84e21edcf55021feaf417a2f7276dc12b895 by wenchen
[SPARK-34212][SQL][FOLLOWUP] Move the added test to ParquetQuerySuite

### What changes were proposed in this pull request?

This pr moves the added test from `SQLQuerySuite` to `ParquetQuerySuite`.

### Why are the changes needed?
1. It can be tested by `ParquetV1QuerySuite` and `ParquetV2QuerySuite`.
2. Reduce the testing time of `SQLQuerySuite`(SQLQuerySuite ~ 3 min 17 sec, ParquetV1QuerySuite ~ 27 sec).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #32090 from wangyum/SPARK-34212.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: b34a84e)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala (diff)