Changes

Summary

  1. [MINOR][DOCS] Fix the Binder link to point the quickstart notebook (commit: eaaf783) (details)
  2. [SPARK-32592][SQL] Make DataFrameReader.table take the specified options (commit: 806140d) (details)
  3. [SPARK-32721][SQL] Simplify if clauses with null and boolean (commit: 1453a09) (details)
  4. [MINOR][R] Fix a R style in try and finally at DataFrame.R (commit: 701e593) (details)
  5. [SPARK-32190][PYTHON][DOCS] Development - Contribution Guide in PySpark (commit: 86ca90c) (details)
  6. [SPARK-32659][SQL][FOLLOWUP] Improve test for pruning DPP on non-atomic (commit: a701bc7) (details)
  7. [SPARK-32721][SQL][FOLLOWUP] Simplify if clauses with null and boolean (commit: 94d313b) (details)
  8. [SPARK-32191][FOLLOW-UP][PYTHON][DOCS] Indent the table and reword the (commit: d80c85c) (details)
  9. [SPARK-32624][SQL][FOLLOWUP] Fix regression in (commit: 6e5bc39) (details)
  10. [SPARK-32757][SQL] Physical InSubqueryExec should be consistent with (commit: d2a5dad) (details)
  11. [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder (commit: e1dbc85) (details)
  12. [SPARK-32757][SQL][FOLLOW-UP] Use child's output for canonicalization in (commit: fea9360) (details)
  13. [SPARK-32761][SQL] Allow aggregating multiple foldable distinct (commit: a410658) (details)
  14. [SPARK-32754][SQL][TEST] Unify to `assertEqualJoinPlans` for join (commit: 2a88a20) (details)
Commit eaaf783148547abc631b2a42227e5b88547c295d by gurwls223
[MINOR][DOCS] Fix the Binder link to point the quickstart notebook
correctly
### What changes were proposed in this pull request?
This PR fixes the link of Binder in Quickstart notebook and
documentation.
From:
https://mybinder.org/v2/gh/databricks/apache/master?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb
To:
https://mybinder.org/v2/gh/apache/spark/master?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb
This link is the same as the one in RST files:
https://github.com/apache/spark/blob/b54103016a01d77f074295db82f96087f45cef4f/python/docs/source/conf.py#L57
### Why are the changes needed?
The link was wrong, and points out non-existent file and repo.
### Does this PR introduce _any_ user-facing change?
Yes, it will fixes the link so users can correctly try Binder.
### How was this patch tested?
Manually tested by building the documentation.
Closes #29597 from HyukjinKwon/minor-link-quickstart.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: eaaf783)
The file was modifiedpython/docs/source/getting_started/quickstart.ipynb (diff)
Commit 806140de400b11eaf82510034d70a52e5fab3e8e by wenchen
[SPARK-32592][SQL] Make DataFrameReader.table take the specified options
### What changes were proposed in this pull request? pass specified
options in DataFrameReader.table to JDBCTableCatalog.loadTable
### Why are the changes needed? Currently, `DataFrameReader.table`
ignores the specified options. The options specified like the following
are lost.
```
   val df = spark.read
     .option("partitionColumn", "id")
     .option("lowerBound", "0")
     .option("upperBound", "3")
     .option("numPartitions", "2")
     .table("h2.test.people")
``` We need to make `DataFrameReader.table` take the specified options.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? Manually test for now. Will add a test
after V2 JDBC read is implemented.
Closes #29535 from huaxingao/table_options.
Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 806140d)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2DataFrameSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/test/TestHive.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/explain-aqe.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/explain.sql.out (diff)
Commit 1453a09a635c66cee9ac55f87ab4cf87876f959f by d_tsai
[SPARK-32721][SQL] Simplify if clauses with null and boolean
### What changes were proposed in this pull request?
The following if clause:
```sql if(p, null, false)
``` can be simplified to:
```sql and(p, null)
``` Similarly, the clause:
```sql if(p, null, true)
``` can be simplified to
```sql or(not(p), null)
``` iff the predicate `p` is non-nullable, i.e., can be evaluated to
either true or false, but not null.
### Why are the changes needed?
Converting if to or/and clauses can better push filters down.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests.
Closes #29567 from sunchao/SPARK-32721.
Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: DB Tsai
<d_tsai@apple.com>
(commit: 1453a09)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/SimplifyConditionalSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/BooleanSimplificationSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala (diff)
Commit 701e5934140e477426cf2a68f2c5746be67418d6 by gurwls223
[MINOR][R] Fix a R style in try and finally at DataFrame.R
Fix the R style issue which is not catched by the R style checker. Got
error:
``` R/DataFrame.R:1244:17: style: Closing curly-braces should always be
on their own line, unless it's followed by an else.
}, finally = {
^ lintr checks failed.
```
Closes #29574 from lu-wang-dl/fix-r-style.
Lead-authored-by: Lu WANG <lu.wang@databricks.com> Co-authored-by: Lu
Wang <38018689+lu-wang-dl@users.noreply.github.com> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 701e593)
The file was modifiedR/pkg/R/DataFrame.R (diff)
Commit 86ca90ccd7ea28e82562ab9286e19bb9c7a0fa5e by gurwls223
[SPARK-32190][PYTHON][DOCS] Development - Contribution Guide in PySpark
### What changes were proposed in this pull request?
This PR proposes to document PySpark specific contribution guides at
"Development" section.
Here is the demo for reviewing quicker:
https://hyukjin-spark.readthedocs.io/en/stable/development/contributing.html
### Why are the changes needed?
To have a single place for PySpark users, and better documentation.
### Does this PR introduce _any_ user-facing change?
Yes, it is a new documentation. See the demo linked above.
### How was this patch tested?
```bash cd docs SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve
--watch
```
and
```bash cd python/docs make clean html
```
Closes #29596 from HyukjinKwon/SPARK-32190.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 86ca90c)
The file was modifiedpython/docs/source/development/index.rst (diff)
The file was addedpython/docs/source/development/contributing.rst
Commit a701bc79e3bb936e8fb4a4fbe11e9e0bd0ccd8ac by wenchen
[SPARK-32659][SQL][FOLLOWUP] Improve test for pruning DPP on non-atomic
type
### What changes were proposed in this pull request?
Improve test for pruning DPP on non-atomic type:
- Avoid creating new partition tables. This may take 30 seconds..
- Add test `array` type.
### Why are the changes needed?
Improve test.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
N/A
Closes #29595 from wangyum/SPARK-32659-test.
Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: a701bc7)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala (diff)
Commit 94d313b061b0aee7f335696a364c73adb149da8b by d_tsai
[SPARK-32721][SQL][FOLLOWUP] Simplify if clauses with null and boolean
### What changes were proposed in this pull request?
This is a follow-up on SPARK-32721 and PR #29567. In the previous PR we
missed two more cases that can be optimized:
``` if(p, false, null) ==> and(not(p), null) if(p, true, null) ==> or(p,
null)
```
### Why are the changes needed?
By transforming if to boolean conjunctions or disjunctions, we can
enable more filter pushdown to datasources.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added unit tests.
Closes #29603 from sunchao/SPARK-32721-2.
Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: DB Tsai
<d_tsai@apple.com>
(commit: 94d313b)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/SimplifyConditionalSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala (diff)
Commit d80c85c2e3bc1105ddae9d2c2aeaf8aa92064560 by gurwls223
[SPARK-32191][FOLLOW-UP][PYTHON][DOCS] Indent the table and reword the
main page in migration guide
### What changes were proposed in this pull request?
This PR is a minor followup to fix:
1. Slightly reword the wording in the main page.
2. The indentation in the table at the migration guide;
    from
    ![Screen Shot 2020-09-01 at 1 53 40
PM](https://user-images.githubusercontent.com/6477701/91796204-91781800-ec5a-11ea-9f57-d7a9f4207ba0.png)
    to
    ![Screen Shot 2020-09-01 at 1 53 26
PM](https://user-images.githubusercontent.com/6477701/91796202-9046eb00-ec5a-11ea-9db2-815139ddfdb9.png)
### Why are the changes needed?
In order to show the migration guide pretty.
### Does this PR introduce _any_ user-facing change?
Yes, this is a change to user-facing documentation.
### How was this patch tested?
Manually built the documentation.
Closes #29606 from HyukjinKwon/SPARK-32191.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: d80c85c)
The file was modifiedpython/docs/source/migration_guide/index.rst (diff)
The file was modifiedpython/docs/source/migration_guide/pyspark_2.4_to_3.0.rst (diff)
Commit 6e5bc39e17d4cf02806761170de6ddeb634aa343 by gurwls223
[SPARK-32624][SQL][FOLLOWUP] Fix regression in
CodegenContext.addReferenceObj on nested Scala types
### What changes were proposed in this pull request?
Use `CodeGenerator.typeName()` instead of `Class.getCanonicalName()` in
`CodegenContext.addReferenceObj()` for getting the runtime class name
for an object.
### Why are the changes needed?
https://github.com/apache/spark/pull/29439 fixed a bug in
`CodegenContext.addReferenceObj()` for `Array[Byte]` (i.e. Spark SQL's
`BinaryType`) objects, but unfortunately it introduced a regression for
some nested Scala types.
For example, for `implicitly[Ordering[UTF8String]]`, after that PR
`CodegenContext.addReferenceObj()` would return `((null) references[0]
/* ... */)`. The actual type for `implicitly[Ordering[UTF8String]]` is
`scala.math.LowPriorityOrderingImplicits$$anon$3` in Scala 2.12.10, and
`Class.getCanonicalName()` returns `null` for that class.
On the other hand, `Class.getName()` is safe to use for all non-array
types, and Janino will happily accept the type name returned from
`Class.getName()` for nested types. `CodeGenerator.typeName()` happens
to do the right thing by correctly handling arrays and otherwise use
`Class.getName()`. So it's a better alternative than
`Class.getCanonicalName()`.
Side note: rule of thumb for using Java reflection in Spark: it may be
tempting to use `Class.getCanonicalName()`, but for functions that may
need to handle Scala types, please avoid it due to potential issues with
nested Scala types. Instead, use `Class.getName()` or utility functions
in `org.apache.spark.util.Utils` (e.g. `Utils.getSimpleName()` or
`Utils.getFormattedClassName()` etc).
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added new unit test case for the regression case in
`CodeGenerationSuite`.
Closes #29602 from rednaxelafx/spark-32624-followup.
Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 6e5bc39)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala (diff)
Commit d2a5dad97c0cd9c2b7eede72a0a8df268714155f by wenchen
[SPARK-32757][SQL] Physical InSubqueryExec should be consistent with
logical InSubquery
### What changes were proposed in this pull request?
`InSubquery` can be either single-column mode, or multi-column mode,
depending on the output length of the subquery. For multi-column mode,
the length of input `values` must match the subquery output length.
However, `InSubqueryExec` doesn't follow it and always be executed under
single column mode. It's OK as it's only used by DPP, which looks up one
key in one `InSubqueryExec`, so the multi-column mode is not needed. But
it's better to make the physical and logical node consistent.
This PR updates `InSubqueryExec` to support multi-column mode, and also
fix `SubqueryBroadcastExec` to report output correctly.
### Why are the changes needed?
Fix a potential bug.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
existing tests
Closes #29601 from cloud-fan/follow.
Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen
Fan <wenchen@databricks.com>
(commit: d2a5dad)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SubqueryBroadcastExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala (diff)
Commit e1dbc85c72d5d3f2fe7f2480e33b44e2f2e3b28a by wenchen
[SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder
### What changes were proposed in this pull request? Add JDBCScan,
JDBCScanBuilder, JDBCWriteBuilder in Datasource V2 JDBC
### Why are the changes needed? Complete Datasource V2 JDBC
implementation
### Does this PR introduce _any_ user-facing change? Yes
### How was this patch tested? new tests
Closes #29396 from huaxingao/v2jdbc.
Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: e1dbc85)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCV2Suite.scala
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCScanBuilder.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTable.scala (diff)
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCWriteBuilder.scala
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCScan.scala
Commit fea9360ae70a6715655e6673e13f32959e02368b by wenchen
[SPARK-32757][SQL][FOLLOW-UP] Use child's output for canonicalization in
SubqueryBroadcastExec
### What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/29601 , to
fix a small mistake in `SubqueryBroadcastExec`.
`SubqueryBroadcastExec.doCanonicalize` should canonicalize the build
keys with the query output, not the `SubqueryBroadcastExec.output`.
### Why are the changes needed?
fix mistake
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
existing test
Closes #29610 from cloud-fan/follow.
Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen
Fan <wenchen@databricks.com>
(commit: fea9360)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SubqueryBroadcastExec.scala (diff)
Commit a410658c9bc244e325702dc926075bd835b669ff by wenchen
[SPARK-32761][SQL] Allow aggregating multiple foldable distinct
expressions
### What changes were proposed in this pull request? For queries with
multiple foldable distinct columns, since they will be eliminated during
execution, it's not mandatory to let `RewriteDistinctAggregates` handle
this case. And in the current code, `RewriteDistinctAggregates` *dose*
miss some "aggregating with multiple foldable distinct expressions"
cases. For example: `select count(distinct 2), count(distinct 2, 3)`
will be missed.
But in the planner, this will trigger an error that "multiple distinct
expressions" are not allowed. As the foldable distinct columns can be
eliminated finally, we can allow this in the aggregation planner check.
### Why are the changes needed? bug fix
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? added test case
Closes #29607 from linhongliu-db/SPARK-32761.
Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by:
Wenchen Fan <wenchen@databricks.com>
(commit: a410658)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala (diff)
Commit 2a88a202719839f96838307fc9a8b4dc9bfd0c34 by dongjoon
[SPARK-32754][SQL][TEST] Unify to `assertEqualJoinPlans` for join
reorder suites
### What changes were proposed in this pull request?
Now three join reorder suites(`JoinReorderSuite`,
`StarJoinReorderSuite`, `StarJoinCostBasedReorderSuite`) all contain an
`assertEqualPlans` method and the logic is almost the same. We can
extract the method to a single place for code simplicity.
### Why are the changes needed?
To reduce code redundancy.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Covered by existing tests.
Closes #29594 from wzhfy/unify_assertEqualPlans_joinReorder.
Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Dongjoon Hyun
<dongjoon@apache.org>
(commit: 2a88a20)
The file was removedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/StarJoinReorderSuite.scala
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/PlanTest.scala (diff)
The file was removedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/JoinReorderSuite.scala
The file was addedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/joinReorder/StarJoinCostBasedReorderSuite.scala
The file was addedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/joinReorder/StarJoinReorderSuite.scala
The file was addedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/joinReorder/JoinReorderPlanTestBase.scala
The file was addedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/joinReorder/JoinReorderSuite.scala
The file was removedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/StarJoinCostBasedReorderSuite.scala