Changes

Summary

  1. [SPARK-32667][SQL] Script transform 'default-serde' mode should pad null (commit: c75a827) (details)
  2. [SPARK-32674][DOC] Add suggestion for parallel directory listing in (commit: bf221de) (details)
  3. [SPARK-32646][SQL] ORC predicate pushdown should work with (commit: e277ef1) (details)
  4. [SPARK-32682][INFRA] Use workflow_dispatch to enable manual test (commit: 6dd37cb) (details)
  5. [SPARK-32669][SQL][TEST] Expression unit tests should explore all cases (commit: 3dca81e) (details)
  6. [MINOR][DOCS] fix typo for docs,log message and comments (commit: 1450b5e) (details)
  7. [SPARK-32662][ML] CountVectorizerModel: Remove requirement for minimum (commit: 1fd54f4) (details)
  8. [SPARK-32672][SQL] Fix data corruption in boolean bit set compression (commit: 12f4331) (details)
  9. [SPARK-31792][SS][DOC][FOLLOW-UP] Rephrase the description for some (commit: 8b26c69) (details)
  10. [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13 (commit: 25c7d0f) (details)
  11. [SPARK-32092][ML][PYSPARK] Fix parameters not being copied in (commit: d9eb06e) (details)
Commit c75a82794fc6a0f35697f8e1258562d43e860f68 by wenchen
[SPARK-32667][SQL] Script transform 'default-serde' mode should pad null
value to filling column
### What changes were proposed in this pull request? Hive no serde mode
when  column less then output specified column, it will pad null value
to it, spark should do this also.
``` hive> SELECT TRANSFORM(a, b)
   >   ROW FORMAT DELIMITED
   >   FIELDS TERMINATED BY '|'
   >   LINES TERMINATED BY '\n'
   >   NULL DEFINED AS 'NULL'
   > USING 'cat' as (a string, b string, c string, d string)
   >   ROW FORMAT DELIMITED
   >   FIELDS TERMINATED BY '|'
   >   LINES TERMINATED BY '\n'
   >   NULL DEFINED AS 'NULL'
   > FROM (
   > select 1 as a, 2 as b
   > ) tmp ; OK 1 2 NULL NULL Time taken: 24.626 seconds,
Fetched: 1 row(s)
```
### Why are the changes needed? Keep save behavior with hive data.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? Added UT
Closes #29500 from AngersZhuuuu/SPARK-32667.
Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: c75a827)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/BaseScriptTransformationSuite.scala (diff)
Commit bf221debd02b11003b092201d0326302196e4ba5 by gurwls223
[SPARK-32674][DOC] Add suggestion for parallel directory listing in
tuning doc
### What changes were proposed in this pull request?
This adds some tuning guide for increasing parallelism of directory
listing.
### Why are the changes needed?
Sometimes when job input has large number of directories, the listing
can become a bottleneck. There are a few parameters to tune this. This
adds some info to Spark tuning guide to make the knowledge better
shared.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
N/A
Closes #29498 from sunchao/SPARK-32674.
Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: bf221de)
The file was modifieddocs/sql-performance-tuning.md (diff)
The file was modifieddocs/tuning.md (diff)
Commit e277ef1a83e37bc94e7817467ca882d660c83284 by wenchen
[SPARK-32646][SQL] ORC predicate pushdown should work with
case-insensitive analysis
### What changes were proposed in this pull request?
This PR proposes to fix ORC predicate pushdown under case-insensitive
analysis case. The field names in pushed down predicates don't need to
match in exact letter case with physical field names in ORC files, if we
enable case-insensitive analysis.
### Why are the changes needed?
Currently ORC predicate pushdown doesn't work with case-insensitive
analysis. A predicate "a < 0" cannot pushdown to ORC file with field
name "A" under case-insensitive analysis.
But Parquet predicate pushdown works with this case. We should make ORC
predicate pushdown work with case-insensitive analysis too.
### Does this PR introduce _any_ user-facing change?
Yes, after this PR, under case-insensitive analysis, ORC predicate
pushdown will work.
### How was this patch tested?
Unit tests.
Closes #29457 from viirya/fix-orc-pushdown.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen
Fan <wenchen@databricks.com>
(commit: e277ef1)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcScan.scala (diff)
The file was modifiedsql/core/v2.3/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala (diff)
The file was modifiedsql/core/v1.2/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala (diff)
The file was modifiedsql/core/v1.2/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala (diff)
The file was modifiedsql/core/v2.3/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcPartitionReaderFactory.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcScanBuilder.scala (diff)
Commit 6dd37cbaacbad78a4bc52684ef5a6f2654e987df by yamamuro
[SPARK-32682][INFRA] Use workflow_dispatch to enable manual test
triggers
### What changes were proposed in this pull request?
This PR proposes to add a `workflow_dispatch` entry in the GitHub Action
script (`build_and_test.yml`). This update can enable developers to run
the Spark tests for a specific branch on their own local repository, so
I think it might help to check if al the tests can pass before opening a
new PR.
<img width="944" alt="Screen Shot 2020-08-21 at 16 28 41"
src="https://user-images.githubusercontent.com/692303/90866249-96250c80-e3ce-11ea-8496-3dd6683e92ea.png">
### Why are the changes needed?
To reduce the pressure of GitHub Actions on the Spark repository.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manually checked.
Closes #29504 from maropu/DispatchTest.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by:
Takeshi Yamamuro <yamamuro@apache.org>
(commit: 6dd37cb)
The file was modifieddev/run-tests.py (diff)
The file was modified.github/workflows/build_and_test.yml (diff)
Commit 3dca81e4f5d51c81d6c183ddf762de011e4b9093 by yamamuro
[SPARK-32669][SQL][TEST] Expression unit tests should explore all cases
that can lead to null result
### What changes were proposed in this pull request?
Add document to `ExpressionEvalHelper`, and ask people to explore all
the cases that can lead to null results (including null in struct
fields, array elements and map values).
This PR also fixes `ComplexTypeSuite.GetArrayStructFields` to explore
all the null cases.
### Why are the changes needed?
It happened several times that we hit correctness bugs caused by wrong
expression nullability. When writing unit tests, we usually don't test
the nullability flag directly, and it's too late to add such tests for
all expressions.
In https://github.com/apache/spark/pull/22375, we extended the
expression test framework, which checks the nullability flag when the
expected result/field/element is null.
This requires the test cases to explore all the cases that can lead to
null results
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
I reverted
https://github.com/apache/spark/commit/5d296ed39e3dd79ddb10c68657e773adba40a5e0
locally, and `ComplexTypeSuite` can catch the bug.
Closes #29493 from cloud-fan/small.
Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi
Yamamuro <yamamuro@apache.org>
(commit: 3dca81e)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala (diff)
Commit 1450b5e095c4dde4eb38d6237e54d6bfa96955e2 by yamamuro
[MINOR][DOCS] fix typo for docs,log message and comments
### What changes were proposed in this pull request? Fix typo for docs,
log messages and comments
### Why are the changes needed? typo fix to increase readability
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? manual test has been performed to test
the updated
Closes #29443 from brandonJY/spell-fix-doc.
Authored-by: Brandon Jiang <Brandon.jiang.a@outlook.com> Signed-off-by:
Takeshi Yamamuro <yamamuro@apache.org>
(commit: 1450b5e)
The file was modifieddocs/sql-ref-syntax-qry-select-hints.md (diff)
The file was modifiedsbin/decommission-worker.sh (diff)
The file was modifieddocs/sql-ref-syntax-qry-select-groupby.md (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/resource/ResourceDiscoveryScriptPlugin.scala (diff)
The file was modifieddocs/sql-ref.md (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/QueryPlanningTracker.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowTablePropertiesExec.scala (diff)
The file was modifiedcommon/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java (diff)
The file was modifiedlauncher/src/main/java/org/apache/spark/launcher/LauncherServer.java (diff)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java (diff)
The file was modifiedcore/src/main/java/org/apache/spark/api/plugin/DriverPlugin.java (diff)
The file was modifieddocs/job-scheduling.md (diff)
Commit 1fd54f4bf58342c067adfa28f0705a4efef5e60a by huaxing
[SPARK-32662][ML] CountVectorizerModel: Remove requirement for minimum
Vocab size
### What changes were proposed in this pull request?
The strict requirement for the vocabulary to remain non-empty has been
removed in this pull request.
Link to the discussion:
http://apache-spark-user-list.1001560.n3.nabble.com/Ability-to-have-CountVectorizerModel-vocab-as-empty-td38396.html
### Why are the changes needed?
This soothens running it across the corner cases. Without this, the user
has to manupulate the data in genuine case, which may be a perfectly
fine valid use-case.
Question: Should we a log when empty vocabulary is found instead?
### Does this PR introduce _any_ user-facing change?
May be a slight change. If someone has put a try-catch to detect an
empty vocab. Then that behavior would no longer stand still.
### How was this patch tested?
1. Added testcase to `fit` generating an empty vocabulary 2. Added
testcase to `transform` with empty vocabulary
Request to review: srowen hhbyyh
Closes #29482 from purijatin/spark_32662.
Authored-by: Jatin Puri <purijatin@gmail.com> Signed-off-by: Huaxin Gao
<huaxing@us.ibm.com>
(commit: 1fd54f4)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizerSuite.scala (diff)
Commit 12f4331b9eb563cb0cfbf6a241d1d085ca4f7676 by gurwls223
[SPARK-32672][SQL] Fix data corruption in boolean bit set compression
## What changes were proposed in this pull request?
This fixed SPARK-32672 a data corruption.  Essentially the BooleanBitSet
CompressionScheme would miss nulls at the end of a CompressedBatch.  The
values would then default to false.
### Why are the changes needed? It fixes data corruption
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? I manually tested it against the original
issue that was producing errors for me.  I also added in a unit test.
Closes #29506 from revans2/SPARK-32672.
Authored-by: Robert (Bobby) Evans <bobby@apache.org> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 12f4331)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/columnar/compression/compressionSchemes.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/columnar/compression/BooleanBitSetSuite.scala (diff)
Commit 8b26c69ce7f9077775a3c7bbabb1c47ee6a51a23 by kabhwan.opensource
[SPARK-31792][SS][DOC][FOLLOW-UP] Rephrase the description for some
operations
### What changes were proposed in this pull request? Rephrase the
description for some operations to make it clearer.
### Why are the changes needed? Add more detail in the document.
### Does this PR introduce _any_ user-facing change? No, document only.
### How was this patch tested? Document only.
Closes #29269 from xuanyuanking/SPARK-31792-follow.
Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by:
Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
(commit: 8b26c69)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala (diff)
The file was modifieddocs/web-ui.md (diff)
Commit 25c7d0fe6ae20a4c1c42e0cd0b448c08ab03f3fb by srowen
[SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13
### What changes were proposed in this pull request? The purpose of this
pr is to resolve
[SPARK-32526](https://issues.apache.org/jira/browse/SPARK-32526), all
remaining failed cases are fixed.
The main change of this pr as follow:
- Change of `ExecutorAllocationManager.scala` for core module
compilation in Scala 2.13, it's a blocking problem
- Change `Seq[_]` to `scala.collection.Seq[_]` refer to failed cases
- Added different expected plan of `Test 4: Star with several branches`
of StarJoinCostBasedReorderSuite  for Scala 2.13 because the candidates
plans:
``` Join Inner, (d1_pk#5 = f1_fk1#0)
:- Join Inner, (f1_fk2#1 = d2_pk#8)
:  :- Join Inner, (f1_fk3#2 = d3_pk#11)
``` and
``` Join Inner, (f1_fk2#1 = d2_pk#8)
:- Join Inner, (d1_pk#5 = f1_fk1#0)
:  :- Join Inner, (f1_fk3#2 = d3_pk#11)
```
have same cost `Cost(200,9200)`, but `HashMap` is rewritten in scala
2.13 and The order of iterations leads to different results.
This pr fix test cases as follow:
- LiteralExpressionSuite (1 FAILED -> PASS)
- StarJoinCostBasedReorderSuite ( 1 FAILED-> PASS)
- ObjectExpressionsSuite( 2 FAILED-> PASS)
- ScalaReflectionSuite (1 FAILED-> PASS)
- RowEncoderSuite (10 FAILED-> PASS)
- ExpressionEncoderSuite  (ABORTED-> PASS)
### Why are the changes needed? We need to support a Scala 2.13 build.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested?
<!--
- Scala 2.12: Pass the Jenkins or GitHub Action
- Scala 2.13: Do the following:
``` dev/change-scala-version.sh 2.13 mvn clean install -DskipTests  -pl
sql/catalyst -Pscala-2.13 -am mvn test -pl sql/catalyst -Pscala-2.13
```
**Before**
``` Tests: succeeded 4035, failed 17, canceled 0, ignored 6, pending 0
*** 1 SUITE ABORTED ***
*** 15 TESTS FAILED ***
```
**After**
``` Tests: succeeded 4338, failed 0, canceled 0, ignored 6, pending 0
All tests passed.
```
Closes #29434 from LuciferYang/sql-catalyst-tests.
Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen
<srowen@gmail.com>
(commit: 25c7d0f)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapData.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/StarJoinCostBasedReorderSuite.scala (diff)
Commit d9eb06ea37cab185f1e49c641313be9707270252 by srowen
[SPARK-32092][ML][PYSPARK] Fix parameters not being copied in
CrossValidatorModel.copy(), read() and write()
### What changes were proposed in this pull request?
Changed the definitions of
`CrossValidatorModel.copy()/_to_java()/_from_java()` so that exposed
parameters (i.e. parameters with `get()` methods) are copied in these
methods.
### Why are the changes needed?
Parameters are copied in the respective Scala interface for
`CrossValidatorModel.copy()`. It fits the semantics to persist
parameters when calling `CrossValidatorModel.save()` and
`CrossValidatorModel.load()` so that the user gets the same model by
saving and loading it after. Not copying across `numFolds` also causes
bugs like Array index out of bound and losing sub-models because this
parameters will always default to 3 (as described in the JIRA ticket).
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Tests for `CrossValidatorModel.copy()` and `save()`/`load()` are updated
so that they check parameters before and after function calls.
Closes #29445 from Louiszr/master.
Authored-by: Louiszr <zxhst14@gmail.com> Signed-off-by: Sean Owen
<srowen@gmail.com>
(commit: d9eb06e)
The file was modifiedpython/pyspark/ml/tuning.py (diff)
The file was modifiedpython/pyspark/ml/tests/test_tuning.py (diff)