Changes

Summary

  1. [SPARK-33170][SQL] Add SQL config to control fast-fail behavior in (commit: 3010e90) (details)
  2. [MINOR][DOCS][EXAMPLE] Fix the Python manual_load_options_csv example (commit: 7766a6f) (details)
  3. [MINOR][DOCS] Fix the link to the pickle module page in RDD Programming (commit: d2f328a) (details)
  4. [SPARK-33176][K8S] Use 11-jre-slim as default in K8s Dockerfile (commit: 20b7b92) (details)
  5. [SPARK-33109][BUILD][FOLLOW-UP] Remove the obsolete comment about (commit: ad99f14) (details)
  6. [SPARK-33175][K8S] Detect duplicated mountPaths and fail at Spark side (commit: 97605cd) (details)
  7. [SPARK-33177][SQL] CollectList and CollectSet should not be nullable (commit: ce49894) (details)
  8. [SPARK-32069][CORE][SQL] Improve error message on reading unexpected (commit: f8277d3) (details)
  9. [SPARK-33123][INFRA] Ignore GitHub only changes in Amplab Jenkins build (commit: e6c53c2) (details)
  10. [SPARK-33179][TESTS] Switch default Hadoop profile in run-tests.py (commit: 53783e7) (details)
  11. [SPARK-33139][SQL][FOLLOW-UP] Avoid using reflect call on session.py (commit: 388e067) (details)
Commit 3010e9044e068216d7a7a9ec510453ecbb159f95 by dhyun
[SPARK-33170][SQL] Add SQL config to control fast-fail behavior in
FileFormatWriter
### What changes were proposed in this pull request?
This patch proposes to add a config we can control fast-fail behavior in
FileFormatWriter and set it false by default.
### Why are the changes needed?
In SPARK-29649, we catch `FileAlreadyExistsException` in
`FileFormatWriter` and fail fast for the task set to prevent task retry.
Due to latest discussion, it is important to be able to keep original
behavior that is to retry tasks even `FileAlreadyExistsException` is
thrown, because `FileAlreadyExistsException` could be recoverable in
some cases.
We are going to add a config we can control this behavior and set it
false for fast-fail by default.
### Does this PR introduce _any_ user-facing change?
Yes. By default the task in FileFormatWriter will retry even if
`FileAlreadyExistsException` is thrown. This is the behavior before
Spark 3.0. User can control fast-fail behavior by enabling it.
### How was this patch tested?
Unit test.
Closes #30073 from viirya/SPARK-33170.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
(commit: 3010e90)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala (diff)
Commit 7766a6fb5f66c6b339909ae25d7f01769f580b18 by gurwls223
[MINOR][DOCS][EXAMPLE] Fix the Python manual_load_options_csv example
### What changes were proposed in this pull request? This pull request
changes the `sep` parameter's value from `:` to `;` in the example of
`examples/src/main/python/sql/datasource.py`. This code snippet is shown
on the Spark SQL Guide documentation. The `sep` parameter's value should
be `;` since the data in
https://github.com/apache/spark/blob/master/examples/src/main/resources/people.csv
is separated by `;`.
### Why are the changes needed? To fix the example code so that it can
be executed properly.
### Does this PR introduce _any_ user-facing change? Yes. This code
snippet is shown on the Spark SQL Guide documentation:
https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#manually-specifying-options
### How was this patch tested? By building the documentation and
checking the Spark SQL Guide documentation manually in the local
environment.
Closes #30082 from kjmrknsn/fix-example-python-datasource.
Authored-by: Keiji Yoshida <kjmrknsn@gmail.com> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 7766a6f)
The file was modifiedexamples/src/main/python/sql/datasource.py (diff)
Commit d2f328aba6f1d218425fe5d41bdec66dcaa33c85 by gurwls223
[MINOR][DOCS] Fix the link to the pickle module page in RDD Programming
Guide
### What changes were proposed in this pull request? This pull request
changes the link to the pickle module page from
https://docs.python.org/2/library/pickle.html to
https://docs.python.org/3/library/pickle.html in RDD Programming Guide.
### Why are the changes needed? Since Python 2 is no longer supported
and it is preferable to refer to the pickle module page of Python 3.
### Does this PR introduce _any_ user-facing change? Yes. Before: the
`Pickle` link's destination page was
https://docs.python.org/2/library/pickle.html After: the `Pickle` link's
destination page is https://docs.python.org/3/library/pickle.html
### How was this patch tested? By building the documentation site and
check the link's destination page is changed correctly in the local
environment.
Closes #30081 from kjmrknsn/docs-fix-pickle-link.
Authored-by: Keiji Yoshida <kjmrknsn@gmail.com> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: d2f328a)
The file was modifieddocs/rdd-programming-guide.md (diff)
Commit 20b7b923abc2266cf280b8623d6b5b9b277177ec by dhyun
[SPARK-33176][K8S] Use 11-jre-slim as default in K8s Dockerfile
### What changes were proposed in this pull request?
This PR aims to use `openjdk:11-jre-slim` as default in K8s Dockerfile.
### Why are the changes needed?
Although Apache Spark supports both Java8/Java11, there is a difference.
1. Java8-built distribution can run both Java8/Java11 2. Java11-built
distribution can run on Java11, but not Java8.
In short, we had better use Java11 in Dockerfile to embrace both cases
without any issues.
### Does this PR introduce _any_ user-facing change?
Yes. This will remove the change of user frustration when they build
with JDK11 and build the image without overriding Java base image.
### How was this patch tested?
Pass the K8s IT.
Closes #30083 from dongjoon-hyun/SPARK-33176.
Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
(commit: 20b7b92)
The file was modifiedresource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile (diff)
Commit ad99f14b4277616b681c91778eba4d9184f8eecf by dhyun
[SPARK-33109][BUILD][FOLLOW-UP] Remove the obsolete comment about
bringing sbt-dependency-graph back
### What changes were proposed in this pull request?
This PR proposes to remove an obsolete comment about adding the
`sbt-dependency-graph` back in SBT plugins.
### Why are the changes needed?
sbt-dependency-graph is now built-in from SBT 1.4.0, see
https://github.com/sbt/sbt/releases/tag/v1.4.0.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manually tested `./build/sbt dependencyTree`.
Closes #30085 from HyukjinKwon/SPARK-33109.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
(commit: ad99f14)
The file was modifiedproject/plugins.sbt (diff)
Commit 97605cd1269987ed5ba3013a5f8497375ce8913e by dhyun
[SPARK-33175][K8S] Detect duplicated mountPaths and fail at Spark side
### What changes were proposed in this pull request?
This PR aims to detect duplicate `mountPath`s and stop the job.
### Why are the changes needed?
If there is a conflict on `mountPath`, the pod is created and repeats
the following error messages and keeps running. Spark job should not
keep running and wasting the cluster resources. We had better fail at
Spark side.
```
$ k get pod -l 'spark-role in (driver,executor)' NAME    READY   STATUS
  RESTARTS   AGE tpcds   1/1     Running   0          33m
```
``` 20/10/18 05:09:26 WARN ExecutorPodsSnapshotsStoreImpl: Exception
when notifying snapshot subscriber.
io.fabric8.kubernetes.client.KubernetesClientException: Failure
executing: POST at: ... Message: Pod "tpcds-exec-1" is invalid:
spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/data1":
must be unique.
...
```
**AFTER THIS PR** The job will stop with the following error message
instead of keeping running.
``` 20/10/18 06:58:45 ERROR ExecutorPodsSnapshotsStoreImpl: Going to
stop due to IllegalArgumentException java.lang.IllegalArgumentException:
requirement failed: Found duplicated mountPath: `/data1`
```
### Does this PR introduce _any_ user-facing change?
Yes, but this is a bug fix.
### How was this patch tested?
Pass the CI with the newly added test case.
Closes #30084 from dongjoon-hyun/SPARK-33175-2.
Lead-authored-by: Dongjoon Hyun <dhyun@apple.com> Co-authored-by:
Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun
<dhyun@apple.com>
(commit: 97605cd)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsSnapshotsStoreImpl.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/features/MountVolumesFeatureStepSuite.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/MountVolumesFeatureStep.scala (diff)
Commit ce498943d23e1660ba2b724e8831739f3b8a0bbf by gurwls223
[SPARK-33177][SQL] CollectList and CollectSet should not be nullable
### What changes were proposed in this pull request?
Mark `CollectList` and `CollectSet` as non-nullable.
### Why are the changes needed?
`CollectList` and `CollectSet` SQL expressions never return null value.
Marking them as non-nullable can have some performance benefits, because
some optimizer rules apply only to non-nullable expressions
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Did not find any existing tests on the nullability of aggregate
functions.
Closes #30087 from tanelk/SPARK-33177_collect.
Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: ce49894)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala (diff)
Commit f8277d3aa308d267ff0423f85ffd884480cedf59 by dhyun
[SPARK-32069][CORE][SQL] Improve error message on reading unexpected
directory
### What changes were proposed in this pull request? Improve error
message on reading unexpected directory
### Why are the changes needed? Improve error message on reading
unexpected directory
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? Ut
Closes #30027 from AngersZhuuuu/SPARK-32069.
Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
(commit: f8277d3)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSourceSuite.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveTableScanSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala (diff)
Commit e6c53c2c1b538d6272df4d1ca294d04f8b49bd6c by gurwls223
[SPARK-33123][INFRA] Ignore GitHub only changes in Amplab Jenkins build
### What changes were proposed in this pull request?
This PR aims to ignore GitHub only changes in Amplab Jenkins build.
### Why are the changes needed?
This will save server resources.
### Does this PR introduce _any_ user-facing change?
No, this is a dev-only change.
### How was this patch tested?
Manually. I used the following doctest during testing and removed it at
the clean-up.
E2E tests:
``` cd dev cat test.py
```
```python import importlib runtests =
importlib.import_module("run-tests") print([x.name for x in
runtests.determine_modules_for_files([".github/workflows/build_and_test.yml"])])
```
```python
$ GITHUB_ACTIONS=1 python test.py
['root']
$ python test.py
[]
```
Unittests:
```bash
$ GITHUN_ACTIONS=1 python3 -m doctest dev/run-tests.py
$ python3 -m doctest dev/run-tests.py
```
Closes #30020 from williamhyun/SPARK-33123.
Lead-authored-by: William Hyun <williamhyun3@gmail.com> Co-authored-by:
Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: e6c53c2)
The file was modifieddev/run-tests.py (diff)
Commit 53783e706dde943adee978a8eeee95a6f60687bd by gurwls223
[SPARK-33179][TESTS] Switch default Hadoop profile in run-tests.py
### What changes were proposed in this pull request?
This PR aims to switch the default Hadoop profile from `hadoop2.7` to
`hadoop3.2` in `dev/run-tests.py` when it's running in local or GitHub
Action environments.
### Why are the changes needed?
The default Hadoop version is 3.2. We had better be consistent.
### Does this PR introduce _any_ user-facing change? No.
### How was this patch tested? Manually.
**BEFORE**
```
% dev/run-tests.py Cannot install SparkR as R was not found in PATH
[info] Using build tool sbt with Hadoop profile hadoop2.7 and Hive
profile hive2.3 under environment local
```
**AFTER**
```
% dev/run-tests.py Cannot install SparkR as R was not found in PATH
[info] Using build tool sbt with Hadoop profile hadoop3.2 and Hive
profile hive2.3 under environment local
```
Closes #30090 from williamhyun/SPARK-33179.
Authored-by: William Hyun <williamhyun3@gmail.com> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 53783e7)
The file was modifieddev/run-tests.py (diff)
Commit 388e067a909516a9a509399fe17d79ce1fb54d31 by gurwls223
[SPARK-33139][SQL][FOLLOW-UP] Avoid using reflect call on session.py
### What changes were proposed in this pull request?
In [SPARK-33139](https://github.com/apache/spark/pull/30042), I was
using reflect "Class.forName" in python code to invoke method in
SparkSession which is not recommended. using getattr to access
"SparkSession$.Module$" instead.
### Why are the changes needed?
Code refine.
### Does this PR introduce any user-facing change? No.
### How was this patch tested?
Existing tests.
Closes #30092 from leanken/leanken-SPARK-33139-followup.
Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 388e067)
The file was modifiedpython/pyspark/sql/session.py (diff)