Changes

Summary

  1. [DOC][MINOR] pySpark usage - removed repeated keyword causing confusion (commit: 4a47b3e) (details)
  2. [SPARK-33096][K8S] Use LinkedHashMap instead of Map for (commit: 4987db8) (details)
  3. [SPARK-33094][SQL] Make ORC format propagate Hadoop config from DS (commit: c5f6af9) (details)
  4. [SPARK-32743][SQL] Add distinct info at UnresolvedFunction toString (commit: a907729) (details)
  5. [SPARK-32793][FOLLOW-UP] Minor corrections for PySpark annotations and (commit: 3beab8d) (details)
  6. [SPARK-33101][ML] Make LibSVM format propagate Hadoop config from DS (commit: 1234c66) (details)
  7. [SPARK-33099][K8S] Respect executor idle timeout conf in (commit: e1909c9) (details)
  8. [SPARK-32896][SS] Add DataStreamWriter.table API (commit: edb140e) (details)
  9. [SPARK-33082][SPARK-20202][BUILD][SQL][FOLLOW-UP] Remove Hive 1.2 (commit: 2e07ed3) (details)
Commit 4a47b3e1103170eacf2fb910864c6db22a9a37e6 by srowen
[DOC][MINOR] pySpark usage - removed repeated keyword causing confusion
### What changes were proposed in this pull request? While explaining
pySpark usage, use of repeated synonymous words were causing confusion.
Removed "instead of a JAR" word, to keep it more readable.
### Why are the changes needed? To keep the docs more readable and easy
to understand.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? No code changes, minor documentation
change only. No tests added.
Closes #29956 from manubatham20/patch-1.
Authored-by: manubatham20 <manubatham2006@gmail.com> Signed-off-by: Sean
Owen <srowen@gmail.com>
(commit: 4a47b3e)
The file was modifieddocs/submitting-applications.md (diff)
Commit 4987db8c88b49a0c0d8503b6291455e92e114efa by dhyun
[SPARK-33096][K8S] Use LinkedHashMap instead of Map for
newlyCreatedExecutors
### What changes were proposed in this pull request?
This PR aims to use `LinkedHashMap` instead of `Map` for
`newlyCreatedExecutors`.
### Why are the changes needed?
This makes log messages (INFO/DEBUG) more readable. This is helpful when
`spark.kubernetes.allocation.batch.size` is large and especially when
K8s dynamic allocation is used.
**BEFORE**
``` 20/10/08 10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 8
was not found in the Kubernetes cluster since it was created 0
milliseconds ago. 20/10/08 10:24:21 DEBUG ExecutorPodsAllocator:
Executor with id 2 was not found in the Kubernetes cluster since it was
created 0 milliseconds ago. 20/10/08 10:24:21 DEBUG
ExecutorPodsAllocator: Executor with id 5 was not found in the
Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08
10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 4 was not found
in the Kubernetes cluster since it was created 0 milliseconds ago.
20/10/08 10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 7 was
not found in the Kubernetes cluster since it was created 0 milliseconds
ago. 20/10/08 10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 10
was not found in the Kubernetes cluster since it was created 0
milliseconds ago. 20/10/08 10:24:21 DEBUG ExecutorPodsAllocator:
Executor with id 9 was not found in the Kubernetes cluster since it was
created 0 milliseconds ago. 20/10/08 10:24:21 DEBUG
ExecutorPodsAllocator: Executor with id 3 was not found in the
Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08
10:24:21 DEBUG ExecutorPodsAllocator: Executor with id 6 was not found
in the Kubernetes cluster since it was created 0 milliseconds ago.
20/10/08 10:24:21 INFO ExecutorPodsAllocator: Deleting 9 excess pod
requests (5,10,6,9,2,7,3,8,4).
```
**AFTER**
``` 20/10/08 10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 2
was not found in the Kubernetes cluster since it was created 0
milliseconds ago. 20/10/08 10:25:17 DEBUG ExecutorPodsAllocator:
Executor with id 3 was not found in the Kubernetes cluster since it was
created 0 milliseconds ago. 20/10/08 10:25:17 DEBUG
ExecutorPodsAllocator: Executor with id 4 was not found in the
Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08
10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 5 was not found
in the Kubernetes cluster since it was created 0 milliseconds ago.
20/10/08 10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 6 was
not found in the Kubernetes cluster since it was created 0 milliseconds
ago. 20/10/08 10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 7
was not found in the Kubernetes cluster since it was created 0
milliseconds ago. 20/10/08 10:25:17 DEBUG ExecutorPodsAllocator:
Executor with id 8 was not found in the Kubernetes cluster since it was
created 0 milliseconds ago. 20/10/08 10:25:17 DEBUG
ExecutorPodsAllocator: Executor with id 9 was not found in the
Kubernetes cluster since it was created 0 milliseconds ago. 20/10/08
10:25:17 DEBUG ExecutorPodsAllocator: Executor with id 10 was not found
in the Kubernetes cluster since it was created 0 milliseconds ago.
20/10/08 10:25:17 INFO ExecutorPodsAllocator: Deleting 9 excess pod
requests (2,3,4,5,6,7,8,9,10).
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CI or `build/sbt -Pkubernetes "kubernetes/test"`
Closes #29979 from dongjoon-hyun/SPARK-K8S-LOG.
Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
(commit: 4987db8)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala (diff)
Commit c5f6af9f17498bb0ec393c16616f2d99e5d3ee3d by dhyun
[SPARK-33094][SQL] Make ORC format propagate Hadoop config from DS
options to underlying HDFS file system
### What changes were proposed in this pull request? Propagate ORC
options to Hadoop configs in Hive `OrcFileFormat` and in the regular ORC
datasource.
### Why are the changes needed? There is a bug that when running:
```scala spark.read.format("orc").options(conf).load(path)
``` The underlying file system will not receive the conf options.
### Does this PR introduce _any_ user-facing change? Yes
### How was this patch tested? Added UT to `OrcSourceSuite`.
Closes #29976 from MaxGekk/orc-option-propagation.
Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun
<dhyun@apple.com>
(commit: c5f6af9)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala (diff)
Commit a9077299d769bc9569a15f6500754661111fe9ab by yamamuro
[SPARK-32743][SQL] Add distinct info at UnresolvedFunction toString
### What changes were proposed in this pull request?
Add distinct info at `UnresolvedFunction.toString`.
### Why are the changes needed?
Make `UnresolvedFunction` info complete.
``` create table test (c1 int, c2 int); explain extended select
sum(distinct c1) from test;
-- before this pr
== Parsed Logical Plan ==
'Project [unresolvedalias('sum('c1), None)]
+- 'UnresolvedRelation [test]
-- after this pr
== Parsed Logical Plan ==
'Project [unresolvedalias('sum(distinct 'c1), None)]
+- 'UnresolvedRelation [test]
```
### Does this PR introduce _any_ user-facing change?
Yes, get distinct info during sql parse.
### How was this patch tested?
manual test.
Closes #29586 from ulysses-you/SPARK-32743.
Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Takeshi
Yamamuro <yamamuro@apache.org>
(commit: a907729)
The file was modifiedsql/core/src/test/resources/sql-tests/results/explain-aqe.sql.out (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/explain-aqe.sql (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/explain.sql.out (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/explain.sql (diff)
Commit 3beab8d8a8e2ed5e46e063d5a44face40c5fac90 by gurwls223
[SPARK-32793][FOLLOW-UP] Minor corrections for PySpark annotations and
SparkR
### What changes were proposed in this pull request?
- Annotated return types of `assert_true` and `raise_error` as discussed
[here](https://github.com/apache/spark/pull/29947#pullrequestreview-504495801).
- Add `assert_true` and `raise_error`  to SparkR NAMESPACE.
- Validating message vector size in SparkR as discussed
[here](https://github.com/apache/spark/pull/29947#pullrequestreview-504539004).
### Why are the changes needed?
As discussed in review for #29947.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- Existing tests.
- Validation of annotations using MyPy
Closes #29978 from zero323/SPARK-32793-FOLLOW-UP.
Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: 3beab8d)
The file was modifiedpython/pyspark/sql/functions.pyi (diff)
The file was modifiedR/pkg/R/functions.R (diff)
The file was modifiedR/pkg/NAMESPACE (diff)
Commit 1234c66fa6b6d2c45edb40237788fa3bfdf96cf3 by dhyun
[SPARK-33101][ML] Make LibSVM format propagate Hadoop config from DS
options to underlying HDFS file system
### What changes were proposed in this pull request? Propagate LibSVM
options to Hadoop configs in the LibSVM datasource.
### Why are the changes needed? There is a bug that when running:
```scala spark.read.format("libsvm").options(conf).load(path)
``` The underlying file system will not receive the `conf` options.
### Does this PR introduce _any_ user-facing change? Yes. After the
changes, for example, users should read files from Azure Data Lake
successfully:
```scala def hadoopConf1() = Map[String, String](
s"fs.adl.oauth2.access.token.provider.type" -> "ClientCredential",
s"fs.adl.oauth2.client.id" -> dbutils.secrets.get(scope = "...", key =
"..."),
s"fs.adl.oauth2.credential" -> dbutils.secrets.get(scope = "...", key =
"..."),
s"fs.adl.oauth2.refresh.url" ->
s"https://login.microsoftonline.com/.../oauth2/token") val df =
spark.read.format("libsvm").options(hadoopConf1).load("adl://....azuredatalakestore.net/foldersp1/...")
``` and not get the following exception because the settings above are
not propagated to the filesystem:
```java java.lang.IllegalArgumentException: No value for
fs.adl.oauth2.access.token.provider found in conf file.
at ....adl.AdlFileSystem.getNonEmptyVal(AdlFileSystem.java:820)
at
....adl.AdlFileSystem.getCustomAccessTokenProvider(AdlFileSystem.java:220)
at ....adl.AdlFileSystem.getAccessTokenProvider(AdlFileSystem.java:257)
at ....adl.AdlFileSystem.initialize(AdlFileSystem.java:164)
at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
```
### How was this patch tested? Added UT to `LibSVMRelationSuite`.
Closes #29984 from MaxGekk/ml-option-propagation.
Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun
<dhyun@apple.com>
(commit: 1234c66)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala (diff)
Commit e1909c96fbfc3d3f7808f6ddcadec88cc4d11fb9 by dhyun
[SPARK-33099][K8S] Respect executor idle timeout conf in
ExecutorPodsAllocator
### What changes were proposed in this pull request?
This PR aims to protect the executor pod request or pending pod during
executor idle timeout.
### Why are the changes needed?
In case of dynamic allocation, Apache Spark K8s `ExecutorPodsAllocator`
cancels the pod requests or pending pods too eagerly. Like the following
example, `ExecutorPodsAllocator` received the new total executor adjust
request rapidly in two minutes. Sometimes, it's called 3 times in a
single second. It repeats `request` and `delete` on that request or
pending pod frequently. This PR is reusing
`spark.dynamicAllocation.executorIdleTimeout (default: 60s)` to keep the
pod request or pending pod.
``` 20/10/08 05:58:08 INFO ExecutorPodsAllocator: Set
totalExpectedExecutors to 3 20/10/08 05:58:08 INFO
ExecutorPodsAllocator: Going to request 3 executors from Kubernetes.
20/10/08 05:58:09 INFO ExecutorPodsAllocator: Set totalExpectedExecutors
to 3 20/10/08 05:58:43 INFO ExecutorPodsAllocator: Set
totalExpectedExecutors to 1 20/10/08 05:58:47 INFO
ExecutorPodsAllocator: Set totalExpectedExecutors to 0 20/10/08 05:59:26
INFO ExecutorPodsAllocator: Set totalExpectedExecutors to 3 20/10/08
05:59:30 INFO ExecutorPodsAllocator: Set totalExpectedExecutors to 2
20/10/08 05:59:31 INFO ExecutorPodsAllocator: Set totalExpectedExecutors
to 3 20/10/08 05:59:44 INFO ExecutorPodsAllocator: Set
totalExpectedExecutors to 2 20/10/08 05:59:44 INFO
ExecutorPodsAllocator: Set totalExpectedExecutors to 0 20/10/08 05:59:45
INFO ExecutorPodsAllocator: Set totalExpectedExecutors to 3 20/10/08
05:59:50 INFO ExecutorPodsAllocator: Set totalExpectedExecutors to 2
20/10/08 05:59:50 INFO ExecutorPodsAllocator: Set totalExpectedExecutors
to 1 20/10/08 05:59:50 INFO ExecutorPodsAllocator: Set
totalExpectedExecutors to 0 20/10/08 05:59:54 INFO
ExecutorPodsAllocator: Set totalExpectedExecutors to 3 20/10/08 05:59:54
INFO ExecutorPodsAllocator: Going to request 1 executors from
Kubernetes.
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the newly added test case.
Closes #29981 from dongjoon-hyun/SPARK-K8S-INITIAL.
Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
(commit: e1909c9)
The file was modifiedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocatorSuite.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsSnapshotSuite.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorLifecycleTestUtils.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsWatchSnapshotSourceSuite.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsPollingSnapshotSourceSuite.scala (diff)
Commit edb140eb5cb7f20af3e2ee7d2f9fb72f3e20e796 by dhyun
[SPARK-32896][SS] Add DataStreamWriter.table API
### What changes were proposed in this pull request?
This PR proposes to add `DataStreamWriter.table` to specify the output
"table" to write from the streaming query.
### Why are the changes needed?
For now, there's no way to write to the table (especially catalog table)
even the table is capable to handle streaming write, so even with Spark
3, writing to the catalog table via SS should go through the
`DataStreamWriter.format(provider)` and wish the provider can handle it
as same as we do with catalog table.
With the new API, we can directly point to the catalog table which
supports streaming write. Some of usages are covered with tests - simply
saying, end users can do the following:
```scala
// assuming `testcat` is a custom catalog, and `ns` is a namespace in
the catalog spark.sql("CREATE TABLE testcat.ns.table1 (id bigint, data
string) USING foo")
val query = inputDF
     .writeStream
     .table("testcat.ns.table1")
     .option(...)
     .start()
```
### Does this PR introduce _any_ user-facing change?
Yes, as this adds a new public API in DataStreamWriter. This doesn't
bring backward incompatible change.
### How was this patch tested?
New unit tests.
Closes #29767 from HeartSaVioR/SPARK-32896.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(commit: edb140e)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamTableAPISuite.scala (diff)
Commit 2e07ed30418d45e89d108bc4bc020d2933c20a3a by dhyun
[SPARK-33082][SPARK-20202][BUILD][SQL][FOLLOW-UP] Remove Hive 1.2
workarounds and Hive 1.2 profile in Jenkins script
### What changes were proposed in this pull request?
This PR removes the leftover of Hive 1.2 workarounds and Hive 1.2
profile in Jenkins script.
- `test-hive1.2` title is not used anymore in Jenkins
- Remove some comments related to Hive 1.2
- Remove unused codes in `OrcFilters.scala`  Hive
- Test `spark.sql.hive.convertMetastoreOrc` disabled case for the tests
added at SPARK-19809 and SPARK-22267
### Why are the changes needed?
To remove unused codes & improve test coverage
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
Manually ran the unit tests. Also It will be tested in CI in this PR.
Closes #29973 from HyukjinKwon/SPARK-33082-SPARK-20202.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
(commit: 2e07ed3)
The file was modifieddev/run-tests-jenkins.py (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala (diff)
The file was removedsql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcFilterSuite.scala
The file was removedsql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcTest.scala (diff)
The file was removeddev/deps/spark-deps-hadoop-2.7-hive-1.2