Changes

Summary

  1. [SPARK-32659][SQL][FOLLOWUP] Broadcast Array instead of Set in (commit: 6145621) (details)
  2. [SPARK-32964][DSTREAMS] Pass all `streaming` module UTs in Scala 2.13 (commit: dd80845) (details)
  3. [SPARK-32757][SQL][FOLLOWUP] Preserve the attribute name as possible as (commit: fba5736) (details)
  4. [SPARK-32306][SQL][DOCS] Clarify the result of `percentile_approx()` (commit: 7c14f17) (details)
  5. [SPARK-32933][PYTHON] Use keyword-only syntax for keyword_only methods (commit: 779f0a8) (details)
  6. [SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available (commit: 942f577) (details)
  7. [MINOR][SQL] Improve examples for `percentile_approx()` (commit: b53da23) (details)
  8. [SPARK-32870][DOCS][SQL] Make sure that all expressions have their (commit: acfee3c) (details)
  9. [SPARK-32959][SQL][TEST] Fix an invalid test in DataSourceV2SQLSuite (commit: 21b7479) (details)
  10. [SPARK-32907][ML] adaptively blockify instances - revert blockify gmm (commit: 432afac) (details)
  11. [SPARK-32892][CORE][SQL] Fix hash functions on big-endian platforms (commit: 383bb4a) (details)
  12. [SPARK-32950][SQL] Remove unnecessary big-endian code paths (commit: faeb71b) (details)
  13. [SPARK-32981][BUILD] Remove hive-1.2/hadoop-2.7 from Apache Spark 3.1 (commit: 3c97665) (details)
  14. [SPARK-32937][SPARK-32980][K8S] Fix decom & launcher tests and add some (commit: 27f6b5a) (details)
  15. [SPARK-32971][K8S] Support dynamic PVC creation/deletion for K8s (commit: 527cd3f) (details)
Commit 61456214957cd446c2338fc70a0872c4bc22f77d by dhyun
[SPARK-32659][SQL][FOLLOWUP] Broadcast Array instead of Set in
InSubqueryExec
### What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/29475.
This PR updates the code to broadcast the Array instead of Set, which
was the behavior before #29475
### Why are the changes needed?
The size of Set can be much bigger than Array. It's safer to keep the
behavior the same as before and build the set at the executor side.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing tests
Closes #29838 from cloud-fan/followup.
Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by:
Dongjoon Hyun <dhyun@apple.com>
(commit: 6145621)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala (diff)
Commit dd80845735880de5943e97b08a487add3ed53139 by dhyun
[SPARK-32964][DSTREAMS] Pass all `streaming` module UTs in Scala 2.13
### What changes were proposed in this pull request?
There is only one failed case of `streaming` module in Scala 2.13:
`start with non-serializable DStream checkpoint ` in
`StreamingContextSuite`.
`StackOverflowError` is thrown here when `SerializationDebugger#visit`
method is called.
I found that `inputStreams` and `outputStreams` in `DStreamGraph` can
not be matched in `SerializationDebugger#visit` method because
`ArrayBuffer` in not `Array` in Scala 2.13.
The main change of this pr is use `mutable.ArraySeq` instead of
`ArrayBuffer` to store `inputStreams` and `outputStreams` in
`DStreamGraph`, then it can be matched in `SerializationDebugger#visit`
method.
### Why are the changes needed? We need to support a Scala 2.13 build.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested?
- Scala 2.12: Pass the Jenkins or GitHub Action
- Scala 2.13: Pass GitHub 2.13 Build Action
Do the following:
``` dev/change-scala-version.sh 2.13 mvn clean install -DskipTests  -pl
streaming -Pscala-2.13 -am mvn test -pl streaming -Pscala-2.13 mvn test
-pl core -Pscala-2.13
```
streaming module:
``` Tests: succeeded 339, failed 0, canceled 0, ignored 2, pending 0 All
tests passed.
```
core module:
``` Tests: succeeded 2648, failed 0, canceled 4, ignored 7, pending 0
All tests passed.
```
Closes #29836 from LuciferYang/fix-streaming-213.
Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
(commit: dd80845)
The file was modifiedstreaming/src/main/scala/org/apache/spark/streaming/DStreamGraph.scala (diff)
Commit fba5736c50f93d53f5fa0bf0edc97dc147b88804 by dhyun
[SPARK-32757][SQL][FOLLOWUP] Preserve the attribute name as possible as
we scan in SubqueryBroadcastExec
### What changes were proposed in this pull request?
This is a minor followup of https://github.com/apache/spark/pull/29601 ,
to preserve the attribute name in `SubqueryBroadcastExec.output`.
### Why are the changes needed?
During explain, it's better to see the origin column name instead of
always "key".
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
existing tests.
Closes #29839 from cloud-fan/followup2.
Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by:
Dongjoon Hyun <dhyun@apple.com>
(commit: fba5736)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SubqueryBroadcastExec.scala (diff)
Commit 7c14f177eb5b52d491f41b217926cc8ca5f0ce4c by viirya
[SPARK-32306][SQL][DOCS] Clarify the result of `percentile_approx()`
### What changes were proposed in this pull request? More precise
description of the result of the `percentile_approx()` function and its
synonym `approx_percentile()`. The proposed sentence clarifies that  the
function returns **one of elements** (or array of elements) from the
input column.
### Why are the changes needed? To improve Spark docs and avoid
misunderstanding of the function behavior.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested?
`./dev/scalastyle`
Closes #29835 from MaxGekk/doc-percentile_approx.
Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Liang-Chi
Hsieh <viirya@gmail.com>
(commit: 7c14f17)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/functions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala (diff)
The file was modifiedR/pkg/R/functions.R (diff)
The file was modifiedpython/pyspark/sql/functions.py (diff)
Commit 779f0a84eaa0bcee928576e406fdc37c94c5bbc1 by gurwls223
[SPARK-32933][PYTHON] Use keyword-only syntax for keyword_only methods
### What changes were proposed in this pull request?
This PR adjusts signatures of methods decorated with `keyword_only` to
indicate  using [Python 3 keyword-only
syntax](https://www.python.org/dev/peps/pep-3102/).
__Note__:
For the moment the goal is not to replace `keyword_only`. For
justification see
https://github.com/apache/spark/pull/29591#discussion_r489402579
### Why are the changes needed?
Right now it is not clear that `keyword_only` methods are indeed keyword
only. This proposal addresses that.
In practice we could probably capture `locals` and drop `keyword_only`
completel, i.e:
```python keyword_only def __init__(self, *, featuresCol="features"):
   ...
   kwargs = self._input_kwargs
   self.setParams(**kwargs)
```
could be replaced with
```python def __init__(self, *, featuresCol="features"):
   kwargs = locals()
   del kwargs["self"]
   ...
   self.setParams(**kwargs)
```
### Does this PR introduce _any_ user-facing change?
Docstrings and inspect tools will now indicate that `keyword_only`
methods expect only keyword arguments.
For example with ` LinearSVC` will change from
```
>>> from pyspark.ml.classification import LinearSVC
>>> ?LinearSVC.__init__ Signature: LinearSVC.__init__(
   self,
   featuresCol='features',
   labelCol='label',
   predictionCol='prediction',
   maxIter=100,
   regParam=0.0,
   tol=1e-06,
   rawPredictionCol='rawPrediction',
   fitIntercept=True,
   standardization=True,
   threshold=0.0,
   weightCol=None,
   aggregationDepth=2,
) Docstring: __init__(self, featuresCol="features", labelCol="label",
predictionCol="prediction",                  maxIter=100, regParam=0.0,
tol=1e-6, rawPredictionCol="rawPrediction",                
fitIntercept=True, standardization=True, threshold=0.0, weightCol=None,
                aggregationDepth=2): File:    
/path/to/python/pyspark/ml/classification.py Type:      function
```
to
```
>>> from pyspark.ml.classification import LinearSVC
>>> ?LinearSVC.__init__ Signature: LinearSVC.__init__   (
   self,
   *,
   featuresCol='features',
   labelCol='label',
   predictionCol='prediction',
   maxIter=100,
   regParam=0.0,
   tol=1e-06,
   rawPredictionCol='rawPrediction',
   fitIntercept=True,
   standardization=True,
   threshold=0.0,
   weightCol=None,
   aggregationDepth=2,
   blockSize=1,
) Docstring: __init__(self, \*, featuresCol="features",
labelCol="label", predictionCol="prediction",                
maxIter=100, regParam=0.0, tol=1e-6, rawPredictionCol="rawPrediction", 
               fitIntercept=True, standardization=True, threshold=0.0,
weightCol=None,                  aggregationDepth=2, blockSize=1): File:
     ~/Workspace/spark/python/pyspark/ml/classification.py Type:    
function
```
### How was this patch tested?
Existing tests.
Closes #29799 from zero323/SPARK-32933.
Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: 779f0a8)
The file was modifiedpython/pyspark/ml/evaluation.py (diff)
The file was modifiedpython/pyspark/ml/clustering.py (diff)
The file was modifiedpython/pyspark/ml/tuning.py (diff)
The file was modifiedpython/pyspark/sql/streaming.py (diff)
The file was modifiedpython/pyspark/ml/classification.py (diff)
The file was modifiedpython/pyspark/ml/regression.py (diff)
The file was modifiedpython/pyspark/ml/pipeline.py (diff)
The file was modifiedpython/pyspark/ml/feature.py (diff)
The file was modifiedpython/pyspark/ml/recommendation.py (diff)
The file was modifiedpython/pyspark/ml/fpm.py (diff)
Commit 942f577b6e34a4f24203dfcad53f5fb98232a6e4 by gurwls223
[SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available
in PyPI
### What changes were proposed in this pull request?
This PR proposes to add a way to select Hadoop and Hive versions in pip
installation. Users can select Hive or Hadoop versions as below:
```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip
install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
```
When the environment variables are set, internally it downloads the
corresponding Spark version and then sets the Spark home to it. Also
this PR exposes a mirror to set as an environment variable,
`PYSPARK_RELEASE_MIRROR`.
**Please NOTE that:**
- We cannot currently leverage pip's native installation option, for
example:
    ```bash
   pip install pyspark --install-option="hadoop3.2"
   ```
    This is because of a limitation and bug in pip itself. Once they fix
this issue, we can switch from the environment variables to the proper
installation options, see SPARK-32837.
    It IS possible to workaround but very ugly or hacky with a big
change. See [this PR](https://github.com/microsoft/nni/pull/139/files)
as an example.
- In pip installation, we pack the relevant jars together. This PR _does
not touch existing packaging way_ in order to prevent any behaviour
changes.
  Once this experimental way is proven to be safe, we can avoid packing
the relevant jars together (and keep only the relevant Python scripts).
And downloads the Spark distribution as this PR proposes.
- This way is sort of consistent with SparkR:
  SparkR provides a method `SparkR::install.spark` to support CRAN
installation. This is fine because SparkR is provided purely as a R
library. For example, `sparkr` script is not packed together.
  PySpark cannot take this approach because PySpark packaging ships
relevant executable script together, e.g.) `pyspark` shell.
  If PySpark has a method such as `pyspark.install_spark`, users cannot
call it in `pyspark` because `pyspark` already assumes relevant Spark is
installed, JVM is launched, etc.
- There looks no way to release that contains different Hadoop or Hive
to PyPI due to [the version
semantics](https://www.python.org/dev/peps/pep-0440/). This is not an
option.
  The usual way looks either `--install-option` above with hacks or
environment variables given my investigation.
### Why are the changes needed?
To provide users the options to select Hadoop and Hive versions.
### Does this PR introduce _any_ user-facing change?
Yes, users will be able to select Hive and Hadoop version as below when
they install it from `pip`;
```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip
install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
```
### How was this patch tested?
Unit tests were added. I also manually tested in Mac and Windows (after
building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`):
```bash
./build/mvn -DskipTests -Phive-thriftserver clean package
```
Mac:
```bash SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install
pyspark-3.1.0.dev0.tar.gz
```
Windows:
```bash set HADOOP_VERSION=3.2 set SPARK_VERSION=3.0.1 pip install
pyspark-3.1.0.dev0.tar.gz
```
Closes #29703 from HyukjinKwon/SPARK-32017.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 942f577)
The file was modifieddev/sparktestsupport/modules.py (diff)
The file was addedpython/pyspark/install.py
The file was addedpython/pyspark/tests/test_install_spark.py
The file was modifiedpython/docs/source/getting_started/install.rst (diff)
The file was modifiedpython/setup.py (diff)
The file was modifieddev/create-release/release-build.sh (diff)
The file was modifiedpython/pyspark/find_spark_home.py (diff)
Commit b53da23a28fe149cc75d593c5c36f7020a8a2752 by gurwls223
[MINOR][SQL] Improve examples for `percentile_approx()`
### What changes were proposed in this pull request? In the PR, I
propose to replace current examples for `percentile_approx()` with
**only one** input value by example **with multiple values** in the
input column.
### Why are the changes needed? Current examples are pretty trivial, and
don't demonstrate function's behaviour on a sequence of values.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested?
- by running `ExpressionInfoSuite`
- `./dev/scalastyle`
Closes #29841 from MaxGekk/example-percentile_approx.
Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: b53da23)
The file was modifiedsql/core/src/test/resources/sql-functions/sql-expression-schema.md (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala (diff)
Commit acfee3c8b1fbe3daea141cb902dd6954873a1f8f by yamamuro
[SPARK-32870][DOCS][SQL] Make sure that all expressions have their
ExpressionDescription filled
### What changes were proposed in this pull request?
Made sure, that all the expressions in the `FunctionRegistry ` have the
fields `usage`, `examples` and `since` filled in their
`ExpressionDescription`. Added UT to `ExpressionInfoSuite`, to make
sure, that all new expressions will also fill those fields.
### Why are the changes needed?
Documentation improvement
### Does this PR introduce _any_ user-facing change?
Better generated SQL built in functions documentation
### How was this patch tested?
Checked the fix version in the following jiras: SPARK-1251 - UnaryMinus,
Add, Subtract, Multiply, Divide, Remainder, Explode, Not, In, And, Or,
Equals, LessThan, LessThanOrEqual, GreaterThan, GreaterThanOrEqual, If,
Cast SPARK-2053 - CaseWhen SPARK-2665 - EqualNullSafe SPARK-3176 - Abs
SPARK-6542 - CreateStruct SPARK-7135 - MonotonicallyIncreasingID
SPARK-7152 - SparkPartitionID SPARK-7295 - bitwiseAND, bitwiseOR,
bitwiseXOR, bitwiseNOT SPARK-8005 - InputFileName SPARK-8203 - Greatest
SPARK-8204 - Least SPARK-8220 - UnaryPositive SPARK-8221 - Pmod
SPARK-8230 - Size SPARK-8231 - ArrayContains SPARK-8232 - SortArray
SPARK-8234 - md5 SPARK-8235 - sha1 SPARK-8236 - crc32 SPARK-8237 - sha2
SPARK-8240 - Concat SPARK-8246 - GetJsonObject SPARK-8407 -
CreateNamedStruct SPARK-9617 - JsonTuple SPARK-10810 - CurrentDatabase
SPARK-12480 - Murmur3Hash SPARK-14061 - CreateMap SPARK-14160 -
TimeWindow SPARK-14580 - AssertTrue SPARK-16274 - XPathBoolean
SPARK-16278 - MapKeys SPARK-16279 - MapValues SPARK-16284 -
CallMethodViaReflection SPARK-16286 - Stack SPARK-16288 - Inline
SPARK-16289 - PosExplode SPARK-16318 - XPathShort, XPathInt, XPathLong,
XPathFloat, XPathDouble, XPathString, XPathList SPARK-16730 - Cast
aliases SPARK-17495 - HiveHash SPARK-18702 - InputFileBlockStart,
InputFileBlockLength SPARK-20910 - UUID
Closes #29743 from tanelk/SPARK-32870.
Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Signed-off-by:
Takeshi Yamamuro <yamamuro@apache.org>
(commit: acfee3c)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/expressions/ExpressionInfoSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-functions/sql-expression-schema.md (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/xml/xpath.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CountMinSketchAgg.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/CallMethodViaReflection.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/bitwiseAggregates.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/cast.sql.out (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/MonotonicallyIncreasingID.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SparkPartitionID.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/bitwiseExpressions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/inputFileBlock.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala (diff)
Commit 21b74797978e998504d795551dcc6b6a0e5801ac by wenchen
[SPARK-32959][SQL][TEST] Fix an invalid test in DataSourceV2SQLSuite
### What changes were proposed in this pull request?
This PR addresses two issues related to the `Relation: view text` test
in `DataSourceV2SQLSuite`.
1. The test has the following block:
```scala withView("view1") { v1: String =>
sql(...)
}
``` Since `withView`'s signature is `withView(v: String*)(f: => Unit):
Unit`, the `f` that will be executed is ` v1: String => sql(..)`, which
is just defining the anonymous function, and _not_ executing it.
2. Once the test is fixed to run, it actually fails. The reason is that
the v2 session catalog implementation used in tests does not correctly
handle `V1Table` for views in `loadTable`. And this results in views
resolved to `ResolvedTable` instead of `ResolvedView`, causing the test
failure:
https://github.com/apache/spark/blob/f1dc479d39a6f05df7155008d8ec26dff42bb06c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L1007-L1011
### Why are the changes needed?
Fixing a bug in test.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing test.
Closes #29811 from imback82/fix_minor_test.
Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 21b7479)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/TestV2SessionCatalogBase.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala (diff)
Commit 432afac07ea721e57b18e6108fe56fc878acef51 by ruifengz
[SPARK-32907][ML] adaptively blockify instances - revert blockify gmm
### What changes were proposed in this pull request? revert blockify gmm
### Why are the changes needed? WeichenXu123  and I thought we should
use memory size instead of number of rows to blockify instance; then if
a buffer's size is large and determined by number of rows, we should
discard it. In GMM, we found that the pre-allocated memory maybe too
large and should be discarded:
``` transient private lazy val auxiliaryPDFMat =
DenseMatrix.zeros(blockSize, numFeatures)
``` We had some offline discuss and thought it is better to revert
blockify GMM.
### Does this PR introduce _any_ user-facing change? blockSize added in
master branch will be removed
### How was this patch tested? existing testsuites
Closes #29782 from zhengruifeng/unblockify_gmm.
Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by:
zhengruifeng <ruifengz@foxmail.com>
(commit: 432afac)
The file was modifiedpython/pyspark/ml/clustering.py (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala (diff)
The file was modifiedmllib-local/src/main/scala/org/apache/spark/ml/stat/distribution/MultivariateGaussian.scala (diff)
The file was modifiedmllib-local/src/test/scala/org/apache/spark/ml/stat/distribution/MultivariateGaussianSuite.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/clustering/GaussianMixtureSuite.scala (diff)
Commit 383bb4af004253e1eb84d3f3e58347e0d7670f66 by srowen
[SPARK-32892][CORE][SQL] Fix hash functions on big-endian platforms
MurmurHash3 and xxHash64 interpret sequences of bytes as integers
encoded in little-endian byte order. This requires a byte reversal on
big endian platforms.
I've left the hashInt and hashLong functions as-is for now. My
interpretation of these functions is that they perform the hash on the
integer value as if it were serialized in little-endian byte order.
Therefore no byte reversal is necessary.
### What changes were proposed in this pull request? Modify hash
functions to produce correct results on big-endian platforms.
### Why are the changes needed? Hash functions produce incorrect results
on big-endian platforms which, amongst other potential issues, causes
test failures.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? Existing tests run on the IBM Z (s390x)
platform which uses a big-endian byte order.
Closes #29762 from mundaym/fix-hashes.
Authored-by: Michael Munday <mike.munday@ibm.com> Signed-off-by: Sean
Owen <srowen@gmail.com>
(commit: 383bb4a)
The file was modifiedcommon/sketch/src/main/java/org/apache/spark/util/sketch/Murmur3_x86_32.java (diff)
The file was modifiedcommon/unsafe/src/main/java/org/apache/spark/unsafe/hash/Murmur3_x86_32.java (diff)
The file was modifiedsql/catalyst/src/test/java/org/apache/spark/sql/catalyst/expressions/XXH64Suite.java (diff)
The file was modifiedsql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/XXH64.java (diff)
Commit faeb71b39d746afdc29f154e293e7c09871c1254 by srowen
[SPARK-32950][SQL] Remove unnecessary big-endian code paths
### What changes were proposed in this pull request? Remove unnecessary
code.
### Why are the changes needed?
General housekeeping. Might be a slight performance improvement,
especially on big-endian systems.
There is no need for separate code paths for big- and little-endian
platforms in putDoubles and putFloats anymore (since PR #24861). On all
platforms values are encoded in native byte order and can just be copied
directly.
### Does this PR introduce _any_ user-facing change? No.
### How was this patch tested? Existing tests.
Closes #29815 from mundaym/clean-putfloats.
Authored-by: Michael Munday <mike.munday@ibm.com> Signed-off-by: Sean
Owen <srowen@gmail.com>
(commit: faeb71b)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java (diff)
Commit 3c97665dad810bd9d0cc3ca8bd735914bb0d38d6 by dongjoon
[SPARK-32981][BUILD] Remove hive-1.2/hadoop-2.7 from Apache Spark 3.1
distribution
### What changes were proposed in this pull request?
Apache Spark 3.0 switches its Hive execution version from 1.2 to 2.3,
but it still provides the unofficial forked Hive 1.2 version from our
distribution like the following. This PR aims to remove it from Apache
Spark 3.1.0 officially while keeping `hive-1.2` profile.
``` spark-3.0.1-bin-hadoop2.7-hive1.2.tgz
spark-3.0.1-bin-hadoop2.7-hive1.2.tgz.asc
spark-3.0.1-bin-hadoop2.7-hive1.2.tgz.sha512
```
### Why are the changes needed?
The unofficial Hive 1.2.1 fork has many bugs and is not maintained for a
long time. We had better not recommend this in the official Apache Spark
distribution.
### Does this PR introduce _any_ user-facing change?
There is no user-facing change in the default distribution (Hadoop
3.2/Hive 2.3).
### How was this patch tested?
Manually because this is a change in release script .
Closes #29856 from dongjoon-hyun/SPARK-32981.
Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon
Hyun <dongjoon@apache.org>
(commit: 3c97665)
The file was modifieddev/create-release/release-build.sh (diff)
Commit 27f6b5a103137fa1dee2103c3a17594d2df49f1b by dhyun
[SPARK-32937][SPARK-32980][K8S] Fix decom & launcher tests and add some
comments to reduce chance of breakage
### What changes were proposed in this pull request?
Fixes the log strings the decom integration tests looks for and add
comments reminding people to run the K8s integration tests when changing
those code paths.
### Why are the changes needed?
The strings it looks for have been changed.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
WIP: Verify that the K8s jenkins job succeeds
Closes #29854 from holdenk/SPARK-32979-spark-k8s-decom-test-is-broken.
Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
(commit: 27f6b5a)
The file was modifiedcore/src/main/scala/org/apache/spark/scheduler/ExecutorLossReason.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala (diff)
The file was modifiedresource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DecommissionSuite.scala (diff)
The file was modifiedresource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/backend/minikube/Minikube.scala (diff)
Commit 527cd3fc3aac40f84ba8eee291e1a955e03f7665 by dhyun
[SPARK-32971][K8S] Support dynamic PVC creation/deletion for K8s
executors
### What changes were proposed in this pull request?
This PR aims to support dynamic PVC creation and deletion for K8s
executors. The PVCs are created with executor pods and deleted when the
executor pods are deleted.
**Configuration** Mostly, this PR reuses the existing PVC volume configs
and `storageClass` is added.
``` spark.executor.instances=2
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName=OnDemand
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass=gp2
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit=500Gi
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path=/data
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly=false
```
**Executors**
```
$ kubectl get pod -l spark-role=executor NAME                          
   READY   STATUS    RESTARTS   AGE spark-pi-f4d80574b9bb0941-exec-1 
1/1     Running   0          2m6s spark-pi-f4d80574b9bb0941-exec-2   1/1
    Running   0          2m6s
```
**PVCs**
```
$ kubectl get pvc NAME                                     STATUS 
VOLUME                                     CAPACITY   ACCESS MODES 
STORAGECLA SS   AGE spark-pi-f4d80574b9bb0941-exec-1-pvc-0   Bound  
pvc-7d20173f-278b-4c7b-b7e5-7f0ed414ee64   500Gi      RWO            gp2
    48s spark-pi-f4d80574b9bb0941-exec-2-pvc-0   Bound  
pvc-1138f00d-87f1-47f4-9b58-ce5d13ea0c3a   500Gi      RWO            gp2
    48s
```
**Executor Disk**
```
$ k exec -it spark-pi-f4d80574b9bb0941-exec-1 -- df -h /data Filesystem
    Size  Used Avail Use% Mounted on
/dev/nvme3n1    493G   74M  492G   1% /data
```
```
$ k exec -it spark-pi-f4d80574b9bb0941-exec-1 -- ls /data
blockmgr-81dcebaf-11a7-4d7b-91d6-3c580187d914 lost+found
spark-6be42db8-2c58-4389-b52c-8aeeafe76bd5
```
### Why are the changes needed?
While SPARK-32655 supports to mount a pre-created PVC, this PR can
create PVC itself dynamically and reduce lots of manual efforts.
### Does this PR introduce _any_ user-facing change?
Yes. This is a new feature.
### How was this patch tested?
Pass the newly added test cases.
Closes #29846 from dongjoon-hyun/SPARK-32971.
Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
(commit: 527cd3f)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/MountVolumesFeatureStep.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/features/MountVolumesFeatureStepSuite.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesUtils.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesExecutorBuilderSuite.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesVolumeSpec.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesExecutorBuilder.scala (diff)
The file was addedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesExecutorSpec.scala
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesVolumeUtils.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/KubernetesTestConf.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocatorSuite.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala (diff)