Changes

Summary

  1. [SPARK-32927][SQL] Bitwise OR, AND and XOR should have similar (commit: f41ba2a) (details)
  2. [SPARK-32187][PYTHON][DOCS] Doc on Python packaging (commit: a7f84a0) (details)
  3. [SPARK-33011][ML] Promote the stability annotation to Evolving for (commit: d15f504) (details)
  4. [SPARK-32996][WEB-UI] Handle empty ExecutorMetrics in (commit: 173da5b) (details)
  5. [SPARK-27951][SQL][FOLLOWUP] Improve the window function nth_value (commit: a53fc9b) (details)
  6. [SPARK-33021][PYTHON][TESTS] Move functions related test cases into (commit: 376ede1) (details)
  7. [SPARK-33015][SQL] Compute the current date only once (commit: 68cd567) (details)
  8. [SPARK-33020][PYTHON] Add nth_value as a PySpark function (commit: 6868b40) (details)
  9. [MINOR][DOCS] Document when `current_date` and `current_timestamp` are (commit: 1b60ff5) (details)
  10. [SPARK-32948][SQL] Optimize to_json and from_json expression chain (commit: 202115e) (details)
  11. [SPARK-32970][SPARK-32019][SQL][TEST] Reduce the runtime of an UT for (commit: 90e86f6) (details)
  12. [SPARK-32901][CORE] Do not allocate memory while spilling (commit: f167002) (details)
  13. [MINOR][DOCS] Fixing log message for better clarity (commit: 7766fd1) (details)
  14. [SPARK-33018][SQL] Fix estimate statistics issue if child has 0 bytes (commit: 711d8dd) (details)
  15. [SPARK-33019][CORE] Use (commit: cc06266) (details)
Commit f41ba2a2f3b86e485aa0ca1c10a2efe9a7163fb3 by gurwls223
[SPARK-32927][SQL] Bitwise OR, AND and XOR should have similar
canonicalization rules to boolean OR and AND
### What changes were proposed in this pull request?
Add canonicalization rules for commutative bitwise operations.
### Why are the changes needed?
Canonical form is used in many other optimization rules. Reduces the
number of cases, where plans with identical results are considered to be
distinct.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
UT
Closes #29794 from tanelk/SPARK-32927.
Lead-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com>
Co-authored-by: Tanel Kiis <tanel.kiis@reach-u.com> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: f41ba2a)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CanonicalizeSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala (diff)
Commit a7f84a0b457ed3e1b854729f132e218a4ae48b21 by gurwls223
[SPARK-32187][PYTHON][DOCS] Doc on Python packaging
### What changes were proposed in this pull request?
This PR proposes to document PySpark specific packaging guidelines.
### Why are the changes needed?
To have a single place for PySpark users, and better documentation.
### Does this PR introduce _any_ user-facing change?
Yes
### How was this patch tested?
``` cd python/docs make clean html
```
Closes #29806 from fhoering/add_doc_python_packaging.
Lead-authored-by: Fabian Höring <f.horing@criteo.com> Co-authored-by:
Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: a7f84a0)
The file was modifiedpython/docs/source/user_guide/index.rst (diff)
The file was addedpython/docs/source/user_guide/python_packaging.rst
Commit d15f504a5e8bd8acfb6dc1ee138f7d92ff211396 by gurwls223
[SPARK-33011][ML] Promote the stability annotation to Evolving for
MLEvent traits/classes
### What changes were proposed in this pull request?
This PR proposes to promote the stability annotation to `Evolving` for
MLEvent traits/classes.
### Why are the changes needed?
The feature is released in Spark 3.0.0 having SPARK-26818 as the last
change in Feb. 2020, and haven't changed in Spark 3.0.1. (There's no
change more than a half of year.)
While we'd better to wait for some minor releases to consider the API as
stable, it would worth to promote to Evolving so that we clearly state
that we support the API.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Just changed the annotation, no tests required.
Closes #29887 from HeartSaVioR/SPARK-33011.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: d15f504)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/events.scala (diff)
Commit 173da5bf11daecbd428add1a5e0aedd58a66fadb by viirya
[SPARK-32996][WEB-UI] Handle empty ExecutorMetrics in
ExecutorMetricsJsonSerializer
### What changes were proposed in this pull request? When
`peakMemoryMetrics` in `ExecutorSummary` is `Option.empty`, then the
`ExecutorMetricsJsonSerializer#serialize` method does not execute the
`jsonGenerator.writeObject` method. This causes the json to be generated
with `peakMemoryMetrics` key added to the serialized string, but no
corresponding value. This causes an error to be thrown when it is the
next key `attributes` turn to be added to the json:
`com.fasterxml.jackson.core.JsonGenerationException: Can not write a
field name, expecting a value
`
### Why are the changes needed? At the start of the Spark job, if
`peakMemoryMetrics` is `Option.empty`, then it causes a
`com.fasterxml.jackson.core.JsonGenerationException` to be thrown when
we navigate to the Executors tab in Spark UI. Complete stacktrace:
> com.fasterxml.jackson.core.JsonGenerationException: Can not write a
field name, expecting a value
> at
com.fasterxml.jackson.core.JsonGenerator._reportError(JsonGenerator.java:2080)
> at
com.fasterxml.jackson.core.json.WriterBasedJsonGenerator.writeFieldName(WriterBasedJsonGenerator.java:161)
> at
com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:725)
> at
com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:721)
> at
com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:166)
> at
com.fasterxml.jackson.databind.ser.std.CollectionSerializer.serializeContents(CollectionSerializer.java:145)
> at
com.fasterxml.jackson.module.scala.ser.IterableSerializer.serializeContents(IterableSerializerModule.scala:26)
> at
com.fasterxml.jackson.module.scala.ser.IterableSerializer.serializeContents$(IterableSerializerModule.scala:25)
> at
com.fasterxml.jackson.module.scala.ser.UnresolvedIterableSerializer.serializeContents(IterableSerializerModule.scala:54)
> at
com.fasterxml.jackson.module.scala.ser.UnresolvedIterableSerializer.serializeContents(IterableSerializerModule.scala:54)
> at
com.fasterxml.jackson.databind.ser.std.AsArraySerializerBase.serialize(AsArraySerializerBase.java:250)
> at
com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480)
> at
com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319)
> at
com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:4094)
> at
com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:3404)
> at
org.apache.spark.ui.exec.ExecutorsPage.allExecutorsDataScript$1(ExecutorsTab.scala:64)
> at
org.apache.spark.ui.exec.ExecutorsPage.render(ExecutorsTab.scala:76)
> at org.apache.spark.ui.WebUI.$anonfun$attachPage$1(WebUI.scala:89)
> at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:80)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
> at
org.sparkproject.jetty.servlet.ServletHolder.handle(ServletHolder.java:873)
> at
org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623)
> at
org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95)
> at
org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
> at
org.sparkproject.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
> at
org.sparkproject.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
> at
org.sparkproject.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)
> at
org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
> at
org.sparkproject.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
> at
org.sparkproject.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
> at
org.sparkproject.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
> at
org.sparkproject.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
> at
org.sparkproject.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:753)
> at
org.sparkproject.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
> at
org.sparkproject.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
> at org.sparkproject.jetty.server.Server.handle(Server.java:505)
> at
org.sparkproject.jetty.server.HttpChannel.handle(HttpChannel.java:370)
> at
org.sparkproject.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
> at
org.sparkproject.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
> at
org.sparkproject.jetty.io.FillInterest.fillable(FillInterest.java:103)
> at
org.sparkproject.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
> at
org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
> at
org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
> at
org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
> at
org.sparkproject.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
> at
org.sparkproject.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
> at
org.sparkproject.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698)
> at
org.sparkproject.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804)
> at java.base/java.lang.Thread.run(Thread.java:834)
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? Unit test
Closes #29872 from shrutig/SPARK-32996.
Authored-by: Shruti Gumma <shruti_gumma@apple.com> Signed-off-by:
Liang-Chi Hsieh <viirya@gmail.com>
(commit: 173da5b)
The file was modifiedcore/src/main/scala/org/apache/spark/status/api/v1/api.scala (diff)
The file was addedcore/src/test/java/org/apache/spark/status/api/v1/ExecutorSummarySuite.scala
Commit a53fc9b7ae2b96b302d72170db6572b337ec9894 by gurwls223
[SPARK-27951][SQL][FOLLOWUP] Improve the window function nth_value
### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/29604 supports the ANSI SQL
NTH_VALUE. We should override the `prettyName` and `sql`.
### Why are the changes needed? Make the name of nth_value correct. To
show the ignoreNulls parameter correctly.
### Does this PR introduce _any_ user-facing change?
'No'.
### How was this patch tested? Jenkins test.
Closes #29886 from beliefer/improve-nth_value.
Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer
<beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: a53fc9b)
The file was modifiedsql/core/src/test/resources/sql-functions/sql-expression-schema.md (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/postgreSQL/window_part3.sql.out (diff)
Commit 376ede130149e0fa2029da423f8d9c654b096921 by dhyun
[SPARK-33021][PYTHON][TESTS] Move functions related test cases into
test_functions.py
### What changes were proposed in this pull request?
Move functions related test cases from `test_context.py` to
`test_functions.py`.
### Why are the changes needed?
To group the similar test cases.
### Does this PR introduce _any_ user-facing change?
Nope, test-only.
### How was this patch tested?
Jenkins and GitHub Actions should test.
Closes #29898 from HyukjinKwon/SPARK-33021.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
(commit: 376ede1)
The file was modifiedpython/pyspark/sql/tests/test_context.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_functions.py (diff)
Commit 68cd5677ae0e3891e6bb4938a64ff98810656ba8 by wenchen
[SPARK-33015][SQL] Compute the current date only once
### What changes were proposed in this pull request? Compute the current
date at the specified time zone using timestamp taken at the start of
query evaluation.
### Why are the changes needed? According to the doc for
[current_date()](http://spark.apache.org/docs/latest/api/sql/#current_date),
the current date should be computed at the start of query evaluation but
it can be computed multiple times. As a consequence of that, the
function can return different values if the query is executed at the
border of two dates.
### Does this PR introduce _any_ user-facing change? Yes
### How was this patch tested? By existing test suites
`ComputeCurrentTimeSuite` and `DateExpressionsSuite`.
Closes #29889 from MaxGekk/fix-current_date.
Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 68cd567)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala (diff)
Commit 6868b405171bfaa8d013bd938dbef6636a8c9845 by dhyun
[SPARK-33020][PYTHON] Add nth_value as a PySpark function
### What changes were proposed in this pull request?
`nth_value` was added at SPARK-27951. This PR adds the corresponding
PySpark API.
### Why are the changes needed?
To support the consistent APIs
### Does this PR introduce _any_ user-facing change?
Yes, it introduces a new PySpark function API.
### How was this patch tested?
Unittest was added.
Closes #29899 from HyukjinKwon/SPARK-33020.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
(commit: 6868b40)
The file was modifiedpython/pyspark/sql/tests/test_functions.py (diff)
The file was modifiedpython/docs/source/reference/pyspark.sql.rst (diff)
The file was modifiedpython/pyspark/sql/functions.py (diff)
The file was modifiedpython/pyspark/sql/functions.pyi (diff)
Commit 1b60ff5afea0637f74c5f064642225b35b13b069 by wenchen
[MINOR][DOCS] Document when `current_date` and `current_timestamp` are
evaluated
### What changes were proposed in this pull request? Explicitly document
that `current_date` and `current_timestamp` are executed at the start of
query evaluation. And all calls of `current_date`/`current_timestamp`
within the same query return the same value
### Why are the changes needed? Users could expect that `current_date`
and `current_timestamp` return the current date/timestamp at the moment
of query execution but in fact the functions are folded by the optimizer
at the start of query evaluation:
https://github.com/apache/spark/blob/0df8dd60733066076967f0525210bbdb5e12415a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala#L71-L91
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? by running `./dev/scalastyle`.
Closes #29892 from MaxGekk/doc-current_date.
Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 1b60ff5)
The file was modifiedR/pkg/R/functions.R (diff)
The file was modifiedpython/pyspark/sql/functions.py (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/functions.scala (diff)
Commit 202115e7cd0bc2b32c68274e625cded0d628a0c5 by dhyun
[SPARK-32948][SQL] Optimize to_json and from_json expression chain
### What changes were proposed in this pull request?
This patch proposes to optimize from_json + to_json expression chain.
### Why are the changes needed?
To optimize json expression chain that could be manually generated or
generated automatically during query optimization.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test.
Closes #29828 from viirya/SPARK-32948.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
(commit: 202115e)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
The file was addedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala
The file was addedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprsSuite.scala
Commit 90e86f6fac8ac42cf61e523397dc1bcc01871744 by gurwls223
[SPARK-32970][SPARK-32019][SQL][TEST] Reduce the runtime of an UT for
### What changes were proposed in this pull request?
The UT for SPARK-32019 (#28853) tries to write about 16GB of data do the
disk. We must change the value of `spark.sql.files.maxPartitionBytes` to
a smaller value do check the correct behavior with less data. By default
it is `128MB`. The other parameters in this UT are also changed to
smaller values to keep the behavior the same.
### Why are the changes needed?
The runtime of this one UT can be over 7 minutes on Jenkins. After the
change it is few seconds.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing UT
Closes #29842 from tanelk/SPARK-32970.
Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 90e86f6)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala (diff)
Commit f167002522d50eefb261c8ba2d66a23b781a38c4 by herman
[SPARK-32901][CORE] Do not allocate memory while spilling
UnsafeExternalSorter
### What changes were proposed in this pull request?
This PR changes `UnsafeExternalSorter` to no longer allocate any memory
while spilling. In particular it removes the allocation of a new pointer
array in `UnsafeInMemorySorter`. Instead the new pointer array is
allocated whenever the next record is inserted into the sorter.
### Why are the changes needed?
Without this change the `UnsafeExternalSorter` could throw an OOM while
spilling. The following sequence of events would have triggered an OOM:
1. `UnsafeExternalSorter` runs out of space in its pointer array and
attempts to allocate a new large array to replace the old one. 2.
`TaskMemoryManager` tries to allocate the memory backing the new large
array using `MemoryManager`, but `MemoryManager` is only willing to
return most but not all of the memory requested. 3. `TaskMemoryManager`
asks `UnsafeExternalSorter` to spill, which causes
`UnsafeExternalSorter` to spill the current run to disk, to free its
record pages and to reset its `UnsafeInMemorySorter`. 4.
`UnsafeInMemorySorter` frees the old pointer array, and tries to
allocate a new small pointer array. 5. `TaskMemoryManager` tries to
allocate the memory backing the small array using `MemoryManager`, but
`MemoryManager` is unwilling to give it any memory, as the
`TaskMemoryManager` is still holding on to the memory it got for the new
large array. 6. `TaskMemoryManager` again asks `UnsafeExternalSorter` to
spill, but this time there is nothing to spill. 7.
`UnsafeInMemorySorter` receives less memory than it requested, and
causes a `SparkOutOfMemoryError` to be thrown, which causes the current
task to fail.
With the changes in the PR the following will happen instead:
1. `UnsafeExternalSorter` runs out of space in its pointer array and
attempts to allocate a new large array to replace the old one. 2.
`TaskMemoryManager` tries to allocate the memory backing the new large
array using `MemoryManager`, but `MemoryManager` is only willing to
return most but not all of the memory requested. 3. `TaskMemoryManager`
asks `UnsafeExternalSorter` to spill, which causes
`UnsafeExternalSorter` to spill the current run to disk, to free its
record pages and to reset its `UnsafeInMemorySorter`. 4.
`UnsafeInMemorySorter` frees the old pointer array. 5.
`TaskMemoryManager` returns control to
`UnsafeExternalSorter.growPointerArrayIfNecessary` (either by returning
the the new large array or by throwing a `SparkOutOfMemoryError`). 6.
`UnsafeExternalSorter` either frees the new large array or it ignores
the `SparkOutOfMemoryError` depending on what happened in the previous
step. 7. `UnsafeExternalSorter` successfully allocates a new small
pointer array and operation continues as normal.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Tests were added in `UnsafeExternalSorterSuite` and
`UnsafeInMemorySorterSuite`.
Closes #29785 from tomvanbussel/SPARK-32901.
Authored-by: Tom van Bussel <tom.vanbussel@databricks.com>
Signed-off-by: herman <herman@databricks.com>
(commit: f167002)
The file was modifiedcore/src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java (diff)
The file was modifiedcore/src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorterSuite.java (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/memory/TestMemoryManager.scala (diff)
The file was modifiedcore/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java (diff)
The file was modifiedcore/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java (diff)
Commit 7766fd13c9e7cb72b97fdfee224d3958fbe882a0 by srowen
[MINOR][DOCS] Fixing log message for better clarity
Fixing log message for better clarity.
Closes #29870 from akshatb1/master.
Lead-authored-by: Akshat Bordia <akshat.bordia31@gmail.com>
Co-authored-by: Akshat Bordia <akshat.bordia@citrix.com> Signed-off-by:
Sean Owen <srowen@gmail.com>
(commit: 7766fd1)
The file was modifiedcore/src/main/scala/org/apache/spark/SparkConf.scala (diff)
Commit 711d8dd28afd9af92b025f9908534e5f1d575042 by wenchen
[SPARK-33018][SQL] Fix estimate statistics issue if child has 0 bytes
### What changes were proposed in this pull request?
This pr fix estimate statistics issue if child has 0 bytes.
### Why are the changes needed? The `sizeInBytes` can be `0` when AQE
and CBO are enabled(`spark.sql.adaptive.enabled`=true,
`spark.sql.cbo.enabled`=true and
`spark.sql.cbo.planStats.enabled`=true). This will generate incorrect
BroadcastJoin, resulting in Driver OOM. For example:
![SPARK-33018](https://user-images.githubusercontent.com/5399861/94457606-647e3d00-01e7-11eb-85ee-812ae6efe7bb.jpg)
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test.
Closes #29894 from wangyum/SPARK-33018.
Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 711d8dd)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsEstimationTestBase.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/JoinEstimationSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala (diff)
Commit cc06266ade5a4eb35089501a3b32736624208d4c by dhyun
[SPARK-33019][CORE] Use
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by
default
### What changes were proposed in this pull request?
Apache Spark 3.1's default Hadoop profile is `hadoop-3.2`. Instead of
having a warning documentation, this PR aims to use a consistent and
safer version of Apache Hadoop file output committer algorithm which is
`v1`. This will prevent a silent correctness regression during migration
from Apache Spark 2.4/3.0 to Apache Spark 3.1.0. Of course, if there is
a user-provided configuration,
`spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2`, that
will be used still.
### Why are the changes needed?
Apache Spark provides multiple distributions with Hadoop 2.7 and Hadoop
3.2. `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version`
depends on the Hadoop version. Apache Hadoop 3.0 switches the default
algorithm from `v1` to `v2` and now there exists a discussion to remove
`v2`. We had better provide a consistent default behavior of `v1` across
various Spark distributions.
- [MAPREDUCE-7282](https://issues.apache.org/jira/browse/MAPREDUCE-7282)
MR v2 commit algorithm should be deprecated and not the default
### Does this PR introduce _any_ user-facing change?
Yes. This changes the default behavior. Users can override this conf.
### How was this patch tested?
Manual.
**BEFORE (spark-3.0.1-bin-hadoop3.2)**
```scala scala> sc.version res0: String = 3.0.1
scala>
sc.hadoopConfiguration.get("mapreduce.fileoutputcommitter.algorithm.version")
res1: String = 2
```
**AFTER**
```scala scala>
sc.hadoopConfiguration.get("mapreduce.fileoutputcommitter.algorithm.version")
res0: String = 1
```
Closes #29895 from dongjoon-hyun/SPARK-DEFAUT-COMMITTER.
Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
(commit: cc06266)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala (diff)
The file was modifieddocs/configuration.md (diff)