Changes

Summary

  1. [SPARK-32352][SQL][FOLLOW-UP][TEST-HADOOP2.7][TEST-HIVE1.2] Exclude (commit: 11c6a23) (details)
  2. [SPARK-32649][SQL] Optimize BHJ/SHJ inner/semi join with empty hashed (commit: 08b951b) (details)
  3. [SPARK-32588][CORE][TEST] Fix SizeEstimator initialization in tests (commit: bc23bb7) (details)
  4. [SPARK-32516][SQL] 'path' option cannot coexist with load()'s path (commit: e3a88a9) (details)
  5. [SPARK-32686][PYTHON] Un-deprecate inferring DataFrame schema from list (commit: 41cf1d0) (details)
  6. [SPARK-32550][SQL][FOLLOWUP] Eliminate negative impact on (commit: a30bb0c) (details)
  7. [MINOR][SQL] Add missing documentation for LongType mapping (commit: 3eee915) (details)
  8. [SPARK-31608][CORE][WEBUI][TEST] Add test suites for HybridStore and (commit: 9151a58) (details)
  9. [SPARK-31000][PYTHON][SQL] Add ability to set table description via (commit: f540031) (details)
  10. [SPARK-32646][SQL][TEST-HADOOP2.7][TEST-HIVE1.2] ORC predicate pushdown (commit: cee48a9) (details)
  11. [SPARK-32641][SQL] withField + getField should return null if original (commit: 3f1e56d) (details)
  12. Revert "[SPARK-32412][SQL] Unify error handling for spark thrift serv… (commit: c26a976) (details)
  13. [SPARK-32466][SQL][FOLLOW-UP] Normalize Location info in explain plan (commit: b78b776) (details)
  14. [SPARK-32683][DOCS][SQL] Fix doc error and add migration guide for (commit: 1f3bb51) (details)
  15. [SPARK-32664][CORE] Switch the log level from info to debug at (commit: a3179a7) (details)
  16. [SPARK-32287][TESTS][FOLLOWUP] Add debugging for flaky (commit: 2feab4e) (details)
  17. [SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for (commit: a9d4e60) (details)
Commit 11c6a23c13745a61c1a1cfc82e4f1ac95eaaa04a by yamamuro
[SPARK-32352][SQL][FOLLOW-UP][TEST-HADOOP2.7][TEST-HIVE1.2] Exclude
partition columns from data columns
### What changes were proposed in this pull request?
This PR fixes a bug of #29406. #29406 partially pushes down data filter
even if it mixed in partition filters. But in some cases partition
columns might be in data columns too. It will possibly push down a
predicate with partition column to datasource.
### Why are the changes needed?
The test
"org.apache.spark.sql.hive.orc.HiveOrcHadoopFsRelationSuite.save()/load()
- partitioned table - simple queries - partition columns in data" is
currently failed with hive-1.2 profile in master branch.
```
[info] - save()/load() - partitioned table - simple queries - partition
columns in data *** FAILED *** (1 second, 457 milliseconds)
[info]   java.util.NoSuchElementException: key not found: p1
[info]   at scala.collection.immutable.Map$Map2.apply(Map.scala:138)
[info]   at
org.apache.spark.sql.hive.orc.OrcFilters$.buildLeafSearchArgument(OrcFilters.scala:250)
[info]   at
org.apache.spark.sql.hive.orc.OrcFilters$.convertibleFiltersHelper$1(OrcFilters.scala:143)
[info]   at
org.apache.spark.sql.hive.orc.OrcFilters$.$anonfun$convertibleFilters$4(OrcFilters.scala:146)
[info]   at
scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
[info]   at
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
[info]   at
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
[info]   at
scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
[info]   at
scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
[info]   at
scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
[info]   at
scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
[info]   at
org.apache.spark.sql.hive.orc.OrcFilters$.convertibleFilters(OrcFilters.scala:145)
[info]   at
org.apache.spark.sql.hive.orc.OrcFilters$.createFilter(OrcFilters.scala:83)
[info]   at
org.apache.spark.sql.hive.orc.OrcFileFormat.buildReader(OrcFileFormat.scala:142)
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test.
Closes #29526 from viirya/SPARK-32352-followup.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Takeshi
Yamamuro <yamamuro@apache.org>
(commit: 11c6a23)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala (diff)
Commit 08b951b1cb58cea2c34703e43202fe7c84725c8a by wenchen
[SPARK-32649][SQL] Optimize BHJ/SHJ inner/semi join with empty hashed
relation
### What changes were proposed in this pull request?
For broadcast hash join and shuffled hash join, whenever the build side
hashed relation turns out to be empty. We don't need to execute stream
side plan at all, and can return an empty iterator (for inner join and
left semi join), because we know for sure that none of stream side rows
can be outputted as there's no match.
### Why are the changes needed?
A very minor optimization for rare use case, but in case build side
turns out to be empty, we can leverage it to short-cut stream side to
save CPU and IO.
Example broadcast hash join query similar to `JoinBenchmark` with empty
hashed relation:
```
def broadcastHashJoinLongKey(): Unit = {
   val N = 20 << 20
   val M = 1 << 16
    val dim = broadcast(spark.range(0).selectExpr("id as k", "cast(id as
string) as v"))
   codegenBenchmark("Join w long", N) {
     val df = spark.range(N).join(dim, (col("id") % M) === col("k"))
   
assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
     df.noop()
   }
}
```
Comparing wall clock time for enabling and disabling this PR (for
non-codegen code path). Seeing like 8x improvement.
``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.4
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz Join w long:                  
          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per
Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Join PR disabled                                    637            646 
       12         32.9          30.4       1.0X Join PR enabled        
                            77             78           2        271.8 
        3.7       8.3X
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added unit test in `JoinSuite`.
Closes #29484 from c21/empty-relation.
Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 08b951b)
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/EliminateJoinToEmptyRelation.scala
The file was removedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/EliminateNullAwareAntiJoin.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala (diff)
Commit bc23bb78823f4fa02385b7b2a0270cd1b98bce34 by srowen
[SPARK-32588][CORE][TEST] Fix SizeEstimator initialization in tests
In order to produce consistent results from SizeEstimator the tests
override some system properties that are used during SizeEstimator
initialization. However there were several places where either the
compressed references property wasn't set or the system properties were
set but the SizeEstimator not re-initialized.
This caused failures when running the tests with a large heap build of
OpenJ9 because it does not use compressed references unlike most
environments.
### What changes were proposed in this pull request? Initialize
SizeEstimator class explicitly in the tests where required to avoid
relying on a particular environment.
### Why are the changes needed? Test failures can be seen when
compressed references are disabled (e.g. using an OpenJ9 large heap
build or Hotspot with a large heap).
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? Tests run on machine running OpenJ9 large
heap build.
Closes #29407 from mundaym/fix-sizeestimator.
Authored-by: Michael Munday <mike.munday@ibm.com> Signed-off-by: Sean
Owen <srowen@gmail.com>
(commit: bc23bb7)
The file was modifiedcore/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/storage/MemoryStoreSuite.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/util/SizeEstimatorSuite.scala (diff)
Commit e3a88a9767c00afd8d8186e91372535e7ad45f30 by wenchen
[SPARK-32516][SQL] 'path' option cannot coexist with load()'s path
parameters
### What changes were proposed in this pull request?
This PR proposes to make the behavior consistent for the `path` option
when loading dataframes with a single path (e.g, `option("path",
path).format("parquet").load(path)` vs. `option("path",
path).parquet(path)`) by disallowing `path` option to coexist with
`load`'s path parameters.
### Why are the changes needed?
The current behavior is inconsistent:
```scala scala> Seq(1).toDF.write.mode("overwrite").parquet("/tmp/test")
scala> spark.read.option("path",
"/tmp/test").format("parquet").load("/tmp/test").show
+-----+
|value|
+-----+
|    1|
+-----+
scala> spark.read.option("path", "/tmp/test").parquet("/tmp/test").show
+-----+
|value|
+-----+
|    1|
|    1|
+-----+
```
### Does this PR introduce _any_ user-facing change?
Yes, now if the `path` option is specified along with `load`'s path
parameters, it would fail:
```scala scala> Seq(1).toDF.write.mode("overwrite").parquet("/tmp/test")
scala> spark.read.option("path",
"/tmp/test").format("parquet").load("/tmp/test").show
org.apache.spark.sql.AnalysisException: There is a path option set and
load() is called with path parameters. Either remove the path option or
move it into the load() parameters.;
at
org.apache.spark.sql.DataFrameReader.verifyPathOptionDoesNotExist(DataFrameReader.scala:310)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
... 47 elided
scala> spark.read.option("path", "/tmp/test").parquet("/tmp/test").show
org.apache.spark.sql.AnalysisException: There is a path option set and
load() is called with path parameters. Either remove the path option or
move it into the load() parameters.;
at
org.apache.spark.sql.DataFrameReader.verifyPathOptionDoesNotExist(DataFrameReader.scala:310)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:250)
at
org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:778)
at
org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:756)
... 47 elided
```
The user can restore the previous behavior by setting
`spark.sql.legacy.pathOptionBehavior.enabled` to `true`.
### How was this patch tested?
Added a test
Closes #29328 from imback82/dfw_option.
Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: e3a88a9)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifieddocs/sql-migration-guide.md (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala (diff)
Commit 41cf1d093fbe5e08266f34d55ca450e5ea27b1ed by cutlerb
[SPARK-32686][PYTHON] Un-deprecate inferring DataFrame schema from list
of dict
### What changes were proposed in this pull request?
As discussed in
https://github.com/apache/spark/pull/29491#discussion_r474451282 and in
SPARK-32686, this PR un-deprecates Spark's ability to infer a DataFrame
schema from a list of dictionaries. The ability is Pythonic and matches
functionality offered by Pandas.
### Why are the changes needed?
This change clarifies to users that this behavior is supported and is
not going away in the near future.
### Does this PR introduce _any_ user-facing change?
Yes. There used to be a `UserWarning` for this, but now there isn't.
### How was this patch tested?
I tested this manually.
Before:
```python
>>> spark.createDataFrame(spark.sparkContext.parallelize([{'a': 5}]))
/Users/nchamm/Documents/GitHub/nchammas/spark/python/pyspark/sql/session.py:388:
UserWarning: Using RDD of dict to inferSchema is deprecated. Use
pyspark.sql.Row instead
warnings.warn("Using RDD of dict to inferSchema is deprecated. "
DataFrame[a: bigint]
>>> spark.createDataFrame([{'a': 5}])
.../python/pyspark/sql/session.py:378: UserWarning: inferring schema
from dict is deprecated,please use pyspark.sql.Row instead
warnings.warn("inferring schema from dict is deprecated," DataFrame[a:
bigint]
```
After:
```python
>>> spark.createDataFrame(spark.sparkContext.parallelize([{'a': 5}]))
DataFrame[a: bigint]
>>> spark.createDataFrame([{'a': 5}]) DataFrame[a: bigint]
```
Closes #29510 from nchammas/SPARK-32686-df-dict-infer-schema.
Authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
(commit: 41cf1d0)
The file was modifiedpython/pyspark/sql/session.py (diff)
Commit a30bb0cfda4b9e6a6f2014ade0264656177c83f7 by gurwls223
[SPARK-32550][SQL][FOLLOWUP] Eliminate negative impact on
HyperLogLogSuite
### What changes were proposed in this pull request? Change to use
`dataTypes.foreach` instead of get the element use specified index in
`def this(dataTypes: Seq[DataType]) `constructor of
`SpecificInternalRow` because the random access performance is
unsatisfactory if the input argument not a `IndexSeq`.
This pr followed srowen's  advice.
### Why are the changes needed? I found that SPARK-32550 had some
negative impact on performance, the typical cases is "deterministic
cardinality estimation" in `HyperLogLogPlusPlusSuite` when rsd is 0.001,
we found the code that is significantly slower is line 41 in
`HyperLogLogPlusPlusSuite`: `new
SpecificInternalRow(hll.aggBufferAttributes.map(_.dataType)) `
https://github.com/apache/spark/blob/08b951b1cb58cea2c34703e43202fe7c84725c8a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlusSuite.scala#L40-L44
The size of "hll.aggBufferAttributes" in this case is 209716, the
results of comparison before and after spark-32550 merged are as
follows, The unit is ns:
  | After   SPARK-32550 createBuffer | After   SPARK-32550 end to end |
Before   SPARK-32550 createBuffer | Before   SPARK-32550 end to end
-- | -- | -- | -- | -- rsd 0.001, n   1000 | 52715513243 | 53004810687 |
195807999 | 773977677 rsd 0.001, n   5000 | 51881246165 | 52519358215 |
13689949 | 249974855 rsd 0.001, n   10000 | 52234282788 | 52374639172 |
14199071 | 183452846 rsd 0.001, n   50000 | 55503517122 | 55664035449 |
15219394 | 584477125 rsd 0.001, n   100000 | 51862662845 | 52116774177 |
19662834 | 166483678 rsd 0.001, n   500000 | 51619226715 | 52183189526 |
178048012 | 16681330 rsd 0.001, n   1000000 | 54861366981 | 54976399142
| 226178708 | 18826340 rsd 0.001, n   5000000 | 52023602143 |
52354615149 | 388173579 | 15446409 rsd 0.001, n   10000000 | 53008591660
| 53601392304 | 533454460 | 16033032
### Does this PR introduce _any_ user-facing change? no
### How was this patch tested?
`mvn test -pl sql/catalyst
-DwildcardSuites=org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlusSuite
-Dtest=none`
**Before**:
``` Run completed in 8 minutes, 18 seconds. Total number of tests run: 5
Suites: completed 2, aborted 0 Tests: succeeded 5, failed 0, canceled 0,
ignored 0, pending 0
```
**After**
``` Run completed in 7 seconds, 65 milliseconds. Total number of tests
run: 5 Suites: completed 2, aborted 0 Tests: succeeded 5, failed 0,
canceled 0, ignored 0, pending 0
```
Closes #29529 from LuciferYang/revert-spark-32550.
Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: a30bb0c)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SpecificInternalRow.scala (diff)
Commit 3eee915b474c58cff9ea108f67073ed9c0c86224 by gurwls223
[MINOR][SQL] Add missing documentation for LongType mapping
### What changes were proposed in this pull request?
Added Java docs for Long data types in the Row class.
### Why are the changes needed?
The Long datatype is somehow missing in Row.scala's `apply` and `get`
methods.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing UTs.
Closes #29534 from yeshengm/docs-fix.
Authored-by: Yesheng Ma <kimi.ysma@gmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: 3eee915)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala (diff)
Commit 9151a589a7fa393829f5031044fcb9f9d14851ec by kabhwan.opensource
[SPARK-31608][CORE][WEBUI][TEST] Add test suites for HybridStore and
HistoryServerMemoryManager
### What changes were proposed in this pull request? This pull request
adds 2 test suites for 2 new classes HybridStore and
HistoryServerMemoryManager, which were created in
https://github.com/apache/spark/pull/28412. This pull request also did
some minor changes in these 2 classes to expose some variables for
testing. Besides 2 suites, this pull request adds a unit test in
FsHistoryProviderSuite to test parsing logs with HybridStore.
### Why are the changes needed? Unit tests are needed for new features.
### Does this PR introduce _any_ user-facing change? No.
### How was this patch tested? Unit tests.
Closes #29509 from baohe-zhang/SPARK-31608-UT.
Authored-by: Baohe Zhang <baohe.zhang@verizonmedia.com> Signed-off-by:
Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
(commit: 9151a58)
The file was addedcore/src/test/scala/org/apache/spark/deploy/history/HybridStoreSuite.scala
The file was modifiedcore/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala (diff)
The file was addedcore/src/test/scala/org/apache/spark/deploy/history/HistoryServerMemoryManagerSuite.scala
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/history/HybridStore.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/history/HistoryServerMemoryManager.scala (diff)
Commit f540031419812208594c1ed9255a1bdd41eb26b1 by gurwls223
[SPARK-31000][PYTHON][SQL] Add ability to set table description via
Catalog.createTable()
### What changes were proposed in this pull request?
This PR enhances `Catalog.createTable()` to allow users to set the
table's description. This corresponds to the following SQL syntax:
```sql CREATE TABLE ... COMMENT 'this is a fancy table';
```
### Why are the changes needed?
This brings the Scala/Python catalog APIs a bit closer to what's already
possible via SQL.
### Does this PR introduce any user-facing change?
Yes, it adds a new parameter to `Catalog.createTable()`.
### How was this patch tested?
Existing unit tests:
```sh
./python/run-tests \
--python-executables python3.7 \
--testnames
'pyspark.sql.tests.test_catalog,pyspark.sql.tests.test_context'
```
```
$ ./build/sbt testOnly org.apache.spark.sql.internal.CatalogSuite
org.apache.spark.sql.CachedTableSuite
org.apache.spark.sql.hive.MetastoreDataSourcesSuite
org.apache.spark.sql.hive.execution.HiveDDLSuite
```
Closes #27908 from nchammas/SPARK-31000-table-description.
Authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: f540031)
The file was modifiedpython/pyspark/sql/catalog.py (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/internal/CatalogSuite.scala (diff)
The file was modifiedpython/pyspark/sql/tests/test_catalog.py (diff)
Commit cee48a966167b81e54069a6bb8d29d8fc8f452ed by gurwls223
[SPARK-32646][SQL][TEST-HADOOP2.7][TEST-HIVE1.2] ORC predicate pushdown
should work with case-insensitive analysis
### What changes were proposed in this pull request?
This PR proposes to fix ORC predicate pushdown under case-insensitive
analysis case. The field names in pushed down predicates don't need to
match in exact letter case with physical field names in ORC files, if we
enable case-insensitive analysis.
This is re-submitted for #29457.  Because #29457 has a hive-1.2 error
and there were some tests failed with hive-1.2 profile at the same time,
#29457 was reverted to unblock others.
### Why are the changes needed?
Currently ORC predicate pushdown doesn't work with case-insensitive
analysis. A predicate "a < 0" cannot pushdown to ORC file with field
name "A" under case-insensitive analysis.
But Parquet predicate pushdown works with this case. We should make ORC
predicate pushdown work with case-insensitive analysis too.
### Does this PR introduce _any_ user-facing change?
Yes, after this PR, under case-insensitive analysis, ORC predicate
pushdown will work.
### How was this patch tested?
Unit tests.
Closes #29530 from viirya/fix-orc-pushdown.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: cee48a9)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala (diff)
The file was modifiedsql/core/v1.2/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcPartitionReaderFactory.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcScanBuilder.scala (diff)
The file was modifiedsql/core/v2.3/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcScan.scala (diff)
The file was modifiedsql/core/v2.3/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala (diff)
The file was modifiedsql/core/v1.2/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala (diff)
Commit 3f1e56d4ca8b64dfb29c26bc32072855c70b6529 by wenchen
[SPARK-32641][SQL] withField + getField should return null if original
struct was null
### What changes were proposed in this pull request?
There is a bug in the way the optimizer rule in
`SimplifyExtractValueOps` is currently written in master branch which
yields incorrect results in scenarios like the following:
``` sql("SELECT CAST(NULL AS struct<a:int,b:int>) struct_col")
.select($"struct_col".withField("d", lit(4)).getField("d").as("d"))
// currently returns this:
+---+
|d  |
+---+
|4  |
+---+
// when in fact it should return this:
+----+
|d   |
+----+
|null|
+----+
``` The changes in this PR will fix this bug.
### Why are the changes needed?
To fix the aforementioned bug. Optimizer rules should improve the
performance of the  query but yield exactly the same results.
### Does this PR introduce _any_ user-facing change?
Yes, this bug will no longer occur. That said, this isn't something to
be concerned about as this bug was introduced in Spark 3.1 and Spark 3.1
has yet to be released.
### How was this patch tested?
Unit tests were added. Jenkins must pass them.
Closes #29522 from fqaiser94/SPARK-32641.
Authored-by: fqaiser94@gmail.com <fqaiser94@gmail.com> Signed-off-by:
Wenchen Fan <wenchen@databricks.com>
(commit: 3f1e56d)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ComplexTypes.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/complexTypesSuite.scala (diff)
Commit c26a97637fe97cdb090244407d25576d361bb4c3 by wenchen
Revert "[SPARK-32412][SQL] Unify error handling for spark thrift serv…
…er operations"
### What changes were proposed in this pull request?
This reverts commit 510a1656e650246a708d3866c8a400b7a1b9f962.
### Why are the changes needed?
see https://github.com/apache/spark/pull/29204#discussion_r475716547
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
pass ci tools
Closes #29531 from yaooqinn/revert.
Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: c26a976)
The file was modifiedsql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/ThriftServerWithSparkContextSuite.scala (diff)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkOperation.scala (diff)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala (diff)
Commit b78b776c9ebc97bb2d384c823c313df8b81a0235 by gurwls223
[SPARK-32466][SQL][FOLLOW-UP] Normalize Location info in explain plan
### What changes were proposed in this pull request?
1. Extract `SQLQueryTestSuite.replaceNotIncludedMsg` to `PlanTest`.
2. Reuse `replaceNotIncludedMsg` to normalize the explain plan that
generated in `PlanStabilitySuite`.
### Why are the changes needed?
This's a follow-up of https://github.com/apache/spark/pull/29270.
Eliminates the personal related information (e.g., local directories) in
the explain plan.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Updated test.
Closes #29537 from Ngone51/follow-up-plan-stablity.
Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: b78b776)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/PlanStabilitySuite.scala (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q11.sf100/explain.txt (diff)
The file was modifiedsql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q11/explain.txt (diff)
Commit 1f3bb5175749816be1f0bc793ed5239abf986000 by wenchen
[SPARK-32683][DOCS][SQL] Fix doc error and add migration guide for
datetime pattern F
### What changes were proposed in this pull request?
This PR fixes the doc error and add a migration guide for datetime
pattern.
### Why are the changes needed? This is a bug of the doc that we
inherited from JDK https://bugs.openjdk.java.net/browse/JDK-8169482
The SimpleDateFormatter(**F Day of week in month**) we used in 2.x and
the DatetimeFormatter(**F week-of-month**) we use now both have the
opposite meanings to what they declared in the java docs. And
unfortunately, this also leads to silent data change in Spark too.
The `week-of-month` is actually the pattern `W` in DatetimeFormatter,
which is banned to use in Spark 3.x.
If we want to keep pattern `F`, we need to accept the behavior change
with proper migration guide and fix the doc in Spark
### Does this PR introduce _any_ user-facing change?
Yes, doc changed
### How was this patch tested?
passing ci doc generating job
Closes #29538 from yaooqinn/SPARK-32683.
Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 1f3bb51)
The file was modifieddocs/sql-migration-guide.md (diff)
The file was modifieddocs/sql-ref-datetime-pattern.md (diff)
Commit a3179a78b6619f27726bd7fb2d81d2fce5f00fd1 by gurwls223
[SPARK-32664][CORE] Switch the log level from info to debug at
BlockManager.getLocalBlockData
### What changes were proposed in this pull request? Changing an info
log to a debug log based on SPARK-32664
### Why are the changes needed? It is outlined in SPARK-32664
### Does this PR introduce _any_ user-facing change? There are changes
to the debug and info logs
### How was this patch tested? Tested by looking at the logs
Closes #29527 from dmoore62/SPARK-32664.
Authored-by: Daniel Moore <moore@knights.ucf.edu> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: a3179a7)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/BlockManager.scala (diff)
Commit 2feab4ef4f949a4904a58949334269e07bf7d6f0 by gurwls223
[SPARK-32287][TESTS][FOLLOWUP] Add debugging for flaky
ExecutorAllocationManagerSuite
### What changes were proposed in this pull request?
fixing flaky test in ExecutorAllocationManagerSuite. The issue is that
there is a timing issue when we do a reset as to when the
numExecutorsToAddPerResourceProfileId gets reset. The fix is to just
always set those values back to 1 when we call reset().
### Why are the changes needed?
fixing flaky test in ExecutorAllocationManagerSuite
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ran the unit test via this PR a bunch of times and the fix seems to be
working.
Closes #29508 from tgravescs/debugExecAllocTest.
Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 2feab4e)
The file was modifiedcore/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala (diff)
Commit a9d4e60a90d4d6765642e6bf7810da117af6437b by gurwls223
[SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for
CSV
### What changes were proposed in this pull request?
Spark's CSV source can optionally ignore lines starting with a comment
char. Some code paths check to see if it's set before applying comment
logic (i.e. not set to default of `\0`), but many do not, including the
one that passes the option to Univocity. This means that rows beginning
with a null char were being treated as comments even when 'disabled'.
### Why are the changes needed?
To avoid dropping rows that start with a null char when this is not
requested or intended. See JIRA for an example.
### Does this PR introduce _any_ user-facing change?
Nothing beyond the effect of the bug fix.
### How was this patch tested?
Existing tests plus new test case.
Closes #29516 from srowen/SPARK-32614.
Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: a9d4e60)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala (diff)
The file was modifieddev/deps/spark-deps-hadoop-2.7-hive-1.2 (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala (diff)
The file was modifieddev/deps/spark-deps-hadoop-2.7-hive-2.3 (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala (diff)
The file was modifieddev/deps/spark-deps-hadoop-3.2-hive-2.3 (diff)
The file was modifiedpom.xml (diff)