1. [SPARK-32516][SQL][FOLLOWUP] 'path' option cannot coexist with path (commit: baaa756) (details)
  2. [SPARK-32705][SQL] Fix serialization issue for EmptyHashedRelation (commit: eb37976) (details)
  3. [SPARK-32696][SQL][TEST-HIVE1.2][TEST-HADOOP2.7] Get columns operation (commit: f14f374) (details)
  4. [SPARK-30654] Bootstrap4 docs upgrade (commit: ed51a7f) (details)
  5. [SPARK-32701][CORE][DOCS] (commit: 8749b2b) (details)
  6. [SPARK-32713][K8S] Support execId placeholder in executor PVC conf (commit: 182727d) (details)
  7. [SPARK-32693][SQL] Compare two dataframes with same schema except (commit: d6c095c) (details)
  8. [SPARK-32183][DOCS][PYTHON] User Guide - PySpark Usage Guide for Pandas (commit: c154629) (details)
  9. [SPARK-28612][SQL][FOLLOWUP] Correct method doc of (commit: 73bfed3) (details)
  10. [SPARK-32722][PYTHON][DOCS] Update document type conversion for Pandas (commit: 5775073) (details)
  11. [SPARK-32717][SQL] Add a AQEOptimizer for AdaptiveSparkPlanExec (commit: c3b9404) (details)
Commit baaa756deee536a06956d38d92ce81764a1aca54 by wenchen
[SPARK-32516][SQL][FOLLOWUP] 'path' option cannot coexist with path
parameter for, DataStreamReader.load() and
### What changes were proposed in this pull request?
This is a follow up PR to #29328 to apply the same constraint where
`path` option cannot coexist with path parameter to
``, `DataStreamReader.load()` and
### Why are the changes needed?
The current behavior silently overwrites the `path` option if path
parameter is passed to ``,
`DataStreamReader.load()` and `DataStreamWriter.start()`.
For example,
``` Seq(1).toDF.write.option("path", "/tmp/path1").parquet("/tmp/path2")
``` will write the result to `/tmp/path2`.
### Does this PR introduce _any_ user-facing change?
Yes, if `path` option coexists with path parameter to any of the above
methods, it will throw `AnalysisException`:
``` scala> Seq(1).toDF.write.option("path",
org.apache.spark.sql.AnalysisException: There is a 'path' option set and
save() is called with a  path parameter. Either remove the path option,
or call save() without the parameter. To ignore this check, set
'spark.sql.legacy.pathOptionBehavior.enabled' to 'true'.;
The user can restore the previous behavior by setting
`spark.sql.legacy.pathOptionBehavior.enabled` to `true`.
### How was this patch tested?
Added new tests.
Closes #29543 from imback82/path_option.
Authored-by: Terry Kim <> Signed-off-by: Wenchen Fan
(commit: baaa756)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamReaderWriterSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala (diff)
The file was modifiedpython/pyspark/sql/tests/ (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala (diff)
The file was modifieddocs/ (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala (diff)
Commit eb379766f406fc1f91821f9109bacff7f3403fc3 by wenchen
[SPARK-32705][SQL] Fix serialization issue for EmptyHashedRelation
### What changes were proposed in this pull request? Currently,
EmptyHashedRelation and HashedRelationWithAllNullKeys is an object, and
it will cause JavaDeserialization Exception as following
``` 20/08/26 11:13:30 WARN [task-result-getter-2] TaskSetManager: Lost
task 34.0 in stage 57.0 (TID 18076, emr-worker-5.cluster-183257,
executor 18):
org.apache.spark.sql.execution.joins.EmptyHashedRelation$; no valid
This PR includes
* Using case object instead to fix serialization issue.
* Also change EmptyHashedRelation not to extend NullAwareHashedRelation
since it's already being used in other non-NAAJ joins.
### Why are the changes needed? It will cause BHJ failed when buildSide
is Empty and BHJ(NAAJ) failed when buildSide with null partition keys.
### Does this PR introduce _any_ user-facing change? No.
### How was this patch tested?
* Existing UT.
* Run entire TPCDS for E2E coverage.
Closes #29547 from leanken/leanken-SPARK-32705.
Authored-by: xuewei.linxuewei <>
Signed-off-by: Wenchen Fan <>
(commit: eb37976)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala (diff)
Commit f14f3742e0c98dd306abf02e93d2f10d89bc423f by wenchen
[SPARK-32696][SQL][TEST-HIVE1.2][TEST-HADOOP2.7] Get columns operation
should handle interval column properly
### What changes were proposed in this pull request?
This PR let JDBC clients identify spark interval columns properly.
### Why are the changes needed?
JDBC users can query interval values through thrift server, create views
with interval columns, e.g.
```sql CREATE global temp view view1 as select interval 1 day as i;
``` but when they want to get the details of the columns of view1, the
will fail with `Unrecognized type name: INTERVAL`
``` Caused by: java.lang.IllegalArgumentException: Unrecognized type
at org.apache.hadoop.hive.serde2.thrift.Type.getType(
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102)
at scala.Option.foreach(Option.scala:407)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
... 34 more
### Does this PR introduce _any_ user-facing change?
#### before
#### after
### How was this patch tested?
new tests
Closes #29539 from yaooqinn/SPARK-32696.
Authored-by: Kent Yao <> Signed-off-by: Wenchen Fan
(commit: f14f374)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetColumnsOperation.scala (diff)
The file was modifiedsql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/SparkMetadataOperationSuite.scala (diff)
The file was modifiedsql/hive-thriftserver/v1.2/src/main/scala/org/apache/spark/sql/hive/thriftserver/ThriftserverShimUtils.scala (diff)
The file was modifiedsql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala (diff)
The file was modifiedsql/hive-thriftserver/v2.3/src/main/scala/org/apache/spark/sql/hive/thriftserver/ThriftserverShimUtils.scala (diff)
Commit ed51a7f083936e9214f27837ba788c766e1e599c by srowen
[SPARK-30654] Bootstrap4 docs upgrade
### What changes were proposed in this pull request? We are using an
older version of Bootstrap (v. 2.1.0) for the online documentation site.
Bootstrap 2.x was moved to EOL in Aug 2013 and Bootstrap 3.x was moved
to EOL in July 2019 (  Older versions
of Bootstrap are also getting flagged in security scans for various
I haven't validated each CVE, but it would probably be good practice to
resolve any potential issues and get on a supported release.
The bad news is that there have been quite a few changes between
Bootstrap 2 and Bootstrap 4.  I've tried updating the library,
refactoring/tweaking the CSS and JS to maintain a similar appearance and
functionality, and testing the documentation.  This is a fairly large
change so I'm sure additional testing and fixes will be needed.
### How was this patch tested? This has been manually tested, but as
there is a lot of documentation it is possible issues were missed.
Additional testing and feedback is welcomed.  If it appears a whole
section was missed let me know and I'll take a pass at addressing that
Closes #27369 from clarkead/bootstrap4-docs-upgrade.
Authored-by: Dale Clarke <> Signed-off-by: Sean
Owen <>
(commit: ed51a7f)
The file was removeddocs/js/vendor/bootstrap.js
The file was addeddocs/js/vendor/bootstrap.bundle.min.js
The file was addeddocs/css/
The file was modifieddocs/js/main.js (diff)
The file was removeddocs/js/vendor/bootstrap.min.js
The file was modifieddocs/css/bootstrap.min.css (diff)
The file was addeddocs/js/vendor/
The file was removeddocs/css/bootstrap-responsive.css
The file was modifieddocs/_layouts/global.html (diff)
The file was removeddocs/css/bootstrap-responsive.min.css
The file was removeddocs/css/bootstrap.css
The file was modifieddocs/css/main.css (diff)
Commit 8749b2b6fae5ee0ce7b48aae6d859ed71e98491d by srowen
mapreduce.fileoutputcommitter.algorithm.version default value
The current documentation states that the default value of
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version is 1 which
is not entirely true since this configuration isn't set anywhere in
Spark but rather inherited from the Hadoop FileOutputCommitter class.
### What changes were proposed in this pull request?
I'm submitting this change, to clarify that the default value will
entirely depend on the Hadoop version of the runtime environment.
### Why are the changes needed?
An application would end up using algorithm version 1 on certain
environments but without any changes the same exact application will use
version 2 on environments running Hadoop 3.0 and later. This can have
pretty bad consequences in certain scenarios, for example, two tasks can
partially overwrite their output if speculation is enabled. Also, please
refer to the following JIRA:
### Does this PR introduce _any_ user-facing change?
Yes. Configuration page content was modified where previously we
explicitly highlighted that the default version for the
FileOutputCommitter algorithm was v1, this now has changed to "Dependent
on environment" with additional information in the description column to
### How was this patch tested?
Checked changes locally in browser
Closes #29541 from waleedfateem/SPARK-32701.
Authored-by: waleedfateem <> Signed-off-by: Sean
Owen <>
(commit: 8749b2b)
The file was modifieddocs/ (diff)
Commit 182727d90fb849cce6b611997ae228b3d8cd5675 by dongjoon
[SPARK-32713][K8S] Support execId placeholder in executor PVC conf
### What changes were proposed in this pull request?
This PR aims to support executor id placeholder in
configuration like the following.
### Why are the changes needed?
This is a convenient way to mount corresponding PV to the executor.
### Does this PR introduce _any_ user-facing change?
Yes, but this is a new feature and there is no regression because users
don't use `SPARK_EXECUTOR_ID` in PVC claim name.
### How was this patch tested?
Pass the newly added test case.
Closes #29557 from dongjoon-hyun/SPARK-PVC.
Authored-by: Dongjoon Hyun <> Signed-off-by: Dongjoon
Hyun <>
(commit: 182727d)
The file was modifiedresource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/features/MountVolumesFeatureStepSuite.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/MountVolumesFeatureStep.scala (diff)
Commit d6c095c92c739eadd7f336e2a812ffee324cb1ef by yamamuro
[SPARK-32693][SQL] Compare two dataframes with same schema except
nullable property
### What changes were proposed in this pull request?
This PR changes key data types check in `HashJoin` to use `sameType`.
### Why are the changes needed?
Looks at the resolving condition of `SetOperation`, it requires only
each left data types should be `sameType` as the right ones. Logically
the `EqualTo` expression in equi-join, also requires only left data type
`sameType` as right data type. Then `HashJoin` requires left keys data
type exactly the same as right keys data type, looks not reasonable.
It makes inconsistent results when doing `except` between two
If two dataframes don't have nested fields, even their field nullable
property different, `HashJoin` passes the key type check because it
checks field individually so field nullable property is ignored.
If two dataframes have nested fields like struct, `HashJoin` fails the
key type check because now it compare two struct types and nullable
property now affects.
### Does this PR introduce _any_ user-facing change?
Yes. Making consistent `except` operation between dataframes.
### How was this patch tested?
Unit test.
Closes #29555 from viirya/SPARK-32693.
Authored-by: Liang-Chi Hsieh <> Signed-off-by: Takeshi
Yamamuro <>
(commit: d6c095c)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala (diff)
Commit c15462917114c9066001a8c003597fd609fc50e4 by gurwls223
[SPARK-32183][DOCS][PYTHON] User Guide - PySpark Usage Guide for Pandas
with Apache Arrow
### What changes were proposed in this pull request?
This PR proposes to move Arrow usage guide from Spark documentation site
to PySpark documentation site (at "User Guide").
Here is the demo for reviewing quicker:
### Why are the changes needed?
To have a single place for PySpark users, and better documentation.
### Does this PR introduce _any_ user-facing change?
Yes, it will move
to our PySpark documentation.
### How was this patch tested?
```bash cd docs SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve
```bash cd python/docs make clean html
Closes #29548 from HyukjinKwon/SPARK-32183.
Authored-by: HyukjinKwon <> Signed-off-by:
HyukjinKwon <>
(commit: c154629)
The file was modifieddocs/ (diff)
The file was addedpython/docs/source/user_guide/arrow_pandas.rst
The file was modifiedpython/docs/source/user_guide/index.rst (diff)
The file was modifiedexamples/src/main/python/sql/ (diff)
The file was modifiedpython/pyspark/sql/pandas/ (diff)
The file was modifiedpython/docs/source/reference/pyspark.sql.rst (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
Commit 73bfed3633be4322ae2a56df1f382d4749cc87f2 by gurwls223
[SPARK-28612][SQL][FOLLOWUP] Correct method doc of
### What changes were proposed in this pull request?
This patch corrects the method doc of DataFrameWriterV2.replace() which
explanation of exception is described oppositely.
### Why are the changes needed?
The method doc is incorrect.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Only doc change.
Closes #29568 from HeartSaVioR/SPARK-28612-FOLLOWUP-fix-doc-nit.
Authored-by: Jungtaek Lim (HeartSaVioR) <>
Signed-off-by: HyukjinKwon <>
(commit: 73bfed3)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameWriterV2.scala (diff)
Commit 5775073a01c0268db69dffd77dd90e855fb0803e by gurwls223
[SPARK-32722][PYTHON][DOCS] Update document type conversion for Pandas
UDFs (pyarrow 1.0.1, pandas 1.1.1, Python 3.7)
### What changes were proposed in this pull request?
This PR updates the chart generated at SPARK-25666. We bumped up the
minimal PyArrow version. It's better to use PyArrow 0.15.1+
### Why are the changes needed?
To track the changes in type coercion of PySpark <> PyArrow <> pandas.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Use this code to generate the chart:
```python from pyspark.sql.types import * from pyspark.sql.functions
import pandas_udf
columns = [
   ('none', 'object(NoneType)'),
   ('bool', 'bool'),
   ('int8', 'int8'),
   ('int16', 'int16'),
   ('int32', 'int32'),
   ('int64', 'int64'),
   ('uint8', 'uint8'),
   ('uint16', 'uint16'),
   ('uint32', 'uint32'),
   ('uint64', 'uint64'),
   ('float64', 'float16'),
   ('float64', 'float32'),
   ('float64', 'float64'),
   ('date', 'datetime64[ns]'),
   ('tz_aware_dates', 'datetime64[ns, US/Eastern]'),
   ('string', 'object(string)'),
   ('decimal', 'object(Decimal)'),
   ('array', 'object(array[int32])'),
   ('float128', 'float128'),
   ('complex64', 'complex64'),
   ('complex128', 'complex128'),
   ('category', 'category'),
   ('tdeltas', 'timedelta64[ns]'),
def create_dataframe():
   import pandas as pd
   import numpy as np
   import decimal
   pdf = pd.DataFrame({
       'none': [None, None],
       'bool': [True, False],
       'int8': np.arange(1, 3).astype('int8'),
       'int16': np.arange(1, 3).astype('int16'),
       'int32': np.arange(1, 3).astype('int32'),
       'int64': np.arange(1, 3).astype('int64'),
       'uint8': np.arange(1, 3).astype('uint8'),
       'uint16': np.arange(1, 3).astype('uint16'),
       'uint32': np.arange(1, 3).astype('uint32'),
       'uint64': np.arange(1, 3).astype('uint64'),
       'float16': np.arange(1, 3).astype('float16'),
       'float32': np.arange(1, 3).astype('float32'),
       'float64': np.arange(1, 3).astype('float64'),
       'float128': np.arange(1, 3).astype('float128'),
       'complex64': np.arange(1, 3).astype('complex64'),
       'complex128': np.arange(1, 3).astype('complex128'),
       'string': list('ab'),
       'array': pd.Series([np.array([1, 2, 3], dtype=np.int32),
np.array([1, 2, 3], dtype=np.int32)]),
       'decimal': pd.Series([decimal.Decimal('1'),
       'date': pd.date_range('19700101', periods=2).values,
       'category': pd.Series(list("AB")).astype('category')})
   pdf['tdeltas'] = [[1],[0]]
   pdf['tz_aware_dates'] = pd.date_range('19700101', periods=2,
   return pdf
types =  [
   DecimalType(10, 0),
   MapType(StringType(), IntegerType()),
   StructType([StructField("_1", IntegerType())]),
df = spark.range(2).repartition(1) results = [] count = 0 total =
len(types) * len(columns) values = []
spark.sparkContext.setLogLevel("FATAL") for t in types:
   result = []
   for column, pandas_t in columns:
       v = create_dataframe()[column][0]
           row = _:
create_dataframe()[column], t)(
           ret_str = repr(row[0])
       except Exception:
           ret_str = "X"
       progress = "SQL Type: [%s]\n  Pandas Value(Type): %s(%s)]\n
Result Python Value: [%s]" % (
           t.simpleString(), v, pandas_t, ret_str)
       count += 1
       print("%s/%s:\n  %s" % (count, total, progress))
   results.append([t.simpleString()] + list(map(str, result)))
schema = ["SQL Type \\ Pandas Value(Type)"] + list(map(lambda
values_column: "%s(%s)" % (values_column[0], values_column[1][1]),
zip(values, columns))) strings = spark.createDataFrame(results,
schema=schema)._jdf.showString(20, 20, False) print("\n".join(map(lambda
line: "    # %s  # noqa" % line, strings.strip().split("\n"))))
Closes #29569 from HyukjinKwon/SPARK-32722.
Authored-by: HyukjinKwon <> Signed-off-by:
HyukjinKwon <>
(commit: 5775073)
The file was modifiedpython/pyspark/sql/pandas/ (diff)
Commit c3b940425396ac61d9efa657c8999ef5f5ef2152 by gurwls223
[SPARK-32717][SQL] Add a AQEOptimizer for AdaptiveSparkPlanExec
### What changes were proposed in this pull request?
This PR proposes to add a specific `AQEOptimizer` for the
`AdaptiveSparkPlanExec` instead of implementing an anonymous
`RuleExecutor`. At the same time, this PR also adds the configuration
`spark.sql.adaptive.optimizer.excludedRules`, which follows the same
pattern of `Optimizer`, to make the `AQEOptimizer` more flexible for
users and developers.
### Why are the changes needed?
Currently, `AdaptiveSparkPlanExec` has implemented an anonymous
`RuleExecutor` to apply the AQE optimize rules on the plan. However, the
anonymous class usually could be inconvenient to maintain and extend for
the long term.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
It's a pure refactor so pass existing tests should be ok.
Closes #29559 from Ngone51/impro-aqe-optimizer.
Authored-by: yi.wu <> Signed-off-by: HyukjinKwon
(commit: c3b9404)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala (diff)
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AQEOptimizer.scala
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala (diff)