Changes

Summary

  1. Revert "[SPARK-33069][INFRA] Skip test result report if no JUnit XML (commit: a7a8dae) (details)
  2. [SPARK-33169][SQL][TESTS] Check propagation of datasource options to (commit: 26b13c7) (details)
  3. [SPARK-32941][SQL] Optimize UpdateFields expression chain and put the (commit: 66c5e01) (details)
  4. [SPARK-17333][PYSPARK] Enable mypy (commit: 6ad75cd) (details)
  5. [SPARK-33181][SQL][DOCS] Document Load Table Directly from File in SQL (commit: f65a244) (details)
  6. [SPARK-32351][SQL] Show partially pushed down partition filters in (commit: 3513390) (details)
  7. [SPARK-33160][SQL] Allow saving/loading INT96 in parquet w/o rebasing (commit: a44e008) (details)
  8. [SPARK-32229][SQL] Fix PostgresConnectionProvider and (commit: fbb6843) (details)
  9. [SPARK-33190][INFRA][TESTS] Set upper bound of PyArrow version in GitHub (commit: eb9966b) (details)
  10. [SPARK-33191][YARN][TESTS] Fix PySpark test cases in YarnClusterSuite (commit: 2cfd215) (details)
  11. [MINOR][DOCS] Fix the description about to_avro and from_avro functions (commit: 46ad325) (details)
  12. [MINOR][CORE] Improve log message during storage decommission (commit: c824db2) (details)
  13. [SPARK-33198][CORE] getMigrationBlocks should not fail at missing files (commit: 385d5db) (details)
  14. [SPARK-33189][PYTHON][TESTS] Add env var to tests for legacy nested (commit: 47a6568) (details)
Commit a7a8dae4836f455a26ba6cb3c7d733775b6af0f6 by gurwls223
Revert "[SPARK-33069][INFRA] Skip test result report if no JUnit XML
files are found"
This reverts commit a0aa8f33a9420feb9228b51a3dfad2e7e86d65a5.
(commit: a7a8dae)
The file was modified.github/workflows/test_report.yml (diff)
Commit 26b13c70c312147e42db27cd986e970115a55cdd by gurwls223
[SPARK-33169][SQL][TESTS] Check propagation of datasource options to
underlying file system for built-in file-based datasources
### What changes were proposed in this pull request? 1. Add the common
trait `CommonFileDataSourceSuite` with tests that can be executed for
all built-in file-based datasources. 2. Add a test
`CommonFileDataSourceSuite` to check that datasource options are
propagated to underlying file systems as Hadoop configs. 3. Mix
`CommonFileDataSourceSuite` to `AvroSuite`, `OrcSourceSuite`,
`TextSuite`, `JsonSuite`, CSVSuite` and to `ParquetFileFormatSuite`. 4.
Remove duplicated tests from `AvroSuite` and from `OrcSourceSuite`.
### Why are the changes needed? To improve test coverage and test all
built-in file-based datasources.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? By running the affected test suites.
Closes #30067 from MaxGekk/ds-options-common-test.
Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: 26b13c7)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala (diff)
The file was modifiedexternal/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormatSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala (diff)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/CommonFileDataSourceSuite.scala
Commit 66c5e0132209a5a94f9d7efb5e895f143b0ef53b by dhyun
[SPARK-32941][SQL] Optimize UpdateFields expression chain and put the
rule early in Analysis phase
### What changes were proposed in this pull request?
This patch proposes to add more optimization to `UpdateFields`
expression chain. And optimize `UpdateFields` early in analysis phase.
### Why are the changes needed?
`UpdateFields` can manipulate complex nested data, but using
`UpdateFields` can easily create inefficient expression chain. We should
optimize it further.
Because when manipulating deeply nested schema, the `UpdateFields`
expression tree could be too complex to analyze, this change optimizes
`UpdateFields` early in analysis phase.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test.
Closes #29812 from viirya/SPARK-32941.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
(commit: 66c5e01)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was removedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CombineUpdateFieldsSuite.scala
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/complexTypesSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala (diff)
The file was addedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeWithFieldsSuite.scala
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UpdateFields.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ComplexTypes.scala (diff)
Commit 6ad75cda1eb9704ca1fd1539ea80454d66681965 by dhyun
[SPARK-17333][PYSPARK] Enable mypy
### What changes were proposed in this pull request?
Add MyPy to the CI. Once this is installed on the CI:
https://issues.apache.org/jira/browse/SPARK-32797?jql=project%20%3D%20SPARK%20AND%20text%20~%20mypy
this wil automatically check the types.
### Why are the changes needed?
We should check if the types are still correct on the CI.
``` MacBook-Pro-van-Fokko:spark fokkodriesprong$ ./dev/lint-python
starting python compilation test... python compilation succeeded.
starting pycodestyle test... pycodestyle checks passed.
starting flake8 test... flake8 checks passed.
starting mypy test... mypy checks passed.
The sphinx-build command was not found. Skipping Sphinx build for now.
all lint-python tests passed!
```
### Does this PR introduce _any_ user-facing change?
No :)
### How was this patch tested?
By running `./dev/lint-python` locally.
Closes #30088 from Fokko/SPARK-17333.
Authored-by: Fokko Driesprong <fokko@apache.org> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
(commit: 6ad75cd)
The file was modified.gitignore (diff)
The file was modifieddev/lint-python (diff)
The file was modified.github/workflows/build_and_test.yml (diff)
Commit f65a24412b6691ecdb4254e70d6e7abc846edb66 by gurwls223
[SPARK-33181][SQL][DOCS] Document Load Table Directly from File in SQL
Select Reference
### What changes were proposed in this pull request?
Add the link to the feature: "Run SQL on files directly" to SQL
reference documentation page
### Why are the changes needed?
To make SQL Reference complete
### Does this PR introduce _any_ user-facing change?
yes. Previously, reading in sql from file directly is not included in
the documentation:
https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select.html, not
listed in from_items. The new link is added to the select statement
documentation, like the below:
![image](https://user-images.githubusercontent.com/16770242/96517999-c34f3900-121e-11eb-8d56-c4ba0432855e.png)
![image](https://user-images.githubusercontent.com/16770242/96518808-8126f700-1220-11eb-8c98-fb398eee0330.png)
### How was this patch tested?
Manually built and tested
Closes #30095 from liaoaoyuan97/master.
Authored-by: liaoaoyuan97 <al3468@columbia.edu> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: f65a244)
The file was modifieddocs/sql-ref-syntax-qry-select.md (diff)
Commit 35133901f79209bd5e6e3e17531095d0ecae737d by gurwls223
[SPARK-32351][SQL] Show partially pushed down partition filters in
explain()
### What changes were proposed in this pull request?
Currently, actual non-dynamic partition pruning is executed in the
optimizer phase (PruneFileSourcePartitions) if an input relation has a
catalog file index. The current code assumes the same partition filters
are generated again in FileSourceStrategy and passed into
FileSourceScanExec. FileSourceScanExec uses the partition filters when
listing files, but these non-dynamic partition filters do nothing
because unnecessary partitions are already pruned in advance, so the
filters are mainly used for explain output in this case. If a WHERE
clause has DNF-ed predicates, FileSourceStrategy cannot extract the same
filters with PruneFileSourcePartitions and then PartitionFilters is not
shown in explain output.
This patch proposes to extract partition filters in FileSourceStrategy
and HiveStrategy with `extractPredicatesWithinOutputSet` added in
https://github.com/apache/spark/pull/29101/files#diff-6be42cfa3c62a7536b1eb1d6447c073c
again, then It will show the partially pushed down partition filter in
explain().
### Why are the changes needed?
without the patch, the explained plan is inconsistent with what is
actually executed
<b>without the change </b> the explained plan of `"SELECT * FROM t WHERE
p = '1' OR (p = '2' AND i = 1)"` for datasource and hive tables are like
the following respectively (missing pushed down partition filters)
```
== Physical Plan ==
*(1) Filter ((p#21 = 1) OR ((p#21 = 2) AND (i#20 = 1)))
+- *(1) ColumnarToRow
  +- FileScan parquet default.t[i#20,p#21] Batched: true, DataFilters:
[], Format: Parquet, Location:
InMemoryFileIndex[file:/Users/nanzhu/code/spark/sql/hive/target/tmp/hive_execution_test_group/war...,
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<i:int>
```
```
  == Physical Plan ==
  *(1) Filter ((p#33 = 1) OR ((p#33 = 2) AND (i#32 = 1)))
  +- Scan hive default.t [i#32, p#33], HiveTableRelation [`default`.`t`,
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [i#32],
Partition Cols: [p#33], Pruned Partitions: [(p=1), (p=2)]]
```
<b> with change </b> the  plan looks like (the actually executed
partition filters are exhibited)
```
== Physical Plan ==
*(1) Filter ((p#21 = 1) OR ((p#21 = 2) AND (i#20 = 1)))
+- *(1) ColumnarToRow
  +- FileScan parquet default.t[i#20,p#21] Batched: true, DataFilters:
[], Format: Parquet, Location:
InMemoryFileIndex[file:/Users/nanzhu/code/spark/sql/hive/target/tmp/hive_execution_test_group/war...,
PartitionFilters: [((p#21 = 1) OR (p#21 = 2))], PushedFilters: [],
ReadSchema: struct<i:int>
```
```
== Physical Plan ==
*(1) Filter ((p#37 = 1) OR ((p#37 = 2) AND (i#36 = 1)))
+- Scan hive default.t [i#36, p#37], HiveTableRelation [`default`.`t`,
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [i#36],
Partition Cols: [p#37], Pruned Partitions: [(p=1), (p=2)]], [((p#37 = 1)
OR (p#37 = 2))]
```
### Does this PR introduce _any_ user-facing change
no
### How was this patch tested? Unit test.
Closes #29831 from CodingCat/SPARK-32351.
Lead-authored-by: Nan Zhu <nanzhu@uber.com> Co-authored-by: Nan Zhu
<CodingCat@users.noreply.github.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: 3513390)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PrunePartitionSuiteBase.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitionsSuite.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala (diff)
Commit a44e008de3ae5aecad9e0f1a7af6a1e8b0d97f4e by gurwls223
[SPARK-33160][SQL] Allow saving/loading INT96 in parquet w/o rebasing
### What changes were proposed in this pull request? 1. Add the SQL
config `spark.sql.legacy.parquet.int96RebaseModeInWrite` to control
timestamps rebasing in saving them as INT96. It supports the same set of
values as `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` but the
default value is `LEGACY` to preserve backward compatibility with Spark
<= 3.0. 2. Write the metadata key `org.apache.spark.int96NoRebase` to
parquet files if the files are saved with
`spark.sql.legacy.parquet.int96RebaseModeInWrite` isn't set to `LEGACY`.
3. Add the SQL config
`spark.sql.legacy.parquet.datetimeRebaseModeInRead` to control loading
INT96 timestamps when parquet metadata doesn't have enough info (the
`org.apache.spark.int96NoRebase` tag) about parquet writer - either
INT96 was written by Proleptic Gregorian system or some Julian one. 4.
Modified Vectorized and Parquet-mr Readers to support loading/saving
INT96 timestamps w/o rebasing depending on SQL config and the metadata
tag:
   - **No rebasing** in testing when the SQL config
`spark.test.forceNoRebase` is set to `true`
   - **No rebasing** if parquet metadata contains the tag
`org.apache.spark.int96NoRebase`. This is the case when parquet files
are saved by Spark >= 3.1 with
`spark.sql.legacy.parquet.datetimeRebaseModeInWrite` is set to
`CORRECTED`, or saved by other systems with the tag
`org.apache.spark.int96NoRebase`.
   - **With rebasing** if parquet files saved by Spark (any versions)
without the metadata tag `org.apache.spark.int96NoRebase`.
   - Rebasing depend on the SQL config
`spark.sql.legacy.parquet.datetimeRebaseModeInRead` if there are no
metadata tags `org.apache.spark.version` and
`org.apache.spark.int96NoRebase`.
New SQL configs are added instead of re-using existing
`spark.sql.legacy.parquet.datetimeRebaseModeInWrite` and
`spark.sql.legacy.parquet.datetimeRebaseModeInRead` because of:
- To allow users have different modes for INT96 and for TIMESTAMP_MICROS
(MILLIS). For example, users might want to save INT96 as LEGACY but
TIMESTAMP_MICROS as CORRECTED.
- To have different modes for INT96 and DATE in load (or in save).
- To be backward compatible with Spark 2.4. For now,
`spark.sql.legacy.parquet.datetimeRebaseModeInWrite/Read` are set to
`EXCEPTION` by default.
### Why are the changes needed? 1. Parquet spec says that INT96 must be
stored as Julian days (see
https://github.com/apache/parquet-format/pull/49). This doesn't mean
that a reader ( or a writer) is based on the Julian calendar. So,
rebasing from Proleptic Gregorian to Julian calendar can be not needed.
2. Rebasing from/to Julian calendar can loose information because dates
in one calendar don't exist in another one. Like 1582-10-04..1582-10-15
exist in Proleptic Gregorian calendar but not in the hybrid calendar
(Julian + Gregorian), and visa versa, Julian date 1000-02-29 doesn't
exist in Proleptic Gregorian calendar. We should allow users to save
timestamps without loosing such dates (rebasing shifts such dates to the
next valid date). 3. It would also make Spark compatible with other
systems such as Impala and newer versions of Hive that write proleptic
Gregorian based INT96 timestamps.
### Does this PR introduce _any_ user-facing change? It can when
`spark.sql.legacy.parquet.int96RebaseModeInWrite` is set non-default
value `LEGACY`.
### How was this patch tested?
- Added a test to check the metadata key
`org.apache.spark.int96NoRebase`
- By `ParquetIOSuite`
Closes #30056 from MaxGekk/parquet-rebase-int96.
Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: a44e008)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/package.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala (diff)
The file was modifiedsql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRecordMaterializer.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
Commit fbb68436203627186e4070cac674707283c9dcc2 by yamamuro
[SPARK-32229][SQL] Fix PostgresConnectionProvider and
MSSQLConnectionProvider by accessing wrapped driver
### What changes were proposed in this pull request? Postgres and MSSQL
connection providers are not able to get custom `appEntry` because under
some circumstances the driver is wrapped with `DriverWrapper`. Such case
is not handled in the mentioned providers. In this PR I've added this
edge case handling by passing unwrapped `Driver` from `JdbcUtils`.
### Why are the changes needed?
`DriverWrapper` is not considered.
### Does this PR introduce _any_ user-facing change? No.
### How was this patch tested? Existing + additional unit tests.
Closes #30024 from gaborgsomogyi/SPARK-32229.
Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by:
Takeshi Yamamuro <yamamuro@apache.org>
(commit: fbb6843)
The file was addedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/jdbc/connection/TestDriver.scala
The file was addedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/jdbc/DriverRegistrySuite.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/DriverRegistry.scala (diff)
Commit eb9966b70055a67dd02451c78ec205d913a38a42 by gurwls223
[SPARK-33190][INFRA][TESTS] Set upper bound of PyArrow version in GitHub
Actions
### What changes were proposed in this pull request?
PyArrow is uploaded into PyPI today (https://pypi.org/project/pyarrow/),
and some tests fail with PyArrow 2.0.0+:
```
======================================================================
ERROR [0.774s]: test_grouped_over_window_with_key
(pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File
"/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py",
line 595, in test_grouped_over_window_with_key
   .select('id', 'result').collect()
File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 588, in
collect
   sock_info = self._jdf.collectToPython()
File
"/__w/spark/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
line 1305, in __call__
   answer, self.gateway_client, self.target_id, self.name)
File "/__w/spark/spark/python/pyspark/sql/utils.py", line 117, in deco
   raise converted from None pyspark.sql.utils.PythonException:
An exception was thrown from the Python worker. Please see the stack
trace below. Traceback (most recent call last):
File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line
601, in main
   process()
File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line
593, in process
   serializer.dump_stream(out_iter, outfile)
File
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
line 255, in dump_stream
   return ArrowStreamSerializer.dump_stream(self,
init_stream_yield_batches(), stream)
File
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
line 81, in dump_stream
   for batch in iterator:
File
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
line 248, in init_stream_yield_batches
   for series in iterator:
File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line
426, in mapper
   return f(keys, vals)
File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line
170, in <lambda>
   return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))]
File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line
158, in wrapped
   result = f(key, pd.concat(value_series, axis=1))
File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line
68, in wrapper
   return f(*args, **kwargs)
File
"/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py",
line 590, in f
   "{} != {}".format(expected_key[i][1], window_range) AssertionError:
{'start': datetime.datetime(2018, 3, 15, 0, 0), 'end':
datetime.datetime(2018, 3, 20, 0, 0)} != {'start':
datetime.datetime(2018, 3, 15, 0, 0, tzinfo=<StaticTzInfo 'Etc/UTC'>),
'end': datetime.datetime(2018, 3, 20, 0, 0, tzinfo=<StaticTzInfo
'Etc/UTC'>)}
```
https://github.com/apache/spark/runs/1278917457
This PR proposes to set the upper bound of PyArrow in GitHub Actions
build. This should be removed when we properly support PyArrow 2.0.0+
(SPARK-33189).
### Why are the changes needed?
To make build pass.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
GitHub Actions in this build will test it out.
Closes #30098 from HyukjinKwon/hot-fix-test.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: eb9966b)
The file was modified.github/workflows/build_and_test.yml (diff)
Commit 2cfd215dc4fb1ff6865644fec8284ba93dcddd5c by gurwls223
[SPARK-33191][YARN][TESTS] Fix PySpark test cases in YarnClusterSuite
### What changes were proposed in this pull request?
This PR proposes to fix:
``` org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application
in yarn-client mode org.apache.spark.deploy.yarn.YarnClusterSuite.run
Python application in yarn-cluster mode
org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in
yarn-cluster mode using spark.yarn.appMasterEnv to override local envvar
```
it currently fails as below:
``` 20/10/16 19:20:36 WARN TaskSetManager: Lost task 0.0 in stage 0.0
(TID 0) (amp-jenkins-worker-03.amp executor 1):
org.apache.spark.SparkException: Error from python worker:
Traceback (most recent call last):
   File "/usr/lib64/python2.6/runpy.py", line 104, in
_run_module_as_main
     loader, code, fname = _get_module_details(mod_name)
   File "/usr/lib64/python2.6/runpy.py", line 79, in _get_module_details
     loader = get_loader(mod_name)
   File "/usr/lib64/python2.6/pkgutil.py", line 456, in get_loader
     return find_loader(fullname)
   File "/usr/lib64/python2.6/pkgutil.py", line 466, in find_loader
     for importer in iter_importers(fullname):
   File "/usr/lib64/python2.6/pkgutil.py", line 422, in iter_importers
     __import__(pkg)
   File
"/home/jenkins/workspace/SparkPullRequestBuilder2/python/pyspark/__init__.py",
line 53, in <module>
     from pyspark.rdd import RDD, RDDBarrier
   File
"/home/jenkins/workspace/SparkPullRequestBuilder2/python/pyspark/rdd.py",
line 34, in <module>
     from pyspark.java_gateway import local_connect_and_auth
   File
"/home/jenkins/workspace/SparkPullRequestBuilder2/python/pyspark/java_gateway.py",
line 29, in <module>
     from py4j.java_gateway import java_import, JavaGateway, JavaObject,
GatewayParameters
   File
"/home/jenkins/workspace/SparkPullRequestBuilder2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
line 60
     PY4J_TRUE = {"yes", "y", "t", "true"}
                       ^
SyntaxError: invalid syntax
```
I think this was broken when Python 2 was dropped but was not caught
because this specific test does not run when there's no change in YARN
codes. See also
https://github.com/apache/spark/pull/29843#issuecomment-712540024
The root cause seems like the paths are different, see
https://github.com/apache/spark/pull/29843#pullrequestreview-502595199.
I _think_ Jenkins uses a different Python executable via Anaconda and
the executor side does not know where it is for some reasons.
This PR proposes to fix it just by explicitly specifying the absolute
path for Python executable so the tests should pass in any environment.
### Why are the changes needed?
To make tests pass.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
This issue looks specific to Jenkins. It should run the tests on
Jenkins.
Closes #30099 from HyukjinKwon/SPARK-33191.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 2cfd215)
The file was modifiedcore/src/main/scala/org/apache/spark/TestUtils.scala (diff)
The file was modifiedresource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala (diff)
Commit 46ad325e56abd95c0ffdbe64aad78582da8c725d by gurwls223
[MINOR][DOCS] Fix the description about to_avro and from_avro functions
### What changes were proposed in this pull request? This pull request
changes the description about `to_avro` and `from_avro` functions to
include Python as a supported language as the functions have been
supported in Python since Apache Spark 3.0.0
[[SPARK-26856](https://issues.apache.org/jira/browse/SPARK-26856)].
### Why are the changes needed? Same as above.
### Does this PR introduce _any_ user-facing change? Yes. The
description changed by this pull request is on
https://spark.apache.org/docs/latest/sql-data-sources-avro.html#to_avro-and-from_avro.
### How was this patch tested? Tested manually by building and checking
the document in the local environment.
Closes #30105 from kjmrknsn/fix-docs-sql-data-sources-avro.
Authored-by: Keiji Yoshida <kjmrknsn@gmail.com> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 46ad325)
The file was modifieddocs/sql-data-sources-avro.md (diff)
Commit c824db2d8b154acf51637844f5f268e988bd0081 by dhyun
[MINOR][CORE] Improve log message during storage decommission
### What changes were proposed in this pull request?
This PR aims to improve the log message for better analysis.
### Why are the changes needed?
Good logs are crucial always.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual review.
Closes #30109 from dongjoon-hyun/k8s_log.
Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
(commit: c824db2)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala (diff)
Commit 385d5db9413a7f23c8a4c2d802541e88ce3a4633 by dhyun
[SPARK-33198][CORE] getMigrationBlocks should not fail at missing files
### What changes were proposed in this pull request?
This PR aims to fix `getMigrationBlocks` error handling and to add test
coverage. 1. `getMigrationBlocks` should not fail at indexFile only
case. 2. `assert` causes `java.lang.AssertionError` which is not an
`Exception`.
### Why are the changes needed?
To handle the exception correctly.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CI with the newly added test case.
Closes #30110 from dongjoon-hyun/SPARK-33198.
Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
(commit: 385d5db)
The file was modifiedcore/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/shuffle/sort/IndexShuffleBlockResolverSuite.scala (diff)
Commit 47a6568265525002021c1e5cfa4330f5b1a91469 by gurwls223
[SPARK-33189][PYTHON][TESTS] Add env var to tests for legacy nested
timestamps in pyarrow
### What changes were proposed in this pull request?
Add an environment variable `PYARROW_IGNORE_TIMEZONE` to pyspark tests
in run-tests.py to use legacy nested timestamp behavior. This means that
when converting arrow to pandas, nested timestamps with timezones will
have the timezone localized during conversion.
### Why are the changes needed?
The default behavior was changed in PyArrow 2.0.0 to propagate timezone
information. Using the environment variable enables testing with newer
versions of pyarrow until the issue can be fixed in SPARK-32285.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests
Closes #30111 from
BryanCutler/arrow-enable-legacy-nested-timestamps-SPARK-33189.
Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: 47a6568)
The file was modified.github/workflows/build_and_test.yml (diff)
The file was modifiedpython/run-tests.py (diff)