Changes

Summary

  1. [SPARK-33143][PYTHON] Add configurable timeout to python server and (commit: 0bb911d) (details)
  2. [SPARK-33510][BUILD] Update SBT to 1.4.4 (commit: 84e7036) (details)
  3. Revert "[SPARK-32481][CORE][SQL] Support truncate table to move data to (commit: c891e02) (details)
  4. [SPARK-33515][SQL] Improve exception messages while handling (commit: 60f3a73) (details)
  5. [SPARK-33511][SQL] Respect case sensitivity while resolving V2 partition (commit: 23e9920) (details)
Commit 0bb911d979955ac59adc39818667b616eb539103 by gurwls223
[SPARK-33143][PYTHON] Add configurable timeout to python server and
client
### What changes were proposed in this pull request? Spark creates local
server to serialize several type of data for python. The python code
tries to connect to the server, immediately after it's created but there
are several system calls in between (this may change in each Spark
version):
* getaddrinfo
* socket
* settimeout
* connect
Under some circumstances in heavy user environments these calls can be
super slow (more than 15 seconds). These issues must be analyzed
one-by-one but since these are system calls the underlying OS and/or DNS
servers must be debugged and fixed. This is not trivial task and at the
same time data processing must work somehow. In this PR I'm only
intended to add a configuration possibility to increase the mentioned
timeouts in order to be able to provide temporary workaround. The
rootcause analysis is ongoing but I think this can vary in each case.
Because the server part doesn't contain huge amount of log entries to
with one can measure time, I've added some.
### Why are the changes needed? Provide workaround when localhost python
server connection timeout appears.
### Does this PR introduce _any_ user-facing change? Yes, new
configuration added.
### How was this patch tested? Existing unit tests + manual test.
```
#Compile Spark
echo "spark.io.encryption.enabled true" >> conf/spark-defaults.conf echo
"spark.python.authenticate.socketTimeout 10" >> conf/spark-defaults.conf
$ ./bin/pyspark Python 3.8.5 (default, Jul 21 2020, 10:48:26)
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin Type "help", "copyright",
"credits" or "license" for more information. 20/11/20 10:17:03 WARN
NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable Setting default
log level to "WARN". To adjust logging level use
sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/11/20 10:17:03 WARN SparkEnv: I/O encryption enabled without RPC
encryption: keys will be visible on the wire. Welcome to
     ____              __
    / __/__  ___ _____/ /__
   _\ \/ _ \/ _ `/ __/  '_/
  /__ / .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
     /_/
Using Python version 3.8.5 (default, Jul 21 2020 10:48:26) Spark context
Web UI available at http://192.168.0.189:4040 Spark context available as
'sc' (master = local[*], app id = local-1605863824276). SparkSession
available as 'spark'.
>>> sc.setLogLevel("TRACE")
>>> sc.parallelize([0, 2, 3, 4, 6], 5).glom().collect() 20/11/20
10:17:09 TRACE PythonParallelizeServer: Creating listening socket
20/11/20 10:17:09 TRACE PythonParallelizeServer: Setting timeout to 10
sec 20/11/20 10:17:09 TRACE PythonParallelizeServer: Waiting for
connection on port 59726 20/11/20 10:17:09 TRACE
PythonParallelizeServer: Connection accepted from address
/127.0.0.1:59727 20/11/20 10:17:09 TRACE PythonParallelizeServer: Client
authenticated 20/11/20 10:17:09 TRACE PythonParallelizeServer: Closing
server
... 20/11/20 10:17:10 TRACE SocketFuncServer: Creating listening socket
20/11/20 10:17:10 TRACE SocketFuncServer: Setting timeout to 10 sec
20/11/20 10:17:10 TRACE SocketFuncServer: Waiting for connection on port
59735 20/11/20 10:17:10 TRACE SocketFuncServer: Connection accepted from
address /127.0.0.1:59736 20/11/20 10:17:10 TRACE SocketFuncServer:
Client authenticated 20/11/20 10:17:10 TRACE SocketFuncServer: Closing
server
[[0], [2], [3], [4], [6]]
>>>
```
Closes #30389 from gaborgsomogyi/SPARK-33143.
Lead-authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Co-authored-by:
HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: 0bb911d)
The file was modifiedcore/src/main/scala/org/apache/spark/security/SocketAuthServer.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/internal/config/Python.scala (diff)
The file was modifiedpython/pyspark/java_gateway.py (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/api/python/PythonUtils.scala (diff)
The file was modifiedpython/pyspark/context.py (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/api/python/PythonRunner.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/security/SocketAuthHelper.scala (diff)
Commit 84e70362dbf2bbebc7f1a1b734b99952d7e95e4d by dongjoon
[SPARK-33510][BUILD] Update SBT to 1.4.4
### What changes were proposed in this pull request? This PR aims to
update SBT from 1.4.2 to 1.4.4.
### Why are the changes needed?
This will bring the latest bug fixes.
- https://github.com/sbt/sbt/releases/tag/v1.4.3
- https://github.com/sbt/sbt/releases/tag/v1.4.4
### Does this PR introduce _any_ user-facing change? No.
### How was this patch tested? Pass the CIs.
Closes #30453 from williamhyun/sbt143.
Authored-by: William Hyun <williamhyun3@gmail.com> Signed-off-by:
Dongjoon Hyun <dongjoon@apache.org>
(commit: 84e7036)
The file was modifiedproject/build.properties (diff)
The file was modifieddev/mima (diff)
Commit c891e025b8ed34392fbc81e988b75bdbdb268c11 by gurwls223
Revert "[SPARK-32481][CORE][SQL] Support truncate table to move data to
trash"
### What changes were proposed in this pull request?
This reverts commit 065f17386d1851d732b4c1badf1ce2e14d0de338, which is
not part of any released version. That is, this is an unreleased feature
### Why are the changes needed?
I like the concept of Trash, but I think this PR might just resolve a
very specific issue by introducing a mechanism without a proper design
doc. This could make the usage more complex.
I think we need to consider the big picture. Trash directory is an
important concept. If we decide to introduce it, we should consider all
the code paths of Spark SQL that could delete the data, instead of
Truncate only. We also need to consider what is the current behavior if
the underlying file system does not provide the API
`Trash.moveToAppropriateTrash`. Is the exception good? How about the
performance when users are using the object store instead of HDFS? Will
it impact the GDPR compliance?
In sum, I think we should not merge the PR
https://github.com/apache/spark/pull/29552 without the design doc and
implementation plan. That is why I reverted it before the code freeze of
Spark 3.1
### Does this PR introduce _any_ user-facing change? Reverted the
original commit
### How was this patch tested? The existing tests.
Closes #30463 from gatorsmile/revertSpark-32481.
Authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: c891e02)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/util/Utils.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala (diff)
Commit 60f3a730e4e67c3b67d6e45fb18f589ad66b07e6 by wenchen
[SPARK-33515][SQL] Improve exception messages while handling
UnresolvedTable
### What changes were proposed in this pull request?
This PR proposes to improve the exception messages while
`UnresolvedTable` is handled based on this suggestion:
https://github.com/apache/spark/pull/30321#discussion_r521127001.
Currently, when an identifier is resolved to a view when a table is
expected, the following exception message is displayed (e.g., for
`COMMENT ON TABLE`):
``` v is a temp view not table.
``` After this PR, the message will be:
``` v is a temp view. 'COMMENT ON TABLE' expects a table.
```
Also, if an identifier is not resolved, the following exception message
is currently used:
``` Table not found: t
``` After this PR, the message will be:
``` Table not found for 'COMMENT ON TABLE': t
```
### Why are the changes needed?
To improve the exception message.
### Does this PR introduce _any_ user-facing change?
Yes, the exception message will be changed as described above.
### How was this patch tested?
Updated existing tests.
Closes #30461 from imback82/unresolved_table_message.
Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 60f3a73)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/SQLViewSuite.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/v2ResolutionPlans.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala (diff)
Commit 23e9920b3910e4f05269853429c7f18888cdc7b5 by wenchen
[SPARK-33511][SQL] Respect case sensitivity while resolving V2 partition
specs
### What changes were proposed in this pull request? 1. Pre-process
partition specs in `ResolvePartitionSpec`, and convert partition names
according to the partition schema and the SQL config
`spark.sql.caseSensitive`. In the PR, I propose to invoke
`normalizePartitionSpec` for that. The function is used in DSv1
commands, so, the behavior will be similar to DSv1. 2. Move
`normalizePartitionSpec()` from
`sql/core/.../datasources/PartitioningUtils` to
`sql/catalyst/.../util/PartitioningUtils` to use it in Catalyst's rule
`ResolvePartitionSpec`
### Why are the changes needed? DSv1 commands like `ALTER TABLE .. ADD
PARTITION` and `ALTER TABLE .. DROP PARTITION` respect the SQL config
`spark.sql.caseSensitive` while resolving partition specs. For example:
```sql spark-sql> CREATE TABLE tbl1 (id bigint, data string) USING
parquet PARTITIONED BY (id); spark-sql> ALTER TABLE tbl1 ADD PARTITION
(ID=1); spark-sql> SHOW PARTITIONS tbl1; id=1
``` The same command fails on V2 Table catalog with error:
``` AnalysisException: Partition key ID not exists
```
### Does this PR introduce _any_ user-facing change? Yes. After the
changes, partition spec resolution works as for DSv1 (without the
exception showed above).
### How was this patch tested? By running
`AlterTablePartitionV2SQLSuite`.
Closes #30454 from MaxGekk/partition-spec-case-sensitivity.
Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 23e9920)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolvePartitionSpec.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/AlterTablePartitionV2SQLSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala (diff)
The file was addedsql/catalyst/src/main/scala/org/apache/spark/sql/util/PartitioningUtils.scala
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala (diff)