Changes

Summary

  1. [SPARK-32785][SQL][DOCS][FOLLOWUP] Update migaration guide for (commit: dcb0820) (details)
  2. [SPARK-33111][ML][FOLLOW-UP] aft transform optimization - (commit: 618695b) (details)
  3. [SPARK-33205][BUILD] Bump snappy-java version to 1.1.8 (commit: 1b7367c) (details)
  4. [SPARK-33202][CORE] Fix BlockManagerDecommissioner to return the correct (commit: 7aed81d) (details)
  5. [SPARK-31964][PYTHON][FOLLOW-UP] Use is_categorical_dtype instead of (commit: 66005a3) (details)
  6. [SPARK-33160][SQL][FOLLOWUP] Update benchmarks of INT96 type rebasing (commit: bbf2d6f) (details)
  7. [SPARK-33203][PYTHON][TEST] Fix tests failing with rounding errors (commit: 4a33cd9) (details)
  8. [SPARK-33210][SQL] Set the rebasing mode for parquet INT96 type to (commit: ba13b94) (details)
  9. [SPARK-33212][BUILD] Move to shaded clients for Hadoop 3.x profile (commit: cb3fa6c) (details)
  10. [SPARK-30796][SQL] Add parameter position for REGEXP_REPLACE (commit: eb33bcb) (details)
  11. [SPARK-33218][CORE] Update misleading log messages for removed shuffle (commit: a908b67) (details)
  12. [SPARK-26533][SQL] Support query auto timeout cancel on thriftserver (commit: d9ee33c) (details)
  13. [SPARK-33095][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: add, (commit: 8cae7f8) (details)
  14. [SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location (commit: a1629b4) (details)
  15. [SPARK-32978][SQL] Make sure the number of dynamic part metric is (commit: b38f3a5) (details)
  16. [SPARK-33160][SQL][FOLLOWUP] Replace the parquet metadata key (commit: a03d77d) (details)
Commit dcb08204339e2291727be8e1a206e272652f9ae4 by yamamuro
[SPARK-32785][SQL][DOCS][FOLLOWUP] Update migaration guide for
incomplete interval literals
### What changes were proposed in this pull request?
Address comments
https://github.com/apache/spark/pull/29635#discussion_r507241899 to
improve migration guide
### Why are the changes needed?
improve migration guide
### Does this PR introduce _any_ user-facing change?
NO,only doc update
### How was this patch tested?
passing GitHub action
Closes #30113 from yaooqinn/SPARK-32785-F.
Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi
Yamamuro <yamamuro@apache.org>
(commit: dcb0820)
The file was modifieddocs/sql-migration-guide.md (diff)
Commit 618695b78fe93ae6506650ecfbebe807a43c5f0c by srowen
[SPARK-33111][ML][FOLLOW-UP] aft transform optimization -
predictQuantiles
### What changes were proposed in this pull request? 1, optimize
`predictQuantiles` by pre-computing an auxiliary var.
### Why are the changes needed? In
https://github.com/apache/spark/pull/30000, I optimized the `transform`
method. I find that we can also optimize `predictQuantiles` by
pre-computing an auxiliary var.
It is about 56% faster than existing impl.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? existing testsuites
Closes #30034 from zhengruifeng/aft_quantiles_opt.
Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean
Owen <srowen@gmail.com>
(commit: 618695b)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/regression/AFTSurvivalRegressionSuite.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala (diff)
Commit 1b7367ccd7cdcbfc9ff9a3893693a3261a5eb7c1 by viirya
[SPARK-33205][BUILD] Bump snappy-java version to 1.1.8
### What changes were proposed in this pull request?
This PR intends to upgrade snappy-java from 1.1.7.5 to 1.1.8.
### Why are the changes needed?
For performance improvements; the released `snappy-java` bundles the
latest `Snappy` v1.1.8 binaries with small performance improvements.
- snappy-java release note:
https://github.com/xerial/snappy-java/releases/tag/1.1.8
- snappy release note:
https://github.com/google/snappy/releases/tag/1.1.8
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA tests.
Closes #30120 from maropu/Snappy1.1.8.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by:
Liang-Chi Hsieh <viirya@gmail.com>
(commit: 1b7367c)
The file was modifieddev/deps/spark-deps-hadoop-2.7-hive-2.3 (diff)
The file was modifieddev/deps/spark-deps-hadoop-3.2-hive-2.3 (diff)
The file was modifiedpom.xml (diff)
Commit 7aed81d4926c8f13ffb38f7ff90162b15c876016 by dhyun
[SPARK-33202][CORE] Fix BlockManagerDecommissioner to return the correct
migration status
### What changes were proposed in this pull request?
This PR changes `<` into `>` in the following to fix data loss during
storage migrations.
```scala
// If we found any new shuffles to migrate or otherwise have not
migrated everything.
- newShufflesToMigrate.nonEmpty || migratingShuffles.size <
numMigratedShuffles.get()
+ newShufflesToMigrate.nonEmpty || migratingShuffles.size >
numMigratedShuffles.get()
```
### Why are the changes needed?
`refreshOffloadingShuffleBlocks` should return `true` when the migration
is still on-going.
Since `migratingShuffles` is defined like the following,
`migratingShuffles.size > numMigratedShuffles.get()` means the migration
is not finished.
```scala
// Shuffles which are either in queue for migrations or migrated
protected[storage] val migratingShuffles =
mutable.HashSet[ShuffleBlockInfo]()
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CI with the updated test cases.
Closes #30116 from dongjoon-hyun/SPARK-33202.
Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
(commit: 7aed81d)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/storage/BlockManagerDecommissionUnitSuite.scala (diff)
Commit 66005a323625fc8c7346d28e9a8c52f91ae8d1a0 by cutlerb
[SPARK-31964][PYTHON][FOLLOW-UP] Use is_categorical_dtype instead of
deprecated is_categorical
### What changes were proposed in this pull request?
This PR is a small followup of
https://github.com/apache/spark/pull/28793 and  proposes to use
`is_categorical_dtype` instead of deprecated `is_categorical`.
`is_categorical_dtype` exists from minimum pandas version we support
(https://github.com/pandas-dev/pandas/blob/v0.23.2/pandas/core/dtypes/api.py),
and `is_categorical` was deprecated from pandas 1.1.0
(https://github.com/pandas-dev/pandas/commit/87a1cc21cab751c16fda4e6f0a95988a8d90462b).
### Why are the changes needed?
To avoid using deprecated APIs, and remove warnings.
### Does this PR introduce _any_ user-facing change?
Yes, it will remove warnings that says `is_categorical` is deprecated.
### How was this patch tested?
By running any pandas UDF with pandas 1.1.0+:
```python import pandas as pd from pyspark.sql.functions import
pandas_udf
def func(x: pd.Series) -> pd.Series:
   return x
spark.range(10).select(pandas_udf(func, "long")("id")).show()
```
Before:
```
/.../python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py:151:
FutureWarning: is_categorical is deprecated and will be removed in a
future version.  Use is_categorical_dtype instead
...
```
After:
```
...
```
Closes #30114 from HyukjinKwon/replace-deprecated-is_categorical.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Bryan
Cutler <cutlerb@gmail.com>
(commit: 66005a3)
The file was modifiedpython/pyspark/sql/pandas/serializers.py (diff)
Commit bbf2d6f6df0011c3035d829a56b035a2b094295c by gurwls223
[SPARK-33160][SQL][FOLLOWUP] Update benchmarks of INT96 type rebasing
### What changes were proposed in this pull request? 1. Turn off/on the
SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` which was
added by https://github.com/apache/spark/pull/30056 in
`DateTimeRebaseBenchmark`. The parquet readers should infer correct
rebasing mode automatically from metadata. 2. Regenerate benchmark
results of `DateTimeRebaseBenchmark` in the environment:
| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge (spot instance) |
| AMI | ami-06f2f779464715dc5
(ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) |
| Java | OpenJDK8/11 installed by`sudo add-apt-repository
ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`|
### Why are the changes needed? To have up-to-date info about INT96
performance which is the default type for Catalyst's timestamp type.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? By updating benchmark results:
```
$ SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain
org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark"
```
Closes #30118 from MaxGekk/int96-rebase-benchmark.
Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: bbf2d6f)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DateTimeRebaseBenchmark.scala (diff)
The file was modifiedsql/core/benchmarks/DateTimeRebaseBenchmark-results.txt (diff)
The file was modifiedsql/core/benchmarks/DateTimeRebaseBenchmark-jdk11-results.txt (diff)
Commit 4a33cd928df4739e69ae9530aae23964e470d2f8 by dhyun
[SPARK-33203][PYTHON][TEST] Fix tests failing with rounding errors
### What changes were proposed in this pull request?
Increase tolerance for two tests that fail in some environments and fail
in others (flaky? Pass/fail is constant within the same environment)
### Why are the changes needed? The tests `pyspark.ml.recommendation`
and `pyspark.ml.tests.test_algorithms` fail with
``` File "/home/jenkins/python/pyspark/ml/tests/test_algorithms.py",
line 96, in test_raw_and_probability_prediction
   self.assertTrue(np.allclose(result.rawPrediction,
expected_rawPrediction, atol=1)) AssertionError: False is not true
```
``` File "/home/jenkins/python/pyspark/ml/recommendation.py", line 256,
in _main_.ALS Failed example:
   predictions[0] Expected:
   Row(user=0, item=2, newPrediction=0.6929101347923279) Got:
   Row(user=0, item=2, newPrediction=0.6929104924201965)
...
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
This path changes a test target. Just executed the tests to verify they
pass.
Closes #30104 from AlessandroPatti/apatti/rounding-errors.
Authored-by: Alessandro Patti <ale812@yahoo.it> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
(commit: 4a33cd9)
The file was modifiedpython/pyspark/ml/tests/test_algorithms.py (diff)
The file was modifiedpython/pyspark/ml/recommendation.py (diff)
Commit ba13b94f6b2b477a93c0849c1fc776ffd5f1a0e6 by wenchen
[SPARK-33210][SQL] Set the rebasing mode for parquet INT96 type to
`EXCEPTION` by default
### What changes were proposed in this pull request? 1. Set the default
value for the SQL configs
`spark.sql.legacy.parquet.int96RebaseModeInWrite` and
`spark.sql.legacy.parquet.int96RebaseModeInRead` to `EXCEPTION`. 2.
Update the SQL migration guide.
### Why are the changes needed? Current default value `LEGACY` may lead
to shifting timestamps in read or in write. We should leave the decision
about rebasing to users.
### Does this PR introduce _any_ user-facing change? Yes
### How was this patch tested? By existing test suites like
`ParquetIOSuite`.
Closes #30121 from MaxGekk/int96-exception-by-default.
Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: ba13b94)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifieddocs/sql-migration-guide.md (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala (diff)
Commit cb3fa6c9368e64184a5f7b19688181d11de9511c by d_tsai
[SPARK-33212][BUILD] Move to shaded clients for Hadoop 3.x profile
### What changes were proposed in this pull request?
This switches Spark to use shaded Hadoop clients, namely
hadoop-client-api and hadoop-client-runtime, for Hadoop 3.x. For Hadoop
2.7, we'll still use the same modules such as hadoop-client.
In order to still keep default Hadoop profile to be hadoop-3.2, this
defines the following Maven properties:
``` hadoop-client-api.artifact hadoop-client-runtime.artifact
hadoop-client-minicluster.artifact
```
which default to:
``` hadoop-client-api hadoop-client-runtime hadoop-client-minicluster
``` but all switch to `hadoop-client` when the Hadoop profile is
hadoop-2.7. A side affect from this is we'll import the same dependency
multiple times. For this I have to disable Maven enforcer
`banDuplicatePomDependencyVersions`.
Besides above, there are the following changes:
- explicitly add a few dependencies which are imported via transitive
dependencies from Hadoop jars, but are removed from the shaded client
jars.
- removed the use of `ProxyUriUtils.getPath` from `ApplicationMaster`
which is a server-side/private API.
- modified `IsolatedClientLoader` to exclude `hadoop-auth` jars when
Hadoop version is 3.x. This change should only matter when we're not
sharing Hadoop classes with Spark (which is _mostly_ used in tests).
### Why are the changes needed?
This serves two purposes:
- to unblock Spark from upgrading to Hadoop 3.2.2/3.3.0+. Latest Hadoop
versions have upgraded to use Guava 27+ and in order to adopt the latest
Hadoop versions in Spark, we'll need to resolve the Guava conflicts.
This takes the approach by switching to shaded client jars provided by
Hadoop.
- avoid pulling 3rd party dependencies from Hadoop and avoid potential
future conflicts.
### Does this PR introduce _any_ user-facing change?
When people use Spark with `hadoop-provided` option, they should make
sure class path contains `hadoop-client-api` and `hadoop-client-runtime`
jars. In addition, they may need to make sure these jars appear before
other Hadoop jars in the order. Otherwise, classes may be loaded from
the other non-shaded Hadoop jars and cause potential conflicts.
### How was this patch tested?
Relying on existing tests.
Closes #29843 from sunchao/SPARK-29250.
Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: DB Tsai
<d_tsai@apple.com>
(commit: cb3fa6c)
The file was modifiedresource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala (diff)
The file was modifiedexternal/kafka-0-10-assembly/pom.xml (diff)
The file was modifiedexternal/kinesis-asl-assembly/pom.xml (diff)
The file was modifiedcore/pom.xml (diff)
The file was modifiedsql/hive/pom.xml (diff)
The file was modifiedsql/catalyst/pom.xml (diff)
The file was modifiedhadoop-cloud/pom.xml (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala (diff)
The file was modifiedresource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClusterSuite.scala (diff)
The file was modifiedexternal/kafka-0-10-token-provider/pom.xml (diff)
The file was modifieddev/deps/spark-deps-hadoop-2.7-hive-2.3 (diff)
The file was modifiedcommon/network-yarn/pom.xml (diff)
The file was modifiedexternal/kafka-0-10-sql/pom.xml (diff)
The file was modifiedlauncher/pom.xml (diff)
The file was modifiedresource-managers/yarn/pom.xml (diff)
The file was modifieddev/deps/spark-deps-hadoop-3.2-hive-2.3 (diff)
The file was modifiedpom.xml (diff)
Commit eb33bcb4b2db2a13b3da783e58feb8852e04637b by wenchen
[SPARK-30796][SQL] Add parameter position for REGEXP_REPLACE
### What changes were proposed in this pull request?
`REGEXP_REPLACE` could replace all substrings of string that match
regexp with replacement string. But `REGEXP_REPLACE` lost some
flexibility. such as: converts camel case strings to a string containing
lower case words separated by an underscore: AddressLine1 ->
address_line_1 If we support the parameter position, we can do like
this(e.g. Oracle):
``` WITH strings as (
SELECT 'AddressLine1' s FROM dual union all
SELECT 'ZipCode' s FROM dual union all
SELECT 'Country' s FROM dual
)
SELECT s "STRING",
        lower(regexp_replace(s, '([A-Z0-9])', '_\1', 2))
"MODIFIED_STRING"
FROM strings;
``` The output:
```
STRING               MODIFIED_STRING
-------------------- -------------------- AddressLine1       
address_line_1 ZipCode              zip_code Country            
country
```
There are some mainstream database support the syntax.
**Oracle**
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/REGEXP_REPLACE.html#GUID-EA80A33C-441A-4692-A959-273B5A224490
**Vertica**
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_REPLACE.htm?zoom_highlight=regexp_replace
**Redshift**
https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_REPLACE.html
### Why are the changes needed? The parameter position for
`REGEXP_REPLACE` is very useful.
### Does this PR introduce _any_ user-facing change?
'Yes'.
### How was this patch tested? Jenkins test.
Closes #29891 from beliefer/add-position-for-regex_replace.
Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer
<beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: eb33bcb)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/RegexpExpressionsSuite.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/regexp-functions.sql.out (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-functions/sql-expression-schema.md (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/inputs/regexp-functions.sql (diff)
Commit a908b67502164d5b1409aca912dac7042e825586 by dhyun
[SPARK-33218][CORE] Update misleading log messages for removed shuffle
blocks
### What changes were proposed in this pull request?
This updates the misleading log messages for removed shuffle block
during migration.
### Why are the changes needed?
1. For the deleted shuffle blocks, `IndexShuffleBlockResolver` shows
users WARN message saying `skipping migration`. However,
`BlockManagerDecommissioner` shows users INFO message including
`Migrated ShuffleBlockInfo(...)` inconsistently. Technically, we didn't
migrated. We should not show `Migrated` message in this case.
``` INFO BlockManagerDecommissioner: Trying to migrate shuffle
ShuffleBlockInfo(109,18924) to BlockManagerId(...) (2 / 3) WARN
IndexShuffleBlockResolver: Failed to resolve shuffle block
ShuffleBlockInfo(109,18924), skipping migration. This is expected to
occur if a block is removed after decommissioning has started. INFO
BlockManagerDecommissioner: Got migration sub-blocks List()
... INFO BlockManagerDecommissioner: Migrated
ShuffleBlockInfo(109,18924) to BlockManagerId(...)
```
2. In addition, if the shuffle file is deleted while the information is
in the queue, the above messages are repeated multiple times,
`spark.storage.decommission.maxReplicationFailuresPerBlock`. We had
better use one line instead of the group of messages for that case.
``` INFO BlockManagerDecommissioner: Trying to migrate shuffle
ShuffleBlockInfo(109,18924) to BlockManagerId(...) (0 / 3)
... INFO BlockManagerDecommissioner: Trying to migrate shuffle
ShuffleBlockInfo(109,18924) to BlockManagerId(...) (1 / 3)
... INFO BlockManagerDecommissioner: Trying to migrate shuffle
ShuffleBlockInfo(109,18924) to BlockManagerId(...) (2 / 3)
```
3. Skipping or not is a role of `BlockManagerDecommissioner` class.
`IndexShuffleBlockResolver.getMigrationBlocks` is used twice differently
like the following. We had better inform users at
`BlockManagerDecommissioner` once.
   - At the beginning, to get the sub-blocks.
   - In case of `IOException`, to determine whether ignoring it or
re-throwing. And, `BlockManagerDecommissioner` shows WARN message
(`Skipping block ...`) again.
### Does this PR introduce _any_ user-facing change?
No. This is an update for log message info to be consistent.
### How was this patch tested?
Manually.
Closes #30129 from dongjoon-hyun/SPARK-33218.
Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
(commit: a908b67)
The file was modifiedcore/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala (diff)
Commit d9ee33cfb95e1f05878e498c93c5cc65ce449f0e by yamamuro
[SPARK-26533][SQL] Support query auto timeout cancel on thriftserver
### What changes were proposed in this pull request?
Support query auto cancelling when running too long on thriftserver.
This is the rework of #28991 and the credit should be the original
author, leoluan2009.
Closes #28991
### Why are the changes needed?
For some cases, we use thriftserver as long-running applications. Some
times we want all the query need not to run more than given time. In
these cases, we can enable auto cancel for time-consumed query.Which can
let us release resources for other queries to run.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added tests.
Closes #29933 from maropu/pr28991.
Lead-authored-by: Xuedong Luan <luanxuedong2009@gmail.com>
Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Co-authored-by:
Luan <luanxuedong2009@gmail.com> Signed-off-by: Takeshi Yamamuro
<yamamuro@apache.org>
(commit: d9ee33c)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2Listener.scala (diff)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2EventManager.scala (diff)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala (diff)
The file was modifiedsql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/operation/SQLOperation.java (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala (diff)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/server/SparkSQLOperationManager.scala (diff)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala (diff)
The file was modifiedsql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2AppStatusStore.scala (diff)
The file was modifiedsql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/ui/HiveThriftServer2ListenerSuite.scala (diff)
The file was modifiedsql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/operation/OperationManager.java (diff)
The file was modifiedsql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperationSuite.scala (diff)
Commit 8cae7f88b011939473fc9a6373012e23398bbc07 by wenchen
[SPARK-33095][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: add,
update type and nullability of columns (MySQL dialect)
### What changes were proposed in this pull request?
Override the default SQL strings for: ALTER TABLE UPDATE COLUMN TYPE
ALTER TABLE UPDATE COLUMN NULLABILITY in the following MySQL JDBC
dialect according to official documentation. Write MySQL integration
tests for JDBC.
### Why are the changes needed? Improved code coverage and support mysql
dialect for jdbc.
### Does this PR introduce _any_ user-facing change?
Yes, Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and
nullability of columns (MySQL dialect)
### How was this patch tested?
Added tests.
Closes #30025 from ScrapCodes/mysql-dialect.
Authored-by: Prashant Sharma <prashsh1@in.ibm.com> Signed-off-by:
Wenchen Fan <wenchen@databricks.com>
(commit: 8cae7f8)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/jdbc/MySQLDialect.scala (diff)
The file was addedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MySQLIntegrationSuite.scala
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala (diff)
Commit a1629b4a5790dce1a57e2c2bad9e04c627b88d29 by wenchen
[SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location
### What changes were proposed in this pull request? Support
`spark.sql.hive.metastore.jars` use HDFS location.
When user need to use path to set hive metastore jars, you should set
`spark.sql.hive.metasstore.jars=path` and set real path in
`spark.sql.hive.metastore.jars.path` since we use `File.pathSeperator`
to split path, but `FIle.pathSeparator` is `:` in unix, it will split
hdfs location `hdfs://nameservice/xx`. So add new config
`spark.sql.hive.metastore.jars.path` to set comma separated paths. To
keep both two way supported
### Why are the changes needed? All spark app can fetch internal version
hive jars in HDFS location, not need distribute to all node.
### Does this PR introduce _any_ user-facing change? User can use HDFS
location to store hive metastore jars
### How was this patch tested? Manuel tested.
Closes #29881 from AngersZhuuuu/SPARK-32852.
Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: a1629b4)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala (diff)
Commit b38f3a5557b45503e0f8d67bc77c5d390a67a42f by wenchen
[SPARK-32978][SQL] Make sure the number of dynamic part metric is
correct
### What changes were proposed in this pull request?
The purpose of this pr is to resolve SPARK-32978.
The main reason of bad case describe in SPARK-32978 is the
`BasicWriteTaskStatsTracker` directly reports the new added partition
number of each task, which makes it impossible to remove duplicate data
in driver side.
The main of this pr is change to report partitionValues to driver and
remove duplicate data at driver side to make sure the number of dynamic
part metric is correct.
### Why are the changes needed? The the number of dynamic part metric we
display on the UI should be correct.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? Add a new test case refer to described in
SPARK-32978
Closes #30026 from LuciferYang/SPARK-32978.
Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: b38f3a5)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala (diff)
The file was addedsql/core/benchmarks/InsertTableWithDynamicPartitionsBenchmark-jdk11-results.txt
The file was addedsql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/InsertTableWithDynamicPartitionsBenchmark.scala
The file was addedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteJobStatsTrackerMetricSuite.scala
The file was addedsql/core/benchmarks/InsertTableWithDynamicPartitionsBenchmark-results.txt
Commit a03d77d32696f5a33770e9bee654acde904da7d4 by wenchen
[SPARK-33160][SQL][FOLLOWUP] Replace the parquet metadata key
`org.apache.spark.int96NoRebase` by `org.apache.spark.legacyINT96`
### What changes were proposed in this pull request? 1. Replace the
metadata key `org.apache.spark.int96NoRebase` by
`org.apache.spark.legacyINT96`. 2. Change the condition when new key
should be saved to parquet metadata: it should be saved when the SQL
config `spark.sql.legacy.parquet.int96RebaseModeInWrite` is set to
`LEGACY`. 3. Change handling the metadata key in read:
   - If there is no the key in parquet metadata, take the rebase mode
from the SQL config: `spark.sql.legacy.parquet.int96RebaseModeInRead`
   - If parquet files were saved by Spark < 3.1.0, use the `LEGACY`
rebasing mode for INT96 type.
   - For files written by Spark >= 3.1.0, if the
`org.apache.spark.legacyINT96` presents in metadata, perform rebasing
otherwise don't.
### Why are the changes needed?
- To not increase parquet size by default when
`spark.sql.legacy.parquet.int96RebaseModeInWrite` is `EXCEPTION` after
https://github.com/apache/spark/pull/30121.
- To have the implementation similar to
`org.apache.spark.legacyDateTime`
- To minimise impact on other subsystems that are based on file sizes
like gathering statistics.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? Modified test in `ParquetIOSuite`
Closes #30132 from MaxGekk/int96-flip-metadata-rebase-key.
Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: a03d77d)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala (diff)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/package.scala (diff)