Changes

Summary

  1. [SPARK-32839][WINDOWS] Make Spark scripts working with the spaces in (commit: 742fcff) (details)
  2. [SPARK-32873][BUILD] Fix code which causes error when build with sbt and (commit: b121f0d) (details)
  3. [SPARK-32854][SS] Minor code and doc improvement for stream-stream join (commit: 978f531) (details)
  4. [SPARK-32844][SQL] Make `DataFrameReader.table` take the specified (commit: 5e82548) (details)
  5. [SPARK-32868][SQL] Add more order irrelevant aggregates to (commit: 7a17158) (details)
  6. [SPARK-32876][SQL] Change default fallback versions to 3.0.1 and 2.4.7 (commit: 0696f04) (details)
  7. [SPARK-32872][CORE] Prevent BytesToBytesMap at MAX_CAPACITY from (commit: 72550c3) (details)
  8. [SPARK-32882][K8S] Remove python2 installation in K8s python image (commit: d58a4a3) (details)
  9. [SPARK-32871][BUILD] Append toMap to Map#filterKeys if the result of (commit: 4fac6d5) (details)
  10. [SPARK-32715][CORE] Fix memory leak when failed to store pieces of (commit: 7a9b066) (details)
  11. [SPARK-32878][CORE] Avoid scheduling TaskSetManager which has no pending (commit: 0811666) (details)
  12. [SPARK-32884][TESTS] Mark TPCDSQuery*Suite as ExtendedSQLTest (commit: d8a0d85) (details)
  13. [SPARK-32879][SQL] Refactor SparkSession initial options (commit: c8baab1) (details)
  14. [SPARK-32738][CORE] Should reduce the number of active threads if fatal (commit: 99384d1) (details)
  15. [SPARK-32874][SQL][TEST] Enhance result set meta data check for execute (commit: 316242b) (details)
Commit 742fcff3501e46722eeaeb9d1ac20e569f8f1c2c by gurwls223
[SPARK-32839][WINDOWS] Make Spark scripts working with the spaces in
paths on Windows
### What changes were proposed in this pull request?
If you install Spark under the path that has whitespaces, it does not
work on Windows, for example as below:
```
>>> SparkSession.builder.getOrCreate() Presence of build for multiple
Scala versions detected (C:\...\assembly\target\scala-2.13 and
C:\...\assembly\target\scala-2.12). Remove one of them or, set
SPARK_SCALA_VERSION=2.13 in spark-env.cmd. Visit
https://spark.apache.org/docs/latest/configuration.html#environment-variables
for more details about setting environment variables in spark-env.cmd.
Either clean one of them or, set SPARK_SCALA_VERSION in spark-env.cmd.
```
This PR fixes the whitespace handling to support any paths on Windows.
### Why are the changes needed?
To support Spark working with whitespaces in paths on Windows.
### Does this PR introduce _any_ user-facing change?
Yes, users will be able to install and run Spark under the paths with
whitespaces.
### How was this patch tested?
Manually tested.
Closes #29706 from HyukjinKwon/window-space-path.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 742fcff)
The file was modifiedbin/find-spark-home.cmd (diff)
The file was modifiedbin/spark-class2.cmd (diff)
The file was modifiedbin/load-spark-env.cmd (diff)
Commit b121f0d4596969ded3db9d5d7b0cb8adac8ac00c by gurwls223
[SPARK-32873][BUILD] Fix code which causes error when build with sbt and
Scala 2.13
### What changes were proposed in this pull request?
This PR fix code which causes error when build with sbt and Scala 2.13
like as follows.
```
[error] [warn]
/home/kou/work/oss/spark-scala-2.13/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaRDD.scala:251:
method with a single empty parameter list overrides method without any
parameter list
[error] [warn]   override def hasNext(): Boolean = requestOffset <
part.untilOffset
[error] [warn]
[error] [warn]
/home/kou/work/oss/spark-scala-2.13/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaRDD.scala:294:
method with a single empty parameter list overrides method without any
parameter list
[error] [warn]   override def hasNext(): Boolean = okNext
```
More specifically, what this PR fixes are
* Methods which has an empty parameter list and overrides an method
which has no parameter list.
``` override def hasNext(): Boolean = okNext
```
* Methods which has no parameter list and overrides an method which has
an empty parameter list.
```
     override def next: (Int, Double) = {
```
* Infix operator expression that the operator wraps.
```
   3L * math.min(k, numFeatures) * math.min(k, numFeatures)
   3L * math.min(k, numFeatures) * math.min(k, numFeatures) +
   + math.max(math.max(k, numFeatures), 4L * math.min(k, numFeatures)
     math.max(math.max(k, numFeatures), 4L * math.min(k, numFeatures) *
   * math.min(k, numFeatures) + 4L * math.min(k, numFeatures))
```
### Why are the changes needed?
For building Spark with sbt and Scala 2.13.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
After this change and #29742 applied, compile passed with the following
command.
``` build/sbt -Pscala-2.13  -Phive -Phive-thriftserver -Pyarn
-Pkubernetes compile test:compile
```
Closes #29745 from sarutak/fix-code-for-sbt-and-spark-2.13.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: b121f0d)
The file was modifiedexternal/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaRDD.scala (diff)
The file was modifiedstreaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala (diff)
The file was modifiedstreaming/src/main/scala/org/apache/spark/streaming/receiver/ReceivedBlockHandler.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala (diff)
The file was modifiedmllib-local/src/main/scala/org/apache/spark/ml/linalg/Vectors.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2CommandExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalog.scala (diff)
Commit 978f531010adfc08110897450d49cb569e4805ab by wenchen
[SPARK-32854][SS] Minor code and doc improvement for stream-stream join
### What changes were proposed in this pull request?
Several minor code and documentation improvement for stream-stream join.
Specifically:
* Remove extending from `SparkPlan`, as extending from `BinaryExecNode`
is enough.
* Return `left/right.outputPartitioning` for `Left/RightOuter` in
`outputPartitioning`, as the `PartitioningCollection` wrapper is
unnecessary (similar to batch joins `ShuffledHashJoinExec`,
`SortMergeJoinExec`).
*  Avoid per-row check for join type
(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L486-L492),
by creating the method before the loop of reading rows
(`generateFilteredJoinedRow` in `storeAndJoinWithOtherSide`). Similar
optimization (i.e. create auxiliary method/variable per different join
type before the iterator of input rows) has been done in batch join
world (`SortMergeJoinExec`, `ShuffledHashJoinExec`).
* Minor fix for comment/indentation for better readability.
### Why are the changes needed?
Minor optimization to avoid per-row unnecessary work (this probably can
be optimized away by compiler, but we can do a better join to avoid it
at the first place). And other comment/indentation fix to have better
code readability for future developers.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests in `StreamingJoinSuite.scala` as no new logic is
introduced.
Closes #29724 from c21/streaming.
Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 978f531)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala (diff)
Commit 5e825482d70e13a8cb16f1fbdac8139710482d17 by wenchen
[SPARK-32844][SQL] Make `DataFrameReader.table` take the specified
options for datasource v1
### What changes were proposed in this pull request? Make
`DataFrameReader.table` take the specified options for datasource v1.
### Why are the changes needed? Keep the same behavior of v1/v2
datasource, the v2 fix has been done in SPARK-32592.
### Does this PR introduce _any_ user-facing change? Yes. The
DataFrameReader.table will take the specified options. Also, if there
are the same key and value exists in specified options and table
properties, an exception will be thrown.
### How was this patch tested? New UT added.
Closes #29712 from xuanyuanking/SPARK-32844.
Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by:
Wenchen Fan <wenchen@databricks.com>
(commit: 5e82548)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala (diff)
Commit 7a17158a4d7fd6d22f9550eceab42d8af308aeb4 by yamamuro
[SPARK-32868][SQL] Add more order irrelevant aggregates to
EliminateSorts
### What changes were proposed in this pull request?
Mark `BitAggregate` as order irrelevant in `EliminateSorts`.
### Why are the changes needed?
Performance improvements in some queries
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Generalized an existing UT
Closes #29740 from tanelk/SPARK-32868.
Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Signed-off-by:
Takeshi Yamamuro <yamamuro@apache.org>
(commit: 7a17158)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSortsSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
Commit 0696f0467270969f40e9baa829533bdb55f4002a by dongjoon
[SPARK-32876][SQL] Change default fallback versions to 3.0.1 and 2.4.7
in HiveExternalCatalogVersionsSuite
### What changes were proposed in this pull request?
The Jenkins job fails to get the versions. This was fixed by adding
temporary fallbacks at https://github.com/apache/spark/pull/28536. This
still doesn't work without the temporary fallbacks. See
https://github.com/apache/spark/pull/29694
This PR adds new fallbacks since 2.3 is EOL and Spark 3.0.1 and 2.4.7
are released.
### Why are the changes needed?
To test correctly in Jenkins.
### Does this PR introduce _any_ user-facing change?
No, dev-only
### How was this patch tested?
Jenkins and GitHub Actions builds should test.
Closes #29748 from HyukjinKwon/SPARK-32876.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon
Hyun <dongjoon@apache.org>
(commit: 0696f04)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala (diff)
Commit 72550c3be7120fcf2844d6914e883f1bec30d93f by dongjoon
[SPARK-32872][CORE] Prevent BytesToBytesMap at MAX_CAPACITY from
exceeding growth threshold
### What changes were proposed in this pull request?
When BytesToBytesMap is at `MAX_CAPACITY` and reaches its growth
threshold, `numKeys >= growthThreshold` is true but `longArray.size() /
2 < MAX_CAPACITY` is false. This correctly prevents the map from
growing, but `canGrowArray` incorrectly remains true. Therefore the map
keeps accepting new keys and exceeds its growth threshold. If we attempt
to spill the map in this state, the UnsafeKVExternalSorter will not be
able to reuse the long array for sorting. By this point the task has
typically consumed all available memory, so the allocation of the new
pointer array is likely to fail.
This PR fixes the issue by setting `canGrowArray` to false in this case.
This prevents the map from accepting new elements when it cannot grow to
accommodate them.
### Why are the changes needed?
Without this change, hash aggregations will fail when the number of
groups per task is greater than `MAX_CAPACITY / 2 = 2^28` (approximately
268 million), and when the grouping aggregation is the only
memory-consuming operator in its stage.
For example, the final aggregation in `SELECT COUNT(DISTINCT id) FROM
tbl` fails when `tbl` contains 1 billion distinct values and when
`spark.sql.shuffle.partitions=1`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Reproducing this issue requires building a very large BytesToBytesMap.
Because this is infeasible to do in a unit test, this PR was tested
manually by adding the following test to AbstractBytesToBytesMapSuite.
Before this PR, the test fails in 8.5 minutes. With this PR, the test
passes in 1.5 minutes.
```java public abstract class AbstractBytesToBytesMapSuite {
// ...
Test
public void respectGrowthThresholdAtMaxCapacity() {
   TestMemoryManager memoryManager2 =
       new TestMemoryManager(
           new SparkConf()
           .set(package$.MODULE$.MEMORY_OFFHEAP_ENABLED(), true)
           .set(package$.MODULE$.MEMORY_OFFHEAP_SIZE(), 25600 * 1024 *
1024L)
           .set(package$.MODULE$.SHUFFLE_SPILL_COMPRESS(), false)
           .set(package$.MODULE$.SHUFFLE_COMPRESS(), false));
   TaskMemoryManager taskMemoryManager2 = new
TaskMemoryManager(memoryManager2, 0);
   final long pageSizeBytes = 8000000 + 8; // 8 bytes for end-of-page
marker
   final BytesToBytesMap map = new BytesToBytesMap(taskMemoryManager2,
1024, pageSizeBytes);
    try {
     // Insert keys into the map until it stops accepting new keys.
     for (long i = 0; i < BytesToBytesMap.MAX_CAPACITY; i++) {
       if (i % (1024 * 1024) == 0) System.out.println("Inserting element
" + i);
       final long[] value = new long[]{i};
       BytesToBytesMap.Location loc = map.lookup(value,
Platform.LONG_ARRAY_OFFSET, 8);
       Assert.assertFalse(loc.isDefined());
       boolean success =
           loc.append(value, Platform.LONG_ARRAY_OFFSET, 8, value,
Platform.LONG_ARRAY_OFFSET, 8);
       if (!success) break;
     }
      // The map should grow to its max capacity.
     long capacity = map.getArray().size() / 2;
     Assert.assertTrue(capacity == BytesToBytesMap.MAX_CAPACITY);
      // The map should stop accepting new keys once it has reached its
growth
     // threshold, which is half the max capacity.
     Assert.assertTrue(map.numKeys() == BytesToBytesMap.MAX_CAPACITY /
2);
      map.free();
   } finally {
     map.free();
   }
}
}
```
Closes #29744 from ankurdave/SPARK-32872.
Authored-by: Ankur Dave <ankurdave@gmail.com> Signed-off-by: Dongjoon
Hyun <dongjoon@apache.org>
(commit: 72550c3)
The file was modifiedcore/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java (diff)
Commit d58a4a310aecb9fa1bee1be0f5cb02b3be078667 by dongjoon
[SPARK-32882][K8S] Remove python2 installation in K8s python image
### What changes were proposed in this pull request? This PR aims to
remove python2 installation in K8s python image because spark 3.1 does
not support python2.
### Why are the changes needed?
This will save disk space.
**BEFORE**
``` kubespark/spark-py ... 917MB
```
**AFTER**
``` kubespark/spark-py ... 823MB
```
### Does this PR introduce _any_ user-facing change? No.
### How was this patch tested?
Pass the Jenkins with the K8s IT.
Closes #29751 from williamhyun/remove_py2.
Authored-by: William Hyun <williamhyun3@gmail.com> Signed-off-by:
Dongjoon Hyun <dongjoon@apache.org>
(commit: d58a4a3)
The file was modifiedresource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile (diff)
Commit 4fac6d501a5d97530edb712ff3450890ac10e413 by gurwls223
[SPARK-32871][BUILD] Append toMap to Map#filterKeys if the result of
filter is concatenated with another Map for Scala 2.13
### What changes were proposed in this pull request?
This PR appends `toMap` to `Map` instances with `filterKeys` if such
maps is to be concatenated with another maps.
### Why are the changes needed?
As of Scala 2.13, Map#filterKeys return a MapView, not the original Map
type. This can cause compile error.
```
/sql/DataFrameReader.scala:279: type mismatch;
[error]  found   : Iterable[(String, String)]
[error]  required: java.util.Map[String,String]
[error] Error occurred in an application involving default arguments.
[error]       val dsOptions = new
CaseInsensitiveStringMap(finalOptions.asJava)
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Compile passed with the following command.
`build/mvn -Pscala-2.13 -Phive -Phive-thriftserver -Pyarn -Pkubernetes
-DskipTests test-compile`
Closes #29742 from sarutak/fix-filterKeys-issue.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 4fac6d5)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala (diff)
Commit 7a9b066c66d29e946b4f384292021123beb6fe57 by dongjoon
[SPARK-32715][CORE] Fix memory leak when failed to store pieces of
broadcast
### What changes were proposed in this pull request? In
TorrentBroadcast.scala
```scala L133: if (!blockManager.putSingle(broadcastId, value,
MEMORY_AND_DISK, tellMaster = false)) L137:
TorrentBroadcast.blockifyObject(value, blockSize,
SparkEnv.get.serializer, compressionCodec) L147: if
(!blockManager.putBytes(pieceId, bytes, MEMORY_AND_DISK_SER, tellMaster
= true))
``` After the original value is saved
successfully(TorrentBroadcast.scala: L133), but the following
`blockifyObject()`(L137) or store piece(L147) steps are failed. There is
no opportunity to release broadcast from memory.
This patch is to remove all pieces of the broadcast when failed to
blockify or failed to store some pieces of a broadcast.
### Why are the changes needed? We use Spark thrift-server as a
long-running service. A bad query submitted a heavy
BroadcastNestLoopJoin operation and made driver full GC. We killed the
bad query but we found the driver's memory usage was still high and full
GCs were still frequent. By investigating with GC dump and log, we found
the broadcast may memory leak.
> 2020-08-19T18:54:02.824-0700: [Full GC (Allocation Failure)
2020-08-19T18:54:02.824-0700: [Class Histogram (before full gc):
116G->112G(170G), 184.9121920 secs]
[Eden: 32.0M(7616.0M)->0.0B(8704.0M) Survivors: 1088.0M->0.0B Heap:
116.4G(170.0G)->112.9G(170.0G)], [Metaspace: 177285K->177270K(182272K)]
1: 676531691 72035438432 [B 2: 676502528 32472121344
org.apache.spark.sql.catalyst.expressions.UnsafeRow 3: 99551 12018117568
[Ljava.lang.Object; 4: 26570 4349629040 [I 5: 6 3264536688
[Lorg.apache.spark.sql.catalyst.InternalRow; 6: 1708819 256299456 [C 7:
2338 179615208 [J 8: 1703669 54517408 java.lang.String 9: 103860
34896960 org.apache.spark.status.TaskDataWrapper 10: 177396 25545024
java.net.URI
...
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? Manually test. This UT is hard to write
and the patch is straightforward.
Closes #29558 from LantaoJin/SPARK-32715.
Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Dongjoon
Hyun <dongjoon@apache.org>
(commit: 7a9b066)
The file was modifiedcore/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala (diff)
Commit 0811666ab104b41cf189233439f4158b18bc8282 by dongjoon
[SPARK-32878][CORE] Avoid scheduling TaskSetManager which has no pending
tasks
### What changes were proposed in this pull request?
This PR proposes to avoid scheduling the (non-zombie) TaskSetManager
which has no pending tasks.
### Why are the changes needed?
Currently, Spark always tries to schedule a (non-zombie) TaskSetManager
even if it has no pending tasks. This causes notable problems for the
barrier TaskSetManager: 1. `calculateAvailableSlots` can be called for
multiple times for a launched barrier TaskSetManager; 2. user would see
"Skip current round of resource offers for barrier stage" log message
for a launched barrier TaskSetManager all the time until the barrier
TaskSetManager finishes, which is quite confused.
Besides, scheduling a TaskSetManager always involves many function
invocations even if there're no pending tasks.
Therefore, I think we can skip those un-schedulable TasksetManagers to
avoid the potential overhead.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass existing tests.
Closes #29750 from Ngone51/filter-out-unschedulable-stage.
Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun
<dongjoon@apache.org>
(commit: 0811666)
The file was modifiedcore/src/main/scala/org/apache/spark/scheduler/Pool.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/scheduler/Schedulable.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala (diff)
Commit d8a0d8569243d29e7f091d545ee1e9eb780d3dc8 by gurwls223
[SPARK-32884][TESTS] Mark TPCDSQuery*Suite as ExtendedSQLTest
### What changes were proposed in this pull request?
This PR aims to mark the following suite as `ExtendedSQLTest` to reduce
GitHub Action test time.
- TPCDSQuerySuite
- TPCDSQueryANSISuite
- TPCDSQueryWithStatsSuite
### Why are the changes needed?
Currently, the longest GitHub Action task is `Build and test / Build
modules: sql - other tests` with `1h 57m 10s` while `Build and test /
Build modules: sql - slow tests` takes `42m 20s`. With this PR, we can
move the workload from `other tests` to `slow tests` task and reduce the
total waiting time about 7 ~ 8 minutes.
### Does this PR introduce _any_ user-facing change?
No. This is a test-only change.
### How was this patch tested?
Pass the GitHub Action with the reduced running time.
Closes #29755 from dongjoon-hyun/SPARK-SLOWTEST.
Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: d8a0d85)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala (diff)
Commit c8baab1a1f2ac03951946ff899d1c51a69c2c8b3 by wenchen
[SPARK-32879][SQL] Refactor SparkSession initial options
### What changes were proposed in this pull request? This PR refactors
the way we propagate the options from the `SparkSession.Builder` to the`
SessionState`. This currently done via a mutable map inside the
SparkSession. These setting settings are then applied **after** the
Session. This is a bit confusing when you expect something to be set
when constructing the `SessionState`. This PR passes the options as a
constructor parameter to the `SessionStateBuilder` and this will set the
options when the configuration is created.
### Why are the changes needed? It makes it easier to reason about the
configurations set in a SessionState than before. We recently had an
incident where someone was using `SparkSessionExtensions` to create a
planner rule that relied on a conf to be set. While this is in itself
probably incorrect usage, it still illustrated this somewhat funky
behavior.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? Existing tests.
Closes #29752 from hvanhovell/SPARK-32879.
Authored-by: herman <herman@databricks.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: c8baab1)
The file was modifiedsql/hive/src/test/scala/org/apache/spark/sql/hive/test/TestHive.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/test/TestSQLContext.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala (diff)
The file was modifiedproject/MimaExcludes.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala (diff)
Commit 99384d1e831b7fe82a3a80ade1da976971624ee7 by wenchen
[SPARK-32738][CORE] Should reduce the number of active threads if fatal
error happens in `Inbox.process`
### What changes were proposed in this pull request?
Processing for `ThreadSafeRpcEndpoint` is controlled by
`numActiveThreads` in `Inbox`. Now if any fatal error happens during
`Inbox.process`, `numActiveThreads` is not reduced. Then other threads
can not process messages in that inbox, which causes the endpoint to
"hang". For other type of endpoints, we also should keep
`numActiveThreads` correct.
This problem is more serious in previous Spark 2.x versions since the
driver, executor and block manager endpoints are all thread safe
endpoints.
To fix this, we should reduce the number of active threads if fatal
error happens in `Inbox.process`.
### Why are the changes needed?
`numActiveThreads` is not correct when fatal error happens and will
cause the described problem.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Add a new test.
Closes #29580 from wzhfy/deal_with_fatal_error.
Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 99384d1)
The file was modifiedcore/src/main/scala/org/apache/spark/rpc/netty/Inbox.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/rpc/netty/InboxSuite.scala (diff)
Commit 316242b768a232ea541e854633374aebcd2ed194 by wenchen
[SPARK-32874][SQL][TEST] Enhance result set meta data check for execute
statement operation with thrift server
### What changes were proposed in this pull request?
This PR adds test cases for the result set metadata checking for Spark's
`ExecuteStatementOperation` to make the JDBC API more future-proofing
because any server-side change may affect the client compatibility.
### Why are the changes needed?
add test to prevent potential silent behavior change for JDBC users.
### Does this PR introduce _any_ user-facing change?
NO, test only
### How was this patch tested?
add new test
Closes #29746 from yaooqinn/SPARK-32874.
Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan
<wenchen@databricks.com>
(commit: 316242b)
The file was modifiedsql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/SparkThriftServerProtocolVersionsSuite.scala (diff)