Changes

Summary

  1. [SPARK-32949][FOLLOW-UP][R][SQL] Reindent lines in SparkR (commit: 9b21fdd) (details)
  2. [SPARK-32958][SQL] Prune unnecessary columns from JsonToStructs (commit: 37c806a) (details)
  3. [SPARK-33049][CORE] Decommission shuffle block test is flaky (commit: db420f7) (details)
  4. [SPARK-33065][TESTS] Expand the stack size of a thread in a test in (commit: fab5321) (details)
  5. [SPARK-33017][PYTHON] Add getCheckpointDir method to PySpark Context (commit: 4ab9aa0) (details)
  6. [SPARK-33040][R][ML] Add SparkR wrapper for vector_to_array (commit: e83d03c) (details)
  7. [SPARK-33040][FOLLOW-UP][R] Reorder argument choices and add examples (commit: 24f890e) (details)
Commit 9b21fdd731489b529a52cd2074f79dc7293eed3b by dhyun
[SPARK-32949][FOLLOW-UP][R][SQL] Reindent lines in SparkR
timestamp_seconds
### What changes were proposed in this pull request?
Re-indent lines of SparkR `timestamp_seconds`.
### Why are the changes needed?
Current indentation is not aligned with the opening line.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes #29940 from zero323/SPARK-32949-FOLLOW-UP.
Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
(commit: 9b21fdd)
The file was modifiedR/pkg/R/functions.R (diff)
Commit 37c806af2bd3fb4c1f25e02f4986226e5e8d994d by dhyun
[SPARK-32958][SQL] Prune unnecessary columns from JsonToStructs
### What changes were proposed in this pull request?
This patch proposes to do column pruning for `JsonToStructs` expression
if we only require some fields from it.
### Why are the changes needed?
`JsonToStructs` takes a schema parameter used to tell `JacksonParser`
what fields are needed to parse. If `JsonToStructs` is followed by
`GetStructField`. We can prune the schema to only parse certain field.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test
Closes #29900 from viirya/SPARK-32958.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
(commit: 37c806a)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprsSuite.scala (diff)
Commit db420f79cc588dc0f98b906accb34d63a1e4664c by dhyun
[SPARK-33049][CORE] Decommission shuffle block test is flaky
### What changes were proposed in this pull request?
Increase the listener bus event length, syncrhonize the addition of
blocks modified to the array list.
### Why are the changes needed?
This test appears flaky in Jenkins (can not repro locally). Given that
the index file made it through and the index file is only transferred
after the data file, the only two reasons I could come up with an
interminentent failure here are with the listenerbus dropping a message
or the two block change messages being received at the same time.
### Does this PR introduce _any_ user-facing change?
No (test only).
### How was this patch tested?
The tests still pass on my machine but they did before. We'll need to
run it through jenkins a few times first.
Closes #29929 from
holdenk/fix-.BlockManagerDecommissionIntegrationSuite.
Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
(commit: db420f7)
The file was modifiedcore/src/test/scala/org/apache/spark/storage/BlockManagerDecommissionIntegrationSuite.scala (diff)
Commit fab53212cb110a81696cee8546c35095332f6e09 by dhyun
[SPARK-33065][TESTS] Expand the stack size of a thread in a test in
LocalityPlacementStrategySuite for Java 11 with sbt
### What changes were proposed in this pull request?
This PR fixes an issue that a test in `LocalityPlacementStrategySuite`
fails with Java 11 due to `StackOverflowError`.
```
[info] - handle large number of containers and tasks (SPARK-18750) ***
FAILED *** (170 milliseconds)
[info]   StackOverflowError should not be thrown; however, got:
[info]
[info]   java.lang.StackOverflowError
[info]          at
java.base/java.util.concurrent.ConcurrentHashMap.putVal(ConcurrentHashMap.java:1012)
[info]          at
java.base/java.util.concurrent.ConcurrentHashMap.putIfAbsent(ConcurrentHashMap.java:1541)
[info]          at
java.base/java.lang.ClassLoader.getClassLoadingLock(ClassLoader.java:668)
[info]          at
java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:591)
[info]          at
java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:579)
[info]          at
java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
[info]          at
java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
```
The solution is to expand the stack size of a thread in the test from
32KB to 256KB. Currently, the stack size is specified as 32KB but the
actual stack size can be greater than 32KB. According to the code of
Hotspot, the minimum stack size is prefer to the specified size.
Java 8:
https://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/c92ba514724d/src/os/linux/vm/os_linux.cpp#l900
Java 11:
https://hg.openjdk.java.net/jdk-updates/jdk11u/file/73edf743a93a/src/hotspot/os/posix/os_posix.cpp#l1555
For Linux on x86_64, the minimum stack size seems to be 224KB and 136KB
for Java 8 and Java 11 respectively. So, the actual stack size should be
224KB rather than 32KB for Java 8 on x86_64/Linux. As the test passes
for Java 8 but doesn't for Java 11, 224KB is enough while 136KB is not.
So I think specifing 256KB is reasonable for the new stack size.
### Why are the changes needed?
To pass the test for Java 11.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Following command with Java 11.
``` build/sbt -Pyarn clean package "testOnly
org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite"
```
Closes #29943 from sarutak/fix-stack-size.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by:
Dongjoon Hyun <dhyun@apple.com>
(commit: fab5321)
The file was modifiedresource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/LocalityPlacementStrategySuite.scala (diff)
Commit 4ab9aa03055d3ad90137efacb2e00eff4ac3fbf1 by gurwls223
[SPARK-33017][PYTHON] Add getCheckpointDir method to PySpark Context
### What changes were proposed in this pull request?
Adding a method to get the checkpoint directory from the PySpark context
to match the Scala API
### Why are the changes needed?
To make the Scala and Python APIs consistent and remove the need to use
the JavaObject
### Does this PR introduce _any_ user-facing change?
Yes, there is a new method which makes it easier to get the checkpoint
directory directly rather than using the JavaObject
#### Previous behaviour:
```python
>>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
>>> sc._jsc.sc().getCheckpointDir().get()
'file:/tmp/spark/checkpoint/63f7b67c-e5dc-4d11-a70c-33554a71717a'
``` This method returns a confusing Scala error if it has not been set
```python
>>> sc._jsc.sc().getCheckpointDir().get() Traceback (most recent call
last):
File "<stdin>", line 1, in <module>
File
"/home/paul/Desktop/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
line 1305, in __call__
File "/home/paul/Desktop/spark/python/pyspark/sql/utils.py", line 111,
in deco
   return f(*a, **kw)
File
"/home/paul/Desktop/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
line 328, in get_return_value py4j.protocol.Py4JJavaError: An error
occurred while calling o25.get.
: java.util.NoSuchElementException: None.get
       at scala.None$.get(Option.scala:529)
       at scala.None$.get(Option.scala:527)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
       at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:498)
       at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
       at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
       at py4j.Gateway.invoke(Gateway.java:282)
       at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
       at py4j.commands.CallCommand.execute(CallCommand.java:79)
       at py4j.GatewayConnection.run(GatewayConnection.java:238)
       at java.lang.Thread.run(Thread.java:748)
```
#### New method:
```python
>>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
>>> spark.sparkContext.getCheckpointDir()
'file:/tmp/spark/checkpoint/b38aca2e-8ace-44fc-a4c4-f4e36c2da2a7'
```
``getCheckpointDir()`` returns ``None`` if it has not been set
```python
>>> print(spark.sparkContext.getCheckpointDir()) None
```
### How was this patch tested?
Added to existing unit tests. But I'm not sure how to add a test for the
case where ``getCheckpointDir()`` should return ``None`` since the
existing checkpoint tests set the checkpoint directory in the ``setUp``
method before any tests are run as far as I can tell.
Closes #29918 from reidy-p/SPARK-33017.
Authored-by: reidy-p <paul_reidy@outlook.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: 4ab9aa0)
The file was modifiedpython/pyspark/context.py (diff)
The file was modifiedpython/pyspark/__init__.py (diff)
The file was modifiedpython/pyspark/context.pyi (diff)
The file was modifiedpython/pyspark/tests/test_context.py (diff)
Commit e83d03ca4861a69cd688beacc544b3f6dae32ae0 by gurwls223
[SPARK-33040][R][ML] Add SparkR wrapper for vector_to_array
### What changes were proposed in this pull request?
Add SparkR wrapper for `o.a.s.ml.functions.vector_to_array`
### Why are the changes needed?
- Currently ML vectors, including predictions, are almost inaccessible
to R users. That's is a serious loss of functionality.
- Feature parity.
### Does this PR introduce _any_ user-facing change?
Yes, new R function is added.
### How was this patch tested?
- New unit tests.
- Manual verification.
Closes #29917 from zero323/SPARK-33040.
Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: e83d03c)
The file was modifiedR/pkg/tests/fulltests/test_sparkSQL.R (diff)
The file was modifiedR/pkg/R/generics.R (diff)
The file was modifiedR/pkg/R/functions.R (diff)
The file was modifiedR/pkg/NAMESPACE (diff)
Commit 24f890e8e81ee03fe0d9ce4c8f232784e9fdaccd by gurwls223
[SPARK-33040][FOLLOW-UP][R] Reorder argument choices and add examples
### What changes were proposed in this pull request?
- Reorder choices of `dtype` to match Scala defaults.
- Add example to ml_functions.
### Why are the changes needed?
As requested:
- https://github.com/apache/spark/pull/29917#pullrequestreview-501715344
- https://github.com/apache/spark/pull/29917#pullrequestreview-501716521
### Does this PR introduce _any_ user-facing change?
No (changes to newly added component).
### How was this patch tested?
Existing tests.
Closes #29944 from zero323/SPARK-33040-FOLLOW-UP.
Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: 24f890e)
The file was modifiedR/pkg/R/functions.R (diff)