Changes

Summary

  1. [SPARK-33415][PYTHON][SQL] Don't encode JVM response in Column.__repr__ (commit: 4b76a74) (details)
  2. [SPARK-33404][SQL][FOLLOWUP] Update benchmark results for `date_trunc` (commit: 7e86729) (details)
  3. [SPARK-33402][CORE] Jobs launched in same second have duplicate (commit: 318a173) (details)
  4. [MINOR][GRAPHX] Correct typos in the sub-modules: graphx, external, and (commit: 9d58a2f) (details)
  5. [WIP] Test (#30327) (commit: 61ee5d8) (details)
  6. Revert "[WIP] Test (#30327)" (commit: 6244407) (details)
  7. [SPARK-30294][SS][FOLLOW-UP] Directly override RDD methods (commit: 9f983a6) (details)
  8. [SPARK-33408][SPARK-32354][K8S][R] Use R 3.6.3 in K8s R image and (commit: 22baf05) (details)
Commit 4b76a74f1c0b5d9bd11794eefd739352764c88c4 by gurwls223
[SPARK-33415][PYTHON][SQL] Don't encode JVM response in Column.__repr__
### What changes were proposed in this pull request?
Removes encoding of the JVM response in
`pyspark.sql.column.Column.__repr__`.
### Why are the changes needed?
API consistency and improved readability of the expressions.
### Does this PR introduce _any_ user-facing change?
Before this change
    col("abc")
   col("wąż")
result in
    Column<b'abc'>
   Column<b'w\xc4\x85\xc5\xbc'>
After this change we'll get
    Column<'abc'>
   Column<'wąż'>
### How was this patch tested?
Existing tests and manual inspection.
Closes #30322 from zero323/SPARK-33415.
Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: 4b76a74)
The file was modifiedpython/pyspark/sql/tests/test_column.py (diff)
The file was modifiedpython/pyspark/sql/column.py (diff)
Commit 7e867298fed60db670e40013524ed41b1ab46215 by dongjoon
[SPARK-33404][SQL][FOLLOWUP] Update benchmark results for `date_trunc`
### What changes were proposed in this pull request? Updated results of
`DateTimeBenchmark` in the environment:
| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge (spot instance) |
| AMI | ami-06f2f779464715dc5
(ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) |
| Java | OpenJDK8/11 installed by`sudo add-apt-repository
ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`|
### Why are the changes needed? The fix
https://github.com/apache/spark/pull/30303 slowed down `date_trunc`.
This PR updates benchmark results to have actual info about performance
of `date_trunc`.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? By regenerating benchmark results:
```
$ SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain
org.apache.spark.sql.execution.benchmark.DateTimeBenchmark"
```
Closes #30338 from MaxGekk/fix-trunc_date-benchmark.
Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun
<dongjoon@apache.org>
(commit: 7e86729)
The file was modifiedsql/core/benchmarks/DateTimeBenchmark-results.txt (diff)
The file was modifiedsql/core/benchmarks/DateTimeBenchmark-jdk11-results.txt (diff)
Commit 318a173fcee11902820593fe4ac992a90e6bb00e by dongjoon
[SPARK-33402][CORE] Jobs launched in same second have duplicate
MapReduce JobIDs
### What changes were proposed in this pull request?
1. Applies the SQL changes in SPARK-33230 to SparkHadoopWriter, so that
`rdd.saveAsNewAPIHadoopDataset` passes in a unique job UUID in
`spark.sql.sources.writeJobUUID` 1.
`SparkHadoopWriterUtils.createJobTrackerID` generates a JobID by
appending a random long number to the supplied timestamp to ensure the
probability of a collision is near-zero. 1. With tests of uniqueness,
round trips and negative jobID rejection.
### Why are the changes needed?
Without this, if more than one job is started in the same second *and
the committer expects application attempt IDs to be unique* is at risk
of clashing with other jobs.
With the fix,
* those committers which use the ID set in
`spark.sql.sources.writeJobUUID` as a priority ID will pick that up
instead and so be unique.
* committers which use the Hadoop JobID for unique paths and filenames
will get the randomly generated jobID.  Assuming all clocks in a cluster
in sync, the probability of two jobs launched in the same second has
dropped from 1 to 1/(2^63)
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests
There's a new test suite SparkHadoopWriterUtilsSuite which creates
jobID, verifies they are unique even for the same timestamp and that
they can be marshalled to string and parsed back in the hadoop code,
which contains some (brittle) assumptions about the format of job IDs.
Functional Integration Tests
1. Hadoop-trunk built with [HADOOP-17318], publishing to local maven
repository 1. Spark built with hadoop.version=3.4.0-SNAPSHOT to pick up
these JARs. 1. Spark + Object store integration tests at
[https://github.com/hortonworks-spark/cloud-integration](https://github.com/hortonworks-spark/cloud-integration)
were built against that local spark version 1. And executed against AWS
london.
The tests were run with `fs.s3a.committer.require.uuid=true`, so the s3a
committers fail fast if they don't get a job ID down. This showed that
`rdd.saveAsNewAPIHadoopDataset` wasn't setting the UUID option. It again
uses the current Date value for an app attempt -which is not guaranteed
to be unique.
With the change applied to spark, the relevant tests work, therefore the
committers are getting unique job IDs.
Closes #30319 from steveloughran/BUG/SPARK-33402-jobuuid.
Authored-by: Steve Loughran <stevel@cloudera.com> Signed-off-by:
Dongjoon Hyun <dongjoon@apache.org>
(commit: 318a173)
The file was addedcore/src/test/scala/org/apache/spark/internal/io/SparkHadoopWriterUtilsSuite.scala
The file was modifiedcore/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriterUtils.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala (diff)
Commit 9d58a2f0f0f308a03830bf183959a4743a77b78a by yamamuro
[MINOR][GRAPHX] Correct typos in the sub-modules: graphx, external, and
examples
### What changes were proposed in this pull request?
This PR intends to fix typos in the sub-modules: graphx, external, and
examples. Split per holdenk
https://github.com/apache/spark/pull/30323#issuecomment-725159710
NOTE: The misspellings have been reported at
https://github.com/jsoref/spark/commit/706a726f87a0bbf5e31467fae9015218773db85b#commitcomment-44064356
### Why are the changes needed?
Misspelled words make it harder to read / understand content.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No testing was performed
Closes #30326 from jsoref/spelling-graphx.
Authored-by: Josh Soref <jsoref@users.noreply.github.com> Signed-off-by:
Takeshi Yamamuro <yamamuro@apache.org>
(commit: 9d58a2f)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala (diff)
The file was modifiedexternal/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSourceSuite.scala (diff)
The file was modifiedexternal/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala (diff)
The file was modifiedexamples/src/main/python/streaming/sql_network_wordcount.py (diff)
The file was modifiedexamples/src/main/scala/org/apache/spark/examples/streaming/RecoverableNetworkWordCount.scala (diff)
The file was modifiedexternal/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaTestUtils.scala (diff)
The file was modifiedexternal/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaRelationSuite.scala (diff)
The file was modifiedexamples/src/main/java/org/apache/spark/examples/streaming/JavaCustomReceiver.java (diff)
The file was modifiedexamples/src/main/java/org/apache/spark/examples/streaming/JavaRecoverableNetworkWordCount.java (diff)
The file was modifiedexamples/src/main/python/streaming/recoverable_network_wordcount.py (diff)
The file was modifiedexternal/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisUtilsPythonHelper.scala (diff)
The file was modifiedexternal/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaRDDSuite.scala (diff)
The file was modifiedexamples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala (diff)
The file was modifiedgraphx/src/test/scala/org/apache/spark/graphx/lib/PageRankSuite.scala (diff)
The file was modifiedexternal/kinesis-asl/src/main/python/examples/streaming/kinesis_wordcount_asl.py (diff)
The file was modifiedexamples/src/main/scala/org/apache/spark/examples/streaming/CustomReceiver.scala (diff)
The file was modifiedexternal/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/DockerJDBCIntegrationSuite.scala (diff)
The file was modifiedexamples/src/main/python/ml/train_validation_split.py (diff)
The file was modifiedexamples/src/main/python/sql/arrow.py (diff)
The file was modifiedexamples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala (diff)
The file was modifiedexamples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala (diff)
The file was modifiedexternal/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala (diff)
The file was modifiedexamples/src/main/java/org/apache/spark/examples/streaming/JavaSqlNetworkWordCount.java (diff)
The file was modifiedexternal/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java (diff)
The file was modifiedexamples/src/main/java/org/apache/spark/examples/streaming/JavaNetworkWordCount.java (diff)
Commit 61ee5d8a4e3080e01abfdbd8277fa75868c257cd by noreply
[WIP] Test (#30327)
* resend

* address comments

* directly gen new Iter

* directly gen new Iter

* update blockify strategy

* address comments

* try to fix 2.13

* try to fix scala 2.13

* use 1.0 as the default value for gemv

* update
Co-authored-by: zhengruifeng <ruifengz@foxmail.com>
(commit: 61ee5d8)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala (diff)
The file was modifiedpython/pyspark/ml/param/_shared_params_code_gen.py (diff)
The file was modifiedpython/pyspark/ml/param/shared.py (diff)
The file was modifiedpython/pyspark/ml/param/shared.pyi (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/feature/InstanceSuite.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala (diff)
The file was modifiedpython/pyspark/ml/classification.py (diff)
The file was modifiedpython/pyspark/ml/classification.pyi (diff)
Commit 6244407ce60c33ec9a549011723195fe8e15f287 by gurwls223
Revert "[WIP] Test (#30327)"
This reverts commit 61ee5d8a4e3080e01abfdbd8277fa75868c257cd.
### What changes were proposed in this pull request? I need to merge
https://github.com/apache/spark/pull/30327 to
https://github.com/apache/spark/pull/30009, but I merged it to master by
mistake.
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes #30345 from
zhengruifeng/revert-30327-adaptively_blockify_linear_svc_II.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 6244407)
The file was modifiedpython/pyspark/ml/classification.py (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala (diff)
The file was modifiedpython/pyspark/ml/classification.pyi (diff)
The file was modifiedpython/pyspark/ml/param/shared.py (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala (diff)
The file was modifiedpython/pyspark/ml/param/shared.pyi (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/Instance.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/feature/InstanceSuite.scala (diff)
The file was modifiedpython/pyspark/ml/param/_shared_params_code_gen.py (diff)
Commit 9f983a68f1fdefcd033ea65999ab916b61cba8b3 by gurwls223
[SPARK-30294][SS][FOLLOW-UP] Directly override RDD methods
### Why are the changes needed? Follow the comment:
https://github.com/apache/spark/pull/26935#discussion_r514697997
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? Existing test and Mima test.
Closes #30344 from xuanyuanking/SPARK-30294-follow.
Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 9f983a6)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreRDD.scala (diff)
Commit 22baf05a9ec6fffe53bd34d35c122de776464dd0 by gurwls223
[SPARK-33408][SPARK-32354][K8S][R] Use R 3.6.3 in K8s R image and
re-enable RTestsSuite
### What changes were proposed in this pull request?
This PR aims to use R 3.6.3 in K8s R image and re-enable `RTestsSuite`.
### Why are the changes needed?
Jenkins Server is using `R 3.6.3`.
```
+ SPARK_HOME=/home/jenkins/workspace/SparkPullRequestBuilder-K8s
+ /usr/bin/R CMD check --as-cran --no-tests SparkR_3.1.0.tar.gz
* using log directory
‘/home/jenkins/workspace/SparkPullRequestBuilder-K8s/R/SparkR.Rcheck’
* using R version 3.6.3 (2020-02-29)
```
OpenJDK docker image is using `R 3.5.2 (2018-12-20)` which is old and
currently `spark-3.0.1` fails to run SparkR.
```
$ cd spark-3.0.1-bin-hadoop3.2
$ bin/docker-image-tool.sh -R
kubernetes/dockerfiles/spark/bindings/R/Dockerfile -n build
...
exit code: 1
termination reason: Error
...
$ bin/spark-submit --master k8s://https://192.168.64.49:8443
--deploy-mode cluster --conf
spark.kubernetes.container.image=spark-r:latest
local:///opt/spark/examples/src/main/r/dataframe.R
$ k logs dataframe-r-b1c14b75b0c09eeb-driver
...
+ exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf
spark.driver.bindAddress=172.17.0.4 --deploy-mode client
--properties-file /opt/spark/conf/spark.properties --class
org.apache.spark.deploy.RRunner
local:///opt/spark/examples/src/main/r/dataframe.R 20/11/10 06:03:58
WARN NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable log4j:WARN No
appenders could be found for logger
(io.netty.util.internal.logging.InternalLoggerFactory). log4j:WARN
Please initialize the log4j system properly. log4j:WARN See
http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Error: package or namespace load failed for ‘SparkR’ in rbind(info,
getNamespaceInfo(env, "S3methods")):
number of columns of matrices must match (see arg 2) In addition:
Warning message: package ‘SparkR’ was built under R version 4.0.2
Execution halted
```
In addition, this PR aims to recover the test coverage.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass K8S IT Jenkins job.
Closes #30130 from dongjoon-hyun/SPARK-32354.
Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 22baf05)
The file was modifiedresource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesSuite.scala (diff)
The file was modifiedresource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/R/Dockerfile (diff)