Changes

Summary

  1. [MINOR][SQL][DOCS] Update schema_of_csv and schema_of_json doc (commit: 7f3d99a) (details)
  2. [SPARK-31069][CORE] Avoid repeat compute `chunksBeingTransferred` cause (commit: dd32f45) (details)
  3. [SPARK-24554][PYTHON][SQL] Add MapType support for PySpark with Arrow (commit: 8e2a0bd) (details)
  4. [SPARK-33475][BUILD] Bump ANTLR runtime version to 4.8-1 (commit: 74bd046) (details)
  5. [SPARK-32852][SQL][DOC][FOLLOWUP] Revise the documentation of (commit: a180e02) (details)
  6. [SPARK-32907][ML][PYTHON] Adaptively blockify instances - AFT,LiR,LoR (commit: 689c294) (details)
  7. [SPARK-33476][CORE] Generalize ExecutorSource to expose user-given file (commit: 594c7c6) (details)
  8. [SPARK-27936][K8S] Support python deps (commit: dcac78e) (details)
Commit 7f3d99a8a5b17c049010db46adf9bf65a63eb241 by gurwls223
[MINOR][SQL][DOCS] Update schema_of_csv and schema_of_json doc
### What changes were proposed in this pull request?
This minor PR updates the docs of `schema_of_csv` and `schema_of_json`.
They allow foldable string column instead of a string literal now.
### Why are the changes needed?
The function doc of  `schema_of_csv` and `schema_of_json` are not
updated accordingly with previous PRs.
### Does this PR introduce _any_ user-facing change?
Yes, update user-facing doc.
### How was this patch tested?
Unit test.
Closes #30396 from viirya/minor-json-csv.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 7f3d99a)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/functions.scala (diff)
The file was modifiedpython/pyspark/sql/functions.py (diff)
Commit dd32f45d2058d00293330c01d3d9f53ecdbc036c by mridulatgmail.com
[SPARK-31069][CORE] Avoid repeat compute `chunksBeingTransferred` cause
hight cpu cost in external shuffle service when
`maxChunksBeingTransferred`  use default value
### What changes were proposed in this pull request? Followup from
#27831 , origin author chrysan.
Each request it will check `chunksBeingTransferred `
``` public long chunksBeingTransferred() {
   long sum = 0L;
   for (StreamState streamState: streams.values()) {
     sum += streamState.chunksBeingTransferred.get();
   }
   return sum;
}
```
such as
``` long chunksBeingTransferred =
streamManager.chunksBeingTransferred();
   if (chunksBeingTransferred >= maxChunksBeingTransferred) {
     logger.warn("The number of chunks being transferred {} is above {},
close the connection.",
       chunksBeingTransferred, maxChunksBeingTransferred);
     channel.close();
     return;
   }
``` It will  traverse `streams` repeatedly and we know that fetch data
chunk will access `stream` too,  there cause two problem:
1. repeated traverse `streams`, the longer the length, the longer the
time 2. lock race in ConcurrentHashMap `streams`
In this PR, when `maxChunksBeingTransferred` use default value, we avoid
compute `chunksBeingTransferred ` since we don't  care about this.  If
user want to set this configuration and meet performance problem,  you
can also backport PR #27831
### Why are the changes needed? Speed up  getting
`chunksBeingTransferred`  and avoid lock race in object `streams`
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? Existed UT
Closes #30139 from AngersZhuuuu/SPARK-31069.
Lead-authored-by: angerszhu <angers.zhu@gmail.com> Co-authored-by:
chrysan <chrysanxia@gmail.com> Signed-off-by: Mridul Muralidharan
<mridul<at>gmail.com>
(commit: dd32f45)
The file was modifiedcommon/network-common/src/main/java/org/apache/spark/network/server/ChunkFetchRequestHandler.java (diff)
The file was modifiedcommon/network-common/src/main/java/org/apache/spark/network/server/TransportRequestHandler.java (diff)
Commit 8e2a0bdce706128891ebc4c00345a25d7dd41371 by gurwls223
[SPARK-24554][PYTHON][SQL] Add MapType support for PySpark with Arrow
### What changes were proposed in this pull request?
This change adds MapType support for PySpark with Arrow, if using
pyarrow >= 2.0.0.
### Why are the changes needed?
MapType was previous unsupported with Arrow.
### Does this PR introduce _any_ user-facing change?
User can now enable MapType for `createDataFrame()`, `toPandas()` with
Arrow optimization, and with Pandas UDFs.
### How was this patch tested?
Added new PySpark tests for createDataFrame(), toPandas() and Scalar
Pandas UDFs.
Closes #30393 from BryanCutler/arrow-add-MapType-SPARK-24554.
Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: 8e2a0bd)
The file was modifiedpython/docs/source/user_guide/arrow_pandas.rst (diff)
The file was modifiedpython/pyspark/sql/tests/test_arrow.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_pandas_udf_scalar.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_pandas_grouped_map.py (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (diff)
The file was modifiedpython/pyspark/sql/pandas/types.py (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala (diff)
The file was modifiedpython/pyspark/sql/tests/test_pandas_udf_grouped_agg.py (diff)
The file was modifiedpython/pyspark/sql/pandas/functions.py (diff)
The file was modifiedpython/pyspark/sql/tests/test_pandas_cogrouped_map.py (diff)
The file was modifiedpython/pyspark/sql/pandas/conversion.py (diff)
The file was modifiedpython/pyspark/sql/pandas/serializers.py (diff)
Commit 74bd046d17db69830b6c798a60d3eb3c28e08dec by gurwls223
[SPARK-33475][BUILD] Bump ANTLR runtime version to 4.8-1
### What changes were proposed in this pull request?
This PR intends to upgrade ANTLR runtime from 4.7.1 to 4.8-1.
### Why are the changes needed?
Release note of v4.8 and v4.7.2 (the v4.7.2 release has a few minor bug
fixes for java targets):
- v4.8: https://github.com/antlr/antlr4/releases/tag/4.8
- v4.7.2: https://github.com/antlr/antlr4/releases/tag/4.7.2
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA tests.
Closes #30404 from maropu/UpgradeAntlr.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by:
HyukjinKwon <gurwls223@apache.org>
(commit: 74bd046)
The file was modifieddev/deps/spark-deps-hadoop-2.7-hive-2.3 (diff)
The file was modifieddev/deps/spark-deps-hadoop-3.2-hive-2.3 (diff)
The file was modifiedpom.xml (diff)
Commit a180e0284205ca002ba6fa5fb9e692136febca0b by gengliang.wang
[SPARK-32852][SQL][DOC][FOLLOWUP] Revise the documentation of
spark.sql.hive.metastore.jars
### What changes were proposed in this pull request?
This is a follow-up for https://github.com/apache/spark/pull/29881. It
revises the documentation of the configuration
`spark.sql.hive.metastore.jars`.
### Why are the changes needed?
Fix grammatical error in the doc. Also, make it more clear that the
configuration is effective only when `spark.sql.hive.metastore.jars` is
set as `path`
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Just doc changes.
Closes #30407 from gengliangwang/reviseJarPathDoc.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
(commit: a180e02)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala (diff)
Commit 689c2941021563883fc6b9d25f1a2108b4f7ceff by weichen.xu
[SPARK-32907][ML][PYTHON] Adaptively blockify instances - AFT,LiR,LoR
### What changes were proposed in this pull request? use
`maxBlockSizeInMB` instead of `blockSize` (#rows) to control the
stacking of vectors;
### Why are the changes needed? the performance gain is mainly related
to the nnz of block.
### Does this PR introduce _any_ user-facing change? yes, param
blockSize -> blockSizeInMB in master
### How was this patch tested? updated testsuites
Closes #30355 from zhengruifeng/adaptively_blockify_aft_lir_lor.
Lead-authored-by: zhengruifeng <ruifengz@foxmail.com> Co-authored-by:
Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Weichen Xu
<weichen.xu@databricks.com>
(commit: 689c294)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HingeAggregator.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/optim/aggregator/AFTAggregator.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala (diff)
The file was modifiedpython/pyspark/ml/regression.py (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala (diff)
The file was modifiedpython/pyspark/ml/classification.pyi (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LeastSquaresAggregator.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala (diff)
The file was modifiedpython/pyspark/ml/classification.py (diff)
The file was modifiedpython/pyspark/ml/regression.pyi (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala (diff)
The file was modifiedpython/pyspark/ml/param/_shared_params_code_gen.py (diff)
The file was modifiedpython/pyspark/ml/param/shared.py (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/regression/AFTSurvivalRegressionSuite.scala (diff)
Commit 594c7c613a8ef80ab6b3725832579f12d40b02c8 by dongjoon
[SPARK-33476][CORE] Generalize ExecutorSource to expose user-given file
system schemes
### What changes were proposed in this pull request?
This PR aims to generalize executor metrics to support user-given file
system schemes instead of the fixed `file,hdfs` scheme.
### Why are the changes needed?
For the users using only cloud storages like `S3A`, we need to be able
to expose `S3A` metrics. Also, we can skip unused `hdfs` metrics.
### Does this PR introduce _any_ user-facing change?
Yes, but compatible for the existing users which uses `hdfs` and `file`
filesystem scheme only.
### How was this patch tested?
Manually do the following.
```
$ build/sbt -Phadoop-cloud package
$ sbin/start-master.sh; sbin/start-slave.sh spark://$(hostname):7077
$ bin/spark-shell --master spark://$(hostname):7077 -c
spark.executor.metrics.fileSystemSchemes=file,s3a -c
spark.metrics.conf.executor.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink
scala> spark.read.textFile("s3a://dongjoon/README.md").collect()
```
Separately, launch `jconsole` and check `*.executor.filesystem.s3a.*`.
Also, confirm that there is no `*.executor.filesystem.hdfs.*`
```
$ jconsole
```
![Screen Shot 2020-11-17 at 9 26 03
PM](https://user-images.githubusercontent.com/9700541/99487609-94121180-291b-11eb-9ed2-964546146981.png)
Closes #30405 from dongjoon-hyun/SPARK-33476.
Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon
Hyun <dongjoon@apache.org>
(commit: 594c7c6)
The file was modifiedcore/src/main/scala/org/apache/spark/executor/Executor.scala (diff)
The file was modifieddocs/monitoring.md (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/internal/config/package.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/executor/ExecutorSource.scala (diff)
Commit dcac78e12b146189090f1f7725d63393dd154a26 by dongjoon
[SPARK-27936][K8S] Support python deps
Supports python client deps from the launcher fs. This is a feature that
was added for java deps. This PR adds support fo rpythona s well.
yes
Manually running different scenarios and via examining the driver &
executors logs. Also there is an integration test added. I verified that
the python resources are added to the spark file server and they are
named properly so they dont fail the executors. Note here that as
previously the following will not work: primary resource `A.py`: uses a
closure defined in submited pyfile `B.py`, context.py only adds to the
pythonpath files with certain extension eg. zip, egg, jar.
Closes #25870 from skonto/python-deps.
Lead-authored-by: Stavros Kontopoulos <skontopo@redhat.com>
Co-authored-by: Stavros Kontopoulos <st.kontopoulos@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(commit: dcac78e)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverCommandFeatureStep.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/SparkContext.scala (diff)
The file was modifiedresource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/Utils.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesUtils.scala (diff)
The file was modifiedresource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala (diff)
The file was modifiedresource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala (diff)