Changes

Summary

  1. [SPARK-21481][ML][FOLLOWUP][TRIVIAL] HashingTF use (commit: 934a91f) (details)
  2. [SPARK-32999][SQL] Use Utils.getSimpleName to avoid hitting Malformed (commit: 9a155d4) (details)
  3. [SPARK-32974][ML] FeatureHasher transform optimization (commit: 0c38765) (details)
  4. [SPARK-32714][FOLLOW-UP][PYTHON] Address pyspark.install typing errors (commit: c65b645) (details)
  5. [SPARK-32973][ML][DOC] FeatureHasher does not check categoricalCols in (commit: bc77e5b) (details)
  6. [SPARK-32972][ML] Pass all UTs of  `mllib` module in Scala 2.13 (commit: bb6d5e7) (details)
Commit 934a91fcb4de1e5c4b93b58e7452afa4bb4a9586 by srowen
[SPARK-21481][ML][FOLLOWUP][TRIVIAL] HashingTF use
util.collection.OpenHashMap instead of mutable.HashMap
### What changes were proposed in this pull request?
`HashingTF` use `util.collection.OpenHashMap` instead of
`mutable.HashMap`
### Why are the changes needed? according to
`util.collection.OpenHashMap` 's doc:
> This map is about 5X faster than java.util.HashMap, while using much
less space overhead.
according to performance tests like ([Simple microbenchmarks comparing
Scala vs Java mutable map performance
](https://gist.github.com/pchiusano/1423303)), `mutable.HashMap` maybe
more inefficient than `java.util.HashMap`
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? existing testsuites
Closes #29852 from zhengruifeng/hashingtf_opt.
Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean
Owen <srowen@gmail.com>
(commit: 934a91f)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala (diff)
Commit 9a155d42a3202fbafc48f8b722bbc27cce522e11 by dhyun
[SPARK-32999][SQL] Use Utils.getSimpleName to avoid hitting Malformed
class name in TreeNode
### What changes were proposed in this pull request?
Use `Utils.getSimpleName` to avoid hitting `Malformed class name` error
in `TreeNode`.
### Why are the changes needed?
On older JDK versions (e.g. JDK8u), nested Scala classes may trigger
`java.lang.Class.getSimpleName` to throw an `java.lang.InternalError:
Malformed class name` error.
Similar to https://github.com/apache/spark/pull/29050, we should use
Spark's `Utils.getSimpleName` utility function in place of
`Class.getSimpleName` to avoid hitting the issue.
### Does this PR introduce _any_ user-facing change?
Fixes a bug that throws an error when invoking `TreeNode.nodeName`,
otherwise no changes.
### How was this patch tested?
Added new unit test case in `TreeNodeSuite`. Note that the test case
assumes the test code can trigger the expected error, otherwise it'll
skip the test safely, for compatibility with newer JDKs.
Manually tested on JDK8u and JDK11u and observed expected behavior:
- JDK8u: the test case triggers the "Malformed class name" issue and the
fix works;
- JDK11u: the test case does not trigger the "Malformed class name"
issue, and the test case is safely skipped.
Closes #29875 from rednaxelafx/spark-32999-getsimplename.
Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: Dongjoon
Hyun <dhyun@apple.com>
(commit: 9a155d4)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/trees/TreeNodeSuite.scala (diff)
Commit 0c38765b297337c3d80496db09ae7f79d2acf778 by ruifengz
[SPARK-32974][ML] FeatureHasher transform optimization
### What changes were proposed in this pull request? pre-compute the
output indices of numerical columns, instead of computing them on each
row.
### Why are the changes needed? for a numerical column, its output index
is a hash of its `col_name`, we can pre-compute it at first, instead of
computing it on each row.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? existing testsuites
Closes #29850 from zhengruifeng/hash_opt.
Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by:
zhengruifeng <ruifengz@foxmail.com>
(commit: 0c38765)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala (diff)
Commit c65b64552f947a7eaf4f379edbdce05daa923363 by gurwls223
[SPARK-32714][FOLLOW-UP][PYTHON] Address pyspark.install typing errors
### What changes were proposed in this pull request?
This PR adds two `type: ignores`, one in `pyspark.install` and one in
related tests.
### Why are the changes needed?
To satisfy MyPy type checks. It seems like we originally missed some
changes that happened around merge of
https://github.com/apache/spark/commit/31a16fbb405a19dc3eb732347e0e1f873b16971d
``` python/pyspark/install.py:30: error: Need type annotation for
'UNSUPPORTED_COMBINATIONS' (hint: "UNSUPPORTED_COMBINATIONS:
List[<type>] = ...")  [var-annotated]
python/pyspark/tests/test_install_spark.py:105: error: Cannot find
implementation or library stub for module named 'xmlrunner'  [import]
python/pyspark/tests/test_install_spark.py:105: note: See
https://mypy.readthedocs.io/en/latest/running_mypy.html#missing-imports
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- Existing tests.
- MyPy tests
   ```
   mypy --show-error-code --no-incremental --config python/mypy.ini
python/pyspark
  ```
Closes #29878 from zero323/SPARK-32714-FOLLOW-UP.
Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon
<gurwls223@apache.org>
(commit: c65b645)
The file was modifiedpython/pyspark/install.py (diff)
The file was modifiedpython/pyspark/tests/test_install_spark.py (diff)
Commit bc77e5b840b2feb18a9c8a61dfe75f421e5b64ca by srowen
[SPARK-32973][ML][DOC] FeatureHasher does not check categoricalCols in
inputCols
### What changes were proposed in this pull request? 1, update the
comment: `Note, the relevant columns must also be set in inputCols` ->
`Note, the relevant columns should also be set in inputCols`; 2, add a
check, and if there are `categoricalCols` not set in `inputCols`,
log.warn it;
### Why are the changes needed? 1, there is no check to make sure
`categoricalCols` are all set in `inputCols`, to keep existing behavior,
update this comments;
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested? repl
Closes #29868 from zhengruifeng/feature_hash_cat_doc.
Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean
Owen <srowen@gmail.com>
(commit: bc77e5b)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala (diff)
Commit bb6d5e7a908dbd0918a9fe50147be7d16a4733f5 by srowen
[SPARK-32972][ML] Pass all UTs of  `mllib` module in Scala 2.13
### What changes were proposed in this pull request? The purpose of this
pr is to resolve SPARK-32972, total of 51 Scala failed test cases and 3
Java failed test cases were fixed, the main change of this pr as follow:
- Specified `Seq` to `scala.collection.Seq` in case match `Seq` scene
and `x.asInstanceOf[Seq[T]]` scene
- Use `Row.getSeq[T]` instead of `Row.getAs[Seq]`
- Manual call `toMap` method to convert `MapView` to `Map` in Scala 2.13
- Change  the tol in the last test to 0.75 to pass
`RandomForestRegressorSuite#training with sample weights` in Scala 2.13
### Why are the changes needed? We need to support a Scala 2.13 build.
### Does this PR introduce _any_ user-facing change? No
### How was this patch tested?
- Scala 2.12: Pass the Jenkins or GitHub Action
- Scala 2.13: Pass  GitHub 2.13 Build Action
Do the follow:
``` dev/change-scala-version.sh 2.13 mvn clean install -DskipTests  -pl
mllib -Pscala-2.13 -am mvn test -pl mllib -Pscala-2.13 -fn
```
**Before**
```
[ERROR] Errors:
[ERROR]   JavaVectorIndexerSuite.vectorIndexerAPI:51 » ClassCast
scala.collection.conver...
[ERROR]   JavaWord2VecSuite.testJavaWord2Vec:51 » Spark Job aborted due
to stage failure...
[ERROR]   JavaPrefixSpanSuite.runPrefixSpanSaveLoad:79 » Spark Job
aborted due to stage ...
Tests: succeeded 1567, failed 51, canceled 0, ignored 7, pending 0
*** 51 TESTS FAILED ***
```
**After**
```
[INFO] Tests run: 122, Failures: 0, Errors: 0, Skipped: 0
Tests: succeeded 1617, failed 0, canceled 0, ignored 7, pending 0 All
tests passed.
```
Closes #29857 from LuciferYang/fix-mllib-2.
Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen
<srowen@gmail.com>
(commit: bb6d5e7)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/IDF.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/regression/RandomForestRegressorSuite.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/feature/MinHashLSHSuite.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/mllib/feature/Word2VecSuite.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/util/MLTestSuite.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/feature/StopWordsRemoverSuite.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSHSuite.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/feature/LSHTest.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/mllib/tree/model/DecisionTreeModel.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/feature/NGramSuite.scala (diff)
The file was modifiedmllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala (diff)
The file was modifiedmllib/src/test/scala/org/apache/spark/ml/fpm/FPGrowthSuite.scala (diff)