Changes

Summary

  1. [SPARK-37524][SQL] We should drop all tables after testing dynamic (details)
  2. [SPARK-37526][INFRA][PYTHON][TESTS] Add Java17 PySpark daily test (details)
  3. [SPARK-37531][INFRA][PYTHON][TESTS] Use PyArrow 6.0.0 in Python 3.9 (details)
  4. [SPARK-37534][BUILD] Bump dev.ludovic.netlib to 2.2.1 (details)
  5. [SPARK-37530][CORE] Spark reads many paths very slow though (details)
Commit 2433c942ca39b948efe804aeab0185a3f37f3eea by gurwls223
[SPARK-37524][SQL] We should drop all tables after testing dynamic partition pruning

### What changes were proposed in this pull request?

Drop all tables after testing dynamic partition pruning.

### Why are the changes needed?
We should drop all tables after testing dynamic partition pruning.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Exist unittests

Closes #34768 from weixiuli/SPARK-11150-fix.

Authored-by: weixiuli <weixiuli@jd.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala (diff)
Commit f99e2e61b2d2b067f0ee9ce2e6886f1218ccda0e by gurwls223
[SPARK-37526][INFRA][PYTHON][TESTS] Add Java17 PySpark daily test coverage

### What changes were proposed in this pull request?

This PR aims to add Java 17 PySpark daily test coverage.

### Why are the changes needed?

To support Java 17 at Apache Spark 3.3.

After SPARK-37522, I verified the following with Python 3.9.7 on Linux. (not every Python libraries).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- Pass the CIs to verify this doesn't break anything.
- After manually review, this should be verified after merging.

Closes #34788 from dongjoon-hyun/SPARK-37526.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
The file was modified.github/workflows/build_and_test.yml (diff)
Commit eba4f5c6b605829565f213fdcf8444d16d672504 by dongjoon
[SPARK-37531][INFRA][PYTHON][TESTS] Use PyArrow 6.0.0 in Python 3.9 tests at GitHub Action job

### What changes were proposed in this pull request?

This PR aims to use `PyArrow 6.0.0` in `Python 3.9` unit tests at GitHub Action jobs.

Although the new change is removing `<5.0.0' limitation, there are other minor changes because it's built more recently, too.
- https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage/commit/4f7408f4a95ef9784fdaf490be56bcfd7ff309bb
```
- RUN python3.9 -m pip install numpy 'pyarrow<5.0.0' pandas scipy xmlrunner plotly>=4.8 sklearn 'mlflow>=1.0'
+ RUN python3.9 -m pip install numpy pyarrow pandas scipy xmlrunner plotly>=4.8 sklearn 'mlflow>=1.0'
```

```
$ docker run -it --rm dongjoon/apache-spark-github-action-image:20211116 pip3.9 list > 20211116
$ docker run -it --rm dongjoon/apache-spark-github-action-image:20210930 pip3.9 list > 20210930
$ diff 20210930 20211116
# The following is manually formatted for simplicity.
...
Jinja2                    3.0.1         3.0.3
mlflow                    1.20.2        1.21.0
numpy                     1.21.2        1.21.4
pandas                    1.3.3         1.3.4
plotly                    5.3.1         5.4.0
pyarrow                   4.0.1         6.0.0
scikit-learn              1.0           1.0.1
scipy                     1.7.1         1.7.2
```

### Why are the changes needed?

SPARK-37342 upgrade Apache Arrow to 6.0.0 in Java/Scala.
This is a corresponding upgrade in PySpark.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action.

Closes #34793 from dongjoon-hyun/SPARK-37531.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
The file was modified.github/workflows/build_and_test.yml (diff)
Commit ae9aebab940b0e5683c4b7a14302d3aedb149275 by dongjoon
[SPARK-37534][BUILD] Bump dev.ludovic.netlib to 2.2.1

### What changes were proposed in this pull request?

Bump the version of dev.ludovic.netlib from 2.2.0 to 2.2.1. This fixes a computation bug in sgemm. See [1]. Diff is [2]

[1] https://github.com/luhenry/netlib/issues/7
[2] https://github.com/luhenry/netlib/compare/v2.2.0...v2.2.1

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #34783 from luhenry/patch-1.

Authored-by: Ludovic Henry <git@ludovic.dev>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
The file was modifieddev/deps/spark-deps-hadoop-3.2-hive-2.3 (diff)
The file was modifiedpom.xml (diff)
The file was modifieddev/deps/spark-deps-hadoop-2.7-hive-2.3 (diff)
Commit e0d41e887ea18ff3b82f0451db89075777c510d1 by yao
[SPARK-37530][CORE] Spark reads many paths very slow though newAPIHadoopFile

### What changes were proposed in this pull request?

Same as https://github.com/apache/spark/pull/18441, we parallelize FileInputFormat.listStatus for newAPIHadoopFile

### Why are the changes needed?

![image](https://user-images.githubusercontent.com/8326978/144562490-d8005bf2-2052-4b50-9a5d-8b253ee598cc.png)

Spark can be slow when accessing external storage at driver side, improve perf by parallelizing

### Does this PR introduce _any_ user-facing change?

no
### How was this patch tested?

passing GA

Closes #34792 from yaooqinn/SPARK-37530.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
The file was modifiedcore/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala (diff)