Changes

Summary

  1. [SPARK-32381][CORE][SQL][FOLLOWUP] More cleanup on HadoopFSUtils (commit: 27cd945) (details)
  2. [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2 (commit: 1df69f7) (details)
  3. [SPARK-33464][INFRA] Add/remove (un)necessary cache and restructure (commit: fbfc0bf) (details)
Commit 27cd945c151dccb5ac863e6bc2c4f5b2c6a6d996 by hkarau
[SPARK-32381][CORE][SQL][FOLLOWUP] More cleanup on HadoopFSUtils
### What changes were proposed in this pull request?
This PR is a follow-up of #29471 and does the following improvements for
`HadoopFSUtils`: 1. Removes the extra `filterFun` from the listing API
and combines it with the `filter`. 2. Removes
`SerializableBlockLocation` and `SerializableFileStatus` given that
`BlockLocation` and `FileStatus` are already serializable. 3. Hides the
`isRootLevel` flag from the top-level API.
### Why are the changes needed?
Main purpose is to simplify the logic within `HadoopFSUtils` as well as
cleanup the API.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing unit tests (e.g., `FileIndexSuite`)
Closes #29959 from sunchao/hadoop-fs-utils-followup.
Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Holden Karau
<hkarau@apple.com>
(commit: 27cd945)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala (diff)
Commit 1df69f7e324aa799c05f6158e433371c5eeed8ce by brkyvz
[SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2
### What changes were proposed in this pull request?
This adds support for metadata columns to DataSourceV2. If a source
implements `SupportsMetadataColumns` it must also implement
`SupportsPushDownRequiredColumns` to support projecting those columns.
The analyzer is updated to resolve metadata columns from
`LogicalPlan.metadataOutput`, and this adds a rule that will add
metadata columns to the output of `DataSourceV2Relation` if one is used.
### Why are the changes needed?
This is the solution discussed for exposing additional data in the Kafka
source. It is also needed for a generic `MERGE INTO` plan.
### Does this PR introduce any user-facing change?
Yes. Users can project additional columns from sources that implement
the new API. This also updates `DescribeTableExec` to show metadata
columns.
### How was this patch tested?
Will include new unit tests.
Closes #28027 from rdblue/add-dsv2-metadata-columns.
Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Burak Yavuz
<brkyvz@gmail.com>
(commit: 1df69f7)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala (diff)
The file was addedsql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsMetadataColumns.java
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was addedsql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/MetadataColumn.java
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DescribeTableExec.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Implicits.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownUtils.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala (diff)
Commit fbfc0bf62879a7029e52d392618f623d24eedf1c by dongjoon
[SPARK-33464][INFRA] Add/remove (un)necessary cache and restructure
GitHub Actions yaml
### What changes were proposed in this pull request?
This PR proposes:
- Add `~/.sbt` directory into the build cache, see also
https://github.com/sbt/sbt/issues/3681
- Move `hadoop-2` below to put up together with `java-11` and
`scala-213`, see
https://github.com/apache/spark/pull/30391#discussion_r524881430
- Remove unnecessary `.m2` cache if you run SBT tests only.
- Remove `rm ~/.m2/repository/org/apache/spark`. If you don't `sbt
publishLocal` or `mvn install`, we don't need to care about it.
- Use Java 8 in Scala 2.13 build. We can switch the Java version to 11
used for release later.
- Add caches into linters. The linter scripts uses `sbt` in, for
example, `./dev/lint-scala`, and uses `mvn` in, for example,
`./dev/lint-java`. Also, it requires to `sbt package` in Jekyll build,
see:
https://github.com/apache/spark/blob/master/docs/_plugins/copy_api_dirs.rb#L160-L161.
We need full caches here for SBT, Maven and build tools.
- Use the same syntax of Java version, 1.8 -> 8.
### Why are the changes needed?
- Remove unnecessary stuff
- Cache what we can in the build
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
It will be tested in GitHub Actions build at the current PR
Closes #30391 from HyukjinKwon/SPARK-33464.
Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon
Hyun <dongjoon@apache.org>
(commit: fbfc0bf)
The file was modified.github/workflows/build_and_test.yml (diff)