Changes

Summary

  1. [SPARK-34504][SQL] Avoid unnecessary resolving of SQL temp views for DDL (commit: af55373) (details)
  2. [SPARK-34699][SQL] 'CREATE OR REPLACE TEMP VIEW USING' should uncache (commit: 387d866) (details)
  3. [SPARK-34742][SQL] ANSI mode: Abs throws exception if input is out of (commit: 1433031) (details)
  4. [SPARK-34575][SQL] Push down limit through window when partitionSpec is (commit: c234c5b) (details)
  5. [SPARK-21449][SQL][FOLLOWUP] Avoid log undesirable IllegalStateException (commit: 115f777) (details)
  6. [SPARK-34770][SQL] InMemoryCatalog.tableExists should not fail if (commit: 1a4971d) (details)
  7. [SPARK-34768][SQL] Respect the default input buffer size in Univocity (commit: 385f1e8) (details)
  8. [SPARK-34766][SQL] Do not capture maven config for views (commit: 48637a9) (details)
  9. [SPARK-34749][SQL] Simplify ResolveCreateNamedStruct (commit: bf4570b) (details)
  10. [SPARK-34758][SQL] Simplify Analyzer.resolveLiteralFunction (commit: 9f7b0a0) (details)
  11. [SPARK-33602][SQL] Group exception messages in execution/datasources (commit: 569fb13) (details)
  12. [SPARK-34762][BUILD] Fix the build failure with Scala 2.13 which is (commit: c5cadfe) (details)
  13. [SPARK-34087][SQL] Fix memory leak of ExecutionListenerBus (commit: 4d90c5d) (details)
  14. [SPARK-34731][CORE] Avoid ConcurrentModificationException when redacting (commit: f8a8b34) (details)
  15. [SPARK-34728][SQL] Remove all SQLConf.get if extends from SQLConfHelper (commit: 25e7d1c) (details)
  16. [SPARK-34757][CORE][DEPLOY] Ignore cache for SNAPSHOT dependencies in (commit: 86ea520) (details)
  17. [SPARK-34741][SQL] MergeIntoTable should avoid ambiguous reference in (commit: d99135b) (details)
  18. [SPARK-34774][BUILD] Ensure change-scala-version.sh update scala.version (commit: 2e836cd) (details)
  19. [SPARK-34760][EXAMPLES] Replace `favorite_color` with `age` in (commit: 5570f81) (details)
  20. [SPARK-34781][SQL] Eliminate LEFT SEMI/ANTI joins to its left child side (commit: 8207e2f) (details)
  21. [SPARK-34747][SQL][DOCS] Add virtual operators to the built-in function (commit: 07ee732) (details)
Commit af553735b10812d303de411034540e3d90199a26 by wenchen
[SPARK-34504][SQL] Avoid unnecessary resolving of SQL temp views for DDL commands

### What changes were proposed in this pull request?

For DDL commands like DROP VIEW, they don't really need to resolve the view (parse and analyze the view SQL text), they just need to get the view metadata.

This PR fixes the rule `ResolveTempViews` to only resolve the temp view for `UnresolvedRelation`. This also fixes a bug for DROP VIEW, as previously it tried to resolve the view and failed to drop invalid views.

### Why are the changes needed?

bug fix

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new test

Closes #31853 from cloud-fan/view-resolve.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: af55373)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/SQLViewTestSuite.scala (diff)
Commit 387d8662449cf7a87de95f018f71b5c32eab1796 by wenchen
[SPARK-34699][SQL] 'CREATE OR REPLACE TEMP VIEW USING' should uncache correctly

### What changes were proposed in this pull request?

This PR proposes:
  1. `CREATE OR REPLACE TEMP VIEW USING` should use `TemporaryViewRelation` to store temp views.
  2. By doing #1, it fixes the issue where the temp view being replaced is not uncached.

### Why are the changes needed?

This is a part of an ongoing work to wrap all the temporary views with `TemporaryViewRelation`: [SPARK-34698](https://issues.apache.org/jira/browse/SPARK-34698).

This also fixes a bug where the temp view being replaced is not uncached.

### Does this PR introduce _any_ user-facing change?

Yes, the temp view being replaced with `CREATE OR REPLACE TEMP VIEW USING` is correctly uncached if the temp view is cached.

### How was this patch tested?

Added new tests.

Closes #31825 from imback82/create_temp_view_using.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 387d866)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala (diff)
The file was modifiedsql/core/src/test/resources/sql-tests/results/show-tables.sql.out (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala (diff)
Commit 143303147b64c0ff7f57c217f9d4b9de2cbff649 by wenchen
[SPARK-34742][SQL] ANSI mode: Abs throws exception if input is out of range

### What changes were proposed in this pull request?

For the following cases, ABS should throw exceptions since the results are out of the range of the result data types in ANSI mode.
```
SELECT abs(${Int.MinValue});
SELECT abs(${Long.MinValue});
```
### Why are the changes needed?

Better ANSI compliance

### Does this PR introduce _any_ user-facing change?

Yes, Abs throws an exception if input is out of range in ANSI mode

### How was this patch tested?

Unit test

Closes #31836 from gengliangwang/ansiAbs.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 1433031)
The file was modifieddocs/sql-ref-ansi-compliance.md (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercion.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala (diff)
Commit c234c5b5f1676fbb9a79dc865534fec566425326 by wenchen
[SPARK-34575][SQL] Push down limit through window when partitionSpec is empty

### What changes were proposed in this pull request?

Push down limit through `Window` when the partitionSpec of all window functions is empty and the same order is used. This is a real case from production:

![image](https://user-images.githubusercontent.com/5399861/109457143-3900c680-7a95-11eb-9078-806b041175c2.png)

This pr support 2 cases:
1. All window functions have same orderSpec:
   ```sql
   SELECT *, ROW_NUMBER() OVER(ORDER BY a) AS rn, RANK() OVER(ORDER BY a) AS rk FROM t1 LIMIT 5;
   == Optimized Logical Plan ==
   Window [row_number() windowspecdefinition(a#9L ASC NULLS FIRST, specifiedwindowframe(RowFrame,          unboundedpreceding$(), currentrow$())) AS rn#4, rank(a#9L) windowspecdefinition(a#9L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rk#5], [a#9L ASC NULLS FIRST]
   +- GlobalLimit 5
      +- LocalLimit 5
         +- Sort [a#9L ASC NULLS FIRST], true
            +- Relation default.t1[A#9L,B#10L,C#11L] parquet
   ```
2. There is a window function with a different orderSpec:
   ```sql
   SELECT a, ROW_NUMBER() OVER(ORDER BY a) AS rn, RANK() OVER(ORDER BY b DESC) AS rk FROM t1 LIMIT 5;
   == Optimized Logical Plan ==
   Project [a#9L, rn#4, rk#5]
   +- Window [rank(b#10L) windowspecdefinition(b#10L DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rk#5], [b#10L DESC NULLS LAST]
      +- GlobalLimit 5
         +- LocalLimit 5
            +- Sort [b#10L DESC NULLS LAST], true
               +- Window [row_number() windowspecdefinition(a#9L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rn#4], [a#9L ASC NULLS FIRST]
                  +- Project [a#9L, b#10L]
                     +- Relation default.t1[A#9L,B#10L,C#11L] parquet
   ```

### Why are the changes needed?

Improve query performance.

```scala
spark.range(500000000L).selectExpr("id AS a", "id AS b").write.saveAsTable("t1")
spark.sql("SELECT *, ROW_NUMBER() OVER(ORDER BY a) AS rowId FROM t1 LIMIT 5").show
```

Before this pr | After this pr
-- | --
![image](https://user-images.githubusercontent.com/5399861/109456919-c68fe680-7a94-11eb-89ca-67ec03267158.png) | ![image](https://user-images.githubusercontent.com/5399861/109456927-cd1e5e00-7a94-11eb-9866-d76b2665caea.png)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #31691 from wangyum/SPARK-34575.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: c234c5b)
The file was addedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/LimitPushdownThroughWindowSuite.scala
The file was addedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/LimitPushDownThroughWindow.scala
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
Commit 115f777cb0a9dff78497bad9b64daa5da1ee0e51 by yao
[SPARK-21449][SQL][FOLLOWUP] Avoid log undesirable IllegalStateException when state close

### What changes were proposed in this pull request?

`TmpOutputFile` and `TmpErrOutputFile`  are registered in `o.a.h.u.ShutdownHookManager `during creatation. The `state.close()` will delete them if they are not null and try remove them from the `o.a.h.u.ShutdownHookManager` which causes IllegalStateException when we call it in our ShutdownHookManager too.
In this PR, we delete them ahead with a high priority hook in Spark and set them to null to bypass the deletion and canceling in `state.close()`

### Why are the changes needed?

W/ or w/o this PR, the deletion of these files is not affected, we just mute an undesirable error log here.

### Does this PR introduce _any_ user-facing change?

no, this is a follow-up

### How was this patch tested?

#### the undesirable gone
```scala
spark-sql> 21/03/16 18:41:31 ERROR Utils: Uncaught exception in thread shutdown-hook-0
java.lang.IllegalStateException: Shutdown in progress, cannot cancel a deleteOnExit
at org.apache.hive.common.util.ShutdownHookManager.cancelDeleteOnExit(ShutdownHookManager.java:106)
at org.apache.hadoop.hive.common.FileUtils.deleteTmpFile(FileUtils.java:861)
at org.apache.hadoop.hive.ql.session.SessionState.deleteTmpErrOutputFile(SessionState.java:325)
at org.apache.hadoop.hive.ql.session.SessionState.dropSessionPaths(SessionState.java:829)
at org.apache.hadoop.hive.ql.session.SessionState.close(SessionState.java:1585)
at org.apache.hadoop.hive.cli.CliSessionState.close(CliSessionState.java:66)
at org.apache.spark.sql.hive.client.HiveClientImpl.closeState(HiveClientImpl.scala:172)
at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$new$1(HiveClientImpl.scala:175)
at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1994)
at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
(python)  ✘ kentyaohulk  ~/Downloads/spark/spark-3.2.0-SNAPSHOT-bin-20210316  cd ..
(python)  kentyaohulk  ~/Downloads/spark  tar zxf spark-3.2.0-SNAPSHOT-bin-20210316.tgz
(python)  kentyaohulk  ~/Downloads/spark  cd -
~/Downloads/spark/spark-3.2.0-SNAPSHOT-bin-20210316
(python)  kentyaohulk  ~/Downloads/spark/spark-3.2.0-SNAPSHOT-bin-20210316  bin/spark-sql --conf spark.local.dir=./local --conf spark.hive.exec.local.scratchdir=./local
21/03/16 18:42:15 WARN Utils: Your hostname, hulk.local resolves to a loopback address: 127.0.0.1; using 10.242.189.214 instead (on interface en0)
21/03/16 18:42:15 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/03/16 18:42:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/03/16 18:42:16 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
21/03/16 18:42:18 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
21/03/16 18:42:18 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
21/03/16 18:42:19 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
21/03/16 18:42:19 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore kentyao127.0.0.1
Spark master: local[*], Application Id: local-1615891336877
spark-sql> %
```

#### and the deletion is still fine

```shell
kentyaohulk  ~/Downloads/spark/spark-3.2.0-SNAPSHOT-bin-20210316 
ls -al local
total 0
drwxr-xr-x   7 kentyao  staff  224  3 16 18:42 .
drwxr-xr-x  19 kentyao  staff  608  3 16 18:42 ..
drwx------   2 kentyao  staff   64  3 16 18:42 16cc5238-e25e-4c0f-96ef-0c4bdecc7e51
-rw-r--r--   1 kentyao  staff    0  3 16 18:42 16cc5238-e25e-4c0f-96ef-0c4bdecc7e51219959790473242539.pipeout
-rw-r--r--   1 kentyao  staff    0  3 16 18:42 16cc5238-e25e-4c0f-96ef-0c4bdecc7e518816377057377724129.pipeout
drwxr-xr-x   2 kentyao  staff   64  3 16 18:42 blockmgr-37a52ad2-eb56-43a5-8803-8f58d08fe9ad
drwx------   3 kentyao  staff   96  3 16 18:42 spark-101971df-f754-47c2-8764-58c45586be7e
kentyaohulk  ~/Downloads/spark/spark-3.2.0-SNAPSHOT-bin-20210316  ls -al local
total 0
drwxr-xr-x   2 kentyao  staff   64  3 16 19:22 .
drwxr-xr-x  19 kentyao  staff  608  3 16 18:42 ..
kentyaohulk  ~/Downloads/spark/spark-3.2.0-SNAPSHOT-bin-20210316 
```

Closes #31850 from yaooqinn/followup.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
(commit: 115f777)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala (diff)
Commit 1a4971d8a16b5bc624cef584271243bf64a51941 by wenchen
[SPARK-34770][SQL] InMemoryCatalog.tableExists should not fail if database doesn't exist

### What changes were proposed in this pull request?

This PR updates `InMemoryCatalog.tableExists` to return false if database doesn't exist, instead of failing. The new behavior is consistent with `HiveExternalCatalog` which is used in production, so this bug mostly only affects tests.

### Why are the changes needed?

bug fix

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

a new test

Closes #31860 from cloud-fan/catalog.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 1a4971d)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala (diff)
Commit 385f1e8f5de5dcad62554cd75446e98c9380b384 by gurwls223
[SPARK-34768][SQL] Respect the default input buffer size in Univocity

### What changes were proposed in this pull request?

This PR proposes to follow Univocity's input buffer.

### Why are the changes needed?

- Firstly, it's best to trust their judgement on the default values. Also 128 is too low.
- Default values arguably have more test coverage in Univocity.
- It will also fix https://github.com/uniVocity/univocity-parsers/issues/449
- ^ is a regression compared to Spark 2.4

### Does this PR introduce _any_ user-facing change?

No. In addition, It fixes a regression.

### How was this patch tested?

Manually tested, and added a unit test.

Closes #31858 from HyukjinKwon/SPARK-34768.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 385f1e8)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala (diff)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala (diff)
Commit 48637a9d437f4077243cb4a4607cf6be58490724 by yao
[SPARK-34766][SQL] Do not capture maven config for views

### What changes were proposed in this pull request?

Skip capture maven repo config for views.

### Why are the changes needed?

Due to the bad network, we always use the thirdparty maven repo to run test. e.g.,
```
build/sbt "test:testOnly *SQLQueryTestSuite" -Dspark.sql.maven.additionalRemoteRepositories=xxxxx
```

It's failed with such error msg
```
[info] - show-tblproperties.sql *** FAILED *** (128 milliseconds)
[info] show-tblproperties.sql
[info] Expected "...rredTempViewNames [][]", but got "...rredTempViewNames [][
[info] view.sqlConfig.spark.sql.maven.additionalRemoteRepositories xxxxx]" Result did not match for query #6
[info] SHOW TBLPROPERTIES view (SQLQueryTestSuite.scala:464)
```

It's not necessary to capture the maven config to view since it's a session level config.
 

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

manual test pass
```
build/sbt "test:testOnly *SQLQueryTestSuite" -Dspark.sql.maven.additionalRemoteRepositories=xxx
```

Closes #31856 from ulysses-you/skip-maven-config.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
(commit: 48637a9)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala (diff)
Commit bf4570b43dc10a1c7b66802f1e067bae877b3355 by yamamuro
[SPARK-34749][SQL] Simplify ResolveCreateNamedStruct

### What changes were proposed in this pull request?

This is a follow-up of https://github.com/apache/spark/pull/31808 and simplifies its fix to one line (excluding comments).

### Why are the changes needed?

code simplification

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

Closes #31843 from cloud-fan/simplify.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(commit: bf4570b)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ExpressionParserSuite.scala (diff)
Commit 9f7b0a035bbaf2c165839a9f6109841b812db37b by yamamuro
[SPARK-34758][SQL] Simplify Analyzer.resolveLiteralFunction

### What changes were proposed in this pull request?

This PR simplifies `Analyzer.resolveLiteralFunction` to always create the `Alias`. The caller side will remove the `Alias` if it's not necessary.

### Why are the changes needed?

code simplification.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #31844 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(commit: 9f7b0a0)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
Commit 569fb133d09e24e4ed56ed7efff641512d98b01b by wenchen
[SPARK-33602][SQL] Group exception messages in execution/datasources

### What changes were proposed in this pull request?
This PR group exception messages in `/core/src/main/scala/org/apache/spark/sql/execution/datasources`.

### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.

### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.

### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.

Closes #31757 from beliefer/SPARK-33602.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: 569fb13)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcSerializer.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/binaryfile/BinaryFileFormat.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/RecordReaderIterator.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala (diff)
Commit c5cadfefdf9b4c6135355b49366fd9e9d1e3fcd0 by gurwls223
[SPARK-34762][BUILD] Fix the build failure with Scala 2.13 which is related to commons-cli

### What changes were proposed in this pull request?

This PR fixes the build failure with Scala 2.13 which is related to `commons-cli`.
The last few days, build with Scala 2.13 on GA continues to fail and the error message says like as follows.
```
[error] /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:26:1:  error: package org.apache.commons.cli does not exist
1278[error] import org.apache.commons.cli.GnuParser;
```
The reason is that `mvn help` in `change-scala-version.sh` downloads the POM file of `commons-cli` but doesn't download the JAR file, leading the build failure.

This PR also adds `commons-cli` to the dependencies explicitly because HiveThriftServer depends on it.
### Why are the changes needed?

Expect to fix the build failure with Scala 2.13.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I confirmed that build successfully finishes with Scala 2.13 on my laptop.
```
find ~/.m2 -name commons-cli -exec rm -rf {} \;
find ~/.ivy2 -name commons-cli -exec rm -rf {} \;
find ~/.cache/ -name commons-cli -exec rm -rf {} \; // For Linux
find ~/Library/Caches -name commons-cli -exec rm -rf {} \; // For macOS

dev/change-scala-version 2.13
./build/sbt -Pyarn -Pmesos -Pkubernetes -Phive -Phive-thriftserver -Phadoop-cloud -Pkinesis-asl -Pdocker-integration-tests -Pkubernetes-integration-tests -Pspark-ganglia-lgpl -Pscala-2.13 clean compile test:compile
```

Closes #31862 from sarutak/commons-cli.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: c5cadfe)
The file was modifiedsql/hive-thriftserver/pom.xml (diff)
The file was modifiedpom.xml (diff)
The file was modifieddev/change-scala-version.sh (diff)
Commit 4d90c5dc0efcf77ef6735000ee7016428c57077b by gurwls223
[SPARK-34087][SQL] Fix memory leak of ExecutionListenerBus

### What changes were proposed in this pull request?

This PR proposes an alternative way to fix the memory leak of `ExecutionListenerBus`, which would automatically clean them up.

Basically, the idea is to add `registerSparkListenerForCleanup` to `ContextCleaner`, so we can remove the `ExecutionListenerBus` from `LiveListenerBus` when the `SparkSession` is GC'ed.

On the other hand, to make the `SparkSession` GC-able, we need to get rid of the reference of `SparkSession` in `ExecutionListenerBus`. Therefore, we introduced the `sessionUUID`, which is a unique identifier for SparkSession, to replace the  `SparkSession` object.

Note that, the proposal wouldn't take effect when `spark.cleaner.referenceTracking=false` since it depends on `ContextCleaner`.

### Why are the changes needed?

Fix the memory leak caused by `ExecutionListenerBus` mentioned in SPARK-34087.

### Does this PR introduce _any_ user-facing change?

Yes, save memory for users.

### How was this patch tested?

Added unit test.

Closes #31839 from Ngone51/fix-mem-leak-of-ExecutionListenerBus.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 4d90c5d)
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/SparkSessionBuilderSuite.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/ContextCleaner.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala (diff)
Commit f8a8b340b3a69fd10514af03085d77027b84e617 by gurwls223
[SPARK-34731][CORE] Avoid ConcurrentModificationException when redacting properties in EventLoggingListener

### What changes were proposed in this pull request?

Change DAGScheduler to pass a clone of the Properties object, rather than the original object, to the SparkListenerJobStart event.

### Why are the changes needed?

DAGScheduler might modify the Properties object (e.g., in addPySparkConfigsToProperties) after firing off the SparkListenerJobStart event. Since the handler for that event (onJobStart in EventLoggingListener) will iterate over the elements of the Property object, this sometimes results in a ConcurrentModificationException.

This can be demonstrated using these steps:
```
$ bin/spark-shell --conf spark.ui.showConsoleProgress=false \
--conf spark.executor.cores=1 --driver-memory 4g --conf \
"spark.ui.showConsoleProgress=false" \
--conf spark.eventLog.enabled=true \
--conf spark.eventLog.dir=/tmp/spark-events
...
scala> (0 to 500).foreach { i =>
     |   val df = spark.range(0, 20000).toDF("a")
     |   df.filter("a > 12").count
     | }
21/03/12 18:16:44 ERROR AsyncEventQueue: Listener EventLoggingListener threw an exception
java.util.ConcurrentModificationException
at java.util.Hashtable$Enumerator.next(Hashtable.java:1387)
```

I've not actually seen a ConcurrentModificationException in onStageSubmitted, only in onJobStart. However, they both iterate over the Properties object, so for safety's sake I pass a clone to SparkListenerStageSubmitted as well.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By repeatedly running the reproduction steps from above.

Closes #31826 from bersprockets/elconcurrent.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: f8a8b34)
The file was modifiedcore/src/main/scala/org/apache/spark/util/Utils.scala (diff)
The file was modifiedcore/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala (diff)
Commit 25e7d1ceee8c9f4ecb4ab796f51e9bcbc0500fae by gurwls223
[SPARK-34728][SQL] Remove all SQLConf.get if extends from SQLConfHelper

### What changes were proposed in this pull request?

Remove all SQLConf.get to conf if extends from SQLConfHelper

### Why are the changes needed?

Clean up code.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing unit tests.

Closes #31822 from leoluan2009/SPARK-34728.

Authored-by: Luan <luanxuedong2009@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 25e7d1c)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeCsvJsonExprs.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala (diff)
The file was modifiedsql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/objects.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowPartitionsExec.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/streaming/StreamingRelationV2.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/EliminateResolvedHint.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveUnion.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PlanDynamicPruningFilters.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/CleanupDynamicPruningFilters.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SaveIntoDataSourceCommand.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UpdateFields.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SchemaPruning.scala (diff)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala (diff)
Commit 86ea5203201a1b7611f0beaa9c7759d480fb21af by gurwls223
[SPARK-34757][CORE][DEPLOY] Ignore cache for SNAPSHOT dependencies in spark-submit

### What changes were proposed in this pull request?
This change is to ignore cache for SNAPSHOT dependencies in spark-submit.

### Why are the changes needed?
When spark-submit is executed with --packages, it will not download the dependency jars when they are available in cache (e.g. ivy cache), even when the dependencies are SNAPSHOT.

This might block developers who work on external modules in Spark (e.g. spark-avro), since they need to remove the cache manually every time when they update the code during developments (which generates SNAPSHOT jars). Without knowing this, they could be blocked wondering why their code changes are not reflected in spark-submit executions.

### Does this PR introduce _any_ user-facing change?
Yes. With this change, developers/users who run spark-submit with SNAPSHOT dependencies do not need to remove the cache every time when the SNAPSHOT dependencies are updated.

### How was this patch tested?
Added a unit test.

Closes #31849 from bozhang2820/spark-submit-cache-ignore.

Authored-by: Bo Zhang <bo.zhang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 86ea520)
The file was modifiedcore/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala (diff)
The file was modifiedcore/src/test/scala/org/apache/spark/deploy/SparkSubmitUtilsSuite.scala (diff)
Commit d99135b66ab4ebab69973800990e12845c4ffeaa by wenchen
[SPARK-34741][SQL] MergeIntoTable should avoid ambiguous reference in UpdateAction

### What changes were proposed in this pull request?

This PR proposes to deduplicate the source table when there're conflicting attributes between the target table and the source table.

### Why are the changes needed?

When resolving the `UpdateAction`, which could reference attributes from both target and source tables,  Spark should know clearly where the attribute comes from when there're conflicting attributes instead of picking up a random one.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added a unit test and updated existing tests.

Closes #31835 from Ngone51/dedup-MergeIntoTable.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(commit: d99135b)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala (diff)
The file was modifiedsql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceNullWithFalseInPredicateSuite.scala (diff)
The file was modifiedsql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (diff)
Commit 2e836cdb598255d6b43b386e98fcb79b70338e69 by srowen
[SPARK-34774][BUILD] Ensure change-scala-version.sh update scala.version in parent POM correctly

### What changes were proposed in this pull request?
After SPARK-34507,  execute` change-scala-version.sh` script will update `scala.version` in parent pom, but if we execute the following commands in order:

```
dev/change-scala-version.sh 2.13
dev/change-scala-version.sh 2.12
git status
```

there will generate git diff as follow:

```
diff --git a/pom.xml b/pom.xml
index ddc4ce2f68..f43d8c8f78 100644
--- a/pom.xml
+++ b/pom.xml
-162,7 +162,7
     <commons.math3.version>3.4.1</commons.math3.version>

     <commons.collections.version>3.2.2</commons.collections.version>
-    <scala.version>2.12.10</scala.version>
+    <scala.version>2.13.5</scala.version>
     <scala.binary.version>2.12</scala.binary.version>
     <scalatest-maven-plugin.version>2.0.0</scalatest-maven-plugin.version>
     <scalafmt.parameters>--test</scalafmt.parameters>
```

seem 'scala.version' property was not update correctly.

So this pr add an extra 'scala.version' to scala-2.12 profile to ensure change-scala-version.sh can update the public `scala.version` property correctly.

### Why are the changes needed?
Bug fix.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
**Manual test**

Execute the following commands in order:

```
dev/change-scala-version.sh 2.13
dev/change-scala-version.sh 2.12
git status
```

**Before**

```
diff --git a/pom.xml b/pom.xml
index ddc4ce2f68..f43d8c8f78 100644
--- a/pom.xml
+++ b/pom.xml
-162,7 +162,7
     <commons.math3.version>3.4.1</commons.math3.version>

     <commons.collections.version>3.2.2</commons.collections.version>
-    <scala.version>2.12.10</scala.version>
+    <scala.version>2.13.5</scala.version>
     <scala.binary.version>2.12</scala.binary.version>
     <scalatest-maven-plugin.version>2.0.0</scalatest-maven-plugin.version>
     <scalafmt.parameters>--test</scalafmt.parameters>
```

**After**

No git diff.

Closes #31865 from LuciferYang/SPARK-34774.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(commit: 2e836cd)
The file was modifiedpom.xml (diff)
Commit 5570f817b2862c2680546f35c412bb06779ae1c9 by yao
[SPARK-34760][EXAMPLES] Replace `favorite_color` with `age` in JavaSQLDataSourceExample

### What changes were proposed in this pull request?
In JavaSparkSQLExample when excecute 'peopleDF.write().partitionBy("favorite_color").bucketBy(42,"name").saveAsTable("people_partitioned_bucketed");'
throws Exception: 'Exception in thread "main" org.apache.spark.sql.AnalysisException: partition column favorite_color is not defined in table people_partitioned_bucketed, defined table columns are: age, name;'
Change the column favorite_color to age.

### Why are the changes needed?
Run JavaSparkSQLExample successfully.

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
test in JavaSparkSQLExample .

Closes #31851 from zengruios/SPARK-34760.

Authored-by: zengruios <578395184@qq.com>
Signed-off-by: Kent Yao <yao@apache.org>
(commit: 5570f81)
The file was modifiedexamples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java (diff)
The file was modifiedexamples/src/main/python/sql/datasource.py (diff)
The file was modifiedexamples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java (diff)
Commit 8207e2f65cc2ce2d87ee60ee05a2c1ee896cf93e by yamamuro
[SPARK-34781][SQL] Eliminate LEFT SEMI/ANTI joins to its left child side in AQE

### What changes were proposed in this pull request?

In `EliminateJoinToEmptyRelation.scala`, we can extend it to cover more cases for LEFT SEMI and LEFT ANI joins:

* Join is left semi join, join right side is non-empty and condition is empty. Eliminate join to its left side.
* Join is left anti join, join right side is empty. Eliminate join to its left side.

Given we eliminate join to its left side here, renaming the current optimization rule to `EliminateUnnecessaryJoin` instead.
In addition, also change to use `checkRowCount()` to check run time row count, instead of using `EmptyHashedRelation`. So this can cover `BroadcastNestedLoopJoin` as well. (`BroadcastNestedLoopJoin`'s broadcast side is `Array[InternalRow]`, not `HashedRelation`).

### Why are the changes needed?

Cover more join cases, and improve query performance for affected queries.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit tests in `AdaptiveQueryExecSuite.scala`.

Closes #31873 from c21/aqe-join.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(commit: 8207e2f)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AQEOptimizer.scala (diff)
The file was addedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/EliminateUnnecessaryJoin.scala
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala (diff)
The file was removedsql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/EliminateJoinToEmptyRelation.scala
The file was modifiedsql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala (diff)
Commit 07ee73234f1d1ecd1e5edcce3bc510c59a59cb00 by gurwls223
[SPARK-34747][SQL][DOCS] Add virtual operators to the built-in function document

### What changes were proposed in this pull request?

This PR fix an issue that virtual operators (`||`, `!=`, `<>`, `between` and `case`) are absent from the Spark SQL Built-in functions document.

### Why are the changes needed?

The document should explain about all the supported built-in operators.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Built the document with `SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_PYTHONDOC=1 bundler exec jekyll build` and then, confirmed the document.

![neq1](https://user-images.githubusercontent.com/4736016/111192859-e2e76380-85fc-11eb-89c9-75916a5e856a.png)
![neq2](https://user-images.githubusercontent.com/4736016/111192874-e7ac1780-85fc-11eb-9a9b-c504265b373f.png)
![between](https://user-images.githubusercontent.com/4736016/111192898-eda1f880-85fc-11eb-992d-cf80c544ec27.png)
![case](https://user-images.githubusercontent.com/4736016/111192918-f266ac80-85fc-11eb-9306-5dbc413a0cdb.png)
![double_pipe](https://user-images.githubusercontent.com/4736016/111192952-fb577e00-85fc-11eb-932e-385e5c2a5205.png)

Closes #31841 from sarutak/builtin-op-doc.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(commit: 07ee732)
The file was modifiedsql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala (diff)
The file was modifiedsql/gen-sql-api-docs.py (diff)