Skip to content

Conversation

@soumyakanti3578
Copy link
Contributor

@soumyakanti3578 soumyakanti3578 commented Nov 18, 2025

What changes were proposed in this pull request?

  1. In GenericJdbcDatabaseAccessor.getQualifiedTableName() quote schema and table names, as the qualified name is used to run queries against different DBs.
  2. In CalcitePlanner.genTableLogicalPlan() remove the double quotes (unescape identifier) from Constants.JDBC_SCHEMA and Constants.JDBC_TABLE
  3. In JdbcStorageHandler.getURIForAuth() unescape Constants.JDBC_TABLE, as otherwise we generate the jdbc string with "
  4. Store quoting characters for different databases in DatabaseType
  5. Helper methods to unescape identifiers
  6. Ignore SqlException from parameterMetaData.getParameterType - jdbc_table_with_schema_oracle.q was failing because of this
  7. Tests for different DBs with case sensitive schemas and tables.

Why are the changes needed?

We see Exceptions, as shown here:
https://issues.apache.org/jira/browse/HIVE-29308

Does this PR introduce any user-facing change?

No

How was this patch tested?

mvn test -pl itests/qtest -Pitests -Dtest=TestMiniLlapLocalCliDriver -Dtest.output.overwrite=true -Dqfile="jdbc_case_sensitive_postgres.q,jdbc_case_sensitive_mssql.q,jdbc_case_sensitive_mariadb.q,jdbc_case_sensitive_oracle.q,jdbc_table_with_schema_oracle.q"

@sonarqubecloud
Copy link

@soumyakanti3578 soumyakanti3578 marked this pull request as ready for review December 30, 2025 17:29
Copy link
Member

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left quite a few comments/questions but those that are the most important are:

  • the potential breaking changes
  • the conflicts/ambiguity between case sensitive and insensitive identifiers.

return null;
}
String schemaName = conf.get(Constants.JDBC_SCHEMA);
String schemaName = quoteIdentifier(unescapeHiveJdbcIdentifier(conf.get(Constants.JDBC_SCHEMA)), dbType);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we always quote the schema/table name then we are making all code relying on this method case-sensitive. In other words, if previously the user had defined "hive.sql.table" = "Country" then queries over this table may stop working due to case sensitivity. Quoting everything seems to be a breaking change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method getQualifiedTableName is used to build queries that are run in the backend db. We have to make sure to quote the table and schema names here, if we want to support case sensitivity. Currently, if the user defines "hive.sql.table" = "Country", the table name in the query that is run in the backend db is "essentially" country. And if we quote all the time we will be changing the table name to Country, which might break existing jobs.

I think it's easy to check if the table and schema names were "escaped" by the user before we "unescape" and quote. If they were not escaped, we could skip it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conditional escaping is a good idea. I have the impression though that we may also need to make quoting conditional. For the conditional we have various options:

  • Based on the content/value of the hive.sql.table property as proposed above
  • Based on a new global config property (e.g., hive.jdbc.identifiers.casesensitive)
  • Based on table level property (e.g., hive.sql.identifiers.casesensitive)

Not sure whats the best option to move forward. Maybe we can take some inspiration by checking what other engines (Presto/Trino/Spark/etc) are doing.

* @return the unescaped identifier
*/
public static String unescapeHiveJdbcIdentifier(String identifier) {
return unescapeIdentifier(identifier, '"');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The choice of " as the "Hive" identifier seems a bit arbitrary. Who/Where do we define the values that are allowed inside hive.sql.table, hive.sql.schema, etc?

What prevents a MSSQL user to write "hive.sql.table"="[WorldData]". Was this working before? Is this working now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used " as a POC for now, but I think we just have to decide on a quoting character for Hive. I think either ' or " should be fine, but we definitely should support just one. An MSSQL user can create a table in the backend using [..], but when creating an external Hive table, they should use the quoting character that Hive supports. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this stage, I am trying to understand the impact on existing use-cases. Does "hive.sql.table"="[WorldData]" work without the changes in this PR? Was there any kind of quoting that was working even without the changes in this PR?

Comment on lines +3183 to +3184
final String schemaName = unescapeHiveJdbcIdentifier(tabMetaData.getProperty(Constants.JDBC_SCHEMA));
final String tableName = unescapeHiveJdbcIdentifier(tabMetaData.getProperty(Constants.JDBC_TABLE));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the user defines two tables with the following props:

  • "hive.sql.table"="[WorldData]"
  • "hive.sql.table"="WorldData"
    The underlying DBMS may also have two tables defined as such.

Aren't we risking having conflicts/ambiguity if we "unescape" the content of the properties?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed above in another comment, we should maybe decide on a quoting character for Hive.

@@ -137,7 +138,7 @@ public void setJobProperties(Map<String, String> jobProperties) {
this.jobProperties = jobProperties;
}

@Explain(displayName = "jobProperties", explainLevels = { Level.EXTENDED })
@Explain(displayName = ExplainTask.JOB_PROPERTIES, explainLevels = { Level.EXTENDED })
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes seems unrelated to the PR. From a quick search it seems that whenever we use @Explain(displayName we always have a plain string so for consistency reasons I would leave it as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's related, but tangentially :)
I think I had to add CONFIG_USERNAME in ‎JdbcStorageConfigManager.java‎ to support case sensitivity for one of the backend DBs (maybe Oracle, but I have to check). And we don't want to print usernames and passwords in the explain plans. I removed the username from the explain plans in ExplainTask.java, where I am also using the variable ExplainTask.JOB_PROPERTIES.

It's better to use the variable ExplainTask.JOB_PROPERTIES here to avoid bugs in the future arising from changes in the displayName.

Without these changes we will see username in some plans.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a variable creates dependencies (import) to classes that were not present before. Another point that is a bit worrisome is that we rely on the displayName (that should be relevant only for an end-user) to perform some actions in the code. I don't know if this pattern is already present in the code but overall seems shaky and we should better avoid it if possible.

Anyways, as mentioned elsewhere this change should land separately so we can continue the discussion on the respective PR.

"jdbc:metastore://" : tableProperties.get(Constants.JDBC_URL);
String table_name = tableProperties.get(Constants.JDBC_TABLE);
return new URI(host_url+"/"+table_name);
String tableName = unescapeHiveJdbcIdentifier(tableProperties.get(Constants.JDBC_TABLE));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we "unescape" some content of the property aren't we risking conflicts/ambiguity.
For instance:

  • "hive.sql.table"="\"Country\""
  • "hive.sql.table"="Country"

Moreover, it seems that the current unescaping is not gonna work for other quoting chars:

  • "hive.sql.table"="[Country]"
  • "hive.sql.table"="`Country`"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I checked that we do run into conflicts here. I added

CREATE TABLE "WorldData".Country
(
    id   int,
    name varchar(20)
);

INSERT INTO "WorldData".Country VALUES (5, 'France'), (6, 'Italy'), (7, 'Spain'), (8, 'Portugal');

to q_test_case_sensitive.postgres.sql

While creating the external table, if we pass "hive.sql.table" = "country", the select outputs the new data shown above, but if we pass "hive.sql.table" = "Country" or "hive.sql.table" = "\"Country\"", the select returns data from the case sensitive table.

I think we can check if the value of "hive.sql.table" is escaped or not. If not, then we can probably always just use lower case. Let me know what you think about this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using lower case may not always be an option since as far as I recall there are DBMS that store unquoted identifers using different conventions (e.g., upper case).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from ambiguity quoted identifiers may contain arbitrary characters inside the quotes that are not URI friendly (e.g., ?%+-/&$). We may have to come up with a more robust normalization strategy but this depends on how getURIForAuth is used and if we care about supporting such use-cases.



-- Test Case-Sensitive Query Field Names
-- (Should fail in SerDe/Iterator with Column not found)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the test should fail it should be in the negative directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comments are incorrect/from an earlier version. I will remove these too

Comment on lines +73 to +76
-- Cleanup
DROP TABLE country_test;
DROP TABLE cities_test;
DROP TABLE geography_test;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the impression that the framework takes care of the cleanup so these DROP statements are redundant.

throw new IllegalArgumentException("Database type string cannot be null");
}
// METASTORE must be handled before valueOf
if (METASTORE.name().equalsIgnoreCase(dbType)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does METASTORE need special handling?

* @return The matching DatabaseType.
* @throws IllegalArgumentException if the dbType is null or not a valid type.
*/
public static DatabaseType from(String dbType) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this method and not simply use valueOf?

"hive.sql.jdbc.url" = "${system:hive.test.database.qdb.jdbc.url}",
"hive.sql.dbcp.username" = "${system:hive.test.database.qdb.jdbc.username}",
"hive.sql.dbcp.password" = "${system:hive.test.database.qdb.jdbc.password}",
"hive.sql.schema" = "\"WorldData\"",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the impression that we also support single quoted table properties so it may be a bit more readable to use those instead of escaping double quotes:

'hive.sql.schema'='"WorldData"'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants