-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-29308: Exception when JDBC table names are case-sensitive #6197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
b44ea6f to
82c4ade
Compare
82c4ade to
c88fdfa
Compare
c88fdfa to
1415a39
Compare
|
zabetak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left quite a few comments/questions but those that are the most important are:
- the potential breaking changes
- the conflicts/ambiguity between case sensitive and insensitive identifiers.
| return null; | ||
| } | ||
| String schemaName = conf.get(Constants.JDBC_SCHEMA); | ||
| String schemaName = quoteIdentifier(unescapeHiveJdbcIdentifier(conf.get(Constants.JDBC_SCHEMA)), dbType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we always quote the schema/table name then we are making all code relying on this method case-sensitive. In other words, if previously the user had defined "hive.sql.table" = "Country" then queries over this table may stop working due to case sensitivity. Quoting everything seems to be a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method getQualifiedTableName is used to build queries that are run in the backend db. We have to make sure to quote the table and schema names here, if we want to support case sensitivity. Currently, if the user defines "hive.sql.table" = "Country", the table name in the query that is run in the backend db is "essentially" country. And if we quote all the time we will be changing the table name to Country, which might break existing jobs.
I think it's easy to check if the table and schema names were "escaped" by the user before we "unescape" and quote. If they were not escaped, we could skip it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conditional escaping is a good idea. I have the impression though that we may also need to make quoting conditional. For the conditional we have various options:
- Based on the content/value of the
hive.sql.tableproperty as proposed above - Based on a new global config property (e.g.,
hive.jdbc.identifiers.casesensitive) - Based on table level property (e.g.,
hive.sql.identifiers.casesensitive)
Not sure whats the best option to move forward. Maybe we can take some inspiration by checking what other engines (Presto/Trino/Spark/etc) are doing.
| * @return the unescaped identifier | ||
| */ | ||
| public static String unescapeHiveJdbcIdentifier(String identifier) { | ||
| return unescapeIdentifier(identifier, '"'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The choice of " as the "Hive" identifier seems a bit arbitrary. Who/Where do we define the values that are allowed inside hive.sql.table, hive.sql.schema, etc?
What prevents a MSSQL user to write "hive.sql.table"="[WorldData]". Was this working before? Is this working now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used " as a POC for now, but I think we just have to decide on a quoting character for Hive. I think either ' or " should be fine, but we definitely should support just one. An MSSQL user can create a table in the backend using [..], but when creating an external Hive table, they should use the quoting character that Hive supports. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this stage, I am trying to understand the impact on existing use-cases. Does "hive.sql.table"="[WorldData]" work without the changes in this PR? Was there any kind of quoting that was working even without the changes in this PR?
| final String schemaName = unescapeHiveJdbcIdentifier(tabMetaData.getProperty(Constants.JDBC_SCHEMA)); | ||
| final String tableName = unescapeHiveJdbcIdentifier(tabMetaData.getProperty(Constants.JDBC_TABLE)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the user defines two tables with the following props:
"hive.sql.table"="[WorldData]""hive.sql.table"="WorldData"
The underlying DBMS may also have two tables defined as such.
Aren't we risking having conflicts/ambiguity if we "unescape" the content of the properties?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed above in another comment, we should maybe decide on a quoting character for Hive.
| @@ -137,7 +138,7 @@ public void setJobProperties(Map<String, String> jobProperties) { | |||
| this.jobProperties = jobProperties; | |||
| } | |||
|
|
|||
| @Explain(displayName = "jobProperties", explainLevels = { Level.EXTENDED }) | |||
| @Explain(displayName = ExplainTask.JOB_PROPERTIES, explainLevels = { Level.EXTENDED }) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes seems unrelated to the PR. From a quick search it seems that whenever we use @Explain(displayName we always have a plain string so for consistency reasons I would leave it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's related, but tangentially :)
I think I had to add CONFIG_USERNAME in JdbcStorageConfigManager.java to support case sensitivity for one of the backend DBs (maybe Oracle, but I have to check). And we don't want to print usernames and passwords in the explain plans. I removed the username from the explain plans in ExplainTask.java, where I am also using the variable ExplainTask.JOB_PROPERTIES.
It's better to use the variable ExplainTask.JOB_PROPERTIES here to avoid bugs in the future arising from changes in the displayName.
Without these changes we will see username in some plans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a variable creates dependencies (import) to classes that were not present before. Another point that is a bit worrisome is that we rely on the displayName (that should be relevant only for an end-user) to perform some actions in the code. I don't know if this pattern is already present in the code but overall seems shaky and we should better avoid it if possible.
Anyways, as mentioned elsewhere this change should land separately so we can continue the discussion on the respective PR.
| "jdbc:metastore://" : tableProperties.get(Constants.JDBC_URL); | ||
| String table_name = tableProperties.get(Constants.JDBC_TABLE); | ||
| return new URI(host_url+"/"+table_name); | ||
| String tableName = unescapeHiveJdbcIdentifier(tableProperties.get(Constants.JDBC_TABLE)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we "unescape" some content of the property aren't we risking conflicts/ambiguity.
For instance:
"hive.sql.table"="\"Country\"""hive.sql.table"="Country"
Moreover, it seems that the current unescaping is not gonna work for other quoting chars:
"hive.sql.table"="[Country]"- "hive.sql.table"="`Country`"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I checked that we do run into conflicts here. I added
CREATE TABLE "WorldData".Country
(
id int,
name varchar(20)
);
INSERT INTO "WorldData".Country VALUES (5, 'France'), (6, 'Italy'), (7, 'Spain'), (8, 'Portugal');
to q_test_case_sensitive.postgres.sql
While creating the external table, if we pass "hive.sql.table" = "country", the select outputs the new data shown above, but if we pass "hive.sql.table" = "Country" or "hive.sql.table" = "\"Country\"", the select returns data from the case sensitive table.
I think we can check if the value of "hive.sql.table" is escaped or not. If not, then we can probably always just use lower case. Let me know what you think about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using lower case may not always be an option since as far as I recall there are DBMS that store unquoted identifers using different conventions (e.g., upper case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from ambiguity quoted identifiers may contain arbitrary characters inside the quotes that are not URI friendly (e.g., ?%+-/&$). We may have to come up with a more robust normalization strategy but this depends on how getURIForAuth is used and if we care about supporting such use-cases.
|
|
||
|
|
||
| -- Test Case-Sensitive Query Field Names | ||
| -- (Should fail in SerDe/Iterator with Column not found) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the test should fail it should be in the negative directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comments are incorrect/from an earlier version. I will remove these too
| -- Cleanup | ||
| DROP TABLE country_test; | ||
| DROP TABLE cities_test; | ||
| DROP TABLE geography_test; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the impression that the framework takes care of the cleanup so these DROP statements are redundant.
| throw new IllegalArgumentException("Database type string cannot be null"); | ||
| } | ||
| // METASTORE must be handled before valueOf | ||
| if (METASTORE.name().equalsIgnoreCase(dbType)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does METASTORE need special handling?
| * @return The matching DatabaseType. | ||
| * @throws IllegalArgumentException if the dbType is null or not a valid type. | ||
| */ | ||
| public static DatabaseType from(String dbType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this method and not simply use valueOf?
| "hive.sql.jdbc.url" = "${system:hive.test.database.qdb.jdbc.url}", | ||
| "hive.sql.dbcp.username" = "${system:hive.test.database.qdb.jdbc.username}", | ||
| "hive.sql.dbcp.password" = "${system:hive.test.database.qdb.jdbc.password}", | ||
| "hive.sql.schema" = "\"WorldData\"", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the impression that we also support single quoted table properties so it may be a bit more readable to use those instead of escaping double quotes:
'hive.sql.schema'='"WorldData"'



What changes were proposed in this pull request?
GenericJdbcDatabaseAccessor.getQualifiedTableName()quote schema and table names, as the qualified name is used to run queries against different DBs.CalcitePlanner.genTableLogicalPlan()remove the double quotes (unescape identifier) fromConstants.JDBC_SCHEMAandConstants.JDBC_TABLEJdbcStorageHandler.getURIForAuth()unescapeConstants.JDBC_TABLE, as otherwise we generate the jdbc string with"DatabaseTypeWhy are the changes needed?
We see Exceptions, as shown here:
https://issues.apache.org/jira/browse/HIVE-29308
Does this PR introduce any user-facing change?
No
How was this patch tested?