Add an option to provide parameters by JSON file for hive metastore migration utility#175
Add an option to provide parameters by JSON file for hive metastore migration utility#175manabery wants to merge 6 commits intoaws-samples:masterfrom
Conversation
* Resolved the warning message with yaml.load() : https://github.com/yaml/pyyaml/wiki/PyYAML-yaml.load(input)-Deprecation
…on.py can be loaded from Glue 5.0 job, which doesn't have PyYAML by default
| - Set `--direction` to `to_metastore`. | ||
| - Set `--mode` to `to_metastore`. |
There was a problem hiding this comment.
Just to confirm, there was not the argument --direction and we are going to update README to match the implementation, correct?
There was a problem hiding this comment.
Yes, --direction option has not existed in the code and it corrected README.
| options = get_options(parser, args) | ||
|
|
||
| if options["mode"] == FROM_METASTORE: | ||
| if options.get("config_file") is not None: | ||
| # parse json config file if provided | ||
| config_file_path = options["config_file"] | ||
| logger.info(f"config_file provided. Parsing arguments from {config_file_path}") | ||
| with open(config_file_path, 'r') as json_file_stream: | ||
| config_options = json.load(json_file_stream) | ||
|
|
||
| # merge options. command line options are prioritized. | ||
| for key in config_options: | ||
| if not options.get(key): | ||
| options[key] = config_options[key] | ||
| elif options[key] is None: | ||
| options[key] = config_options[key] |
There was a problem hiding this comment.
If we do like this, for example, if both config_file and mode are provided, the mode is replaced by the mode specified in the config file. Is my understanding correct? Is this intended behavior? To me, if we explicitly set arguments, it's natural to prioritize the explicit argument over the config file.
In any of those decisions, we will need to explain how it is prioritized in README.
There was a problem hiding this comment.
No, if a same argument is provided from both on command line and via config_file, the value in the config file is ignored. options variable is created at line 1600, and command line arguments are read at the time. Arguments in the config files are added into options variable only when corresponding arguments are not provided from command line. This is a test case 3 scenario.
I will update README to explain this behavior.
…in the config file and command line
This is a PR to merge #13
Changes made in #13
from_metastoreandto_metastoreoptions. This prevents JDBC password exposure on spark-submit command.Changes added in this PR
Tests
Testing with a config file
Case 1: run with expected config file
{ "mode": "from-metastore", "jdbc_url": "jdbc:mysql://**:3306", "jdbc_username": "hive", "jdbc_password": "password", "database_prefix": "dbpre_", "table_prefix": "tablepre_", "output_path": "s3://path" }Result: Succeeded and exported metastore to S3.
Case 2: run with a config file and lack mandatory parameter
jdbc_password){ "mode": "from-metastore", "jdbc_url": "jdbc:mysql://**:3306", "jdbc_username": "hive", "output_path": "s3://path" }Result: failed the command with the following message as expected.
Case 3: run with a config file and command line parameter
{ "mode": "from-metastore", "jdbc_url": "jdbc:mysql://**:3306", "jdbc_username": "hive", "jdbc_password": "password", "database_prefix": "dbpre_", "table_prefix": "tablepre_", "output_path": "s3://path" }--database-prefixResult: successfully metastore was exported, and --database-prefix was overriden by command line config.
Testing without a config file (regression)
case 4: run with command line parameters
Result: the job succeeded.
case 5: run with from-s3 mode
Result: Succeeded (importing updated hive_metastore_migration.py didn't affect the behavior)
case 6: run with to-s3 mode
Result: Succeeded (importing updated hive_metastore_migration.py didn't affect the behavior)