|
1 | | -"""Deep Learning Model Training with LSTM |
2 | | -
|
3 | | -This Python script is used for training a deep learning model using |
4 | | -Long Short-Term Memory (LSTM) networks. |
5 | | -
|
6 | | -The script starts by importing necessary libraries. These include `sys` |
7 | | -for interacting with the system, `pandas` for data manipulation, `tensorflow` |
8 | | -for building and training the model, `sklearn` for splitting the dataset and |
9 | | -calculating metrics, and `numpy` for numerical operations. |
10 | | -
|
11 | | -The script expects two command-line arguments: the input file and the output directory. |
12 | | -If these are not provided, the script will exit with a usage message. |
13 | | -
|
14 | | -The input file is expected to be a CSV file, which is loaded into a pandas DataFrame. |
15 | | -The script assumes that this DataFrame has a column named "Query" containing the text |
16 | | -data to be processed, and a column named "Label" containing the target labels. |
17 | | -
|
18 | | -The text data is then tokenized using the `Tokenizer` class from |
19 | | -`tensorflow.keras.preprocessing.text` (TF/IDF). The tokenizer is fit on the text data |
20 | | -and then used to convert the text into sequences of integers. The sequences are then |
21 | | -padded to a maximum length of 100 using the `pad_sequences` function. |
22 | | -
|
23 | | -The data is split into a training set and a test set using the `train_test_split` function |
24 | | -from `sklearn.model_selection`. The split is stratified, meaning that the distribution of |
25 | | -labels in the training and test sets should be similar. |
26 | | -
|
27 | | -A Sequential model is created using the `Sequential` class from `tensorflow.keras.models`. |
28 | | -The model consists of an Embedding layer, an LSTM layer, and a Dense layer. The model is |
29 | | -compiled with the Adam optimizer and binary cross-entropy loss function, and it is trained |
30 | | -on the training data. |
31 | | -
|
32 | | -After training, the model is used to predict the labels of the test set. The predictions |
33 | | -are then compared with the true labels to calculate various performance metrics, including |
34 | | -accuracy, recall, precision, F1 score, specificity, and ROC. These metrics are printed to |
35 | | -the console. |
36 | | -
|
37 | | -Finally, the trained model is saved in the SavedModel format to the output directory |
38 | | -specified by the second command-line argument. |
39 | | -""" |
40 | | - |
41 | 1 | import sys |
42 | 2 | import pandas as pd |
43 | 3 | from tensorflow.keras.preprocessing.text import Tokenizer |
|
0 commit comments