You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: app/Dockerfile
+31-5Lines changed: 31 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -5,6 +5,7 @@ FROM rabidsheep55/python-base-eval-layer
5
5
WORKDIR /app
6
6
7
7
RUN mkdir /usr/share/nltk_data
8
+
RUN mkdir -p /usr/share/nltk_data/corpora /usr/share/nltk_data/models /usr/share/nltk_data/tokenizers
8
9
9
10
ARG NLTK_DATA=/usr/share/nltk_data
10
11
@@ -14,12 +15,37 @@ COPY requirements.txt .
14
15
COPY brown_length .
15
16
COPY word_freqs .
16
17
COPY w2v .
18
+
RUN yum install -y wget unzip
17
19
RUN pip3 install -r requirements.txt
18
-
RUN python -m nltk.downloader wordnet
19
-
RUN python -m nltk.downloader word2vec_sample
20
-
RUN python -m nltk.downloader brown
21
-
RUN python -m nltk.downloader stopwords
22
-
RUN python -m nltk.downloader punkt
20
+
21
+
# Download NLTK data files
22
+
RUN wget -O /usr/share/nltk_data/corpora/wordnet.zip https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/wordnet.zip
23
+
RUN wget -O /usr/share/nltk_data/models/word2vec_sample.zip https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/models/word2vec_sample.zip
24
+
RUN wget -O /usr/share/nltk_data/corpora/brown.zip https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/brown.zip
25
+
RUN wget -O /usr/share/nltk_data/corpora/stopwords.zip https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/stopwords.zip
26
+
RUN wget -O /usr/share/nltk_data/tokenizers/punkt.zip https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt.zip
27
+
RUN wget -O /usr/share/nltk_data/tokenizers/punkt_tab.zip https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt_tab.zip
28
+
29
+
# Unzip the downloaded files into the correct subfolders corresponsing to NLTK requirements
30
+
RUN unzip /usr/share/nltk_data/corpora/wordnet.zip -d /usr/share/nltk_data/corpora/
31
+
RUN unzip /usr/share/nltk_data/models/word2vec_sample.zip -d /usr/share/nltk_data/models/
32
+
RUN unzip /usr/share/nltk_data/corpora/brown.zip -d /usr/share/nltk_data/corpora/
33
+
RUN unzip /usr/share/nltk_data/corpora/stopwords.zip -d /usr/share/nltk_data/corpora/
34
+
RUN unzip /usr/share/nltk_data/tokenizers/punkt.zip -d /usr/share/nltk_data/tokenizers/
35
+
RUN unzip /usr/share/nltk_data/tokenizers/punkt_tab.zip -d /usr/share/nltk_data/tokenizers/
36
+
37
+
# Clean up zip files to reduce image size
38
+
RUN rm /usr/share/nltk_data/corpora/*.zip
39
+
RUN rm /usr/share/nltk_data/models/*.zip
40
+
RUN rm /usr/share/nltk_data/tokenizers/*.zip
41
+
42
+
# Warnings: those commands sometimes download corrupted zips, so it is better to wget each package from the main site
Copy file name to clipboardExpand all lines: app/docs/dev.md
+9Lines changed: 9 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -37,6 +37,15 @@ Otherwise, it will have the additional fields:
37
37
38
38
If the method is w2v, it means the two texts were found to be similar. Otherwise, a BOW vector similarity check is performed in order to identify the most likely word that caused the texts to be found dissimilar.
39
39
40
+
## Initial SetUp
41
+
Follow Docker Image instructions and run
42
+
`docker build -t <image_name> .` in app/
43
+
44
+
Otherwise if setup locally:
45
+
1. create a venv
46
+
2. in the venv `pip install -r app/requirements.txt`
47
+
3. if errors encountered with nltk packages, follow `testing_nltk.py` instructions
48
+
40
49
## Examples
41
50
*List of example inputs and outputs for this function, each under a different sub-heading*
0 commit comments