Skip to content

Commit f48c75f

Browse files
authored
Merge pull request #5 from lambda-feedback/slm
tr154-Updated Dockerfile(for nltk corruption error)
2 parents cb3deb8 + d8c233f commit f48c75f

File tree

6 files changed

+65
-7
lines changed

6 files changed

+65
-7
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -127,3 +127,6 @@ dmypy.json
127127

128128
# Pyre type checker
129129
.pyre/
130+
131+
# MacOS files
132+
.DS_Store

app/Dockerfile

Lines changed: 31 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ FROM rabidsheep55/python-base-eval-layer
55
WORKDIR /app
66

77
RUN mkdir /usr/share/nltk_data
8+
RUN mkdir -p /usr/share/nltk_data/corpora /usr/share/nltk_data/models /usr/share/nltk_data/tokenizers
89

910
ARG NLTK_DATA=/usr/share/nltk_data
1011

@@ -14,12 +15,37 @@ COPY requirements.txt .
1415
COPY brown_length .
1516
COPY word_freqs .
1617
COPY w2v .
18+
RUN yum install -y wget unzip
1719
RUN pip3 install -r requirements.txt
18-
RUN python -m nltk.downloader wordnet
19-
RUN python -m nltk.downloader word2vec_sample
20-
RUN python -m nltk.downloader brown
21-
RUN python -m nltk.downloader stopwords
22-
RUN python -m nltk.downloader punkt
20+
21+
# Download NLTK data files
22+
RUN wget -O /usr/share/nltk_data/corpora/wordnet.zip https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/wordnet.zip
23+
RUN wget -O /usr/share/nltk_data/models/word2vec_sample.zip https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/models/word2vec_sample.zip
24+
RUN wget -O /usr/share/nltk_data/corpora/brown.zip https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/brown.zip
25+
RUN wget -O /usr/share/nltk_data/corpora/stopwords.zip https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/stopwords.zip
26+
RUN wget -O /usr/share/nltk_data/tokenizers/punkt.zip https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt.zip
27+
RUN wget -O /usr/share/nltk_data/tokenizers/punkt_tab.zip https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt_tab.zip
28+
29+
# Unzip the downloaded files into the correct subfolders corresponsing to NLTK requirements
30+
RUN unzip /usr/share/nltk_data/corpora/wordnet.zip -d /usr/share/nltk_data/corpora/
31+
RUN unzip /usr/share/nltk_data/models/word2vec_sample.zip -d /usr/share/nltk_data/models/
32+
RUN unzip /usr/share/nltk_data/corpora/brown.zip -d /usr/share/nltk_data/corpora/
33+
RUN unzip /usr/share/nltk_data/corpora/stopwords.zip -d /usr/share/nltk_data/corpora/
34+
RUN unzip /usr/share/nltk_data/tokenizers/punkt.zip -d /usr/share/nltk_data/tokenizers/
35+
RUN unzip /usr/share/nltk_data/tokenizers/punkt_tab.zip -d /usr/share/nltk_data/tokenizers/
36+
37+
# Clean up zip files to reduce image size
38+
RUN rm /usr/share/nltk_data/corpora/*.zip
39+
RUN rm /usr/share/nltk_data/models/*.zip
40+
RUN rm /usr/share/nltk_data/tokenizers/*.zip
41+
42+
# Warnings: those commands sometimes download corrupted zips, so it is better to wget each package from the main site
43+
# RUN python -m nltk.downloader wordnet
44+
# RUN python -m nltk.downloader word2vec_sample
45+
# RUN python -m nltk.downloader brown
46+
# RUN python -m nltk.downloader stopwords
47+
# RUN python -m nltk.downloader punkt
48+
# RUN python -m nltk.downloader punkt_tab
2349

2450
# Copy the evaluation and testing scripts
2551
COPY brown_length ./app/

app/docs/dev.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,15 @@ Otherwise, it will have the additional fields:
3737

3838
If the method is w2v, it means the two texts were found to be similar. Otherwise, a BOW vector similarity check is performed in order to identify the most likely word that caused the texts to be found dissimilar.
3939

40+
## Initial SetUp
41+
Follow Docker Image instructions and run
42+
`docker build -t <image_name> .` in app/
43+
44+
Otherwise if setup locally:
45+
1. create a venv
46+
2. in the venv `pip install -r app/requirements.txt`
47+
3. if errors encountered with nltk packages, follow `testing_nltk.py` instructions
48+
4049
## Examples
4150
*List of example inputs and outputs for this function, each under a different sub-heading*
4251

app/evaluation_tests.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -133,5 +133,16 @@ def test_navier_stokes_equation(self):
133133
result = evaluation_function(response, answer, params)
134134
self.assertEqual(result.get("is_correct"), True, msg=f'Response: {response}')
135135

136+
def test_negation(self):
137+
answer, params = 'not light blue', dict()
138+
correct_responses = [
139+
'bright blue',
140+
'light blue'
141+
]
142+
143+
for response in correct_responses:
144+
result = evaluation_function(response, answer, params)
145+
self.assertEqual(result.get("is_correct"), True, msg=f'Response: {response}')
146+
136147
if __name__ == "__main__":
137148
unittest.main()

app/requirements.txt

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,8 @@
11
numpy
2-
nltk
2+
nltk==3.8.1
33
gensim
4-
matplotlib
4+
matplotlib
5+
6+
# To run on cli: /Applications/Python\ 3.11/Install\ Certificates.command
7+
# If SSL cert fail on Mac -> command above calls pip install --upgrade certifi -> then calling nltk.download works
8+
certifi

app/testing_nltk.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
import nltk
2+
print(nltk.data.path)
3+
nltk.download()
4+
# If zip packages cannot be unzipped or error from the downloader above, then download packages from online https://www.nltk.org/nltk_data/
5+
# Need to check with the command above where the zip packages should go in folder /Users/<username>/nltk_data/...

0 commit comments

Comments
 (0)