Merge orange response with existing data#52
Conversation
|
Hello, can you describe in detail how you triggered this problem and how you use this API, |
|
Hey @Yansera The details of how the API is used are in annotation.py. From the behaviour I've observed, the API seems to work only from time to time and with a very limited number of requests (eg. 1). For example, when attempting to preprocess 10 tables, we only get one cohesive response while the rest report the following error: {
"code": 1,
"message": "Internal error",
"description": "apigee - CatchCommonErrors exception: TypeError: Can't use 'in' on a non-object."
}I actually observed the same behaviour with just two concurrent requests. The details of how the pipeline works:
Hope this helps |
|
Hello @EelisK, |
|
@Yansera This entire repository is the pipeline, but specifically:
Honestly it looks to be the case with all the tables that can be crawled. Since the pipeline is quite big, we can demonstrate this behaviour with our test fixtures: import os
import time
from dataclasses import asdict
from postprocessing.annotation import TableAnnotationAPIClient
from core.spiders.fixtures.wiki_nhl import wiki_nhl
os.environ['ORANGE_CLIENT_ID'] = "<client id>"
os.environ['ORANGE_CLIENT_SECRET'] = "<client secret>"
# use test fixture as sample table
_, _, data = wiki_nhl()
data = data[0]
NUMBER_OF_QUERIES = 2
client = TableAnnotationAPIClient()
task_ids = [client.preprocess(asdict(data)) for _ in range(NUMBER_OF_QUERIES)]
print("Submitted tasks")
# Wait for processing to finish
still_processing = True
while still_processing:
time.sleep(2)
task_statuses = list(
map(lambda x: client.get_preprocess_task_status(x['task_id']), task_ids))
if not any(map(lambda x: x['task_status'] != 'SUCCESS', task_statuses)):
still_processing = False
else:
print("Sleeping...")
print("Getting results...")
for task in task_ids:
response = client.get_preprocess_task_result(task['task_id'])
if 'code' in response:
print(response)
else:
print("Success")output: |
|
The problem has been identified by Orange. In short, this is the parallelization of requests which is buggy. Hence, the workaround, for now, is to submit just one request at a time, and wait for getting the full results before submitting another table to analyze. |
|
NOTE: We have decided to leave this PR open to avoid introducing a brittle/inefficient workaround for the problem. Once the API is fixed, it should be fine to merge this. |
tableOrientationheadersrelationNOTE: API is acting up, and it's difficult to decipher what it can/should return: