Skip to content

Merge orange response with existing data#52

Open
EelisK wants to merge 1 commit into
mainfrom
eelisk/postprocess-backfill-update
Open

Merge orange response with existing data#52
EelisK wants to merge 1 commit into
mainfrom
eelisk/postprocess-backfill-update

Conversation

@EelisK
Copy link
Copy Markdown
Collaborator

@EelisK EelisK commented Jan 18, 2021

  • Update the following items with orange preprocessing response
    • tableOrientation
    • headers
    • relation

NOTE: API is acting up, and it's difficult to decipher what it can/should return:

ERROR: Failed to update document {'code': 1, 'message': 'Internal error', 'description': "apigee - CatchCommonErrors exception: TypeError: Can't use 'in' on a non-object."}...

Comment thread postprocessing/models.py
Comment thread postprocessing/postprocess.py
@EelisK EelisK marked this pull request as ready for review January 20, 2021 12:28
@Yansera
Copy link
Copy Markdown
Collaborator

Yansera commented Jan 20, 2021

Hello, can you describe in detail how you triggered this problem and how you use this API,
I will accurately convey this information, thank you.

@EelisK
Copy link
Copy Markdown
Collaborator Author

EelisK commented Jan 20, 2021

Hey @Yansera

The details of how the API is used are in annotation.py. From the behaviour I've observed, the API seems to work only from time to time and with a very limited number of requests (eg. 1). For example, when attempting to preprocess 10 tables, we only get one cohesive response while the rest report the following error:

{
  "code": 1,
  "message": "Internal error",
  "description": "apigee - CatchCommonErrors exception: TypeError: Can't use 'in' on a non-object."
}

I actually observed the same behaviour with just two concurrent requests.

The details of how the pipeline works:

  1. Crawl, parse, and send results to kafka
  2. Consume message and send request to /preprocess
  3. (wait for an undefined amount of time)
  4. Execute post processing, which includes fetching results from /preprocess/:task_id/result for task_ids that have SUCCESS status → mostly internal error responses

Hope this helps

@Yansera
Copy link
Copy Markdown
Collaborator

Yansera commented Jan 21, 2021

Hello @EelisK,
Thx for your response.
Could you please tell me what code the pipeline is implemented by? (The address of the pipeline), and which ten tables you have processed. :P

@EelisK
Copy link
Copy Markdown
Collaborator Author

EelisK commented Jan 21, 2021

@Yansera This entire repository is the pipeline, but specifically:

Honestly it looks to be the case with all the tables that can be crawled. Since the pipeline is quite big, we can demonstrate this behaviour with our test fixtures:

import os
import time
from dataclasses import asdict
from postprocessing.annotation import TableAnnotationAPIClient
from core.spiders.fixtures.wiki_nhl import wiki_nhl

os.environ['ORANGE_CLIENT_ID'] = "<client id>"
os.environ['ORANGE_CLIENT_SECRET'] = "<client secret>"

# use test fixture as sample table
_, _, data = wiki_nhl()
data = data[0]

NUMBER_OF_QUERIES = 2

client = TableAnnotationAPIClient()
task_ids = [client.preprocess(asdict(data)) for _ in range(NUMBER_OF_QUERIES)]

print("Submitted tasks")

# Wait for processing to finish
still_processing = True
while still_processing:
    time.sleep(2)
    task_statuses = list(
        map(lambda x: client.get_preprocess_task_status(x['task_id']), task_ids))
    if not any(map(lambda x: x['task_status'] != 'SUCCESS', task_statuses)):
        still_processing = False
    else:
        print("Sleeping...")


print("Getting results...")

for task in task_ids:
    response = client.get_preprocess_task_result(task['task_id'])
    if 'code' in response:
        print(response)
    else:
        print("Success")

output:

Submitted tasks
Sleeping...
Sleeping...
Getting results...
Success
{'code': 1, 'message': 'Internal error', 'description': "apigee - CatchCommonErrors exception: TypeError: Can't use 'in' on a non-object."}

@rtroncy
Copy link
Copy Markdown
Member

rtroncy commented Jan 21, 2021

The problem has been identified by Orange. In short, this is the parallelization of requests which is buggy. Hence, the workaround, for now, is to submit just one request at a time, and wait for getting the full results before submitting another table to analyze.

@EelisK
Copy link
Copy Markdown
Collaborator Author

EelisK commented Feb 5, 2021

NOTE: We have decided to leave this PR open to avoid introducing a brittle/inefficient workaround for the problem. Once the API is fixed, it should be fine to merge this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants