Skip to content

Query on nested struct field with PyIceberg? #953

@cfrancois7

Description

@cfrancois7

Question

I'm looking for a tutorial to make a query on one subfield of a struct field.
I scrolled all internet but failed to find a way to do it simply with pyiceberg.

To make it concret, for instance how to get the row with "employment.status = 'Employed'" :

[{'id': 1,
  'name': 'Alice',
  'age': 28,
  'address': {'street': '123 Maple St',
   'city': 'Springfield',
   'postal_code': '12345'},
  'contact': {'email': 'alice@example.com', 'phone': '555-1234'},
  'employment': {'status': 'Employed',
   'position': 'Software Engineer',
   'company': {'name': 'Tech Corp', 'location': 'Silicon Valley'}},
  'preferences': {'newsletter': True,
   'notifications': {'email': True, 'sms': False}}},
 {'id': 2,
  'name': 'Bob',
  'age': 35,
  'address': {'street': '456 Oak St',
   'city': 'Metropolis',
   'postal_code': '67890'},
  'contact': {'email': 'bob@example.com', 'phone': '555-5678'},
  'employment': {'status': 'Self-employed',
   'position': 'Consultant',
   'company': {'name': 'Freelance', 'location': 'Remote'}},
  'preferences': {'newsletter': False,
   'notifications': {'email': True, 'sms': True}}}]

With the following schema:

 import pyarrow as pa
 
 schema = pa.schema([
  ('id', pa.int32()),
  ('name', pa.string()),
  ('age', pa.int32()),
  ('address', pa.struct([
      ('street', pa.string()),
      ('city', pa.string()),
      ('postal_code', pa.string())
  ])),
  ('contact', pa.struct([
      ('email', pa.string()),
      ('phone', pa.string())
  ])),
  ('employment', pa.struct([
      pa.field('status', pa.string(), nullable=True),
      pa.field('position', pa.string(), nullable=True),
      pa.field('company', pa.struct([
          ('name', pa.string()),
          ('location', pa.string())
      ]), nullable=True)
  ])),
  ('preferences', pa.struct([
      ('newsletter', pa.bool_()),
      ('notifications', pa.struct([
          ('email', pa.bool_()),
          ('sms', pa.bool_())
      ]))
  ]))
])

I tried this kind of query, but without success:

row_filter = "employment.status = 'Employed'"

table.scan(
    row_filter=row_filter,
    selected_fields=["age", "employment", 'contact.email']
).to_pandas()

The command raises the error:

ValueError: Could not find field with name status, case_sensitive=True

The backend is supported by SQLite.

versions:

$ pip list | grep 'iceberg\|arrow\|sqlite'
arrow                     1.3.0
pyarrow                   15.0.2
pyiceberg                 0.6.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions