Tables and Schemas

This section describes how BioDM is leveraging your Tables and Schemas in order to set up resources. It contains some useful information for developers to design their own.

It is recommended to visit the documentation of both SQLAlchemy and Marshmallow beforehand. Moreover, our example project also provides plenty of inspiration for this task.

Tables

In principle any valid SQLALchemy table is accepted by BioDM. Yet, depending how you configure it, it shall adopt varying behaviors.

SQLAlchemy 2.0 ORM API is a really nice and convenient piece of technology. However, it does not natively support trees of entities (nested dictionaries).

To palliate this problem, BioDM does a pass of parsing on schema validation results under the hood, using Tables as litterals in order to build the hirarchical tree of statements. In a second pass, it inserts the whole tree in order, ensuring integrity.

Relationship specifics

Statement building leverages tables relationships definitions, taking orientation into account.

In particular, if a relationship is one-armed (pointing in one direction only), it will not be possible to create a nested resource in the other direction.

Special columns

Some special column will yield built-in behavior.

Tracking resource submitter: submitter_username

Setting up the following foreign key, in a table will automatically populate the field with requesting user’s username creating the resource.

class MyTable(Base):
    id:          Mapped[int]   = mapped_column(Integer(), primary_key=True)
    ...
    submitter_username:  Mapped[str] = mapped_column(ForeignKey("USER.username"), nullable=False)

Note

This effectively has a similar effect as @login_required to create that resource.

Schemas

Schemas are the “i/o surface” of your app. This is where you decide what gets loaded and dumped for a specific resource.

For most cases, you may simply set fields identical to matching Table, using Marshmallow syntax, making sure to have matching names between columns.

Marshmallow also comes with some limitations, such as not being able to infer foreign key population in respect to nested entities while de-serializing.

E.g. Given the following matching table and schema:

class Dataset(Base):
    id:          Mapped[int]   = mapped_column(Integer(), primary_key=True)
    ...
    project_id:  Mapped[int]   = mapped_column(ForeignKey("PROJECT.id"), nullable=False)
    project: Mapped["Project"] = relationship(back_populates="datasets")

class DatasetSchema(Schema):
    id = Integer()
    ...
    project_id = Integer()
    project = Nested('ProjectSchema')

If you POST a new /datasets resource definition with a nested project. Upon strict validation, project_id will not be populated, which ultimately is your NOT NULL FOREIGN KEY field. Hence SQL insert statement shall raise integrity errors.

Marshmallow is offering built-ins in the form of decorators that let you tag functions attached to the Schema such as @pre_load which is a hook called before validation, that lets you manually get data if you detect it present in the dict.

This technique has two major disadvantages:

It is quite cumbersome and error prone for the developer, as for each relationship you may have to set foreign keys on either side and is as many conditions checking what is present in your input dict and whatnot.
This cannot take into account generated keys. In our example, we may be creating the project as well. Hence it will not have an id yet, thus raise a ValidationError for the dataset if we set required=True flag for project_id.

To bypass those limitations, BioDM validates incoming data using Marshmallow’s partial=True flag. Meaning that required keywords on fields are ignored during this step.

Note

While required flags are ignored at this validation step those flags are still relevant for client side validation

This yields a (List of) dictionary (of nested Dictionaries) that is sent down to a Service for statement building and insertion. The Core will use knowledge of Table relationships to infer this foreign key population and raise appropriate errors in case of truely incomplete input data or emit the right statements in order when the structure complies.

This ultimately allows for more flexibily on input such as sending a mixin of create/update of new resources via POST and limit the total number of requests to achieve the same result.

Custom Schema Component

BioDM provides a custom Schema component that may be used by importing from biodm.utils.biodm import Schema.

The use of this component is discouraged but may provides some performance improvements by turning SQLALchemy objects to transient state effectively disabling further lazy loading of nested attributes upon serializing.

This can be an alternative to circular dumping policies described below. However this approach can be quite risky and result in missing data down the line.

This component remains provided as it may be of interest for testing purposes.

Nested flags policy

Serialization is following down Nested fields. In particular that means it is important to limit the depth of data that is fetched, as it is easy to end up in infinite loops in case of circular or self referencial dependencies.

E.g.

user.py

class UserSchema(Schema):
    """Schema for Keycloak Users. id field is purposefully out as we manage it internally."""
    username = String()
    password = String(load_only=True)
    email = String()
    firstName = String()
    lastName = String()

    def dump_group(): # Delay import using a function.
        from .group import GroupSchema
        return GroupSchema(load_only=['users', 'children', 'parent'])

    groups = List(Nested(dump_group))

group.py

class GroupSchema(Schema):
    """Schema for Keycloak Groups."""
    path = String(metadata={"description": "Group name chain separated by '__'"})
    # If import order allows it: you may pass lambdas.
    users = List(Nested(lambda: UserSchema(load_only=['groups'])))
    children = List(Nested(lambda: GroupSchema(load_only=['users', 'children', 'parent'])))
    # Make sense to not create parents from children.
    parent = Nested('GroupSchema', dump_only=True)

The example above is demonstrating how to allow loading sensible relationships while limiting dumping depth to one. In other words, to have a resource output its attached related resources, with their own fields but not their subsequent related resources.

This is the highly recommended approach, both to avoid critical errors while using BioDM and follow RESTful principles.

Warning

Marshmallow provides other primitives such as only and exclude that can be used to do this restriction.

However, be careful with your dumping configuration in order not to impede a Schema’s loading capabilites of essential fields for creating a new resource.

Although you may allow more depth at places depending on your use case, always make sure that the resulting tree do not have cycles.

Alternatively using the custom schema component will naturally occur in a 2 level pruned tree.

Note

Setting “metadata.description” like for path in our example example above, is used for automatic apispec docstrings generation.

Duplicate Schemas

While setting your nested flags policy, you may get a notice from apispec in the form or a UserWarning that multiple schemas got resolved to the same name. In that case it will generate other schemas with incrementing trailing numbers in the name (Group1, User1 …).

It is a normal behavior as per the OpenAPI specification, partials schemas are considered others. However, it may not help your client to automatically discover your app, with careful configuration it is possible to avoid the case.

If that proves to be necessary, a future version of BioDM may implement a name resolver.