The TSPM-DB library represents a careful and deliberate approach to integrating AI assistance in research software development. While AI tools were leveraged to accelerate implementation of routine data operations, the core algorithm and API design were developed and validated by domain experts to ensure scientific accuracy and usability.
The Transitive Sequential Pattern Mining (TSPM) algorithm implementation was designed and hand-built by Nick Benik (Neomancy Inc / Harvard Medical School). Nick also architected the library's public API with careful attention to usability for bioinformaticians and data scientists working with electronic health records. The algorithm implementation underwent rigorous hand-optimization to ensure both correctness and performance when processing large-scale EHR datasets.
AI was strategically employed to implement CRUD operations (Create, Read, Update, Delete) and data retrieval operations that consume the results generated by the hand-built TSPM algorithm. These include:
- Patient and observation code lookup and translation
- Subpopulation management (creation, membership, querying)
- Frequency filtering and aggregation from pre-computed results
- DataFrame and iterator return format conversions
- Documentation and example notebooks
This division of labor allowed the team to maintain scientific rigor where it matters most—in the algorithm itself—while leveraging automation for supporting infrastructure.
The implementation was independently reviewed for accuracy by:
- Mr. J.H., PhD (Visiting Researcher at InstitutionXYZ)
- Mr. H.E., PhD (InstitutionXYZ)
This external validation ensures that the algorithm implementation faithfully reproduces the TSPM methodology as described in the original research and that the library's behavior is correct across a range of datasets and use cases.
TSPM-DB demonstrates that AI can be a valuable tool in accelerating research software development when used thoughtfully: automating routine tasks while preserving human expertise and oversight for the components that directly impact scientific validity.