Designing a Database to Support a Chatbot Interview App – Part 2

by DL Keeshin


September 2024


images/kds_discovery_erd_20240903.png
Enlarge diagram
View database schema detail


In my previous blog post, I laid out the logical data design for the Chatbot interview data discovery app. Since then, I have made several important changes and have continued building out the design. Here is a summary of what was done:

  1. Two more schemas were added:
    • load (blue in the above diagram)
    • stage (orange in the above diagram)
  2. The schema “discovery” was renamed to “interview.” While the entire process is data “discovery,” these tables store interview respondents and answers. Thus, “interview” more accurately reflects their purpose.

So, right now, the design contains the following schemas:

Schema Color Purpose
admin red administrative data.
analysis purple storing analysis results.
client green client-related data.
interview rose interview respondents and answer data.
load blue initial data loads from external sources.
model yellow model questions and related metadata.
reference grey reference data like “location.”
stage orange processed data from the initial load schema.

You can now see the preliminary “physical” database structure on GitHub.

Keep in Mind

  1. The “load” schema is intended to load external client or model data generated from an external source, like client data initially gathered from a web front end or model data generated from a large language model (LLM) like ChatGPT.

    • Columns in the tables defined in this “load” schema are primarily character data types, except for a few system columns like “create date,” which are loaded automatically with default values. Using all character data simplifies the loading process by reducing or mostly eliminating errors. If the length of the columns is large enough, data should get loaded quickly. There are no primary or foreign keys defined in this schema. The idea here is to easily “just get the data in.”
    • Besides simplicity, this design is scalable. This data could potentially be stored outside of a relational database model. A NoSQL database like AWS DynamoDB, Azure Cosmos DB, or Google BigQuery could work. These solutions would be a consideration and likely a requirement if the amount of data that needed loading was large—probably in the thousands per second range.
  2. In the “stage” schema, data is assigned specific data types like date or date-time, numeric values, etc. Any errors can be written to stage error tables. It is also a good place to generate any natural keys that reflect the keys that are part of the next step in data transformation.

Natural Keys vs. Surrogate Keys

This raises the age-old question of whether to use natural or surrogate keys. Why not use both? Early in the data transformation, it makes sense to use a natural key. Here’s why:

Natural Keys

Definition: A natural key is a key that is derived from the data itself, typically a field or a combination of fields that naturally and uniquely identify a record in a table. Examples of natural keys include Social Security Numbers (SSNs), email addresses, or VINs (Vehicle Identification Numbers).

Surrogate Keys

Definition: A surrogate key is an artificially created key that serves as a unique identifier for each record in a table, independent of the data in the record. Surrogate keys are usually sequential integers or UUIDs (Universally Unique Identifiers).

Why Use Natural Keys During Staging and Assign Surrogate Keys Later?

Natural Keys in Staging

Surrogate Keys in Final Tables

This post outlines the ongoing development of a database designed to support a Chatbot interview data discovery app, highlighting key design changes such as the introduction of new schemas and renaming existing ones for better clarity. The strategic use of natural keys during the staging phase leverages business-related identifiers for validation, traceability, and integration. Subsequently, surrogate keys are assigned in the final tables to optimize performance, ensure data stability, and maintain flexibility. This hybrid approach balances data integrity with system performance, ensuring that the database design meets both functional and technical requirements while supporting the app's functionality in a scalable and efficient manner.

As always, thanks for stopping by.

kDS

Leave a Comment:

* Required fields