This is interesting! A field being nullable because it's legitimately optional in the domain model is one thing, but for new fields which shouldn't be nullable in the domain model, unless you can pick a reasonable identity value, you need a concept of absence that's different from null. Luckily the intersection of "non-nullable fields" and "fields with no reasonable identity value" and "fields which didnt exist in v1 of the domain model" is normally pretty small, but it's painful when it happens.
This reminds me of frozen/nonfrozen enums in Swift. You can do exhaustive case analysis on frozen enums, but case analysis on nonfrozen enums requires adding an `@unknown default` case.
What they are saying is that the field is always present in the domain model but we don't have the information to backfill it. For example, say you have a customers table. Originally, it just stored their name and internal ID. But now you are adding in their government ID as well. Except that you already have thousands of customers and you don't have their government ID. So you either make the column nullable and slowly backfill it over time. Or you find some default value which isn't null but the code understands it to still be empty. And again, you slowly backfill over time.
> So you either make the column nullable and slowly backfill it over time. Or you find some default value which isn't null but the code understands it to still be empty. And again, you slowly backfill over time.
What would be the justification for using some other “default value” in this case? That’s just null with extra steps. Null _is_ the default value
There’s nothing gross or unholy about null values. Very smart people many years ago envisioned them as part of relational databases specifically for the use cases like these.
How would the database know whether the other app layers depend on that value or not? You could absolutely have an app that does not require data in a specific field to function, yet all records happen to have data. This is actually fairly common in single-tenant apps, where some tenants populate a field and others do not. You need to look at how the data is used across the entire stack to know whether or not it should be nullable, not whatever the current data happens to be.
The script example in TFA is just a starting point. I believe you would still manually go through all the columns it finds and decide which ones are actually supposed to be nullable and which ones don't need to be. As the article said, nullable fields that don't contain null could be a sign of incomplete migrations and such.
It doesn't. That's why it's the responsibility of the application layer to correctly implement the data model of the database and not the other way around
From my experience, any new field that gets added to existing tables (after a non-trivial system got deployed to production environment with actual clients) must always be nullable.
Firstly, it’s more practical and brings fewer surprises to all people involved (especially if there are many people involved).
Secondly, if we’re being pedantic, it is a mere acknowledgement of the fact that the field was not there before and things worked without it, and now the field is here, so in all senses the field _is actually_ optional, so all the touching code should know how to handle this.
Worked in an enterprise project where they use a BPMN engine for all business logic (designed by non-technical people).
Each process can either have predefined fields (take time to implement) or key-string value store (yes, they chose this).
Either the BPMN logic or some JSs (some, as not all people can code) in PBMN nodes or some Java hooks (a lot, wrote under those other people requests) edit those values.
So when something's wrong, we swam in the sea of key-value, search all the code for who update this field, and why it's value is not JSON nor null nor boolean but a string "Not used, waiting for policy 1234".
After that project, I'm happy every time I can use a DB to store data, more so if it can run in local, much more so if it's relational.
What if almost everything is NULLABLE? including the supposedly primary key and foreign keys of the table?
I've had the firsthand experience building a consumer for that kind of DB and it's hell to get anything running correctly without literally writing a dedicated layer to sanity check everything.
> without literally writing a dedicated layer to sanity check everything
It's not attractive to developers for a variety of reasons, but encapsulation is generally cheaper (in every sense) than the alternatives by orders of magnitude. If a system is hard to alter and other people rely on it to do their jobs, thinking about it as set in stone during my planning process will save me grief and heartache.
Starting with an independent client in mind makes it easier to plan around things like reconstructing tables from composed exports or dealing with authoritatively incorrect data. It's a good collection point for hard-coded edge case handling, and a great location for all of your discovery notes.
That's just the JSON equivalent of "we have data, and it's null" vs "data is missing", and consistency could be enforced using a constraint or by making it non-NULL in the first place.
It's more common in string fields, which in many cases just get rendered on a web form that doesn't differentiate blank (empty string) from null state, therefore in the database we should in most cases set it up as follows:
- Value required: field should be non-NULL + at least length 1 (via check constraint)
- Value optional: either field is non-NULL, or field is nullable + at least length 1
I'm curious if you prefer to store optional strings as non-NULL and rely on the length, or as nullable and have a length constraint.
I completely agree. Unfortunately, this is one of those things that’s hard to convince people of. You often hear: ‘It doesn’t matter,’ ‘It works fine without it,’ or ‘Why overcomplicate things?’—but that mindset can be limiting.
A column that is nullable but never null might indicate that it should be non-nullable but does not necessarily imply so. Say you have an optional comment field, it might just happen by accident that a comment was recorded for each row, but that of course becomes increasingly unlikely with each additional row in the table. There is probably no harm in checking your database for such columns, especially in tables with many rows, but in the end it is up to the semantics of the data model whether a column should be nullable or not. The absence of NULLs is an indicator but not a determiner.
there is also other scenario, field might obviously looks like it should have value, and in check it might event always have it, but it might be "lazy" value.
eg. you might have some bug CSV uploaded and your have number of rows in it, your app could insert record without this number and async process would fill that later.
there might be even some corner case where null value is possible
I believe solution here isn't to check what fields do not use null, but to improve process of creating such migration. either you should create second ticket for next release to update db or commit new migration to some kind of next branch.
With that I agree, once you start to treat the output of any linter as the truth, you are actively lowering the quality of your codebase. At the very least if it has reasonably good quality, if it is in bad shape, blindly following linters might still increase code quality, you will just not reach the peak.
For a tooling solution for this problem, and many others, pgroll (https://github.com/xataio/pgroll) automates the steps from the blog post in a single higher-level operation. It can do things like adding a hidden column, backfill it with data, then adds the constraint, and only then expose it in the new schema.
ALTER TABLE foo with nocheck
ADD CONSTRAINT CheckNotnull check (id IS NOT NULL)
insert foo values (null)
Msg 547, Level 16, State 0, Line 13
The INSERT statement conflicted with the CHECK constraint "CheckNotnull". The conflict occurred in database tempdb", table "dbo.foo", column 'id'.
The statement has been terminated.
However be aware that if you update an existing value to NULL, you will still get the error
update foo set id = null where id = 2
Msg 547, Level 16, State 0, Line 20
The UPDATE statement conflicted with the CHECK constraint "CheckNotnull". The conflict occurred in database "tempdb", table "dbo.foo", column 'id'.
Am I missing something here in my (MS) SQL world? if a new field is added as null, I do that to the (now) 20 year old system to we don't break 100's of stored procs - any (new) code that needs that field, has to check for it being null...
Should not we look for database to be able to do online, efficient non locking addition of column with any default value, not just NULL rather than application to have a complicated and fragile logic ?
I believe PostgreSQL does this since v11, which was released in 2018: (current is v17)
> Many other useful performance improvements, including the ability to avoid a table rewrite for ALTER TABLE ... ADD COLUMN with a non-null column default
I think there is some restriction there, like the default can't be "volatile" - I can't remember the precise definition here but I think current_timestamp would be volatile, but any static value would not.
That is correct, for non-volatile default values Postgres is quick, which means that it is generally a safe operation.
Also interesting, `now()` is non-volatile because it's defined as "start of the transaction". So if you add a column with `DEFAULT now()` all rows will get the same value. But `timeofday()` is not volatile, so `DEFAULT timeofday()` is going to lock the table for a long time. A bit of a subtle gotcha.
> But a field that is nullable in the schema and never null in practice is a silent lie.
This seems to be the central claim. But just as lies require intent, so does database design to some degree.
A column that is nullable but never null does not conclusively say anything, really. That's like saying a birthday column that is never before 1970 in the current data should be restricted to years after that date.
A nullable column signals that the data may be left empty, which is entirely different from that column actually having to be empty with the current data. Is-ought distinction, the potentiality of null ("nullability") is not to be confused with the actuality of current vacancy ("null values"). The programmer extracting data from the database must check for null if the data is nullable, not just if is null now as a principle of robust fault-tolerant code.
I don't think the author is talking generally about fields that could be NULL but just happen to never be so in the production DB. The piece is specifically in the context of a new database field that is fully intended by its designer to be NOT NULL, which was NULL only for migration purposes, and which was never updated to be NOT NULL once the migration is complete. The point was not meant to be extended beyond that.
One could write a separate piece about maybe using that as a clue that the field could be NOT NULL'd in the future but that's not what this post is.
In that case we read it entirely differently — the last paragraph explicitly mentions inferring "Nullable fields that never hold nulls are often the result of incomplete migrations or forgotten cleanup steps". The author also proposes a script "to identify cases where the field could safely be changed to non-nullable". But alas, we cannot do that with full generality, due to there being a big difference in intent that cannot be captured purely by counting nulls, and surely not by only calculating percentages.
> That's like saying a birthday column that is never before 1970 in the current data should be restricted to years after that date.
No it's not, because they specifically started with the premise that the field was initially intentionally non-null and was only temporarily made nullable for migration purposes. That is obviously not the situation you are describing, right?
I think their point is that for all intents, the column IS not nullable. It's nullable as an artifact of making live schema migration easier, with no blocking/downtime. But as the data model is concerned, it should not be nullable.
Sure, if one just leaves a column nullable due to negligence, one should check the actual requirements and enforce them to make invalid states unrepresentable. The author still makes too strong of a claim that becomes detached from the migration aspect, insinuating that one can just empirically test the database to check whether this is the case, to which I disagree.
Problem is you end up other places with the assumption thar it's never null. So in the future when you actually set it to null somewhere it will blow up.
People that assume that a (nullable) value is never null because the sample at hand does not contain a null value _will_ learn their lesson the hard way, the one you describe.
But even non-nullable does not always resist to time, I'd argue that use cases where the field _has_ to be null eventually emerges and somehow have to be mitigated. There is no easy solution to safely workaround that without either tons work that duplicates lots of things or taking risks by adapting the code base.
I'm glad to see people discussing [zero,infinitesimals,false,emptystring,isnull,unpopulated,missing] as if each one is a thing. They've always been things! We've just been buried in compromises and shortcuts all these years.
There should also be a [the-asteroid-has-hit-y'all-are-so-stupid] and global data systems should just just pass that around after impact until the power goes out for good.
This reminds me of frozen/nonfrozen enums in Swift. You can do exhaustive case analysis on frozen enums, but case analysis on nonfrozen enums requires adding an `@unknown default` case.
https://docs.swift.org/swift-book/documentation/the-swift-pr...
Could you give an example? Null _is_ absence, the way I see it
What would be the justification for using some other “default value” in this case? That’s just null with extra steps. Null _is_ the default value
There’s nothing gross or unholy about null values. Very smart people many years ago envisioned them as part of relational databases specifically for the use cases like these.
Firstly, it’s more practical and brings fewer surprises to all people involved (especially if there are many people involved).
Secondly, if we’re being pedantic, it is a mere acknowledgement of the fact that the field was not there before and things worked without it, and now the field is here, so in all senses the field _is actually_ optional, so all the touching code should know how to handle this.
I envy your team who's only mistake is to forget setting NULLABLE. Rainbows and unicorns ;)
Each process can either have predefined fields (take time to implement) or key-string value store (yes, they chose this).
Either the BPMN logic or some JSs (some, as not all people can code) in PBMN nodes or some Java hooks (a lot, wrote under those other people requests) edit those values.
So when something's wrong, we swam in the sea of key-value, search all the code for who update this field, and why it's value is not JSON nor null nor boolean but a string "Not used, waiting for policy 1234".
After that project, I'm happy every time I can use a DB to store data, more so if it can run in local, much more so if it's relational.
I've had the firsthand experience building a consumer for that kind of DB and it's hell to get anything running correctly without literally writing a dedicated layer to sanity check everything.
It's not attractive to developers for a variety of reasons, but encapsulation is generally cheaper (in every sense) than the alternatives by orders of magnitude. If a system is hard to alter and other people rely on it to do their jobs, thinking about it as set in stone during my planning process will save me grief and heartache.
Starting with an independent client in mind makes it easier to plan around things like reconstructing tables from composed exports or dealing with authoritatively incorrect data. It's a good collection point for hard-coded edge case handling, and a great location for all of your discovery notes.
It's more common in string fields, which in many cases just get rendered on a web form that doesn't differentiate blank (empty string) from null state, therefore in the database we should in most cases set it up as follows:
- Value required: field should be non-NULL + at least length 1 (via check constraint)
- Value optional: either field is non-NULL, or field is nullable + at least length 1
I'm curious if you prefer to store optional strings as non-NULL and rely on the length, or as nullable and have a length constraint.
eg. you might have some bug CSV uploaded and your have number of rows in it, your app could insert record without this number and async process would fill that later.
there might be even some corner case where null value is possible
I believe solution here isn't to check what fields do not use null, but to improve process of creating such migration. either you should create second ticket for next release to update db or commit new migration to some kind of next branch.
The harm is the same as any other unreliable linter rule.
Each one such rule is almost harmless. And on most places that use that kind of rule, they are extremely harmful.
for example
create table foo(id int) insert foo values (1), (2), (3)
insert foo values (null)
select * from foo
id
1
2
3
NULL
ALTER TABLE foo with nocheck ADD CONSTRAINT CheckNotnull check (id IS NOT NULL)
insert foo values (null)
Msg 547, Level 16, State 0, Line 13 The INSERT statement conflicted with the CHECK constraint "CheckNotnull". The conflict occurred in database tempdb", table "dbo.foo", column 'id'. The statement has been terminated.
However be aware that if you update an existing value to NULL, you will still get the error
update foo set id = null where id = 2
Msg 547, Level 16, State 0, Line 20 The UPDATE statement conflicted with the CHECK constraint "CheckNotnull". The conflict occurred in database "tempdb", table "dbo.foo", column 'id'.
ALTER TABLE foo WITH NOCHECK ADD CONSTRAINT CheckNotnull CHECK (id IS NOT NULL)
any new values coming in cannot be null but the values already in the table with null are fine... then you can update them to not null over time
> Many other useful performance improvements, including the ability to avoid a table rewrite for ALTER TABLE ... ADD COLUMN with a non-null column default
https://www.postgresql.org/docs/release/11.0/
I think there is some restriction there, like the default can't be "volatile" - I can't remember the precise definition here but I think current_timestamp would be volatile, but any static value would not.
Also interesting, `now()` is non-volatile because it's defined as "start of the transaction". So if you add a column with `DEFAULT now()` all rows will get the same value. But `timeofday()` is not volatile, so `DEFAULT timeofday()` is going to lock the table for a long time. A bit of a subtle gotcha.
> But `timeofday()` is not volatile, so `DEFAULT timeofday()` is going to lock the table for a long time.
Perhaps the “not” was a typo?
This seems to be the central claim. But just as lies require intent, so does database design to some degree.
A column that is nullable but never null does not conclusively say anything, really. That's like saying a birthday column that is never before 1970 in the current data should be restricted to years after that date.
A nullable column signals that the data may be left empty, which is entirely different from that column actually having to be empty with the current data. Is-ought distinction, the potentiality of null ("nullability") is not to be confused with the actuality of current vacancy ("null values"). The programmer extracting data from the database must check for null if the data is nullable, not just if is null now as a principle of robust fault-tolerant code.
One could write a separate piece about maybe using that as a clue that the field could be NOT NULL'd in the future but that's not what this post is.
No it's not, because they specifically started with the premise that the field was initially intentionally non-null and was only temporarily made nullable for migration purposes. That is obviously not the situation you are describing, right?
But even non-nullable does not always resist to time, I'd argue that use cases where the field _has_ to be null eventually emerges and somehow have to be mitigated. There is no easy solution to safely workaround that without either tons work that duplicates lots of things or taking risks by adapting the code base.
There should also be a [the-asteroid-has-hit-y'all-are-so-stupid] and global data systems should just just pass that around after impact until the power goes out for good.