table of content

1 Why PostgreSQL matters—and why talking about mistakes does too

1.1 Why learning about PostgreSQL matters

1.2 Why talking about PostgreSQL mistakes matters

1.3 What you will learn

1.4 Typical kinds of PostgreSQL mistakes

1.4.1 Coming with expectations from other databases

1.4.2 Misunderstanding PostgreSQL

1.4.3 Misunderstanding the documentation

1.4.4 Using relics from the SQL Standard

1.4.5 Not following best practices

1.5 How this book works

1.5.1 Mental models

1.5.2 Example mistake

1.6 Sample database: Frogge Emporium

2 Bad SQL usage

2.1 Using NOT IN to exclude

2.1.1 Performance implications

2.1.2 Alternative

2.2 Selecting ranges with BETWEEN

2.3 Not using CTEs

2.4 Using uppercase identifiers

2.5 Dividing INTEGERs

2.6 COUNTing NULL values

2.7 Querying indexed columns with expressions

2.8 Upserting NULLs in a composite unique key

2.9 Selecting and fetching all the data

2.10 Not taking advantage of checkers/linters or large language models

2.10.1 Code checkers/linters

2.10.2 Large language models

3 Improper data type usage

3.1 TIMESTAMP (WITHOUT TIME ZONE)

3.2 TIME WITH TIME ZONE

3.3 CURRENT_TIME

3.4 CHAR(n)

3.5 VARCHAR(n)

3.6 MONEY

3.7 SERIAL data type

3.8 XML

4 Table and index mistakes

4.1 Table inheritance

4.2 Neglecting table partitioning

4.3 Partitioning by multiple keys

4.4 Using the wrong index type

5 Improper feature usage

5.1 Selecting SQL_ASCII as the encoding

5.2 CREATE RULE

5.3 Relational JSON

5.4 Putting UUIDs everywhere

5.5 Homemade multi-master replication

5.6 Homemade distributed systems

6 Performance bad practices

6.1 Default configuration in production

6.2 Improper memory allocation

6.3 Having too many connections

6.4 Having idle connections

6.4.1 What is MVCC?

6.4.2 The problem with idle connections

6.5 Allowing long-running transactions

6.5.1 Idle in transaction

6.5.2 Long-running queries in general

6.6 High transaction rate

6.6.1 XID wraparound

6.6.2 Burning through lots of XIDs

6.7 Turning off autovacuum/autoanalyze

6.8 Not using EXPLAIN (ANALYZE)

6.9 Locking explicitly

6.10 Having no indexes

6.11 Having unused indexes

6.12 Removing indexes used elsewhere

7 Administration bad practices

7.1 Not tracking disk usage

7.1.1 Deleting the Write-Ahead Log

7.1.2 What can eat up your disk space?

7.1.3 What can you do?

7.2 Logging to PGDATA

7.3 Ignoring the logs

7.3.1 Bad configuration

7.3.2 Performance issues

7.3.3 Locks

7.3.4 Corruption

7.3.5 Security

7.4 Not monitoring the database

7.5 No tracking of statistics over time

7.6 Not upgrading Postgres

7.7 Not upgrading your system

8 Security bad practices

8.1 Specifying psql -W or - -password

8.2 Setting listen_addresses = '*'

8.3 trust-ing in pg_hba.conf

8.4 Database owned by a superuser

8.5 Setting SECURITY DEFINER carelessly

8.6 Choosing an insecure search path

9 High availability bad practices

9.1 Not taking backups

9.2 No Point-in-Time Recovery

9.3 Backing up manually

9.4 Not testing backups

9.5 Not having redundancy

9.6 Using no HA tool

10 Upgrade/migration bad practices

10.1 Not reading all release notes

10.2 Performing inadequate testing

10.3 Succumbing to encoding chaos

10.4 Not using proper BOOLEANs

10.5 Mishandling differences in data types

11 PostgreSQL, best practices, and you: Final insights

11.1 What type of user are you?

11.1.1 The dabbler

11.1.2 The cautious steward

11.1.3 The oblivious coder

11.1.4 The freefaller

11.2 Be proactive: Act early

11.3 All right, so you inherited a bad database

11.3.1 “Historical reasons”

11.3.2 What now?

11.3.3 First things first

11.4 Treat Postgres well, and it will treat you well

Appendix

Appendix A: Frogge Emporium database

A.1 Frogge Emporium database schema

A.2 Frogge Emporium database data

Appendix B: Cheat sheet

Overview

9 High availability bad practices

PostgreSQL’s reputation for resilience only holds when high availability is treated as a first-class design goal. The chapter underscores that HA means keeping the database accessible through failures while minimizing downtime and data loss, with acceptable targets defined by each organization’s RPO and RTO. Complacency, cost-cutting, or misunderstanding of PostgreSQL’s internals commonly lead to dangerous gaps: hardware redundancy is mistaken for data safety, replication is assumed to be a backup, and ad‑hoc operations are trusted over automation and testing. The antidote is proactive planning, explicit backups, redundancy, and the right automation and tooling.

Several backup antipatterns recur. Relying on RAID or filesystem snapshots does not protect against logical corruption or user mistakes, and replication can faithfully propagate both accidental drops and corrupted WAL; even delayed replicas are not a guarantee. Proper backups must be PostgreSQL-aware: base backups with archived WAL enable Point-in-Time Recovery, allowing restoration to the precise moment before damage. Using pg_dump as a primary safety net is inadequate for HA because it is logical, slow to restore, and cluster-wide consistency can’t be guaranteed across interdependent databases; instead use pg_basebackup (including incremental backups in newer releases) and continuous archiving. Manual backup routines and storing backups on the same host are recipes for loss; automate schedules, keep off-host/offsite copies, and regularly test restores. Verification should be routine: rehearse recovery, validate base backups with pg_verifybackup, inspect WAL with pg_waldump, and, if needed, use page-level checks—then automate those checks and alerts.

Redundancy is essential to meet tight RTOs: maintain one or more standbys via streaming replication (asynchronous or synchronous), use cascading to offload the primary, and understand features like replication slots and timelines. Just as with backups, avoid homegrown failover scripts; they often miss edge cases such as promoting a lagging replica or split-brain during network partitions. Prefer mature HA tooling that coordinates the cluster, enforces quorum/witness and fencing, manages promotions, keeps replicas in sync (including pg_rewind when timelines diverge), and integrates with monitoring and connection management. In combination, tested backups with PITR, automated verification, multiple replicas, and a proven HA orchestrator are the foundation for avoiding downtime and data loss.

A PostgreSQL installation layout, demonstrating the double physical redundancy of having a standby server but also RAID1 mirrored disks in each of the servers.

A sample PostgreSQL backup setup with a dedicated Barman server and geographical redundancy, showing the possible transfer paths.

A sample PostgreSQL HA setup with a cascaded replication setup for redundancy and backup.

Summary

RAID and filesystem snapshots can’t help you reliably recover from corruption, human error or malicious activity. The best way to guarantee your data is safe is to take backups using appropriate tools like pg_basebackup.
Taking full backups only makes you vulnerable to data loss between backups. Leverage Point-in-time recovery with continuous archiving to be able to restore your database to the point before it was damaged.
Taking backups manually is not robust or reliable, you should instead schedule automated backups, preferably using dedicated software that is PostgreSQL-aware (such as Barman or pgBackRest) and ensuring that you have a redundant copy of the backups in a second location.
Untested backups can fail when you need them the most so to ensure that they work correctly always attempt a full restore to test your backups. Do not rely solely on automation but verify every step. Avoid using homegrown scripts and prefer tried-and-tested solutions.
Having a single database server with no provision for failover inevitably leads to downtime. Ensure redundancy by setting up standby nodes via replication.
Manual failover or custom scripts are risky because of the potential for extended downtime, data divergence or loss. Prefer proven high availability tools such as RepMgr, Patroni or CloudNativePG for Kubernetes to ensure reliable and effective management of your HA cluster.

FAQ

Does RAID and a streaming replica mean I don’t need backups?

No. RAID mirrors failures above the hardware layer, so filesystem or data corruption is duplicated to the mirror. Streaming replication also mirrors mistakes and corruption: a DROP TABLE, bad writes, or corrupt WAL on the primary will be replayed on the standby. Even delayed replicas can still miss the detection window. Only proper backups let you recover to a clean point independent of what happened to the primary or its replica.

Are filesystem snapshots a safe way to back up PostgreSQL?

Only if they are coordinated and truly atomic across all tablespaces. PostgreSQL requires WAL and data files to be in sync. Safe options are to take the snapshot after a clean shutdown, or quiesce with pg_backup_start() and pg_backup_stop(). Never restore a snapshot over a running cluster. Even with atomic snapshots, a restore starts crash recovery and is not a substitute for PostgreSQL-aware backups.

Why isn’t pg_dump/pg_dumpall enough for HA-grade backups?

They create logical backups, not physical copies. That means:

No point-in-time recovery (PITR) between full dumps.
Cluster-wide consistency isn’t guaranteed across multiple databases while they are active.
Restores can be slow because indexes and on-disk structures are rebuilt, and planner stats are lost.

Use physical base backups plus archived WAL for HA and PITR.

What is Point-in-Time Recovery (PITR) and how do I set it up?

PITR lets you restore the database to an exact moment before a failure or mistake.

Enable WAL archiving with archive_command in postgresql.conf.
Take a base backup with pg_basebackup and keep all WAL generated from backup start to completion.
To restore, place the base backup, provide the archived WAL, and set a recovery target (for example recovery_target_time) to the desired point.
Timelines allow you to branch and try multiple recovery points without losing prior states.
PostgreSQL 17 adds incremental backups with pg_basebackup to reduce full-backup cost.

Can I use pg_dump together with WAL files for PITR?

No. You can’t mix logical backups (pg_dump/pg_dumpall) with WAL for PITR. PITR requires physical base backups plus the corresponding archived WAL. Use tools like pg_basebackup and a WAL archive, often orchestrated by Barman or pgBackRest.

Why must backups be automated and stored off the database server?

Manual processes are error-prone and inconsistent, and people are unavailable or forget. Keeping backups on the same server or storage risks total loss if that host fails. Automate schedules and monitoring, and store copies on independent and preferably offsite media. Tools like Barman and pgBackRest handle base backups, WAL archiving, retention, parallel transfer, and multi-destination storage.

How do I verify that my backups and WAL archives actually restore?

Regularly perform test restores in a separate environment. At minimum:

Copy the base backup to a fresh PGDATA.
Verify it with pg_verifybackup.
Provide all required WAL since the backup; sanity-check with pg_waldump.
Start PostgreSQL and confirm it reaches a consistent state and that data is present.

For maximum assurance, use pageinspect to read and validate pages, and automate the entire verification with alerts on failure.

What do RPO and RTO mean, and how do backups and replicas affect them?

RPO (Recovery Point Objective) is the acceptable data loss. PITR with archived WAL can reduce RPO to near zero relative to your last saved WAL. RTO (Recovery Time Objective) is the acceptable downtime. Redundancy with standbys reduces RTO by enabling rapid failover without waiting to provision hardware and restore full backups. Synchronous replication can push RPO toward zero at the cost of write latency.

Why should I use a proven HA tool instead of custom failover scripts?

Edge cases are hard: replication lag can promote a behind replica; network partitions can cause split-brain; diverging timelines require careful reconciliation. HA tools coordinate cluster state, WAL positions, and promotion, using mechanisms like witness/quorum/fencing, and utilities such as pg_rewind to realign replicas. Use Patroni, repmgr, or CloudNativePG (for Kubernetes) instead of reinventing the wheel.

How many replicas should I run, and when should I use cascading replication?

To maintain redundancy during maintenance or a failure, run at least two standbys. Cascading replication lets a standby stream to other standbys and to a backup server, reducing load on the primary. This topology improves availability, supports fast failover, offloads read traffic to hot standbys, and keeps backups current without overburdening the primary.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$54.99 $34.64

you save $20.35 (37%)

include audio $19.99 $12.59

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$54.99 $34.64

you save $20.35 (37%)

include audio $19.99 $12.59

eBook

pdf, ePub, online

$54.99 $34.64

you save $20.35 (37%)

include audio $19.99 $12.59

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more