PDSQI-9: The Clinical Framework Behind AI Note Verification

When a physician uses an AI scribe, the output is a clinical note that will become part of the permanent medical record. The question of whether that note is good — accurate, complete, organized, and clinically sound — is not a matter of opinion.

There is a peer-reviewed framework for evaluating clinical documentation quality. It is called the PDQI-9 (Provider Documentation Quality Instrument), and its enhanced form for AI-generated notes is PDSQI-9 (Provider Documentation Summarization Quality Instrument).

VerifyChart is built on this framework — not a proprietary black box, not a custom LLM prompt, but a structured, evidence-based evaluation instrument developed and validated in clinical research.

This article explains what PDSQI-9 is, how it works, and why it matters for physicians using AI scribes.

What Is PDQI-9?

PDQI-9 is a validated clinical documentation quality instrument developed to assess the quality of physician-generated clinical notes. It evaluates documentation across nine dimensions, each scored on a 1–5 scale.

The framework was designed to move clinical documentation quality assessment from subjective (“this note seems good”) to objective (“this note scores 3.8 across 9 validated dimensions”).

PDSQI-9 extends PDQI-9 specifically for AI-generated and AI-summarized clinical documentation — adding hallucination detection, source groundedness verification, and AI-specific error pattern recognition to the original nine dimensions.

The Nine Dimensions

Each dimension is scored 1–5, where 1 is poor and 5 is excellent.

1. Up-to-Date (1–5)

What it measuresDoes the note reflect current medical guidelines, current terminology, and current clinical standards?

Why it matters for AI notesAI scribes are trained on historical clinical text. They sometimes recommend outdated treatments, use superseded drug names, or reference guidelines that have since been updated.

What a low score looks likeRecommending a medication that has been withdrawn, using outdated staging criteria, or referencing clinical guidelines that were superseded.

2. Accurate (1–5)

What it measuresIs the clinical information factually correct with no internal contradictions?

Why it matters for AI notesThis dimension captures contradiction hallucinations — cases where the AI documents two statements in the same note that cannot both be true. A diagnosis that contradicts the objective findings. A vital sign that contradicts the clinical status assessment.

What a low score looks like"Patient is hemodynamically stable" with BP 58/40 documented in the same note. HFrEF diagnosis with EF 58% on echo.

3. Thorough (1–5)

What it measuresIs everything clinically important documented? Are sections complete?

Why it matters for AI notesAI scribes occasionally omit critical elements — an EKG result that was ordered but never documented, an ejection fraction missing from a heart failure note, a differential diagnosis absent from an acute presentation.

What a low score looks likeChest pain note without EKG findings. Heart failure note without ejection fraction. Acute presentation without differential diagnosis.

4. Useful (1–5)

What it measuresWould this note help the next provider caring for this patient? Does it contain actionable information?

Why it matters for AI notesAI scribes sometimes produce notes that document the encounter but provide no clear direction — vague plans, absent follow-up instructions, no actionable next steps.

What a low score looks like"Will follow up as needed" without specific instructions, timelines, or clinical decision rationale.

5. Organized (1–5)

What it measuresIs the note structured correctly? Is information in the right sections?

Why it matters for AI notesAI scribes occasionally misplace information — putting vital signs in the HPI, placing plan items in the assessment, mixing subjective and objective content. SOAP structure violations are common in AI-generated notes.

What a low score looks likeMedication list appearing in the subjective section. Lab results documented in the plan. Assessment and plan items appearing in reverse order.

6. Comprehensible (1–5)

What it measuresIs the note clearly written and understandable to another clinician?

Why it matters for AI notesAI scribes sometimes produce notes with unclear abbreviations, ambiguous phrasing, or contradictory statements that make the clinical picture unclear to another provider reading the note.

What a low score looks likeHeavy use of unexplained abbreviations, contradictory statements in adjacent sentences, clinical reasoning that cannot be followed.

7. Succinct (1–5)

What it measuresIs the note appropriately concise without unnecessary repetition?

Why it matters for AI notesAI scribes are prone to copy-paste template bloat — repeating information across multiple sections, including boilerplate that adds length without adding clinical value, padding notes with excessive negative findings.

What a low score looks like800-word note for a routine blood pressure check. The same information documented three times in different sections.

Succinct scoring is context-aware. Complex multi-system notes, ICU notes, and admission notes are scored relative to their complexity — not penalized for appropriate length.

8. Synthesized (1–5)

What it measuresDoes the note show clinical reasoning? Are findings connected to diagnoses and plans in a way that demonstrates medical decision-making?

Why it matters for AI notesAI scribes can produce notes that list problems and findings without connecting them — an assessment that is just a bulleted list of diagnoses with no reasoning about how the objective findings support them.

What a low score looks likeAssessment that lists five diagnoses with no explanation of how the history and physical support them. Plan that lists interventions without connecting them to specific problems.

9. Internally Consistent (1–5)

What it measuresDoes the note contradict itself anywhere?

Why it matters for AI notesThis dimension specifically targets contradiction hallucinations — cases where different sections of the same note contain conflicting information.

What a low score looks like"Patient denies chest pain" in the HPI and "Acute chest pain" in the assessment. Age documented as 67 in one section and 72 in another. Medication listed for a condition that was explicitly ruled out in the same note.

How PDSQI-9 Extends PDQI-9 for AI Notes

The original PDQI-9 was designed for human-generated clinical documentation. PDSQI-9 adds three AI-specific evaluation layers:

Hallucination detection: Identifies internal note inconsistencies — contradictions, physiologically impossible values, and temporal inconsistencies that suggest copied-forward data — that are specific to AI-generated content. A common example is HFrEF vs HFpEF documentation errors, where the documented ejection fraction value contradicts the documented heart failure type within the same note.

ISMP/TJC medication safety: Checks for prohibited abbreviations (QD, U for units, trailing zeros, naked decimals, MS/MSO4) that AI scribes produce because they were trained on pre-prohibition clinical text.

Billing risk assessment: Evaluates whether the Medical Decision Making complexity documented in the note supports the CPT code being billed — catching both underbilling (lost revenue) and overbilling (audit risk).

How VerifyChart Uses PDSQI-9

VerifyChart runs every AI-generated note through the full PDSQI-9 framework automatically.

Each of the nine dimensions is scored 1–5. An overall score is calculated starting from 100, with deductions for critical flags and low PDSQI dimension averages. The score floor is 20 — reflecting that even significantly problematic notes have some documentation value.

Score	Interpretation
90–100	Excellent — No issues found, physician review complete
75–89	Good — Review warnings before signing
60–74	Fair — Address issues before signing
Below 60	Poor — Do not sign until gaps are resolved

Each flag includes the specific text that triggered it, the category (hallucination, ISMP safety, billing risk, missing critical element), and the clinical rationale — giving the physician the information needed to make an informed decision about whether to correct the note or confirm the finding is accurate.

Why a Peer-Reviewed Framework Matters

Most AI-powered tools in healthcare are proprietary black boxes. They produce outputs without explaining the evaluation criteria, the scoring methodology, or the clinical basis for their findings.

PDSQI-9 is different. It is a validated, published framework with documented methodology. When VerifyChart flags an issue, it is not because a language model decided the note seemed problematic. It is because the note fails a specific, named, clinically validated criterion.

For physicians, this matters in two ways:

Clinical trust: You understand what is being checked and why. The framework is not a mystery.

Legal defensibility: If a physician's documentation practices are ever questioned, having used a peer-reviewed framework for independent verification is a stronger position than having used no verification at all.

The Bottom Line

PDSQI-9 gives AI note verification a clinical foundation that proprietary tools lack. Nine validated dimensions, evidence-based scoring, and AI-specific extensions for hallucination detection and medication safety.

VerifyChart applies this framework automatically in under 60 seconds — giving physicians a structured, evidence-based second opinion before they sign.

Protect your license. Prove you verified.

Verify a Note Free →

No signup required. No credit card.