FlareWise
Smart pre-visit intake for doctors
Self-evaluation demo
Gemini invents its own synthetic notes (clean / typo / vague / negated / contradictory / urgent / long) and grades itself on each. This is a sanity-check demo, NOT an independent benchmark - real evaluation needs held-out labelled data.
| Test Type | Extraction F1 | Hallucination Rate | Negation Accuracy | Temporal Accuracy | Note |
|---|---|---|---|---|---|
| Run the stress test to generate evaluation metrics. | |||||
How evaluation improves the model behavior
Negation stress
If notes like no fever are missed, the schema and evaluator force negated symptoms to be tracked separately.
Temporal stress
If a symptom happened before a medication change, the evaluator flags unsupported causation or wrong time order.
Safety stress
If urgent terms appear, a rule based safety layer flags them before any generated summary is trusted.