FlareWise

Smart pre-visit intake for doctors

Self-evaluation demo

Gemini invents its own synthetic notes (clean / typo / vague / negated / contradictory / urgent / long) and grades itself on each. This is a sanity-check demo, NOT an independent benchmark - real evaluation needs held-out labelled data.

Test TypeExtraction F1Hallucination RateNegation AccuracyTemporal AccuracyNote
Run the stress test to generate evaluation metrics.

How evaluation improves the model behavior

Negation stress

If notes like no fever are missed, the schema and evaluator force negated symptoms to be tracked separately.

Temporal stress

If a symptom happened before a medication change, the evaluator flags unsupported causation or wrong time order.

Safety stress

If urgent terms appear, a rule based safety layer flags them before any generated summary is trusted.

Developed by aher.dev