Then, rather than rely of humans for the reinforcement learning phase, Anthropic uses that AI evaluation dataset to train a preference model that helps fine-tune Claude to consistently output ...