Rating the Rater: Training Tasteful Generative Models

· 3 min read · Send your thoughts via twitter or mail.

During my time in the San Francisco Bay Area I meet many researchers from frontier AI labs. After talking for a bit, people bring up music by themselves here and there, and I like to ask for people’s favorite artists or genres. Out of a few dozen conversations, Kanye is the clear winner with ten+ mentions. Often, in popular discourse people like to ask a dumb, cliché question like “Drake or Kanye?” a a They have similar listener numbers on mainstream streaming services. A more recent question is “Barbie or Oppenheimer?”, and so for the researchers that don’t name Kanye initially I then ask who they think the better artist is. So far it’s 13:0 for Kanye. These folks diverge in their opinions on visual art, politics, TV, film and literary tastes but for that specific comparison within hiphop/rap it’s clean-cut.

Yes, music can be loaded with culture, subtext, symbolism but it doesn’t need any of that to work. It’s the lowest-context art form and I use the conversation of music preferences as a stealth aesthetic probe to read off how well a person’s baseline nervous system and aesthetic discrimination function. Music can work without much cultural or intellectual scaffolding and is least confounded.

Music has a more direct line to the nervous system and even the most illiterate imbecile should feel in their bones that Kanye vs. Drake is a category error and the question is an insult to art itself. ”Frontier lab researchers” just serve as a rough proxy for “person with minimum baseline cognitive function”. They had to do at least some things right to get there b b And yes, I try to ask what their parents did so I can factor out nepobabies. You can try this with top physicists, leading architects or any other non-fake profession. I’d bet the results are the same.

How does this map to taste in generative models?

When I query a generative AI model I want to be sure that none of the Drake-raters has any weight in the inference. This is the crux of making models seem tasteful: align them with well-calibrated nervous systems c c A non-trivial, non-random competence, not to an unfiltered average. I want to project out the subspace of raters whose latent taste vector is anti-correlated with mine. They are misleading the query and should be ~0 or negative weight for that query.

I think the labs are coming around to this type of thinking. Previously, to get to “not-terrible” outputs you’d throw a giant crowd d d i.e., sweatshop labelling companies like scale.ai or AMT at the problem and average their labels. Crowds don’t work for new art. Avant-garde by definition violates mainstream expectations and expert consensus. So instead of asking “does this rater agree with others in the group?” it asks “is this person self-consistent on similar cases?“. If your votes indistinguishable from a coin flip, your effective influence on the training signal goes to ~0.

Pragmatically, that means you tag every label with a rater ID and learn the rater embedding e e a vector that captures that person’s latent taste and style. The reward model conditions on that. You learn a global model by pooling everyone but can switch to a personalized or group-specific one by subset-ing the space.

It’s worth the effort and might alleviate the droning homogeneity f f The dull gravity of averages and meaninglessness we feel when working with the models of today.