Trend-Themen
#
Bonk Eco continues to show strength amid $USELESS rally
#
Pump.fun to raise $1B token sale, traders speculating on airdrop
#
Boop.Fun leading the way with a new launchpad on Solana.
Eine weitere interessante Beobachtung
ist, dass die Durchführung von SGD auf dem x-Entropie-Verlust auf Ihrem Textkorpus
equivalent ist zu
REINFORCE, d.h. on-policy Policy-Gradient, mit binärer Belohnung "Hat mein Modell Text aus dem Korpus generiert"

12. Aug., 00:20
Why is cross-entropy a good loss for language pretraining?
caveat: this is all known btw; interestingly, even though there are many viewpoints and intuitions on "why x-ent", they all can be arrived at from a single starting point.
Here's a simple first-principles derivation that doesn't assume anything about the data distribution. It comes from a very reasonable operational requirement :)
"I want my model to sound intelligent"
but we can't measure that, so we ask
"I want my model to sound like a human"
Although we have access to all texts ever written, we can't quite measure that either, so we instead ask
"I want my model to be as likely as possible to generate one of the texts ever written"
Or more bluntly:
"I want my model to memorize the training data."
Consider this thought experiment:
Given a dataset S of all text ever written by humans, we perform independent trials for each "text" in S:
Sample: "sample text" from our model Pr( ;W)
Check: did "sample text" exactly match the original? Note: we do not condition on anything! we just ask, of all the stuff the model could generate, did we get "text".
Define success as the event
E = "all per-sample checks succeed"
The probability of E is, the product of the probabilities assigned to the correct ground truth by your model W
Pr(E) = Π_{text in S} Pr(text; W)
Maximizing log Pr(E) over W gives you the cross-entropy objective.
How do you do you optimize this with SGD?
sample text from corpus
compute grad log Pr(token|prefix) for every prefix of text
update model
What's elegant is that this same simultaneously:
1) Minimizes the description length of the data under model P( ;W) (compression view)
2) Minimizes KL divergence to the true distribution—if one exists (though we never assumed one)
3) Implements maximum likelihood estimation
The derivation is straightforward and well-known, but it highlights something important:
cross-entropy emerges naturally from wanting exact reproduction of the training data.
P.S. you could have instead asked to maximize
Pr(text generated by the model is in ground truth)
interestingly, optimizing this can lead to mode collapse, since an optimal solution is to always predict a single piece of text from the corpus. Yet the gradients again look like x-entropy but with a multiplying factor
i.e., Pr(text;W) grad log Pr(text;W)
72,21K
Top
Ranking
Favoriten