To Be or Not To Unterscheidung

# To Be or Not To <span class="german">Unterscheidung</span> ⛈️ GOAL: Is <span class="german">"Unterscheidung"</span> or "To Differentiate" really at the top of a Large Language Model's mind? Look into a LLM's parameters to get a sense of how its knowledge is encoded. - We are experiencing the rise of Large Language Models, most famously OpenAI's ChatGPT. - In strong support of open-science, Meta publicly released a number of LLM models named Llama. - Let's dive into the llama-2-7b model to see how this LLM works. - We quickly see some interesting things... - It can recite Shakespeare "To be or ..." - With no prompt, it wants to start taking about climate change ⛈️ starting with the German word <span class="german">"Unterscheidung"</span> or "To Differentiate" (token number 19838). - Its linear output mapping of internal vector space to external tokens is nearly uniform. - There are several outlier output mapping basis vectors. These basis vectors defy simple characterizations. - LLMs can output highly confident sequences of words. They do so by linearly combining a large number of diffuse basis vectors in a highly-distributed manner. This distributed representation is opposed to activating a sparse number of highly-weighted components. - There is an opportunity to force sparseness into these output basis vectors in training. This might improve interpretability and information storage efficiency, but perhaps at the cost of generalization. <br><br><br>================================ # Llama2 Models - Llama2 is a LLM made by Meta and publicly released <span><img width=256px src="meta.png"></span> - Llama2 model code is available: https://github.com/meta-llama/llama - Importantly the model's optimized weights are available (simply fill out a form to get quick access). - These publicly available models allow researchers to dig in and do real science. <br><br><br>================================ # LLama2 Internals - Dive into the model by simply running a Python debugger - The model itself is shown by "generator.model": ``` Transformer( (tok_embeddings): ParallelEmbedding() # (32000, 4096) (layers): ModuleList( (0-31): 32 x TransformerBlock( (attention): Attention( (wq): ColumnParallelLinear() #(4096, 4096) (wk): ColumnParallelLinear() #(4096, 4096) (wv): ColumnParallelLinear() #(4096, 4096) (wo): RowParallelLinear() ) (feed_forward): FeedForward( (w1): ColumnParallelLinear() #(4096, 11008) (w2): RowParallelLinear() #(11008,4096) (w3): ColumnParallelLinear() #(4096, 11008) ) (attention_norm): RMSNorm() (ffn_norm): RMSNorm() ) ) (norm): RMSNorm() (output): ColumnParallelLinear() ) ``` - ColumnParallelLinear and RowParallelLinear are simply parallel-compute linear multiply functions: https://github.com/facebookresearch/fairscale - "output" matrix maps from the internal 4096 representation to the 32,000 token vocabulary. The final logits output is a simple linear combination of the output vectors as weighted by the internal representation. - $Token = W*h. Token\_i = W\_{i,j} \cdot h\_j$ where $W$ is the output matrix and $h$ is the 4096 vector result of all the transformer blocks and normalization. <br><br><br>================================ # To Be or ... - Prompt the LLM with "To be or" or the token stream [1, 1763, 367, 470] (1 is the "Begin Of Sentence" token). Llama remembers Shakespeare: > <b>Prompt:</b> > To be or > > <b>LLM Response:</b> > not to be, that is the question > > Whether 'tis nobler in the mind to suffer > The slings and arrows of outrageous fortune, > Or to take arms against a sea of troubles, > And by opposing end them. > > To die, to sleep— > ... - For each of those input tokens, record the top three predictions along with their logits: | Input Token | 1st | 2nd | 3rd | | -- | -- | -- | | < begin > | 'Unterscheidung' (7.0748) | 'nobody' (6.8119) | 'everybody' (6.4588) | | To | 'the' (7.8237) | 'create' (7.3098) | ':' (7.0593) | | be | 'honest' (15.0077) | 'a' (13.8370) | 'el' (13.7916) | | or | 'not' (22.0583) | 'to' (17.2366) | 'Not' (16.0178) | - After processing "< begin > To be or" the LLM predicts the next word is "not" as in Shakespeare. It estimates "not" is exp(22.0583-17.2366)=124.176x times more likely than the next most likely token "to". The softmax probabilities: 'not'=98.8%, 'to'=0.795%, 'Not'=0.235%, ... - It is interesting that the model given no context, wants to start with the German work <span class="german">Unterscheidung</span> (token 19838) which means "to differentiate". It is exp(7.0748-6.8119)=1.3x times more likely than "nobody". - With an empty prompt the model outputs: > Unterscheidung von Klimaschutz und Klimaänderung > > Die Begriffe Klimaschutz und Klimaänderung werden oft synonym verwendet, aber es gibt wichtige Unterschiede zwischen ihnen. Klimaschutz bezieht sich auf die Maßnahmen, die ergriff ... In English: > Distinction between climate protection and climate change > > The terms climate protection and climate change are often used interchangeably, but there are important differences between them. Climate protection refers to the measures taken ... - <span class="german">Unterscheidung</span> is at the top of the LLM's mind when it is "thinking" without any context. - This is unexpected. One might think the most likely word with zero context would be a common starting word in the training corpus such as "It" or "The". This is evidence that LLM are NOT simple Markov chains. Is this is a useful feature or an unwanted artifact? <br><br><br>================================ # The Shakespearean Not - How does the LLM predict "not" in Shakespeare's "To be or _not_ to be, that is the question"? - Run the model prompted with "To be or" and record: - the final hidden state distribution: Linear multiply into output matrix. - the output token logits distribution: Logits that softmax to probabilities. | "To be or" | Hidden 4k | Token Logits 32k | | -- | -- | -- | | Unsorted | <img src="toBeOr.h.3.nosort.png"> | <img src="toBeOr.output.3.nosort.png"> | | Sorted | <img src="toBeOr.h.3.png"> | <img src="toBeOr.output.3.png"> | | Density | <img src="dath.density.png"> | <img src="datoutput.density.png"> | | Min | -33.09 | -5.325 . softmax: 0 | | Mean | 0.02587 | 2.7 . softmax: 0.0000312 | | Max | 26.7 | 22.058 . softmax: 0.9876="not" token | | sd | 1.88 | 1.85 . softmaxEntropy: 0.0908 nats| | | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | <br><br><br>================================ # Platonic Ideal Grandmother Cell ... "not" - The hidden state has a couple of large components that clearly rise above the noise (19 components at a 5% Bonferroni level) - Perhaps the single largest component corresponds to the Platonic-ideal "not"-vector. This would suggest a simple <a href=" https://en.wikipedia.org/wiki/Grandmother_cell">"grandmother cell"</a> encoding by the network. - The largest positive weight is 26.7 at position 3241. Does this vector represent the concept "not"? - Sort the values and look at the tokens with the highest and lowest weights: |top 10 tokens | token+ | weight+ | token- | weight-| | -- | -- | -- | -- | -- | |1 | datei | 0.1728515625 | <0x0A> | -0.05908203125 | |2 | typen | 0.166015625 | ▁c | -0.048828125 | |3 | jourd | 0.1650390625 | s | -0.048095703125 | |4 | textt | 0.15625 | - | -0.04736328125 | |5 | ]{' | 0.1494140625 | ▁ | -0.046630859375 | |6 | quelle | 0.130859375 | ▁p | -0.04541015625 | |7 | AccessorImpl | 0.1298828125 | ▁a | -0.045166015625 | |8 | csol | 0.1298828125 | C | -0.044677734375 | |9 | ViewById | 0.1240234375 | S | -0.044677734375 | |10 | daten | 0.12109375 | 1 | -0.044677734375 | - Very unclear. Those top weighted tokens appear to have nothing to do with the concept of "not". - In fact, the weight associated with the token "not" is down-weighted -0.027. In a "simple grandmother cell world", the "not" token in this vector would have a large positive value. To express a "not" concept, the network would have high value on this vector which would have a high value on the token "not", increasing its probability in the softmax. - The representation is HIGHLY distributed. If you examine subsetting to only the most highly weighted components, you find that you need nearly half of the components to get the probability of "not" close to the full representation. Those high weights standing out in the hidden layer do not seem to be doing very much on their own! Only with a large number of components, does the probability become peaked. - <div style="display:inline-flex; align-items: center;"> <img src="probnotComponents.png"> <div style="max-width: 300px; border-style: solid; border-width:1px; padding: 10px;"> Prob("not") as function of the number of highest-weighted component vectors. 1681 components are required to get to get to 90% of the full-representation probability 98.8% The encoding is HIGHLY distributed. There are no simple "grandmother cells" for the "not" concept. Figure[numComp] </div> </div> - Pushing the output components through softmax probabilities shows the entropy tightly ranges from 10.37294 to 10.37344 nats. For reference, maximum entropy uniform is 10.37349 nats. The token distributions are close to uniform. The fine-scale highly-distributed weightings by the network, and not the output components alone, are interesting. - ( Note: the above analysis simply adds subsets of the weights sorted by magnitude. However, the entropy of softmax Boltzman distributions is controlled by temperature (constant factor changing magnitude). Perhaps the vector direction of the subsets is correct but the lower magnitude caused by smaller number of terms causes more uniform distributions. Redoing the analysis to adjust the magnitude of the weighted subset hidden to be the same as the full leads to similar curves. The direction of the hidden is highly distributed. ) - Is this a feature or a bug for LLMs? Highly distributed likely makes the model robust. However, it makes interpretation of model internals almost impossible. It could be useful to look into the LLM and see semantically cohesive concepts locally modeled in single vectors. One might interpret how these clear concepts influence how the text is generated. An L1 penalty on the output vectors could make the interpretation semantically cleaner. <br><br><br>================================ # Weird Outlier Basis Vectors - Look for weird basis vectors: - compute all pairwise Euclidean distances ($4096^2$) over the 32k token dimensions and run PCA / SVD on the normalized [ (x-mean)/sd ] data - Euclidean makes sense as the 32k-dim vectors are simply real valued that are added and softmaxed to get distribution over tokens. <img src="svd.svg"> - There are 44 outliers with abs(PC2)>50 - For each outlier vector sort the weights and report the top 10 largest positive (pos) and negative (neg) weights to summarize the 32k values: - For PC2< -50: <iframe width=800 height=400 src="wls.tokens.html" seamless></iframe> - For PC2> 50: <iframe width=800 height=400 src="whs.tokens.html" seamless></iframe> - There are a large number of highly-weighted non-English tokens in these outlier vectors that appear repeatedly. Perhaps these outlier vectors help encode English/Non-English text. <br><br><br>================================ 2024March25