LLMs

Dec 31, 2024
#cs


LLM stands for large language model. A machine learning model that is able to converse in human language. GPT stands for Generative Pre-trained Transformer. It is a type of LLM. Generative means it is able to generate text. Pre-trained means the model's weights have already been trained on a large dataset. It can be fine-tuned for specific domains. Transformer is a specific type of neural-network architecture that is designed for sequential data. In general, transfomers consist of encoder and decoder blocks. However, the current state-of-the-art LLMs do not have encoder blocks. A decoder block consists of an attention layer followed by a feed-forward layer. Attention is a concept propsed in the 2017 paper "Attention is all you need". TODO The attention layer is modern LLMs is auto-regressive, meaning that future tokens cannot be used to predict previous tokens, and self-attention, meaning the same input matrix is mutiplied by the corresponding query, key, and value matrices. The attention layer adjusts the embedding of each token based on the other tokens in the sequence. Attention is the crucial innovation that has made GPTs so successful today. The feed-forward layer is a relatively simple multi-layer perceptron. i.e. a fully connected network with non-linear activation functions. The feed-forwad layer acts on each token independently. The first stage of the transfomer architecture is the embedding matrix, which is multiplied by the one-hot encoding of the input sequence to create an embedding for each token. This embedding matrix is learned, but can be reused across models. Next, there's the positional encoding transformation TODO. Next, there's a series of decoder blocks. Finally, there's a matrix multiplication to convert back into the space of all tokens, followed by a softmax. The argmax of the last vector corresponds to the prediction for the next token. What is fine-tuning?



Comment