### Foundation Models: Transformer Core ideas 1. have a generalized language model 1. train on a very large corpus 1. zero- or one-shot learning 1. self-attention for encoding long range dependencies 1. self-supervision for leverating large unlabeled datasets (aka unsupervised pre-training) 1. additional supervised training for downstream tasks, e.g. - translation (lang1 & lang2 pairs) - question answering (Q&A pairs) - sentiment analysis (text & mood pairs) - etc.
### Transformer Zoo * the original transformer was meant for translation tasks * usage has broadened ever since * spawning a whole zoo of transformers * some use encoder only * some use decoder only * some use a combination of encoder/decoder just like the original transformer
### Encoder only (BERT-like) _also called auto-encoding Transformer models_ * some problems only need the encoder part of the original transformer * "understanding" texts and their semantics is sufficient to e.g. * answer questions about a text with individual original text passages or to * sort the mood of a text into "positive" or "negative", * filling individual word gaps in texts * in all three cases, no "complex" answer is required for which a decoder would be needed. * BERT and derived models are famous off-the-shelf representatives for encoder only models
### What can BERT do? * token classification * sentence classification * multiple choice classification * question answering
### Training BERT * first objective * input corrupted by using random masking * model must predict the original sentence * only masked words are predicted rather than reconstructing the entire input (because BERT can not do this) * second objective: * inputs are two sentences A and B * model has to predict if the sentences are consecutive or not https://huggingface.co/transformers/model_summary.html#bert https://arxiv.org/abs/1810.04805
### Decoder only (GPT-like) _also called auto-regressive Transformer models_ * the decoder part can transform given inputs into complete sentences * e.g. useful in itself, to complete started sentences * GPT would be an example for this kind of application * unidirectional: trained to predict next word * by OpenAI
### Training GPT * self-supervised training * predict the next word, given all of the previous words within some text * has a limited context https://huggingface.co/transformers/model_summary.html#original-gpt https://huggingface.co/transformers/model_doc/gpt2.html
### Evolution of GPT GPT: Generative Pre-Trained Transformer * GPT-1: 2018, 110 million parameters (https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf), https://www.youtube.com/watch?v=LOCzBgSV4tQ * GPT-2: 2019, 1.5 billion parameters (https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf), https://www.youtube.com/watch?v=BXv1m9Asl7I * GPT-3: 2020, 175 billion parameters (https://arxiv.org/abs/2005.14165), https://www.youtube.com/watch?v=wYdKn-X4MhY * GPT-4: 2022, probably not much larger, but trained on more data and more context (4096 instead of 2048) (https://analyticsindiamag.com/gpt-4-sam-altman-confirms-the-rumours/)
### Complete transformer (BART/T5-like) _also called sequence-to-sequence transformer model_ * combined use of encoder and decoder, as in the original transformer approach * allows summaries of texts in addition to translations * name is probably most appropriate as texts really get transformed * models like T5 and BART are most common here * to be able to operate on all NLP tasks, it transforms them into text-to-text problems by using specific prefixes: * summarize: * question: * translate English to German: * etc. https://arxiv.org/abs/1910.10683
### Training T5 pretraining includes both self-supervised and supervised learning * self-supervised training randomly removes a fraction of the tokens and replacing them with individual sentinel tokens * input of the encoder is the corrupted sentence * input of the decoder is the original sentence * target is then the dropped out tokens delimited by their sentinel tokens * supervised training on downstream tasks provided by the GLUE and SuperGLUE benchmarks * converting them into text-to-text tasks as explained in previous slide https://huggingface.co/transformers/model_summary.html#t5
### Don't forget: Transformers are language models * No abstract reasoning like it is in our brains takes place * The basis is the expression of thoughts in texts and code, etc. * That's the way the system is trained * Whether this is also intelligent is a pointed question * Turing would probably say it doesn't matter * One can argue that this system passes his test * https://twitter.com/glouppe/status/1438496208343949318
### On the Opportunities and Risks of Foundational Models * foundational models: trained on broad data at scale and are adaptable to a wide range of downstream tasks * ML is undergoing a paradigm shift with the rise of these models * their scale results in new emergent capabilities * defects of the foundation model are inherited by all the adapted models downstream * lack of clear understanding of how they work, when they fail, and what they are even capable of https://arxiv.org/abs/2108.07258
### Why it might make sense to study transformers even when you are not into NLP So even though I'm technically in vision, papers, people and ideas across all of AI are suddenly extremely relevant. Everyone is working with essentially the same model, so most improvements and ideas can "copy paste" rapidly across all of AI. https://twitter.com/karpathy/status/1468370611797852161
### More cool GPT based stuff * GPT-3: beta, but no longer private beta * https://beta.openai.com/examples * https://beta.openai.com/codex-javascript-sandbox * large language models (like GPT-3) to solve grade school math problems much more effectively: https://openai.com/blog/grade-school-math/#samples
### Examples for applications in the corporate context * Are nasty things circulating on social media about your company? * Summary of (scientific) articles * Classification of incoming mail (email) * Summarization: long on short texts (product description) * What is your example?
### Wrap-Up * There just isn't "The Transformer" but there is a whole Zoo of transformers * Transformers can be distinguished by their architecture and how they are trained * There is considerable overlap in what tasks they can perform * Foundational models like transformers cause a paradigm shift in machine learning https://www.embarc.de/blog/transformer-zoo/