Type something to search...
ML Papers

ML Papers

ML Papers

6.7k 520
03 May, 2024

What is ML Papers Explained ?

ML Papers Explained is a Github repo with explanations of the key concepts in Machine Learning.

Language Models

TransformerJune 2017An Encoder Decoder model, that introduced multihead attention mechanism for language translation task.
ElmoFebruary 2018Deep contextualized word representations that captures both intricate aspects of word usage and contextual variations across language contexts.
GPTJune 2018A Decoder only transformer which is autoregressively pretrained and then finetuned for specific downstream tasks using task-aware input transformations.
BERTOctober 2018Introduced pre-training for Encoder Transformers. Uses unified architecture across different tasks.
Transformer XLJanuary 2019Extends the original Transformer model to handle longer sequences of text by introducing recurrence into the self-attention mechanism.
GPT 2February 2019Demonstrates that language models begin to learn various language processing tasks without any explicit supervision.

| UniLM | May 2019 | Utilizes a shared Transformer network and specific self-attention masks to excel in both language understanding and generation tasks. | | XLNet | June 2019 | Extension of the Transformer-XL, pre-trained using a new method that combines ideas from AR and AE objectives. | | RoBERTa | July 2019 | Built upon BERT, by carefully optimizing hyperparameters and training data size to improve performance on various language tasks . | | Sentence BERT | August 2019 | A modification of BERT that uses siamese and triplet network structures to derive sentence embeddings that can be compared using cosine-similarity. | | Tiny BERT | September 2019 | Uses attention transfer, and task specific distillation for distilling BERT. | | ALBERT | September 2019 | Presents certain parameter reduction techniques to lower memory consumption and increase the training speed of BERT. | | Distil BERT | October 2019 | Distills BERT on very large batches leveraging gradient accumulation, using dynamic masking and without the next sentence prediction objective. | | T5 | October 2019 | A unified encoder-decoder framework that converts all text-based language problems into a text-to-text format. | | BART | October 2019 | A Decoder pretrained to reconstruct the original text from corrupted versions of it. | | UniLMv2 | February 2020 | Utilizes a pseudo-masked language model (PMLM) for both autoencoding and partially autoregressive language modeling tasks, significantly advancing the capabilities of language models in diverse NLP tasks. | | FastBERT | April 2020 | A speed-tunable encoder with adaptive inference time having branches at each transformer output to enable early outputs. | | MobileBERT | April 2020 | Compressed and faster version of the BERT, featuring bottleneck structures, optimized attention mechanisms, and knowledge transfer. | | Longformer | April 2020 | Introduces a linearly scalable attention mechanism, allowing handling texts of exteded length. | | GPT 3 | May 2020 | Demonstrates that scaling up language models greatly improves task-agnostic, few-shot performance. | | DeBERTa | June 2020 | Enhances BERT and RoBERTa through disentangled attention mechanisms, an enhanced mask decoder, and virtual adversarial training. | | DeBERTa v2 | June 2020 | Enhanced version of the DeBERTa featuring a new vocabulary, nGiE integration, optimized attention mechanisms, additional model sizes, and improved tokenization. | | Codex | July 2021 | A GPT language model finetuned on publicly available code from GitHub. | | FLAN | September 2021 | An instruction-tuned language model developed through finetuning on various NLP datasets described by natural language instructions. | | T0 | October 2021 | A fine tuned encoder-decoder model on a multitask mixture covering a wide variety of tasks, attaining strong zero-shot performance on several standard datasets. | | Gopher | December 2021 | Provides a comprehensive analysis of the performance of various Transformer models across different scales upto 280B on 152 tasks. | | LaMDA | January 2022 | Transformer based models specialized for dialog, which are pre-trained on public dialog data and web text. | | Instruct GPT | March 2022 | Fine-tuned GPT using supervised learning (instruction tuning) and reinforcement learning from human feedback to align with user intent. | | Chinchilla | March 2022 | Investigated the optimal model size and number of tokens for training a transformer LLM within a given compute budget (Scaling Laws). | | PaLM | April 2022 | A 540-B parameter, densely activated, Transformer, trained using Pathways, (ML system that enables highly efficient training across multiple TPU Pods). | | GPT-NeoX-20B | April 2022 | An autoregressive LLM trained on the Pile, and the largest dense model that had publicly available weights at the time of submission. | | Flamingo | April 2022 | Visual Language Models enabling seamless handling of interleaved visual and textual data, and facilitating few-shot learning on large-scale web corpora. | | OPT | May 2022 | A suite of decoder-only pre-trained transformers with parameter ranges from 125M to 175B. OPT-175B being comparable to GPT-3. | | Flan T5, Flan PaLM | October 2022 | Explores instruction fine tuning with a particular focus on scaling the number of tasks, scaling the model size, and fine tuning on chain-of-thought data. | | BLOOM | November 2022 | A 176B-parameter open-access decoder-only transformer, collaboratively developed by hundreds of researchers, aiming to democratize LLM technology. | | Galactica | November 2022 | An LLM trained on scientific data thus specializing in scientific knowledge. | | ChatGPT | November 2022 | An interactive model designed to engage in conversations, built on top of GPT 3.5. | | LLaMA | February 2023 | A collection of foundation LLMs by Meta ranging from 7B to 65B parameters, trained using publicly available datasets exclusively. | | Alpaca | March 2023 | A fine-tuned LLaMA 7B model, trained on instruction-following demonstrations generated in the style of self-instruct using text-davinci-003. | | GPT 4 | March 2023 | A multimodal transformer model pre-trained to predict the next token in a document, which can accept image and text inputs and produce text outputs. | | PaLM 2 | May 2023 | Successor of PALM, trained on a mixture of different pre-training objectives in order to understand different aspects of language. | | LIMA | May 2023 | A LLaMa model fine-tuned on only 1, 000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling. | | Falcon | June 2023 | An Open Source LLM trained on properly filtered and deduplicated web data alone. | | LLaMA 2 | July 2023 | Successor of LLaMA. LLaMA 2-Chat is optimized for dialogue use cases. | | Humpback | August 2023 | LLaMA finetuned using Instrustion backtranslation. | | Code LLaMA | August 2023 | LLaMA 2 based LLM for code. | | GPT-4V | September 2023 | A multimodal model that combines text and vision capabilities, allowing users to instruct it to analyze image inputs. | | LLaMA 2 Long | September 2023 | A series of long context LLMs s that support effective context windows of up to 32, 768 tokens. | | Mistral 7B | October 2023 | Leverages grouped-query attention for faster inference, coupled with sliding window attention to effectively handle sequences of arbitrary length with a reduced inference cost. | | Llemma | October 2023 | An LLM for mathematics, formed by continued pretraining of Code Llama on a mixture of scientific papers, web data containing mathematics, and mathematical code. | | CodeFusion | October 2023 | A diffusion code generation model that iteratively refines entire programs based on encoded natural language, overcoming the limitation of auto-regressive models in code generation by allowing reconsideration of earlier tokens. | | Zephyr 7B | October 2023 | Utilizes dDPO and AI Feedback (AIF) preference data to achieve superior intent alignment in chat-based language modeling. | | Gemini 1.0 | December 2023 | A family of highly capable multi-modal models, trained jointly across image, audio, video, and text data for the purpose of building a model with strong generalist capabilities across modalities. |

Language Models for Retrieval

Dense Passage RetrieverApril 2020Shows that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual encoder framework.

Vision Models

Vision TransformerOctober 2020Images are segmented into patches, which are treated as tokens and a sequence of linear embeddings of these patches are input to a Transformer
DeiTDecember 2020A convolution-free vision transformer that uses a teacher-student strategy with attention-based distillation tokens.
Swin TransformerMarch 2021A hierarchical vision transformer that uses shifted windows to addresses the challenges of adapting the transformer model to computer vision.
BEiTJune 2021Utilizes a masked image modeling task inspired by BERT in, involving image patches and visual tokens to pretrain vision Transformers.
MobileViTOctober 2021A lightweight vision transformer designed for mobile devices, effectively combining the strengths of CNNs and ViTs.
Masked AutoEncoderNovember 2021An encoder-decoder architecture that reconstructs input images by masking random patches and leveraging a high proportion of masking for self-supervision.

Convolutional Neural Networks

LenetDecember 1998Introduced Convolutions.
Alex NetSeptember 2012Introduced ReLU activation and Dropout to CNNs. Winner ILSVRC 2012.
VGGSeptember 2014Used large number of filters of small size in each layer to learn complex features. Achieved SOTA in ILSVRC 2014.
Inception NetSeptember 2014Introduced Inception Modules consisting of multiple parallel convolutional layers, designed to recognize different features at multiple scales.
Inception Net v2 / Inception Net v3December 2015Design Optimizations of the Inception Modules which improved performance and accuracy.
Res NetDecember 2015Introduced residual connections, which are shortcuts that bypass one or more layers in the network. Winner ILSVRC 2015.
Inception Net v4 / Inception ResNetFebruary 2016Hybrid approach combining Inception Net and ResNet.
Dense NetAugust 2016Each layer receives input from all the previous layers, creating a dense network of connections between the layers, allowing to learn more diverse features.
XceptionOctober 2016Based on InceptionV3 but uses depthwise separable convolutions instead on inception modules.
Res NextNovember 2016Built over ResNet, introduces the concept of grouped convolutions, where the filters in a convolutional layer are divided into multiple groups.
Mobile Net V1April 2017Uses depthwise separable convolutions to reduce the number of parameters and computation required.
Mobile Net V2January 2018Built upon the MobileNetv1 architecture, uses inverted residuals and linear bottlenecks.
Mobile Net V3May 2019Uses AutoML to find the best possible neural network architecture for a given problem.
Efficient NetMay 2019Uses a compound scaling method to scale the network’s depth, width, and resolution to achieve a high accuracy with a relatively low computational cost.
NF NetFebruary 2021An improved class of Normalizer-Free ResNets that implement batch-normalized networks, offer faster training times, and introduce an adaptive gradient clipping technique to overcome instabilities associated with deep ResNets.
Conv MixerJanuary 2022Processes image patches using standard convolutions for mixing spatial and channel dimensions.

Object Detection

SSDDecember 2015Discretizes bounding box outputs over a span of various scales and aspect ratios per feature map.
Feature Pyramid NetworkDecember 2016Leverages the inherent multi-scale hierarchy of deep convolutional networks to efficiently construct feature pyramids.
Focal LossAugust 2017Addresses class imbalance in dense object detectors by down-weighting the loss assigned to well-classified examples.
DETRMay 2020A novel object detection model that treats object detection as a set prediction problem, eliminating the need for hand-designed components.

Region-based Convolutional Neural Networks

RCNNNovember 2013Uses selective search for region proposals, CNNs for feature extraction, SVM for classification followed by box offset regression.
Fast RCNNApril 2015Processes entire image through CNN, employs RoI Pooling to extract feature vectors from ROIs, followed by classification and BBox regression.
Faster RCNNJune 2015A region proposal network (RPN) and a Fast R-CNN detector, collaboratively predict object regions by sharing convolutional features.
Mask RCNNMarch 2017Extends Faster R-CNN to solve instance segmentation tasks, by adding a branch for predicting an object mask in parallel with the existing branch.
Cascade RCNNDecember 2017Proposes a multi-stage approach where detectors are trained with progressively higher IoU thresholds, improving selectivity against false positives.

Document AI

Table NetJanuary 2020An end-to-end deep learning model designed for both table detection and structure recognition.
DonutNovember 2021An OCR-free Encoder-Decoder Transformer model. The encoder takes in images, decoder takes in prompts & encoded images to generate the required text.
DiTMarch 2022An Image Transformer pre-trained (self-supervised) on document images
UDoPDecember 2022Integrates text, image, and layout information through a Vision-Text-Layout Transformer, enabling unified representation.

Layout Transformers

Layout LMDecember 2019Utilises BERT as the backbone, adds two new input embeddings: 2-D position embedding and image embedding (Only for downstream tasks).
LamBERTFebruary 2020Utilises RoBERTa as the backbone and adds Layout embeddings along with relative bias.
Layout LM v2December 2020Uses a multi-modal Transformer model, to integrate text, layout, and image in the pre-training stage, to learn end-to-end cross-modal interaction.
Structural LMMay 2021Utilises BERT as the backbone and feeds text, 1D and (2D cell level) embeddings to the transformer model.
Doc FormerJune 2021Encoder-only transformer with a CNN backbone for visual feature extraction, combines text, vision, and spatial features through a multi-modal self-attention layer.
LiLTFebruary 2022Introduced Bi-directional attention complementation mechanism (BiACM) to accomplish the cross-modal interaction of text and layout.
Layout LM V3April 2022A unified text-image multimodal Transformer to learn cross-modal representations, that imputs concatenation of text embedding and image embedding.
ERNIE LayoutOctober 2022Reorganizes tokens using layout information, combines text and visual embeddings, utilizes multi-modal transformers with spatial aware disentangled attention.

Tabular Deep Learning

Entity EmbeddingsApril 2016Maps categorical variables into continuous vector spaces through neural network learning, revealing intrinsic properties.
Wide and Deep LearningJune 2016Combines memorization of specific patterns with generalization of similarities.
Deep and Cross NetworkAugust 2017Combines the a novel cross network with deep neural networks (DNNs) to efficiently learn feature interactions without manual feature engineering.
Tab TransformerDecember 2020Employs multi-head attention-based Transformer layers to convert categorical feature embeddings into robust contextual embeddings.
Tabular ResNetJune 2021An MLP with skip connections.
Feature Tokenizer TransformerJune 2021Transforms all features (categorical and numerical) to embeddings and applies a stack of Transformer layers to the embeddings.


ColD FusionDecember 2022A method enabling the benefits of multitask learning through distributed computation without data sharing and improving model performance.
Are Emergent Abilities of Large Language Models a Mirage?April 2023This paper presents an alternative explanation for emergent abilities, i.e. emergent abilities are created by the researcher’s choice of metrics, not fundamental changes in model family behaviour on specific tasks with scale.
Scaling Data-Constrained Language ModelsMay 2023This study investigates scaling language models in data-constrained regimes.
An In-depth Look at Gemini’s Language AbilitiesDecember 2023A third-party, objective comparison of the abilities of the OpenAI GPT and Google Gemini models with reproducible code and fully transparent results.

Literature Reviewed

Reading Lists