This Week in Machine Learning – 23 July 2021

I share few interesting open source projects related to machine learning deployment, neural search framework, and flexible machine learning data structure for this week in machine learning.

Parallelformers: An Efficient Model Parallelization Toolkit for Deployment [github][documentation]

Parallelformers helps us to deploy most big transformer-based model on multiple gpus for inference. It is designed to make model parallelization easier, and we can parallelize many Huggingface transformer models with a single line of code.

Jina: Cloud-native neural search framework for any kind of Data [github]

Building a search engine is hard, especially when our data are not text data. Neural search leverages on state-of-the art deep neural network to perform information retrieval, and it enables retrieval of any kinds of structured data such as images, video, audio, 3D mesh, etc. Jina is a neural search framework allowing us to build neural search engine as a service quickly. It employs distributed architecture, and cloud-native by design. Take a look at its quick demo.

Jina Architecture. Image taken from Jina’s github page.

Meerkat: Flexible data structure for complex machine learning datasets [github][blog]

With various type of machine learning data (such as images, graphs, videos, time-series), a simple data abstraction for data wrangling greatly helps ML practitioners to interact with high-dimensional, and multi-modal data. Inspired by Panda DataFrame and combined with the capability of from recent data abstraction in deep learning framework such as PyTorch Dataset and Tensorflow Dataset), Meerkat DataPanel provides a simple data abstraction that offers best of the both world. Meerkat DataPanel supports:

  1. Complex datatype (for example: images, graph, videos, time-series)
  2. Datasets that are larger-than-RAM with efficient I/O under-the-hood
  3. Multi-modal datasets
  4. Data creation and manipulation
  5. Data selection
  6. Inspection in interactive environment

Note that PyTorch Dataset and Tensorflow Dataset support number 1-3 in above list, whereas Panda DataFrame supports number 4-6 in above list. Read their blog for more detail examples on how Meerkat DataPanel helps us wrangle our datasets.

That’s all for this week. See you next week.

This Week in Machine Learning – 16 July 2021

Following my study on unsupervised domain adaptation last week, I write a short note on domain divergences survey for this week in machine learning. In domain adaptation, domain divergence measures the distance between the two domains, and reducing the divergence is important to adapt machine learning models to a new target domain.

A Survey and Empirical Analysis on Domain Divergences [paper]

As a measure of the distance between two domains, domain divergence is a major predictor of performance in target domain. Domain divergence measures can be utilized to predict performance drop when a model is applied in a target domain. This paper reviews 12 domain divergence measures, and reports a correlation analysis study on those divergence measures on part-of-speech tagging, named entity recognition, and sentiment analysis on more than 130 domain adaptation scenarios.

Taxonomy of Divergence measures. Figure taken from (Kashyap et al., 2021).

Domain divergence can be categorized into three categories:

  1. Geometric measures
    Geometric measures use vector space distance metric such as Cosine similarity, Manhattan, and Eucledian distance to measure the distance between features extracted from instances from different domains . These measures are easy to compute, but do not work well in a high dimensional space.
  2. Information theoretic
    Information theoretic measures capture the distance between probability distributions, for example cross entropy, f-divergences (KL and JS divergence), Renyi divergence, and Wasserstein distance.
  3. Higher-order measures
    Higher-order measures consider higher order moment matching of random variables or divergence in a projected space. Maximum Mean Discrepancy (MMD), CMD, and CORAL are measures utilizing higher order moment matching, whereas Proxy-a-distance (PAD) uses a classifier to measure the distance between source and target domain.

    Note that KL divergence is related to both Renyi divergence and first-order moment matching.

How do we apply divergence measures?

According to this survey paper, the divergence measures can be applied in three different ways:

  1. Data selection
    One way is to use divergence measures to select a subset of data from source domain that share similar characteristics to the target domain. For example, using cosine similarity or JS divergence to select data for POS tagging.
  2. Learning representation
    Another way is to learn a new representations that are invariant to the data sources. Domain Adversarial Neural Network (DANN) uses a domain classifier as a proxy-a-distance (PAD) to measure the distance between source and target domain. A good domain-invariant representation should be able to confuse the domain classifier whether it comes from source or target domain. Another example is to use MMD to reduce the discrepancy between representations belonging to the two domains.
  3. Decision in the wild
    Predicting performance drops on target domain where no labelled data are available is an important practical problem. Some previous work uses information theoretic measures like cross entropy, Renyi and KL divergence to predict performance drops in a new domain.
Correlation of performance drops with divergence measures. PAD is the most reliable measures to indicate performance drops on target domain. Table taken from (Kashyap et al., 2021).

The authors also run an experiment to see the correlation between divergence measures and performance drops in target domain. More detail experiment setting and discussion can be read in the paper. Here are some recommendation from the authors based on their experiments:

  • PAD is a reliable indicator of performance drop.
  • JS divergence is easy to compute, and can be a strong baseline.
  • Cosine similarity is not a reliable indicator of performance drop.
  • One dataset is not one domain, but a cluster representations of multiple domains.

That’s all for today. Stay safe and see you next week!

This Week in Machine Learning – 9 July 2021

For machine learning week, I study about unsupervised domain adaptation in NLP this week, reading a relevant survey paper in this area. Here are my short summary notes.

A Survey on Neural Unsupervised Domain Adaptation in NLP [paper][github]

In many NLP applications, the availability of the labeled data is very limited. Although deep neural network works well for supervised learning, particularly with pre-trained language model to achieve state-of-the-art performance on NLP tasks, there is still a challenge to learn from unlabeled data under domain shift. In this setting, we have two problems to tackle:

  1. The target domain and the source domain do not follow the same underlying distribution (Note that the task is the same). For example,
    • The source domain is news and wikipedia articles, and the target domain is scientific literatures.
    • The more extreme case is cross-lingual adaptation, the source domain is in one language, and the target domain is in another language.
  2. The scarcity of labeled data in the target domain, and we need unsupervised domain adaptation techniques to handle this issue.

This survey summarizes relevant work on unsupervised domain adaptation where we mitigate the domain shift by learning only from unlabeled target data. The authors also discuss the notion of domain (what constitute a domain?), and suggest to use better term: variety space.

Taxonomy of Domain Adaptation. Figure taken from (Ramponi and Plank, 2020).

Domain adaptation methods can be categorized into 3 groups of approaches:

  1. Model-centric approaches: redesign parts of the model by modifying feature space or loss function.
    • Feature-centric methods by feature augmentation or feature generalization.
      • Feature augmentation methods uses common shared features (called pivots) to construct aligned space, for example structural correspondence learning (SCL), spectral feature alignment (SFA), and pivot-based language model (PBLM).
      • Feature generalization utilizes autoencoder to find good latent representation that can be transferred across domains. Stacked denoising autoencoder (SDA), marginalized stacked denoising autoencoder (MSDA) have been used for unsupervised domain adaptation.
    • Loss-centric methods to regularize or modify model parameters. It can be divided into two groups: domain adversaries and instance-level reweighting.
      • Domain adversaries is probably the most popular methods for neural unsupervised domain adaptation. The idea is inspired by GAN approach which tries to minimize real and synthetic data distribution by learning a representation that cannot be distinguish between real and synthetic data. Bringing that into the domain adaptation setting, cross-domain representation can be achieved, if a domain classifier cannot distinguish whether the input comes from the source or the target domain. A popular Domain adversarial neural network (DANN) uses a gradient reveral layer to achieve cross-domain representation. A more recent approach utilizes Wassertein distance to achieve cross-domain representation.
      • The idea of instance-level reweighting methods is to assign a weight on each training instance proportional to its similarity to the target domain.
      • Other methods explicity reweight the loss based on domain discrepancy information, such as maximum mean discrepancy (MMD), and kernel mean matching (KMM).
  2. Data-centric approaches: focuses on getting signal from the data.
    • Pseudo-labeling using a trained labels to bootstrap initial ‘pseudo’ gold labels on the unlabeled instances. Pseudo-labeling applies semi-supervised methods such as self-training, co-training, and tri-training.
    • Data selection aims to find the best matching data for the new domain using domain similarity measures such as JS divergence or topic distribution.
    • Pre-training models leveraging large unlabeled data. Starting from a pre-trained transformer model followed by fine-tuning on small amount of labeled data has become the standard practice in NLP due to high performance on many NLP tasks. In domain adaptation, pre-training can be divided into:
      1. Pre-training alone
      2. Adaptive pre-training involving multiple rounds of pre-training
        • Multi-phase pre-training: pre-training followed by two or more phases of secondary pre-training, from broad domain to specific domain to task-specific domain, for example: BioBERT, AdaptaBERT, DAPT, TAPT.
        • Auxiliary-task pre-training: pre-training followed by multiple stages of auxiliary task pre-training involving intermediate labeled-data tasks.
  3. Hybrid approaches: intersection between model-centric and data-centric.

One of the authors, Barbara Plank, created a Github repo to collate all relevant papers in this area.

That’s all. Stay safe and see you next week.

This Week in Machine Learning – 2 July 2021

Hello, I am sharing einops library, an MLOps course, and ACM Turing Lecture on deep learning for AI for this week in machine learning.

Writing better deep learning codes with einops [github][website]

Einops (Einstein-inspired Notation for Operations) is a simple, flexible, and powerful tensor operation library for more readable codes. This library supports many backend such as numpy, pytorch, tensorflow, jax, gluon, tf.keras, cupy, chainer, and mxnet.

Using this library, we can focus on the input and output interface of tensor operations, rather than how it is computed, making our code more readable and easier to maintain. It also reduces the chance of making unnecessary errors in our code. Watch the short video below, and pytorch with einops comparison to get a better sense of how to make our code more readable with this library.

MLOps Course [website][github]

Goku Mohandas created a project-based course on how to apply ML to build production grade product. The course walks through product development and iteration cycle. It covers product planning, data transformation, modeling, reproducibility, scripting, testing, and production. More details on lesson highlight can be seen in Goku’s tweet thread.

MLOps Course by Goku Mohandas.

ACM Turing Lecture: Deep Learning for AI [article]

In this article, Bengio, LeCun, and Hinton review the rise of deep learning in AI, describe recent advances in deep learning, and discuss the future directions of deep learning in AI. The rise of deep learning attributed by deep architecture, unsupervised pre-training, success of Rectified Linear units (ReLUs). Deep learning also leads some breakthrough in speech and object recognitions. Several recent advances in deep learning are briefly mentioned in this article: Attention and transformer architecture, unsupervised and self-supervised learning, contrastive learning, and variational auto-encoder.

The discussion on future of deep learning suggests several improvement based on some current AI limitation compare to human learning such as: ability to generalize faster without too many trials, and adaptability and robustnes to changes in distribution (out-of-distribution generalization). In addition to this, applying deep learning on tasks which require a deliberate sequence of steps also another exciting future direction.

Stay safe and see you next week!

This Week in Machine Learning – 25 June 2021

This week in machine learning features NAACL 2021 Best short paper which is also related to PET we discuss last week. Let see main ideas for this paper.

How Many Data Points is a Prompt Worth? [paper][blog][github]

The standard approach for transfer learning using pretrained model for classification is to attach a head layer to perform classification, taking in the pretrained representation to predict the output class. The other alternative approach is to use a prompt, reformulating the task into a task-specific string and ask the model to produce a textual output corresponding to a class label. Prompt-based approach provides a more flexible way to inject extra task-specific guidance to fine tune the model. PET (which we discuss last week) has shown that prompt-based effectiveness on low-data regimes.

This NAACL 2021 best short paper presents a very nice analysis comparing head-based and prompt-based, aiming to quantify how many data points is a prompt worth? The author takes a roberta-large and compare the performance using head-based and prompt-based approach across different available data points (starting from 10 data points and increasing exponentially). For the prompt-based approach, they follow PET model (Schick and Schütze, 2021).

Using the both head-based and prompt-based performance curves on all data points, the data point advantage is estimated by first isolate the y-axis band of the lowest accuracy and the highest accuracy where two curves match in accuracy (this is shown in cross-hatch region in the figure below). Then, the area between two linearly-interpolated curve divided by the height of the band represents the number of data point advantage.

Prompting versus Head classifier performance on SuperGLUE. Highlighted area in cross-hatch region shows the data advatage. Figure taken from (Scao and Rush, 2021).

Experiment results on SuperGLUE benchmark shows prompt-based data advantage on all tasks except WiC. Notable data advantage can be seen in low-data regime.

The authors also perform analysis on the impact of verbalizer in prompt-based approach. To do this, they introduce null verbalizer (random first name to replace “yes”, “no”, “maybe”, “right”, “wrong”). They find that verbalizer is important especially when the availability of the training data points is low. Null verbalizer reduces the prompt-based performance in low-data regime. However, with more and more training data, prompt-based model can adapt to null verbalizer.

Comparison of full prompt and null verbalizer. Figure taken from (Scao and Rush, 2021).

That’s all for this week. Stay safe and See you next week!

This Week in Machine Learning – 18 June 2021

We have NAACL 2021 best paper, a great survey paper of Transformer models, a free NLP course, and an open source NLP library for Machine learning this week. Let’s take a look at them.

It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners [paper][blog]

GPT-3 model has shown its superior few-shot performance in many NLP tasks. However, GPT-3 is a gigantic model, and training GPT-3 needs huge computing power that is not practical for many people. This NAACL 2021 best paper proposes an effective alternative method to train a few-shot learner models using small language model. The main idea is reformulating tasks as cloze questions, and providing a gradient-based learning with few of these cloze questions training examples.

Reformulation sentiment classification as a cloze question. Figure taken from Schick’s blog.

The algorithm is called Pattern-exploiting training (PET). This algorithm needs two components:

  1. Pattern: a conversion of input into a cloze question
  2. Verbalizer: an expression of the output using one or more words.

In the figure shown above, we convert sentence “Excellent pizza!” into “Excellent pizza! It was ____”. Then, we ask the model to verbalize “good” for positive sentiment, or “bad” for negative sentiment.

This paper also shows an extensive experiment to identify factors contributing to strong PET performance such as the choice of patterns, and verbalizers, the usage of both unlabeled and labeled data, and properties of the underlying language model.

A Survey of Transformers [paper]

This paper presents a very comprehensive and detail review on variants of Transformer models. The authors categorize and organize their survey into architectural modification, pre-training, and application. Highly recommended reading for anyone who want to study various aspects of Transformer models.

Categorization of Transformer variants. Note the color grouping on Transformer module. Figure taken from (Lin et al., 2021).
The taxonomy of Transformers introduced in the paper. Figure taken from (Lin et al., 2021).

At the end of the paper, the authors emphasis on several direction for further improvement of Transformer:

  1. Lacks of theoretical analysis on Transformer models.
  2. Better global interaction mechanism beyond attention. Few papers I mentioned in a blog post few weeks ago show some effort in this area.
  3. Unified framework for multimodal data.

Hugging Face Course [website]

One of the most popular NLP libraries just released the first part of free Hugging Face course on Natural Language Processing using Hugging Face ecosystem. The first part focuses on the Transformer models in Hugging Face including on how to fine-tune and sharing the models.

Overview of Hugging Face course. Figure taken from the Hugging Face course website.

The second and third parts will focus on datasets and tokenizers libraries, and deep dive into NLP tasks and developing specialized architectures.

Graph4NLP [github]

Graph4NLP is an easy-to-use Deep Learning on Graph library focusing on NLP research. It’s built upon highly-optimized runtime DGL (Deep Graph Library). This library helps users to use GNNs in various NLP tasks such as text classification, semantic parsing, neural machine translation, summarization, knowledge graph construction, natural language generation.

Graph4NLP architecture. Figure taken from Graph4NLP github page.

The figure below shows the computing flow of Graph4NLP which reflects the standard workflow to use graph-based deep learning approaches for NLP tasks.

Graph4NLP computing flow. Figure taken from Graph4NLP github page.

More in-depth tutorial including recent advances on Deep Learning on Graphs for NLP can be read their NAACL 2021 Deep Learning on Graphs for natural language processing tutorial slides.

That’s all for this week. How’s your machine learning week? Stay safe and see you next week.

This Week in Machine Learning – 11 June 2021

My machine learning this week is about a long text generation. I also encounter an open source computer vision framework for self-supervised learning, and a resource on reproducible machine learning course. Let’s take a look.

Progressive Generation of Long Text with Pretrained Language Models [paper][github]

When generating long text, current large-scale fine-tuned pretrained language model exhibits some issues such as excessive repetitiveness or incoherence between sentences that are far apart. Another strategy for long text generation is using content planning before the actual text generation. The content planning in this strategy may require another task such as summarization or semantic role labeling which demands another resources and labeled data. Another more recent non-monotonic generation approaches also require specialized architectures, making them hard to integrate with the existing pretrained language models.

This paper proposes a simple strategy for long text generation (~1000 tokens) which can still leverage on the existing pretrained models. The main idea is to progressively generate text in multiple stages, breaking the complex text generation into multiple easier steps, from a skeleton to a full passage. Each stage refines the output from the previous stage by adding finer details. The algorithm is called Progressive Generation (ProGen).

Progressive Generation (ProGen) for long text generation. Figure taken from (Tan et al., 2021).

To achieve this mechanism, ProGen uses limited vocabularies on earlier stages based on the word importance, for example the first stage only uses 15% of the most important words in the corpus, the second stage only uses 20% of the most important words in the corpus, and so on. Then, the training data for each stage can be easily constructed by dropping all tokens that do not belong to the vocabularies in the particular stage. We can add noises to improve robustness and reducing data bias. Finally, we fine-tune a pretrain language model on each stage.

ProGen algorithm. Algorithm section taken from (Tan et al., 2021).

We can see that ProGen is conceptually very similar to non-monotonic generation. But, the generation is still performed in autoregressive, left-to-right fashion, making it more compatible with the existing pretrained language models. Compare to content planning strategy, ProGen has more flexible intermediate stages. Typical content planning strategies only have two stages.

Example of 3-stage long text generation on CNN News domain. Figure taken from (Tan et al., 2021).

Lightly [github]

Schematic framework of Lightly

Lightly is computer vision framework for self-supervised learning. This framework helps us to curate our dataset better, find edge cases, and improve data quality. The framework supports several models such as MoCo, SimCLR, SimSiam, BarlowTwins, BYOL, and NNCLR.

Reproducible Deep Learning Course [website]

I stumble upon a great resource on a practical reproducible deep learning course by Simone Scardapane. The course content covers tools that you need for reproducible deep learning workflow such as git, hydra, data version control, docker, weight & biases, and continuous integration. Great course by Simone.

Reproducible Deep Learning Course

That’s all for this week. How’s your machine learning week? Leave a comment below.

Stay safe and see you next week!

This Week in Machine Learning – 4 June 2021

Hello, my machine learning reading this week focuses on alternative architectures to self-attention in Transformer. In particular, I read three papers: FNet, gMLP, and Attention-free Transformer (AFT). Let’s see the main ideas in these papers.

FNet: Mixing Tokens with Fourier Transforms [paper]

Self-attention mechanism in Transformer architecture serves as an inductive bias, connecting each token in the input with weights to every other tokens. This mechanism has shown very effective and achieved many state of the art results in many NLP tasks. However, the standard self-attention mechanism requires quadratic time and memory with respect to sequence length, limiting its applicability to long sequences. There are several effort to reduce this quadratic complexity by sparsifying or compressing the attention matrix (see Tay et al., 2020 on Efficient Transformer survey).

This paper proposes a different approach, replacing self-attention sublayer in transformer architecture with Fourier transform. Fourier transform can be computed fast on GPUs and TPUs, and it does not carry any learned weight, making it smaller than standard Transformer. Thus, the architecture, namely FNet, is faster and using less memory than standard Transformer. FNet is capable to handle long sequences.

FNet encoder architecture, replacing self-attention block with Fourier transformation block. Figure taken from (Lee-Thorp et al., 2021).

The authors also have another model (called linear encoder), replacing self-attention sublayer with two linear sublayers. It is originally introduced as a baseline to the FNet, but it surprisingly turns out that the linear encoder also performs very well too. They also experiment with hybrid FNet attention model by replacing final two Fourier sublayers of FNet with self-attention sublayers. The hybrid model increases the performance of the base FNet model with small additional training cost. The improvement indicates that self-attention is important, but we do not need a big self-attention sublayers to archive strong performance. Here is a great YouTube video explaining FNet by Yannic Kilcher.

Yannic Kilcher’s FNet Explanation

Pay Attention to MLPs [paper]

gMLP is another approach to replace Transformer using MLP layers with gating. gMLP consists of channel projection and spatial projection with multiplicative gating. This alternative is simpler and cheaper to compute than self-attention layer in Transformer.

gMLP architecture. Figure taken from (Liu et al., 2021)

The main component of gMLP is spatial gating unit which captures spatial interaction, offering an alternative mechanism to capture high-order relationships. The authors of this paper perform several experiments comparing gMLP with Vision Transformer (ViT), DeIT (ViT with improved regularization), and BERT. They also study the importance of gating in gMLP, the scaling properties of gMLP, and the usefulness of adding tiny self-attention in gMLP. Attaching a tiny single-head self-attention to the gating function of gMLP improves gMLP to outperform Transformer of similar capacity.

aMLP architecture: gMLP with tiny attention. Figure taken from (Liu et al., 2021).

Attention-Free Transformer [paper]

This paper replaces standard multi-head attention in Transformer with more efficient element-wise operation between query and combined key and value. This replacement improves the computation to linear time and space complexity with respect to input and model size. Attention-Free Transformer (AFT) performs weighted average of values (V) combined with the keys (K) and a set of learned pairwise position biases.

Illustration of Attention-Free Transformer (AFT). Figure taken from (Zhai et al., 2021).

These few papers show recent developments to improve the efficiency of Transformer models by modifying or replacing the most expensive component in Transformer: self-attention layer.

That’s all for this week. Enjoy and see you next week!

This Week in Machine Learning – 28 May 2021

This week in machine learning is filled with GAN-related papers from Nvidia. DatasetGAN shows the usefulness of GAN (Generative Adversarial Network) in generating simulated training data at scale. Another paper from Nvidia shows a nice GAN application for controllable 3D rendering from a single 2D image. Let’s take a look at the idea on each paper.

DatasetGAN: Efficient Labeled Data Factory with Minimal Human Effort [website][paper]

The availability of labeled data is one big bottleneck to have an effective machine learning system. There are huge amount of unlabeled data out there, but high-quality labeled datasets are still limited because it takes huge amount of time, human resources, and money to create such datasets.

DatasetGAN is simple semi-supervised approach to synthesize massive large high-quality dataset with minimal human annotation. It shows that we can synthesize high-quality semantically segmented images using GAN, utilizing semantic knowledge (such as viewpoint, object identity) in its high dimensional latent space.

DatasetGAN generates large high-quality labeled image datasets leveraging StyleGAN and a handful human-annotated images. Figure taken from (Zhang et al., 2021).

DatasetGAN uses StyleGAN as the rendering engine, and then adding style-interpreter component to synthesize labels from StyleGAN latent vectors. Style interpreter consists of an ensemble of MLP classifiers to predict the label on each pixel based on the features from the upsampled StyleGAN latent vectors.

Architecture of DatasetGAN. Figure taken from (Zhang et al., 2021).

The authors perform experiments on bedrooms, cars, heads, birds, and cats images for semantic segmentation, and keypoint detection. They also showcase an application of this approach to perform animatable 3D object reconstruction from an image.

Image GANs Meet Differentiable Rendering for Inverse Graphics and Interpretable 3D Neural Rendering [paper][blog]

This paper reconstructs 3D objects from 2D images, often known as inverse graphics task. Most of existing approaches rely on the availability 3D labels to train a model. The authors of this paper present a way to extracts and disentangle 3D knowledge learned from GAN using differentiable rendering. Using this approach, they are able to obtain high-quality 3D rendering with low annotation effort.

StyleGAN to generate synthetic data efficiently, and Inverse Graphic Network to predicts 3D properties from images. Figure taken from (Zhang et al., 2021).

The approach aims to disentangle camera viewpoints, shapes, texture, light, and background. It combines two state-of-the art renderer as main components. In the first component, they utilize StyleGAN as multi-view data generator. Each layer in StyleGAN controls different image attribute, for example early layers control viewpoint, while intermediate and high layers control shape, texture, and background. They leverage on this knowledge, and uses first four layers of StyleGAN to disentangle camera viewpoints.

The second component is inverse graphics neural network DIB-R (Chen et al., 2019) to predict 3D shapes and texture from an image. This model enables them to detect 3D properties from image. This 3D properties are then mapped back into StyleGAN latent vectors, allowing them to control the StyleGAN rendering based on certain properties.

3D properties are mapped into StyleGAN, enabling controllable 3D rendering. Figure taken from (Zhang et al., 2021).

The model is still unable to predict correct lighting, and disentangling background is still a challenge. Moreover, predicting shapes of out-of-distribution objects (such as Batmobile and Flintstone car) is also another significant challenge.

3D reconstruction from monocular photograph.

That’s all for this week. Stay safe, and see you next week!

This Week in Machine Learning – 21 May 2021

Hello everyone, I read a survey paper on number representation in NLP, encounter a GPT-3 replica model in huggingface model hub, and project Starline. Here few short notes on those.

Representing Numbers in NLP: a Survey and a Vision [paper]

Numbers are important and integral part of text, and yet numbers rarely get special consideration when processing text. Many NLP systems are commonly treated numbers as words, and subword tokenization such as BPE breaks numbers into arbitrary tokens, for example 1234 might be splitted into 1-234 or 12-34 or 123-4.

This paper surveys recent numeracy work in NLP and categorized them into seven numeracy tasks:

  1. Simple arithmetic: arithmetic operation such as addition, subtraction over number alone.
  2. Numeration: decoding a string form to its numeric value.
  3. Magnitude comparison: ability to perform comparison on two or more numbers.
  4. Arithmetic word problems: ability to perform simple arithmetic from a composition.
  5. Exact facts: understanding numbers in commonsense knowledge.
  6. Measurement estimation: approximately guess measure of objects along certain dimensions.
  7. Numerical language modeling: making numeric predictions in completing text.
Numeracy tasks categorized into granularity and units. into Table taken from (Thawani et al., 2021).

The numeracy tasks are categorized following two dimensions:

  1. Granularity. Whether the encoding of the number is exact or approximate.
  2. Units. Whether the numbers are abstract or grounded.

The author groups number representation into string-based and real-based representation. String-based representation treat numbers as strings, with several tweaks. Real-based representation performs computation using the numerical value of the number. Detail summary on each representation can be read in the paper.

Numeracy in NLP. Table taken from (Thawani et al., 2021)

The authors present few practical takeaways to guide the design of number representation:

  • For string-based representation:
    • Scientific notation is superior to decimal notation
    • Character level tokenization outperforms subword level tokenization.
  • For real-based representation:
    • Log scale is preferred over linear scale as inspired by cognitive science literature (but it lacks of rigorous study)
    • Binning (dense cross entropy loss) works better than continuous value prediction (MAE loss)

They also call for unified and more holistic solution to numeracy. This involves a benchmark covering different numeracy subtasks to incentive research progress in numeracy.

GPT-Neo: GPT-3 Replica by EleutherAI [model][article]

OpenAI GPT-3 has been powering many applications in various domains from creativity to productivity. Through OpenAI API, GPT-3 generates about 4.5 billions words per day. However, access to the OpenAI API is not free and still very limited to few companies and developers. And, training GPT-3 from scratch demands a lot of computing power that most people cannot afford.

EleutherAI replicates GPT-3 architecture and trains the model on the Pile dataset. The Pile dataset is 825GB open source data for language modeling, combining texts from 22 sources such as English Wikipedia, OpenWebText2, PubMed Central, ArXiv, Github, Stack Exchange, Ubuntu IRC, and the US Patent and Trademark Office.

Treemap of Pile data sources showing size of each sources. Figure taken from (Gao et al., 2020).

The trained model is called GPT-Neo. The model has 2.7B parameters, and it is comparable to the smallest GPT-3 model. This model is a great free alternative to GPT-3, and it is available in the huggingface model hub.

Project Starline [blog]

Imagine having a long-distance conversation with your loved one, but you can see the person in real-life size, and three dimensions though a sort of magic windows. Google research applies technologies in computer vision, machine learning, spatial audio and real time compression, as well as light field display system to create a magic window that gives a sense of volume and depth without the needs for additional headsets or glasses. The result is a feeling of a person right in front of you, just like he/she is right there.

Project Starline combines 3D imaging, real time compression, and 3D display to create “Magic Window”

That’s all for this week. Stay safe and see you next week!