This Week in Machine Learning – 28 May 2021

This week in machine learning is filled with GAN-related papers from Nvidia. DatasetGAN shows the usefulness of GAN (Generative Adversarial Network) in generating simulated training data at scale. Another paper from Nvidia shows a nice GAN application for controllable 3D rendering from a single 2D image. Let’s take a look at the idea on each paper.

DatasetGAN: Efficient Labeled Data Factory with Minimal Human Effort [website][paper]

The availability of labeled data is one big bottleneck to have an effective machine learning system. There are huge amount of unlabeled data out there, but high-quality labeled datasets are still limited because it takes huge amount of time, human resources, and money to create such datasets.

DatasetGAN is simple semi-supervised approach to synthesize massive large high-quality dataset with minimal human annotation. It shows that we can synthesize high-quality semantically segmented images using GAN, utilizing semantic knowledge (such as viewpoint, object identity) in its high dimensional latent space.

DatasetGAN generates large high-quality labeled image datasets leveraging StyleGAN and a handful human-annotated images. Figure taken from (Zhang et al., 2021).

DatasetGAN uses StyleGAN as the rendering engine, and then adding style-interpreter component to synthesize labels from StyleGAN latent vectors. Style interpreter consists of an ensemble of MLP classifiers to predict the label on each pixel based on the features from the upsampled StyleGAN latent vectors.

Architecture of DatasetGAN. Figure taken from (Zhang et al., 2021).

The authors perform experiments on bedrooms, cars, heads, birds, and cats images for semantic segmentation, and keypoint detection. They also showcase an application of this approach to perform animatable 3D object reconstruction from an image.

Image GANs Meet Differentiable Rendering for Inverse Graphics and Interpretable 3D Neural Rendering [paper][blog]

This paper reconstructs 3D objects from 2D images, often known as inverse graphics task. Most of existing approaches rely on the availability 3D labels to train a model. The authors of this paper present a way to extracts and disentangle 3D knowledge learned from GAN using differentiable rendering. Using this approach, they are able to obtain high-quality 3D rendering with low annotation effort.

StyleGAN to generate synthetic data efficiently, and Inverse Graphic Network to predicts 3D properties from images. Figure taken from (Zhang et al., 2021).

The approach aims to disentangle camera viewpoints, shapes, texture, light, and background. It combines two state-of-the art renderer as main components. In the first component, they utilize StyleGAN as multi-view data generator. Each layer in StyleGAN controls different image attribute, for example early layers control viewpoint, while intermediate and high layers control shape, texture, and background. They leverage on this knowledge, and uses first four layers of StyleGAN to disentangle camera viewpoints.

The second component is inverse graphics neural network DIB-R (Chen et al., 2019) to predict 3D shapes and texture from an image. This model enables them to detect 3D properties from image. This 3D properties are then mapped back into StyleGAN latent vectors, allowing them to control the StyleGAN rendering based on certain properties.

3D properties are mapped into StyleGAN, enabling controllable 3D rendering. Figure taken from (Zhang et al., 2021).

The model is still unable to predict correct lighting, and disentangling background is still a challenge. Moreover, predicting shapes of out-of-distribution objects (such as Batmobile and Flintstone car) is also another significant challenge.

3D reconstruction from monocular photograph.

That’s all for this week. Stay safe, and see you next week!

This Week in Machine Learning – 21 May 2021

Hello everyone, I read a survey paper on number representation in NLP, encounter a GPT-3 replica model in huggingface model hub, and project Starline. Here few short notes on those.

Representing Numbers in NLP: a Survey and a Vision [paper]

Numbers are important and integral part of text, and yet numbers rarely get special consideration when processing text. Many NLP systems are commonly treated numbers as words, and subword tokenization such as BPE breaks numbers into arbitrary tokens, for example 1234 might be splitted into 1-234 or 12-34 or 123-4.

This paper surveys recent numeracy work in NLP and categorized them into seven numeracy tasks:

  1. Simple arithmetic: arithmetic operation such as addition, subtraction over number alone.
  2. Numeration: decoding a string form to its numeric value.
  3. Magnitude comparison: ability to perform comparison on two or more numbers.
  4. Arithmetic word problems: ability to perform simple arithmetic from a composition.
  5. Exact facts: understanding numbers in commonsense knowledge.
  6. Measurement estimation: approximately guess measure of objects along certain dimensions.
  7. Numerical language modeling: making numeric predictions in completing text.
Numeracy tasks categorized into granularity and units. into Table taken from (Thawani et al., 2021).

The numeracy tasks are categorized following two dimensions:

  1. Granularity. Whether the encoding of the number is exact or approximate.
  2. Units. Whether the numbers are abstract or grounded.

The author groups number representation into string-based and real-based representation. String-based representation treat numbers as strings, with several tweaks. Real-based representation performs computation using the numerical value of the number. Detail summary on each representation can be read in the paper.

Numeracy in NLP. Table taken from (Thawani et al., 2021)

The authors present few practical takeaways to guide the design of number representation:

  • For string-based representation:
    • Scientific notation is superior to decimal notation
    • Character level tokenization outperforms subword level tokenization.
  • For real-based representation:
    • Log scale is preferred over linear scale as inspired by cognitive science literature (but it lacks of rigorous study)
    • Binning (dense cross entropy loss) works better than continuous value prediction (MAE loss)

They also call for unified and more holistic solution to numeracy. This involves a benchmark covering different numeracy subtasks to incentive research progress in numeracy.

GPT-Neo: GPT-3 Replica by EleutherAI [model][article]

OpenAI GPT-3 has been powering many applications in various domains from creativity to productivity. Through OpenAI API, GPT-3 generates about 4.5 billions words per day. However, access to the OpenAI API is not free and still very limited to few companies and developers. And, training GPT-3 from scratch demands a lot of computing power that most people cannot afford.

EleutherAI replicates GPT-3 architecture and trains the model on the Pile dataset. The Pile dataset is 825GB open source data for language modeling, combining texts from 22 sources such as English Wikipedia, OpenWebText2, PubMed Central, ArXiv, Github, Stack Exchange, Ubuntu IRC, and the US Patent and Trademark Office.

Treemap of Pile data sources showing size of each sources. Figure taken from (Gao et al., 2020).

The trained model is called GPT-Neo. The model has 2.7B parameters, and it is comparable to the smallest GPT-3 model. This model is a great free alternative to GPT-3, and it is available in the huggingface model hub.

Project Starline [blog]

Imagine having a long-distance conversation with your loved one, but you can see the person in real-life size, and three dimensions though a sort of magic windows. Google research applies technologies in computer vision, machine learning, spatial audio and real time compression, as well as light field display system to create a magic window that gives a sense of volume and depth without the needs for additional headsets or glasses. The result is a feeling of a person right in front of you, just like he/she is right there.

Project Starline combines 3D imaging, real time compression, and 3D display to create “Magic Window”

That’s all for this week. Stay safe and see you next week!

This Week in Machine Learning – 14 May 2021

Hello! Hope you have a great week. I encounter an interesting machine learning papers, one course and one open source project this week:

Emerging Properties in Self-supervised Vision Transformers [paper][github][blog post]

Self-supervised Vision Transformer with no supervision. Figure taken from (Caron et al., 2021).

Recent Vision Transformers (ViT) model, adopting Transformer model in NLP, has shown promising results toward generic and scalable architectures for computer vision tasks. This paper study self-supervised ViT model and discuss two emerging properties:

  1. Self-supervised ViT features contain explicit information about semantic segmentation of an image.
  2. Self-supervised ViT features also an excellent k-NN classifiers.
The Vision Transformer treats an image as a sequence of patches, analogous to a series of word embeddings in NLP Transformer model. Figure taken from Nabil’s blog post.

From the findings, the authors develop a self-supervised learning framework called DINO (Knowledge Distilation with no labels). As indicated in the name, the framework uses knowledge distillation strategy to train the model. But instead of using pre-trained model as teacher and running knowledge distillation as post processing step to self-supervised pre-training, the teacher network also performs distillation from student network using self-supervision objective. In other word, both student and teacher network are doing codistillation.

Nabil Madali has a great blog post discussing more detail about this paper.

Machine Learning Engineering for Production (MLOps) Specialization [url]

Coursera just launched a new course for building production end-to-end ML systems. Bringing machine learning models to production systems involves many tasks such as discovering data issue and data drift, conducting error analysis, managing computation and scaling. MLOps course discusses how to conceptualize, build, and maintain integrated machine learning systems that continuously operate in production. You will get yourself familiar with the capabilities, challanges, and consequences of machine learning in production.

Course website:

Opyrator: Quickly Turn Machine Learning Codes into Microservices [github]

Figure taken from Opyrator Github Repo.

This open source project combines FastAPI, Streamlit, and pydantic to quickly make your python functions into production-ready microservices. It utilizes FastAPI to automatically generate HTTP API, and Streamlit to automatically generate a web UI. A very useful tool to quickly showcase your machine learning models.

Opyratory demo website:

Figure taken from Opyrator Github Repo.

Stay safe, and see you next week!

This Week in Machine Learning – 7 May 2021

Hello everyone! Starting this week, I am going to summarize my notes in a weekly review post. Here are five machine learning projects / resources / research papers / softwares that I find interesting to explore this week:

Geometric Foundations of Deep Learning [website] [blog post][paper][talk]

This paper outlines geometric unification of a broad class of machine learning problem, providing a common mathematical framework to derive the most successful neural network architectures such as CNNs, RNNs, GNNs, and Transformers. The work is motivated by Felix Klein’s Erlangen Programme which approaches geometry as the study of invariants.

In this light, the authors study symmetries, a certain type of transformation that preserves an object or a structure or a system, and show a general blueprint of Geometric Deep Learning which typically consists of a sequence of equivariant layers, followed by an invariant global polling.

Geometric deep learning blueprint. Figure taken from (Bronstein et al., 2021).
Geometric deep learning architectures. Figure taken from (Bronstein et al., 2021).

The general blueprint can be applied to different types of geometric domains such grids, groups (global symmetry transformations in homogeneous space), graphs, geodesics (metric structures on manifolds), and gauges (local reference frames defined on tangent and vector bundles).

Five geometric domains. Figure taken from (Bronstein et al., 2021).

Explainable AI Cheat Sheet [website]

Jay Alammar creates a cheat sheet for Explainable AI. As more and more machine learning models being deployed in mission critical and high-stake applications such as medical diagnosis, it is important to ensure that the models make decision for the right reason. Jay categorizes explainable AI into five key categories:

  1. Interpretable models by design such as KNN, linear models, logistic regression
  2. Model agnostic methods, for example: SHAP, LIME, and pertubation
  3. Model specific methods, for example: using attention, gradient saliency, and integrated gradients
  4. Example based methods (to uncover insight about a model) using adversarial examples, counterfactual explanations, and influence functions.
  5. Neural representation methods: feature visualization, activation maximization, SVCCA, TCAV, and probes.

Check the website for the links to relevant papers on each key categories.

Brief overview of Explainable AI Cheat Sheet

MOOC for getting started with scikit-learn [website][github repo]

This MOOC is developed by scikit-learn core developers. It offers an in-depth introduction to predictive modeling using scikit-learn. The course covers the whole pipeline of predictive modeling including data exploration, modeling (using linear model, decision tree, and ensemble models), hyperparameters tuning, and model evaluation. Highly recommended for beginners!

MixingBoard from Microsoft Research [paper][github]

MixingBoard is an open-source platform for quickly building knowledge grounded stylized text generation demos, unifying text generation algorithms in a shared codebase. It also provides CLI, web, and RESTful API interface. The platform has several modules to build a text processing assistant and conversational AI demos. Each module tackles a specific task needed to build the demos such as conditioned text generation, stylized generation, knowledge grounded generation, and constrained generation.

GPT2, DialoGPT, and SpaceFusion can be utilized for conditioned text generation. StyleFusion enables stylized generation via latent interpolation using soft-edit and soft-retrieval strategy. For knowledge grounded generation, it combines knowledge passage retrieval, machine reading comprehension using BERT, content transfer, and knowledge grounded response generation. Finally, hard or soft constraint can be used for constrained generation during decoding stage to encourage the generated texts contain the desired phrases.

The architecture of MixingBoard, composed of basic tools, algorithms, tasks into integrated demos. Figure taken from (Gao et al., 2021).

The Web Conference 2021 Best Paper: Towards Facilitating Empathic Conversations in Online Mental Health Support: A Reinforcement Learning Approach. [paper][project page][talk]

The best paper of the Web Conference 2021 addresses an important mental health care. The authors present a great application of text rewriting, transforming low-empathy conversational posts to higher empathy. The task facilitates an empathic conversation which rarely expressed in online mental health support. A reinforcement learning agent, PARTNER, is developed to perform sentence-level edits for more empathic conversational posts.

PARTNER observes seeker and response posts, and performs two actions:

  1. Determine a position in the response span for insertion or replacement.
  2. Generate candidate empathic sentences.

It uses four reward functions aim to increase empathy, maintain text fluency, sentence coherence, context specificity, and diversity.

PARTNER: a deep reinforcement learning agent for empathic rewriting. Figure taken from (Sharma et al., 2021).

Hope you enjoy this post. See you next week!