This Week in Machine Learning – 14 May 2021

Hello! Hope you have a great week. I encounter an interesting machine learning papers, one course and one open source project this week:

Emerging Properties in Self-supervised Vision Transformers [paper][github][blog post]

Self-supervised Vision Transformer with no supervision. Figure taken from (Caron et al., 2021).

Recent Vision Transformers (ViT) model, adopting Transformer model in NLP, has shown promising results toward generic and scalable architectures for computer vision tasks. This paper study self-supervised ViT model and discuss two emerging properties:

  1. Self-supervised ViT features contain explicit information about semantic segmentation of an image.
  2. Self-supervised ViT features also an excellent k-NN classifiers.
The Vision Transformer treats an image as a sequence of patches, analogous to a series of word embeddings in NLP Transformer model. Figure taken from Nabil’s blog post.

From the findings, the authors develop a self-supervised learning framework called DINO (Knowledge Distilation with no labels). As indicated in the name, the framework uses knowledge distillation strategy to train the model. But instead of using pre-trained model as teacher and running knowledge distillation as post processing step to self-supervised pre-training, the teacher network also performs distillation from student network using self-supervision objective. In other word, both student and teacher network are doing codistillation.

Nabil Madali has a great blog post discussing more detail about this paper.

Machine Learning Engineering for Production (MLOps) Specialization [url]

Coursera just launched a new course for building production end-to-end ML systems. Bringing machine learning models to production systems involves many tasks such as discovering data issue and data drift, conducting error analysis, managing computation and scaling. MLOps course discusses how to conceptualize, build, and maintain integrated machine learning systems that continuously operate in production. You will get yourself familiar with the capabilities, challanges, and consequences of machine learning in production.

Course website:

Opyrator: Quickly Turn Machine Learning Codes into Microservices [github]

Figure taken from Opyrator Github Repo.

This open source project combines FastAPI, Streamlit, and pydantic to quickly make your python functions into production-ready microservices. It utilizes FastAPI to automatically generate HTTP API, and Streamlit to automatically generate a web UI. A very useful tool to quickly showcase your machine learning models.

Opyratory demo website:

Figure taken from Opyrator Github Repo.

Stay safe, and see you next week!

This Week in Machine Learning – 7 May 2021

Hello everyone! Starting this week, I am going to summarize my notes in a weekly review post. Here are five machine learning projects / resources / research papers / softwares that I find interesting to explore this week:

Geometric Foundations of Deep Learning [website] [blog post][paper][talk]

This paper outlines geometric unification of a broad class of machine learning problem, providing a common mathematical framework to derive the most successful neural network architectures such as CNNs, RNNs, GNNs, and Transformers. The work is motivated by Felix Klein’s Erlangen Programme which approaches geometry as the study of invariants.

In this light, the authors study symmetries, a certain type of transformation that preserves an object or a structure or a system, and show a general blueprint of Geometric Deep Learning which typically consists of a sequence of equivariant layers, followed by an invariant global polling.

Geometric deep learning blueprint. Figure taken from (Bronstein et al., 2021).
Geometric deep learning architectures. Figure taken from (Bronstein et al., 2021).

The general blueprint can be applied to different types of geometric domains such grids, groups (global symmetry transformations in homogeneous space), graphs, geodesics (metric structures on manifolds), and gauges (local reference frames defined on tangent and vector bundles).

Five geometric domains. Figure taken from (Bronstein et al., 2021).

Explainable AI Cheat Sheet [website]

Jay Alammar creates a cheat sheet for Explainable AI. As more and more machine learning models being deployed in mission critical and high-stake applications such as medical diagnosis, it is important to ensure that the models make decision for the right reason. Jay categorizes explainable AI into five key categories:

  1. Interpretable models by design such as KNN, linear models, logistic regression
  2. Model agnostic methods, for example: SHAP, LIME, and pertubation
  3. Model specific methods, for example: using attention, gradient saliency, and integrated gradients
  4. Example based methods (to uncover insight about a model) using adversarial examples, counterfactual explanations, and influence functions.
  5. Neural representation methods: feature visualization, activation maximization, SVCCA, TCAV, and probes.

Check the website for the links to relevant papers on each key categories.

Brief overview of Explainable AI Cheat Sheet

MOOC for getting started with scikit-learn [website][github repo]

This MOOC is developed by scikit-learn core developers. It offers an in-depth introduction to predictive modeling using scikit-learn. The course covers the whole pipeline of predictive modeling including data exploration, modeling (using linear model, decision tree, and ensemble models), hyperparameters tuning, and model evaluation. Highly recommended for beginners!

MixingBoard from Microsoft Research [paper][github]

MixingBoard is an open-source platform for quickly building knowledge grounded stylized text generation demos, unifying text generation algorithms in a shared codebase. It also provides CLI, web, and RESTful API interface. The platform has several modules to build a text processing assistant and conversational AI demos. Each module tackles a specific task needed to build the demos such as conditioned text generation, stylized generation, knowledge grounded generation, and constrained generation.

GPT2, DialoGPT, and SpaceFusion can be utilized for conditioned text generation. StyleFusion enables stylized generation via latent interpolation using soft-edit and soft-retrieval strategy. For knowledge grounded generation, it combines knowledge passage retrieval, machine reading comprehension using BERT, content transfer, and knowledge grounded response generation. Finally, hard or soft constraint can be used for constrained generation during decoding stage to encourage the generated texts contain the desired phrases.

The architecture of MixingBoard, composed of basic tools, algorithms, tasks into integrated demos. Figure taken from (Gao et al., 2021).

The Web Conference 2021 Best Paper: Towards Facilitating Empathic Conversations in Online Mental Health Support: A Reinforcement Learning Approach. [paper][project page][talk]

The best paper of the Web Conference 2021 addresses an important mental health care. The authors present a great application of text rewriting, transforming low-empathy conversational posts to higher empathy. The task facilitates an empathic conversation which rarely expressed in online mental health support. A reinforcement learning agent, PARTNER, is developed to perform sentence-level edits for more empathic conversational posts.

PARTNER observes seeker and response posts, and performs two actions:

  1. Determine a position in the response span for insertion or replacement.
  2. Generate candidate empathic sentences.

It uses four reward functions aim to increase empathy, maintain text fluency, sentence coherence, context specificity, and diversity.

PARTNER: a deep reinforcement learning agent for empathic rewriting. Figure taken from (Sharma et al., 2021).

Hope you enjoy this post. See you next week!

SEAMLS 2019 Highlights

About two weeks ago, I spent my time in Jakarta attending Southeast Asia Machine Learning School (SEAMLS) 2019. SEAMLS is a five-days event to learn the current state of the art in machine learning and deep learning. It aims to inspire, encourage, and educate more machine learning engineers, researchers, and data scientists within the Southeast Asia region. I am very fortunate to get selected from about 1,200 applicants. In this post, I will share few things I learned and caught my attention on each lecture.

Day 1:

Math foundation by Cheng Soon Ong [slides].

The first lecture focuses on the math foundation covering linear algebra, analytics geometry, matrix decomposition, vector calculus, probability and distribution, as well as continuous optimization.

I recommend Cheng Soon’s book to learn more details on math foundation in machine learning:

Machine learning basics by Lee Wee Sun.

Lee Wee Sun delivers two lectures: introduction to machine learning [slides], and machine learning basics [slides]. In his first lecture, he briefly explains common machine learning problems: supervised learning, unsupervised learning, and reinforcement learning. He also presents some examples of machine learning tasks such as classification and regression, data compression, and generative models.

In his second lecture, he talks about the basic and fundamental concepts of machine learning:

  1. loss function: we use loss function to help measure our success.
    • Common loss function: 0-1 loss, square loss, and absolute loss.
    • Empirical risk minimization
    • Overfitting
    • Regularization
    • I.I.D Assumption
  2. Maximum likelihood
    • Maximum likelihood and minimizing empirical risk. For some distributions D, maximizing the log likelihood is equivalent to empirical risk minimization with appropriate loss functions.
    • Maximum a posteriori (MAP) estimation
    • Bayes estimation
    • Unsupervised learning: density estimation, mixture model, autoregressive model
    • Generative Adversarial Network (GAN)
  3. Model selection
    • validation set
    • k-cross validation, leave-one-out cross validation
    • what to do when learning fails
  4. Feature selection and generation
    • filter: mutual information, information gain
    • wrapper: forward selection, backward elimination
    • sparsity-inducing norms: l1 regularization, LASSO
    • feature transformation and normalization: centering, unit range, standardization, clipping, sigmoidal transformation, logarithmic transformation, TF-IDF transformation
    • dimension reduction: autoencoder, PCA

Some materials in his lecture are taken from few chapters in Understanding Machine Learning: From Theory to Algorithms by Shai Shalev-Shwartz and Shai Ben-David. I encourage you to read the book to understand fundamental theory underlying machine learning principles. The book is available here (free for personal use only).

Practical session on Tensorflow and Google Colab by Subhodeep Moitra.

We have a practical session on the first day of SEAMLS. In this sesion, I learned about Seedbank, a cool website hosting a collection of interactive machine learning examples.

Day 2:

Simple unsupervised learning by Wray Buntine [slides].

We start second day with foundations of unsupervised learning. Wray talks about various aspects of unsupervised learning:

  • Clustering
    • How to build the cluster? How many clusters?
    • Extension to clustering: hierarchical clustering, soft clustering, “multi-level” clustering (LDA), matrix factorization, multi-view tensor factorization
    • What are the clusters used for?
      • Typical uses of clustering: discovery tool, generative model, market segmentation, etc.
      • Typical uses of matrix factorization: dimensionality reduction, summarization of relational content, recommender system, etc.
  • Models and algorithms:
    • Distance-based models: k-means, hierarchical algorithms
    • Generative models
      • Gaussian mixture model (GMM)
      • Matrix factorization
  • GMM algorithms:
    • Greedy local search
    • Gibbs sampling
    • Gradient-based search
    • Variational method
  • Foundations:
    • Partition theory and label switching problem
    • Cost function
    • Evaluation

Neural network basics (+tricks of the trade) by Chris Dyer [slides].

Chris begins his talk with examples of neural network applications, followed by explanations on why we favor neural networks.

He explains the details of feed-forward networks, bias and variance in neural networks, differentiable losses, learning optimization, minibatching, and implementation details. He also go through the details on computing derivatives including numeric differentiation (ND), symbolic differentiation (SD), and automatic differentiation (AD).

Second part of Chris’ talk is about tricks of the trade on working with neural networks. Neural network is powerful, but we have a lot of things to be tuned. He gives a general advice:

  • Make sure you can overfit on a few examples
  • Try to get a model that overfits on the training data (i.e. reduce bias)
  • Then try to improve the overfit model (i.e. reduce the variance)
Texts in blue: control bias. Texts in red: control variance. Texts in black: control both bias and variance. Texts in bold face discussed in-depth by Chris.

Optimization in neural network is hard because no guarantee about convergence in neural networks. But, we have few tricks:

  • Make sure inputs and weights are statistically well-behaved. Input normalization allows you to use larger learning rates. Carefully scaling the variance during weight initialization can significantly help with learning.
  • Use large numbers of parameters
  • Use better optimizers
    • Smarter update rules: momentum, RMSProp, Adam
  • Better internal representation
    • Dropout
    • Batch and layer normalization

One interesting research question he asked during the talk: Does dropout increase or decrease interpretability?

Day 3:

All talks on the third day are related to natural language processing.

Sequence modeling and embeddings by Yun-Nung (Vivian) Chen [slides].

In this talk, Vivian explains the basic of word representation, word embeddings, and recurrent neural networks. In word representation, we want to capture the meaning of a word. There are two types of representations being used to represent meaning of a word. First type is knowledge-based representation such as wordnet. However, this approach requires laborious annotation effort, the annotation can be subjective, and newly-invented words need to be manually added to the knowledge-based. It is also difficult to compute word similarity in the knowledge base. Second type of representation leverages on available corpus (thus, it is called corpus-based representation). The representation can be atomic using one-hot representation or neighbor-based using SVD or word embeddings.

For word embeddings part, she explains more details on skip gram model, CBOW, and GloVe, as well as the evaluation of word embeddings.

In the last part of her talk, Vivian discuses about recurrent neural networks, LSTM, GRU, bidirectional RNN, and sequence prediction applications.

Neural sequence generation by Kyunghyun Cho [slides].

I learned new types neural sequence generation beside the traditional unbounded autoregressive model. Kyunghyun introduces two models: iterative parallel decoding, and non-monotonic sequential generation. In iterative parallel decoding, the idea is that the decoder iteratively refine generated sequence (in denoising fashion). Tokens are generated in parallel instead of a word at a time in this framework.

In non-monotonic sequential generation framework, we don’t assume a pre-specified generation order (such as left to right in standard monotonic sequential generation). In this framework, a word is generated at an arbitrary position, then recursively generating its left and right following a binary tree.

Beside neural sequence generation, Kyunghyun Cho also discusses learning and inference on neural dialogue models.

Practical session on Language Model by Rewon Child [slides].

Rewon gives practical on NLP using Tensorflow. We learned to implement a simple transformer model in this practical session.

Day 4:

Convolutional Neural Networks by Viorica Patraucean [slides]. 

Viorica introduces many aspects of convolutional neural networks (convnets):

  • Convnets taxonomy
Click here for bigger image.
  • Convnets inductive biases
    • Hierarchical representation: abstraction increases with depth and size of receptive field
    • locality of the data
    • translation invariance
  • Convolutional layer
  • Pooling
  • Strided convolutions and deconvolutions
  • Initialization in convnets
  • Other convolutions: 1×1 kernels, dilated convolutions, separable convolutions, grouped convolutions, dynamic lightweight convolutions

She also discusses many computer vision tasks including image classification, object detection, multiple objects detection, and semantic segmentation.

Multi-modal Machine Learning by Douwe Kiela [slides].

Douwe gives an interesting talk discussing the meaning of multi-modal, multi-modal representation learning, multi-modal fusion, joint embedding spaces, and fusion by attention.

He also presents some applications on multi-modal machine learning: image captioning, visual question answering (VQA), visual reasoning, visual dialogue, and embodied QA.

Practical session on CNNs by Mike Chrzanowski [slides].

In this practical session, Mike leads us to build a VGG-like convnet.

Day 5:

Deep probabilistic graphical models by Hung Bui [slides].

Hung gives a nice introduction to deep probabilistic graphical models. I suggest you to go through his slides if you are interested in this topics. He explains latent variable models, probabilistic graphical models, and their relation to deep probabilistic graphical models. He also covers training a deep probabilistic graphical model via variational inference, and reparameterization trick. At the end of his talk, he touches a bit on representation learning with deep probabilistic graphical models.

Attention, self-attention, transformer, and BERT by Manaal Faruqui [slides].

Manaal introduces the basic of attention mechanism, then followed by self-attention, transformer architecture, and BERT. He also touches a bit on the interpretability of attention.

Machine learning in Biomedicine by Truyen Tran [slides].

In the final lecture in SEAMLS, Truyen shows many applications of machine learning techniques in biomedicine domain. He presents various problems and applications on set, sequence, and graph data in biomedicine.

Beside lectures, there are three panel discussions and poster sessions. Overall, this event is well-organized, and I really like the program. The program covers good balance of fundamental and more advanced topics. Full programs are available here, along with the slides material. I highly recommend you to go through the slides material on topics you want to learn.

Exciting NLP Developments in 2018

2018 has been an exciting year for me. I encounter, explore and learn many exciting ideas and efforts in NLP. In this post, I briefly summarize NLP developments and efforts that excite me in 2018.

1. Translation without Parallel Data

Recent successes in unpaired image-to-image translation such as DiscoGAN , and CycleGAN inspire work on similar goal in NLP, more specifically in the area of machine translation and text generation. In unpaired translation setting, we eliminate the need of building parallel corpora which is very tedious and slow process.

Difference between paired and unpaired training data. Each example is a correspondence between xi and yi in paired training data, whereas the correspondence information is not available in unpaired training data. Figure taken from (Zhu et al., 2017).

When dealing with unpaired translation setting, a machine learning model needs to discover cross-domain alignment without supervision. In the absence of correspondence mapping information, one trick that has been shown to be effective is utilizing back-translation. In back-translation, after we translate from source to target, we translate the target back to the source. Many models utilize this back-translated source as reference to guide the translation quality.

Learning from unpaired training data is perfect fit for unsupervised machine translation system where we only have access to monolingual corpora. Recent works by (Lample et al., 2018a; Lample et al., 2018b) show that good initialization, language modeling, and back-translation are three principles for success in building unsupervised machine translation model.

Three principles of unsupervised machine translation. Figure taken from (Lample et al., 2018b).

Beside language translation, this approach has been used in many text generation problem such as author attribute anonymity (Shetty et al., 2018), text generation with attribute control (Logeswaran et al., 2018), and text style transfer (Zhang et al., 2018).

2. Text Style Transfer

Example of neural style transfer for image. Figure taken from (Jing et al., 2018)

Motivated by growing interest in Neural Style Transfer in computer vision (see review by Jing et al., 2018), there are a number of recent work on style transfer for text generation. Text style transfer aims to rewrite a given text in a different linguistic style, while at the same time preserving the content of original text.

Many works on text style transfer formulate the problem by learning disentangled latent representation (Bengio et al., 2014) from input text, producing a latent representation consists of content and style component. Using this representation, one can easily control and modify the style component, while keeping the content representation intact, to generate output text. We hope that the generated output has the same content, but in different style.

Adversarial autoencoder (Makhzani et al., 2016) is the most popular choice to learn disentangled latent representation for text style transfer (Shen et al., 2017; Fu et al., 2018; Yang et al., 2018; Zhao et al., 2018; Zhao et al., 2018). Another approach leverages the idea of back-translation from unsupervised machine translation (Prabhumoye et al., 2018; Subramanian et al., 2018). The style transfer includes sentiment modification, translate offensive sentences to non-offensive sentences, and paper-news title transfer.

Disentangling content and style representation. Figure taken from (Shen et al., 2017) slides.

Despite many work on text style transfer, a fundamental question still remains: what constitutes a style? Is sentence modification a good example of text style transfer? Tikhonov and Yamshchikov (2018) suggest that style has to be orthogonal to semantics, and thus any semantically relevant information could be expressed in any style. Furthermore, evaluation on generated text also remains a big question. How do we accurately measure the style, while at the same time ensure the meaning is preserved? Current evaluations are mainly based on style classifier accuracy and human evaluation. Pang and Gimpel (2018) suggest that style classifier accuracy alone it not sufficient to evaluate text style transfer, and propose to combine style classifier accuracy with semantic similarity and fluency metrics to better assess non-parallel textual transfer.

3. Deep Contextual Representations

Word embedding has been an important building block of many deep learning models for NLP tasks. Word embedding encodes information from each word in input text to be processed by deep learning models. Many methods such as word2vec and glove, have been developed to generate word embedding that captures linguistic contexts of words. Recent work on word embedding involves refining (retrofit) learned word embedding to external information such as semantic lexicons (Faruqui et al., 2015) and knowledge graph (Lengerich et al., 2018).

Traditionally, word embedding vectors are pre-trained using shallow neural network on language modeling task. Using large general purpose corpora such as wikipedia or Google news, one can learn good embedding vectors to be used for many NLP tasks. In 2018, we see more efforts in exploiting contextual information in deep neural network, and thus the word representations are derived from multiple layer in deep networks. OpenAi GPT (Radford et al., 2018), ELMo (Peters et al., 2018), and BERT (Devlin et al., 2018) show deep contextual representations give large improvement on broad range of NLP tasks.

BERT, OpenAI GPT, and ELMo architectures. Figure taken from (Devlin et al., 2018)

Jay Alammar has a great blog post in explaining deep contextual representations. I highly recommend to read his post to better understand this concept.

4. Information Extraction for Scientific Literature

With constant increase of scientific papers being published every day, there are growing needs of a scientific knowledge discovery tool. Information extraction from scientific papers becomes an important application of natural language technology. Meta, a scientific discovery tool for biomedical research, extracts various medical concepts, and makes them available to search and follow. Allen AI team behind SemanticScholar develops a scalable system to construct literature graph in order to facilitate algoritmic discovery in the scientific literature (Ammar et al., 2018). SemEval 2018 also has one track for this task. SciIE (Luan et al., 2018) employs multitask approach for identifying entities, relations, and coreference in scientific papers.

In the area of social science, NYU Coleridge Initiative hosts Rich Context Competition which aims to automatically discover research datasets, methods, and fields in social science research publications. The competition finalists (from GESIS, KAIST, Paderborn University, and Allen AI) will be presenting their work in Rich Context Competition Workshop on 15 February 2019. They will webcast the workshop. Register here! 🙂

Another related event in this area, collocated with NAACL 2019 is Extracting Structured Knowledge from Scientific Publications (ESSP) workshop.

5. Datasets, Datasets, and Datasets

We see the birth of more and more datasets for more specific and challenging NLP tasks in 2018. 15 new datasets were presented in EMNLP 2018 alone (Sebastian Ruder summarizes them in his EMNLP 2018 highlight). There are also NLP benchmarks based on established datasets for studying and evaluating NLP models:

  • GLUE benchmark (Wang et al., 2018): a benchmark consists of nine sentence- and sentence-pair language understanding tasks, i.e. CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, and WNLI.
  • decaNLP (McCann et al., 2018): a benchmark for a multitask NLP challenge consisting of question answering, machine translation, summarization, natural language inference, sentiment analysis, semantic role labeling, relation extraction, goal-oriented dialogue, semantic parsing, and commonsense reasoning.

More and more datasets will definitely trigger rapid NLP advancement in solving various NLP tasks. A nice crowdsourcing effort led by Sebastian Ruder to track NLP progress can be seen at

There are some great AI 2018 roundups from various blog posts which inspire me to write this blog post. I highly recommend you to read these posts:

Looking forward to more exciting 2019.

Elasticsearch Workshop at FOSSASIA 2016

Living Analytics Research Centre (LARC) and Elastic gave a workshop titled Elasticsearch: You know, for search! and more! at FOSSASIA 2016 last week. It’s an introduction to Elasticsearch, and we shared our experience in using Elasticsearch at LARC. The crowd was great, and we had a bunch of questions related to Elasticsearch, and particularly how we utilize Elaticsearch in our lab. Here is our slide deck:

And here is the slide deck from Elastic (by @medcl):