July 2021 - Elastic AI

This Week in Machine Learning – 30 July 2021

For this week in machine learning, we look into biases in AI system from CACM (Communication of ACM) article for August 2021. The article nicely explains overview of biases that can present in our AI systems. Here are my short note.

Biases in AI Systems [article]

Machine learning is a complex system, and it involves learning from a large dataset with a predefined objective function. It is well-known too that Machine learning systems exhibit biases that comes from every part of the machine learning pipeline: from dataset development to problem formulation to algorithm development to evaluation stage.

Detecting, measuring, and mitigating biases in machine learning system, and furthermore, developing fair AI algorithm are not easy and still active research areas. This article provides a taxonomy of biases in the machine learning pipeline.

Machine learning pipeline begins with dataset creation, and this process includes data collection, and data annotation. During this process, we may encounter 4 types of biases:

Sampling bias: caused by selecting particular types of instances more than others.
Measurement bias: caused by errors in human measurement or due to certain intrinsic habits of people in capturing data.
Label bias: associated with inconsistencies in the labeling process.
Negative set bias: caused by not having enough negative samples.

Next stage is problem formulation. In this stage, biases are cause by how a problem is defined. For example, in creditworthiness prediction using AI, the problem can be formulated based on various business reasons (such as maximize profit margin) other than fairness and discrimination.

Taxonomy of bias types along the AI pipeline. Figure taken from the article (Srinivasan and Chander, 2021).

On algorithm and data analysis, several types of biases that potentially can occur in your system:

Sample selection bias: caused by selection of data instances as a result of conditioning on some variables in the dataset
Confounding bias: it happens because the machine learning algorithm does not take into account all the information in the data. Two types of confounding bias:
- omitted variable bias
- proxy variable bias, for example zip code might be indicative of race
Design bias: caused by the limitation of the algorithm or other constraints on the system. It could be in the form of algorithm bias, ranking bias, or presentation bias.

Here are several types of biases on the evaluation and validation stage:

Human evaluation bias: caused by human evaluator in validating the machine learning performance.
Sample treatment bias: selected test set for machine learning evaluation may be biased.
Validation and test dataset bias: due to selection of inappropriate dataset for testing.

Beside the taxonomy of bias types, the authors also give some practical guidelines for machine learning developer:

Domain-specific knowledge is crucial in defining and detecting bias.
It is important to understand features that are sensitive to the application.
Datasets used for analysis should be representative of the true population under consideration, as much as possible.
Have an appropriate standard for data annotation to get consistent labels.
Identify all features that may be associated with the target feature is important. Features that are associated with both input and output can lead to biased estimates.
Restricting to some subset of the dataset can lead to unwanted sample selection bias.
Avoid sample treatment bias when evaluating machine learning performance.

That’s all, and see you next week.

This Week in Machine Learning – 23 July 2021

I share few interesting open source projects related to machine learning deployment, neural search framework, and flexible machine learning data structure for this week in machine learning.

Parallelformers: An Efficient Model Parallelization Toolkit for Deployment [github][documentation]

Parallelformers helps us to deploy most big transformer-based model on multiple gpus for inference. It is designed to make model parallelization easier, and we can parallelize many Huggingface transformer models with a single line of code.

Jina: Cloud-native neural search framework for any kind of Data [github]

Building a search engine is hard, especially when our data are not text data. Neural search leverages on state-of-the art deep neural network to perform information retrieval, and it enables retrieval of any kinds of structured data such as images, video, audio, 3D mesh, etc. Jina is a neural search framework allowing us to build neural search engine as a service quickly. It employs distributed architecture, and cloud-native by design. Take a look at its quick demo.

Jina Architecture. Image taken from Jina’s github page.

Meerkat: Flexible data structure for complex machine learning datasets [github][blog]

With various type of machine learning data (such as images, graphs, videos, time-series), a simple data abstraction for data wrangling greatly helps ML practitioners to interact with high-dimensional, and multi-modal data. Inspired by Panda DataFrame and combined with the capability of from recent data abstraction in deep learning framework such as PyTorch Dataset and Tensorflow Dataset), Meerkat DataPanel provides a simple data abstraction that offers best of the both world. Meerkat DataPanel supports:

Complex datatype (for example: images, graph, videos, time-series)
Datasets that are larger-than-RAM with efficient I/O under-the-hood
Multi-modal datasets
Data creation and manipulation
Data selection
Inspection in interactive environment

Note that PyTorch Dataset and Tensorflow Dataset support number 1-3 in above list, whereas Panda DataFrame supports number 4-6 in above list. Read their blog for more detail examples on how Meerkat DataPanel helps us wrangle our datasets.

That’s all for this week. See you next week.

This Week in Machine Learning – 16 July 2021

Following my study on unsupervised domain adaptation last week, I write a short note on domain divergences survey for this week in machine learning. In domain adaptation, domain divergence measures the distance between the two domains, and reducing the divergence is important to adapt machine learning models to a new target domain.

A Survey and Empirical Analysis on Domain Divergences [paper]

As a measure of the distance between two domains, domain divergence is a major predictor of performance in target domain. Domain divergence measures can be utilized to predict performance drop when a model is applied in a target domain. This paper reviews 12 domain divergence measures, and reports a correlation analysis study on those divergence measures on part-of-speech tagging, named entity recognition, and sentiment analysis on more than 130 domain adaptation scenarios.

Taxonomy of Divergence measures. Figure taken from (Kashyap et al., 2021).

Domain divergence can be categorized into three categories:

Geometric measures
Geometric measures use vector space distance metric such as Cosine similarity, Manhattan, and Eucledian distance to measure the distance between features extracted from instances from different domains . These measures are easy to compute, but do not work well in a high dimensional space.
Information theoretic
Information theoretic measures capture the distance between probability distributions, for example cross entropy, f-divergences (KL and JS divergence), Renyi divergence, and Wasserstein distance.
Higher-order measures
Higher-order measures consider higher order moment matching of random variables or divergence in a projected space. Maximum Mean Discrepancy (MMD), CMD, and CORAL are measures utilizing higher order moment matching, whereas Proxy-a-distance (PAD) uses a classifier to measure the distance between source and target domain.

Note that KL divergence is related to both Renyi divergence and first-order moment matching.

How do we apply divergence measures?

According to this survey paper, the divergence measures can be applied in three different ways:

Data selection
One way is to use divergence measures to select a subset of data from source domain that share similar characteristics to the target domain. For example, using cosine similarity or JS divergence to select data for POS tagging.
Learning representation
Another way is to learn a new representations that are invariant to the data sources. Domain Adversarial Neural Network (DANN) uses a domain classifier as a proxy-a-distance (PAD) to measure the distance between source and target domain. A good domain-invariant representation should be able to confuse the domain classifier whether it comes from source or target domain. Another example is to use MMD to reduce the discrepancy between representations belonging to the two domains.
Decision in the wild
Predicting performance drops on target domain where no labelled data are available is an important practical problem. Some previous work uses information theoretic measures like cross entropy, Renyi and KL divergence to predict performance drops in a new domain.

Correlation of performance drops with divergence measures. PAD is the most reliable measures to indicate performance drops on target domain. Table taken from (Kashyap et al., 2021).

The authors also run an experiment to see the correlation between divergence measures and performance drops in target domain. More detail experiment setting and discussion can be read in the paper. Here are some recommendation from the authors based on their experiments:

PAD is a reliable indicator of performance drop.
JS divergence is easy to compute, and can be a strong baseline.
Cosine similarity is not a reliable indicator of performance drop.
One dataset is not one domain, but a cluster representations of multiple domains.

That’s all for today. Stay safe and see you next week!

This Week in Machine Learning – 9 July 2021

For machine learning week, I study about unsupervised domain adaptation in NLP this week, reading a relevant survey paper in this area. Here are my short summary notes.

A Survey on Neural Unsupervised Domain Adaptation in NLP [paper][github]

In many NLP applications, the availability of the labeled data is very limited. Although deep neural network works well for supervised learning, particularly with pre-trained language model to achieve state-of-the-art performance on NLP tasks, there is still a challenge to learn from unlabeled data under domain shift. In this setting, we have two problems to tackle:

The target domain and the source domain do not follow the same underlying distribution (Note that the task is the same). For example,
- The source domain is news and wikipedia articles, and the target domain is scientific literatures.
- The more extreme case is cross-lingual adaptation, the source domain is in one language, and the target domain is in another language.
The scarcity of labeled data in the target domain, and we need unsupervised domain adaptation techniques to handle this issue.

This survey summarizes relevant work on unsupervised domain adaptation where we mitigate the domain shift by learning only from unlabeled target data. The authors also discuss the notion of domain (what constitute a domain?), and suggest to use better term: variety space.

Taxonomy of Domain Adaptation. Figure taken from (Ramponi and Plank, 2020).

Domain adaptation methods can be categorized into 3 groups of approaches:

Model-centric approaches: redesign parts of the model by modifying feature space or loss function.
- Feature-centric methods by feature augmentation or feature generalization.
  - Feature augmentation methods uses common shared features (called pivots) to construct aligned space, for example structural correspondence learning (SCL), spectral feature alignment (SFA), and pivot-based language model (PBLM).
  - Feature generalization utilizes autoencoder to find good latent representation that can be transferred across domains. Stacked denoising autoencoder (SDA), marginalized stacked denoising autoencoder (MSDA) have been used for unsupervised domain adaptation.
- Loss-centric methods to regularize or modify model parameters. It can be divided into two groups: domain adversaries and instance-level reweighting.
  - Domain adversaries is probably the most popular methods for neural unsupervised domain adaptation. The idea is inspired by GAN approach which tries to minimize real and synthetic data distribution by learning a representation that cannot be distinguish between real and synthetic data. Bringing that into the domain adaptation setting, cross-domain representation can be achieved, if a domain classifier cannot distinguish whether the input comes from the source or the target domain. A popular Domain adversarial neural network (DANN) uses a gradient reveral layer to achieve cross-domain representation. A more recent approach utilizes Wassertein distance to achieve cross-domain representation.
  - The idea of instance-level reweighting methods is to assign a weight on each training instance proportional to its similarity to the target domain.
  - Other methods explicity reweight the loss based on domain discrepancy information, such as maximum mean discrepancy (MMD), and kernel mean matching (KMM).
Data-centric approaches: focuses on getting signal from the data.
- Pseudo-labeling using a trained labels to bootstrap initial ‘pseudo’ gold labels on the unlabeled instances. Pseudo-labeling applies semi-supervised methods such as self-training, co-training, and tri-training.
- Data selection aims to find the best matching data for the new domain using domain similarity measures such as JS divergence or topic distribution.
- Pre-training models leveraging large unlabeled data. Starting from a pre-trained transformer model followed by fine-tuning on small amount of labeled data has become the standard practice in NLP due to high performance on many NLP tasks. In domain adaptation, pre-training can be divided into:
  1. Pre-training alone
  2. Adaptive pre-training involving multiple rounds of pre-training
    - Multi-phase pre-training: pre-training followed by two or more phases of secondary pre-training, from broad domain to specific domain to task-specific domain, for example: BioBERT, AdaptaBERT, DAPT, TAPT.
    - Auxiliary-task pre-training: pre-training followed by multiple stages of auxiliary task pre-training involving intermediate labeled-data tasks.
Hybrid approaches: intersection between model-centric and data-centric.

One of the authors, Barbara Plank, created a Github repo to collate all relevant papers in this area.

That’s all. Stay safe and see you next week.

This Week in Machine Learning – 2 July 2021

Hello, I am sharing einops library, an MLOps course, and ACM Turing Lecture on deep learning for AI for this week in machine learning.

Writing better deep learning codes with einops [github][website]

Einops (Einstein-inspired Notation for Operations) is a simple, flexible, and powerful tensor operation library for more readable codes. This library supports many backend such as numpy, pytorch, tensorflow, jax, gluon, tf.keras, cupy, chainer, and mxnet.

Using this library, we can focus on the input and output interface of tensor operations, rather than how it is computed, making our code more readable and easier to maintain. It also reduces the chance of making unnecessary errors in our code. Watch the short video below, and pytorch with einops comparison to get a better sense of how to make our code more readable with this library.

MLOps Course [website][github]

Goku Mohandas created a project-based course on how to apply ML to build production grade product. The course walks through product development and iteration cycle. It covers product planning, data transformation, modeling, reproducibility, scripting, testing, and production. More details on lesson highlight can be seen in Goku’s tweet thread.

ACM Turing Lecture: Deep Learning for AI [article]

In this article, Bengio, LeCun, and Hinton review the rise of deep learning in AI, describe recent advances in deep learning, and discuss the future directions of deep learning in AI. The rise of deep learning attributed by deep architecture, unsupervised pre-training, success of Rectified Linear units (ReLUs). Deep learning also leads some breakthrough in speech and object recognitions. Several recent advances in deep learning are briefly mentioned in this article: Attention and transformer architecture, unsupervised and self-supervised learning, contrastive learning, and variational auto-encoder.

The discussion on future of deep learning suggests several improvement based on some current AI limitation compare to human learning such as: ability to generalize faster without too many trials, and adaptability and robustnes to changes in distribution (out-of-distribution generalization). In addition to this, applying deep learning on tasks which require a deliberate sequence of steps also another exciting future direction.

Stay safe and see you next week!