For this week in machine learning, I am sharing two interesting tutorials from VLDB and KDD conferences this week.
Managing ML Pipelines: Feature Stores and the Coming Wave of Embedding Ecosystems [slides][paper]
Machine learning pipeline is an iterative process involving data curation, feature engineering, training and deploying models, as well as monitoring and maintenance of the deployed models. In large system with many downstream tasks, a feature store is important to standardize and manage feature generation and workflows in using the features. With the advent of self-supervised pre-trained embedding models as features, feature store faces new challenges to manage embeddings.
This VLDB 2021 tutorial gives an overview of the machine learning pipeline and feature store. Then, it introduces embeddings and the challenges faced by feature store in dealing with embeddings. Finally, it introduces recent solutions to some of the challenges and discussion on the future direction.
All You Need to Know to Build a Product Knowledge Graph [website]
This tutorial by Amazonian at KDD 2021 presents best practices in building scalable product knowledge graph. Building product knowledge graph is more challenging than generic knowledge graph due to the sparsity of the data, the complexity of the product domains, evolving taxonomies, and noise in the data.
The tutorial covers the solutions to answer the challenges in building product knowledge graph including knowledge extraction, knowledge cleaning, and ontology construction. Finally, it concludes with some practical tips and future directions.
That’s all for this week. Stay safe, and see you next week.
Geometric Deep Learning (DGL) course publishes great course materials including slides and video recording on the topic of Geometric Foundations of Deep Learning (which I shared some time ago). The course was delivered as part of African Master’s in Machine Intelligence (AMMI 2021). I will take time to go through all the lecture videos of this course.
BERTopic is a topic modeling technique that performs a density-based clustering on document representation (encoded using a transformer-based model). BERTopic utilizes class-based TF-IDF (c-TF-IDF) to get important words on each topic.
This library provides easy-to-use API to perform BERTopic, and visualize topics. In addition, it also supports various pre-trained models such as Sentence Transformer, Flair (allows you to use Huggingface pre-trained transformer models), Spacy, Gensim, and Universal Sentence Encoder (USE).
How to avoid machine learning pitfalls: a guide for academic researchers [paper]
This paper discusses common mistakes when using machine learning techniques, and how to avoid them. The paper is very suitable to anyone new in machine learning area. It covers pitfalls in every stage of machine learning development including data preparation, model development, evaluation to reporting. Here is the outline of the paper.
That’s all for this week. Stay safe and see you next week.
This week, I read a paper on pre-processing training data for Language models. Here is a short summary.
Deduplicating Training Data Makes Language Models Better [paper][github]
This paper shows that datasets for language modeling contain many long repetitive substrings and near-duplicate examples, for example a single 61-word sentence repeated over 60,000 times in C4 dataset. To address this issue, the authors of this paper propose two scalable deduplication methods to detect and remove duplicate sequences.
The advantages of training a language model on deduplicated datasets are as follow:
Reducing the rate of emitting memorized training data.
Removing train-test overlap that is common in non-deduplicated datasets. Train-test overlap causes model overfitting.
More efficient model training due to smaller datasets.
Deduplicating training data does not hurt perplexity.
Naive method to perform deduplication using exact string matching on all example pairs is not scalable. Two deduplication methods are introduced in this paper:
Removing exact substring duplication using suffix array (ExactSubstr)
Approximate matching with MinHash (NearDup)
More details on each method can be read in the paper. The authors also release the source codes.
Below are the percentage of duplicate examples in standard LM datasets detected by ExactSubstr and NearDup methods, as well as the impact of deduplicating training set on validation perplexity.