When not to use Categorical Encoding
Imagine you’re shopping online, there are millions of product listings, which often use a variety of terms to describe similar items. A “couch” might also be referred to as "sofa.” Now if we were to take an online marketplace dataset and ‘product listing’ were it’s own separate column, traditional categorical encoding can struggle with this semantic difference. Besides, then you would have millions of columns in your dataset after encoding each one. This is exactly why traditional categorical encoding might not always be the best method when dealing with large-scale datasets that have columns with diverse descriptions. Because how in the world would you categorize this:
Challenges with Traditional Categorical Encoding
Our whole problem is that traditional categorical encoding methods assign a unique identifier to each category. Leading to a high number of unique categories and losing the semantic similarity between them.
1. High Cardinality:
High cardinality refers to variables that have a large number of unique values. For example, if every unique product listing description were encoded into its own column, you would end up with millions of columns.
2. Semantic Similarity:
Traditional encoding techniques do not capture semantic similarity between different terms. This leads to a loss of meaningful information and can degrade the performance of the model.
3. Scalability Issues:
As the number of unique categories increases, the size of the feature space grows exponentially. This makes the model more complex and harder to train. It can also lead to issues with memory and computational efficiency.
What can we do instead?
Addressing the Challenges with Advanced Techniques
Given these challenges, it’s clear that we need a more sophisticated approach to handle categorical data in large-scale datasets. This is where advanced techniques like Word2Vec come into play. Let’s see what this is and why it can be a better alternative.
Word2Vec Algorithm
Word2Vec is an algorithm that converts words into numerical vectors while preserving their relationships and meanings. It leverages a neural network to discover word correlations in huge text corpora based on context. For example, if one product is named “xyz 14 oz pure himalayan water” and another one is “low fat 1% milk 16 oz” the algorithm would understand that they’re both measured in ounces, so they must be liquids, and would then assign them similar vectors keeping them separate from electronics or furniture products. It does this by placing words with similar meanings close to each other in a high-dimensional space.
Word2Vec maps each word in the vocabulary to a unique vector in a high-dimensional space such that words with similar meanings or usage contexts end up close to each other. This spatial closeness reflects the semantic similarity between the words. Since every word is converted into a vector of our preferred size, let’s say 5, there would be an additional 5 columns in the dataset instead of millions of encoded features!
Word2Vec provides a powerful alternative. If you found this article helpful, consider trying Word2Vec in your next project and share your experiences with us in the comment section!