# One-hot encoding versus dummy encoding

{% hint style="warning" %}
*The following description is written by ChatGPT.*
{% endhint %}

Dummy encoding is essentially the same as one-hot encoding, but with a slight difference in implementation that has implications for certain statistical analyses.

#### One-Hot Encoding

In one-hot encoding, each category of a categorical variable is transformed into a binary column (1 for presence of the feature, 0 for absence). For a variable with `n` categories, one-hot encoding creates `n` binary columns.&#x20;

#### Dummy Encoding

Dummy encoding is a technique often used in the context of regression analysis. It also converts categorical variables into binary columns, but to avoid multicollinearity (a situation where predictor variables are correlated), one level of the categorical variable is dropped. So, for a variable with `n` categories, dummy encoding creates `n−1` binary columns. The dropped category acts as a reference level, and the model coefficients for the dummy variables represent the effect of each category relative to this reference.&#x20;

While the terms are frequently used interchangeably, particularly in machine learning contexts where multicollinearity is less of a concern due to regularization techniques, the distinction is important in regression models where multicollinearity can make the coefficients unreliable.

## Why do we use dummy encoding instead of one-hot encoding?

To avoid the problem of [multi-collinearity](https://www.investopedia.com/terms/m/multicollinearity.asp#:~:text=Multicollinearity%20is%20a%20statistical%20concept,in%20less%20reliable%20statistical%20inferences.).

> Multicollinearity is the occurrence of high intercorrelations among two or more independent variables in a multiple regression model. Multicollinearity **can lead to skewed or misleading results** when a researcher or analyst attempts to determine how well each independent variable can be used most effectively to predict or understand the dependent variable in a statistical model.

One-hot encoding still has a problem, which is that the linear function is not unique.

## References

* [*What is the difference of one-hot encoding and dummy encoding?* from Stack Overflow](https://datascience.stackexchange.com/questions/98172/what-is-the-difference-between-one-hot-and-dummy-encoding)
* [Investopedia definition of Multi-collinearity](https://www.investopedia.com/terms/m/multicollinearity.asp)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://www.hultedtech.club/resources/shared-notes/statistics/one-hot-encoding-versus-dummy-encoding.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
