Hult EdTech Club
  • Overview
  • Club Activities
    • L3 Talks
    • EdTech Platform
    • Open Courseware
  • Membership and Roles
  • Frequently Asked Questions
  • Resources
    • Hult Library and Resources
      • Statista
      • Pitchbook
    • Shared Notes
      • Data Visualization
      • Finance
        • Bonds
        • Modern Portfolio
      • Economics
        • Efficiency Wage Theory
        • Consumer Surplus
        • Price Discrimination
        • Capturing Consumer Surplus
        • Reducing Marginal Costs
        • "Free" vs. free
        • Market of Lemons
        • Irrationality
        • Choice Architecture
      • Excel
        • Downloading Excel on your OSX
        • Excel Shortcuts for Mac and Windows
        • Data Analysis Pak on Excel for OSX
        • Office.com - Online Excel Data Analysis Pak
        • Generating Pivot Tables and Pie charts
        • Generating Descriptive Statistics
        • T-test with paired means
      • Languages
        • English
      • Presentation
        • How To Stop Using Filler Words
        • How to Introduce Yourself
        • Elevator Pitch
        • Pyramid Principle
      • R Programming
        • Installing R and R Studio for OSX
          • Common problems in R and R Studio
      • Statistics
        • Regression Analysis
          • Simple Linear Regression
          • Multiple Linear Regression
          • SSR, SSE, and SST
          • Standardized coefficients
        • One-hot encoding versus dummy encoding
        • ANOVA analysis
        • Interpreting box plots
        • Z-distributions, normal distributions
        • Sampling Distributions
        • Degrees of Freedom
        • Null and Alternate Hypotheses
        • Outliers in Boxplots
        • K-means clustering
        • Independent Variables with Correlation
      • Sustainability
      • Global Operations
      • Anki Flashcard Decks
    • Technologies
      • Anki Web App
      • Canvas
        • Canvas+ Extension
        • Canvas as Instructor / Teaching Assistant
      • Customized Problem Sets via ChatGPT
      • Excalidraw
      • Language Reactor
      • Mac OSX Keyboard Shortcuts
      • Scribe
      • Skim PDF (for OSX)
      • Tableau
        • Tableau Relationships vs. Join
      • Tools for Asynchronous Communications
      • Tradle
    • Learning Strategies
      • L3 Guides
        • L3 Feedback Guide
        • L3 Learner Guide
        • L3 Preparation Guide
      • Use a Dataset That Matters To You
      • Leverage AI in the areas you are not studying
      • Recommended Learning Tech Stack
    • Student Life
      • San Francisco
      • Accommodations Research
      • No Tax Filing (Form 8443)
      • Tax Filing (W-4, I-9)
      • California Franchise Tax Board
      • IRS
  • Research
    • Education
      • Presentation and Pitch Decks
      • Rapid Feedback Loop
      • Student Workload
    • Technology
    • AI
    • Entrepreneurship
      • Hult Mentorship Club with Vishal Sachar
    • Careers and Employment
      • 83 (b) Election
      • Cold outreach to hiring managers
      • LinkedIn Search Operators
      • Making good impressions
      • Requesting for Referrals
  • ABOUT
    • Vision
      • Goals for 2023-2027
    • Team
    • Onboarding
    • Incoming Hult Students
    • Contributing Guidelines
    • Partnerships
    • Privacy
Powered by GitBook
On this page
  • Why do we use dummy encoding instead of one-hot encoding?
  • References

Was this helpful?

  1. Resources
  2. Shared Notes
  3. Statistics

One-hot encoding versus dummy encoding

PreviousStandardized coefficientsNextANOVA analysis

Last updated 1 year ago

Was this helpful?

The following description is written by ChatGPT.

Dummy encoding is essentially the same as one-hot encoding, but with a slight difference in implementation that has implications for certain statistical analyses.

One-Hot Encoding

In one-hot encoding, each category of a categorical variable is transformed into a binary column (1 for presence of the feature, 0 for absence). For a variable with n categories, one-hot encoding creates n binary columns.

Dummy Encoding

Dummy encoding is a technique often used in the context of regression analysis. It also converts categorical variables into binary columns, but to avoid multicollinearity (a situation where predictor variables are correlated), one level of the categorical variable is dropped. So, for a variable with n categories, dummy encoding creates n−1 binary columns. The dropped category acts as a reference level, and the model coefficients for the dummy variables represent the effect of each category relative to this reference.

While the terms are frequently used interchangeably, particularly in machine learning contexts where multicollinearity is less of a concern due to regularization techniques, the distinction is important in regression models where multicollinearity can make the coefficients unreliable.

Why do we use dummy encoding instead of one-hot encoding?

To avoid the problem of .

Multicollinearity is the occurrence of high intercorrelations among two or more independent variables in a multiple regression model. Multicollinearity can lead to skewed or misleading results when a researcher or analyst attempts to determine how well each independent variable can be used most effectively to predict or understand the dependent variable in a statistical model.

One-hot encoding still has a problem, which is that the linear function is not unique.

References

multi-collinearity
What is the difference of one-hot encoding and dummy encoding? from Stack Overflow
Investopedia definition of Multi-collinearity