You spent weeks learning algorithms, tuning parameters, and running experiments. Yet your model’s performance remains mediocre. Before you switch to a fancier algorithm or collect more data, consider this: feature engineering might be the missing piece.
Most beginners obsess over choosing the perfect algorithm while ignoring the quality of their input features. Building better machine learning models often depends more on how you prepare your features than which algorithm you choose. Poor features doom even the most sophisticated models. Great features make simple algorithms perform brilliantly.
Feature engineering in machine learning is the process of creating, transforming, and selecting input variables to help your model learn patterns more effectively. It’s part art and part science, requiring domain knowledge combined with technical skills. Let me show you the techniques that separate amateur models from production-quality systems.
Why feature engineering matters more than you think
Raw data rarely comes in a form that models can use effectively. Features might be on different scales, contain irrelevant information, or miss important relationships that exist in combinations of variables.
A simple example illustrates this. Suppose you’re predicting house prices with features for length and width. Your model struggles to find patterns. But multiply length by width to create an area feature and performance improves dramatically. That’s feature engineering creating value.
The relationship between features and target might be non-linear. Temperature affects ice cream sales, but not in a straight line. Sales spike between 70 and 90 degrees but level off above 90. Creating temperature ranges or polynomial features helps your model capture this pattern.
Features on vastly different scales cause problems too. If one feature ranges from 0 to 1 and another from 0 to 1000000, the large numbers can dominate learning. Proper scaling puts all features on comparable ranges so the model weighs them appropriately.
Categorical variables like color or city names can’t directly feed into most algorithms. You need to encode them as numbers in ways that preserve meaningful information without creating false relationships.
Great feature engineering can boost model performance by 20 percent, 50 percent, or more. It’s often the difference between a model that barely works and one that’s production ready.
Feature scaling techniques
Feature scaling transforms numerical features to similar ranges. Two main approaches handle this: standardization and normalization.
Standardization transforms features to have mean zero and standard deviation one. Each value becomes how many standard deviations away from the mean it sits. This works well when your data roughly follows a normal distribution.
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd
# Sample data with different scales
data = pd.DataFrame({
'age': [25, 35, 45, 55, 65],
'income': [30000, 50000, 70000, 90000, 110000],
'score': [0.6, 0.7, 0.8, 0.9, 0.95]
})
# Standardize features
scaler = StandardScaler()
data_standardized = pd.DataFrame(
scaler.fit_transform(data),
columns=data.columns
)
print("Original data:")
print(data)
print("\nStandardized data:")
print(data_standardized)
Normalization scales features to a fixed range, typically 0 to 1. Each value becomes its position between the minimum and maximum values. This works well when you don’t want to assume any particular distribution.
from sklearn.preprocessing import MinMaxScaler
# Normalize to 0-1 range
normalizer = MinMaxScaler()
data_normalized = pd.DataFrame(
normalizer.fit_transform(data),
columns=data.columns
)
print("Normalized data (0 to 1):")
print(data_normalized)
When to use which? Standardization works better for algorithms sensitive to feature distributions like logistic regression or neural networks. Normalization works well for algorithms that don’t assume distributions like neural networks or k-nearest neighbors. Try both and see what performs better for your specific problem.
Encoding categorical variables
Categorical features need conversion to numerical form. The encoding method you choose affects how your model interprets relationships between categories.
One-hot encoding creates binary columns for each category. If you have a color feature with red, blue, and green, you get three columns: is_red, is_blue, is_green. Each row has 1 in one column and 0 in the others.
from sklearn.preprocessing import OneHotEncoder
# Sample categorical data
colors = pd.DataFrame({
'color': ['red', 'blue', 'green', 'red', 'blue']
})
# One-hot encode
encoder = OneHotEncoder(sparse_output=False)
colors_encoded = pd.DataFrame(
encoder.fit_transform(colors),
columns=encoder.get_feature_names_out()
)
print("One-hot encoded colors:")
print(colors_encoded)
One-hot encoding works great for nominal categories with no inherent order. But it creates many columns when you have lots of categories. A feature with 100 cities creates 100 new columns.
Label encoding assigns each category a number. Red becomes 0, blue becomes 1, green becomes 2. This is compact but creates a false ordering. The model might think blue is somehow between red and green.
from sklearn.preprocessing import LabelEncoder
# Label encode
label_encoder = LabelEncoder()
colors['color_encoded'] = label_encoder.fit_transform(colors['color'])
print("Label encoded:")
print(colors)
Use label encoding only for ordinal categories with natural ordering like education level or product ratings. For nominal categories without ordering, prefer one-hot encoding despite the extra columns.
Target encoding uses the relationship between category and target. Each category gets encoded as the average target value for that category. This can be powerful but risks overfitting if not done carefully with cross-validation.
Creating new features from existing ones
Feature extraction creates new features from combinations or transformations of existing features. This lets you capture relationships the model can’t discover on its own.
Polynomial features create interactions and powers of existing features. From features x and y, you create x squared, y squared, and x times y. This helps linear models capture non-linear patterns.
from sklearn.preprocessing import PolynomialFeatures
# Create polynomial features
X = pd.DataFrame({
'x1': [1, 2, 3, 4, 5],
'x2': [2, 3, 4, 5, 6]
})
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = pd.DataFrame(
poly.fit_transform(X),
columns=poly.get_feature_names_out()
)
print("Polynomial features:")
print(X_poly)
Date and time features need special handling. Extract components like day of week, month, hour, or whether it’s a weekend. These often reveal patterns invisible in raw timestamps.
# Create date features
dates = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=5, freq='D')
})
dates['day_of_week'] = dates['date'].dt.dayofweek
dates['month'] = dates['date'].dt.month
dates['is_weekend'] = dates['date'].dt.dayofweek >= 5
print("Date features extracted:")
print(dates)
Domain knowledge guides feature creation. If you’re predicting customer churn, create features like days since last purchase, average purchase value, or purchase frequency. These domain-specific features often outperform raw transaction data.
Common feature engineering mistakes to avoid
- Adding too many features → noise
- Data leakage
- Applying scaling before train-test split
- Blindly using polynomial features
Selecting the most important features
Not all features improve your model. Some add noise, cause overfitting, or waste computational resources. Feature selection removes features that don’t contribute meaningful information.
Correlation-based selection removes features highly correlated with each other. If two features have correlation above 0.9, they provide redundant information. Keep one and drop the other.
# Calculate correlations
correlation_matrix = data.corr()
print("Feature correlations:")
print(correlation_matrix)
# Find highly correlated pairs
high_corr = []
for i in range(len(correlation_matrix.columns)):
for j in range(i+1, len(correlation_matrix.columns)):
if abs(correlation_matrix.iloc[i, j]) > 0.9:
high_corr.append((
correlation_matrix.columns[i],
correlation_matrix.columns[j],
correlation_matrix.iloc[i, j]
))
print("\\nHighly correlated features:")
for pair in high_corr:
print(f"{pair[0]} <-> {pair[1]}: {pair[2]:.2f}")
Model-based selection uses a simple model to identify important features. Train a decision tree or linear model and examine feature importances. Remove features with very low importance scores.
from sklearn.ensemble import RandomForestRegressor
# Feature importance from random forest
X_train = data[['age', 'income']]
y_train = data['score']
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
importance_df = pd.DataFrame({
'feature': X_train.columns,
'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
print("Feature importances:")
print(importance_df)
Recursive feature elimination systematically removes features and measures performance. Start with all features, remove the least important, and repeat until performance starts degrading.
The goal isn’t having the most features. It’s having the right features that capture signal without noise.
Feature selection for small vs large datasets
Feature selection depends heavily on dataset size. With small datasets, using too many features often causes overfitting. Simple methods like correlation filtering, univariate statistical tests, or L1 regularization (Lasso) work best because they reduce noise while keeping the model stable.
For large datasets, models can handle more features safely. Tree-based models such as Random Forest or Gradient Boosting can automatically identify important features. Dimensionality reduction techniques like PCA are also useful to speed up training without losing much information.
Example (small dataset using Lasso):
from sklearn.linear_modelimport Lasso
from sklearn.feature_selectionimport SelectFromModel
lasso = Lasso(alpha=0.01)
selector = SelectFromModel(lasso)
X_selected = selector.fit_transform(X, y)
Choosing the right feature selection method based on dataset size helps improve generalization and prevents unnecessary model complexity
Putting it all together
Feature engineering in machine learning combines all these techniques into a workflow. Start with understanding your data and problem domain. Identify what information might help predictions.
Scale numerical features appropriately. Encode categorical variables correctly. Create new features from combinations or transformations. Select the most valuable features and discard noise.
Test each transformation’s impact on model performance. Some transformations help, others hurt, and you won’t know until you measure. Keep changes that improve validation performance.
Document your feature engineering pipeline. You’ll need to apply the exact same transformations to new data when making predictions. Sklearn pipelines help automate this process and prevent errors.
The best feature engineering comes from deeply understanding your problem domain. What factors actually drive your target variable? What relationships exist that raw features don’t capture? Domain experts often suggest features that dramatically improve performance.
Remember that feature engineering is iterative. You try ideas, measure results, and refine based on what works. The first attempt rarely produces optimal features. Experimentation and measurement guide you toward better solutions.
Ready to optimize your models even further? Check out our guide on hyperparameter tuning explained to learn how to systematically find the best settings for your algorithms after you’ve built great features.

