Note
Go to the end to download the full example code. or to run this example in your browser via Binder
稠密数据和稀疏数据上的Lasso回归#
我们展示了linear_model.Lasso在稠密数据和稀疏数据上提供相同的结果,并且在稀疏数据的情况下速度有所提升。
from time import time
from scipy import linalg, sparse
from sklearn.datasets import make_regression
from sklearn.linear_model import Lasso
比较两种 Lasso 实现方法在稠密数据上的表现#
我们创建了一个适合Lasso的线性回归问题,也就是说,特征数量多于样本数量。然后我们将数据矩阵分别存储为密集格式(通常的格式)和稀疏格式,并在每种格式上训练一个Lasso。我们计算了两者的运行时间,并通过计算它们学习到的系数之间差异的欧几里得范数来检查它们是否学习到了相同的模型。由于数据是密集的,我们预计使用密集数据格式会有更好的运行时间。
X, y = make_regression(n_samples=200, n_features=5000, random_state=0)
# 创建 X 的稀疏格式副本
X_sp = sparse.coo_matrix(X)
alpha = 1
sparse_lasso = Lasso(alpha=alpha, fit_intercept=False, max_iter=1000)
dense_lasso = Lasso(alpha=alpha, fit_intercept=False, max_iter=1000)
t0 = time()
sparse_lasso.fit(X_sp, y)
print(f"Sparse Lasso done in {(time() - t0):.3f}s")
t0 = time()
dense_lasso.fit(X, y)
print(f"Dense Lasso done in {(time() - t0):.3f}s")
# 比较回归系数
coeff_diff = linalg.norm(sparse_lasso.coef_ - dense_lasso.coef_)
print(f"Distance between coefficients : {coeff_diff:.2e}")
#
Sparse Lasso done in 0.287s
Dense Lasso done in 0.028s
Distance between coefficients : 1.01e-13
比较两种 Lasso 实现方法在稀疏数据上的表现#
我们通过将所有小值替换为0来使前一个问题变得稀疏,并运行与上述相同的比较。由于数据现在是稀疏的,我们预计使用稀疏数据格式的实现会更快。
# 复制之前的数据
Xs = X.copy()
# 通过将小于2.5的值替换为0,使Xs稀疏化
Xs[Xs < 2.5] = 0.0
# 创建 Xs 的稀疏格式副本
Xs_sp = sparse.coo_matrix(Xs)
Xs_sp = Xs_sp.tocsc()
# 计算数据矩阵中非零系数的比例
print(f"Matrix density : {(Xs_sp.nnz / float(X.size) * 100):.3f}%")
alpha = 0.1
sparse_lasso = Lasso(alpha=alpha, fit_intercept=False, max_iter=10000)
dense_lasso = Lasso(alpha=alpha, fit_intercept=False, max_iter=10000)
t0 = time()
sparse_lasso.fit(Xs_sp, y)
print(f"Sparse Lasso done in {(time() - t0):.3f}s")
t0 = time()
dense_lasso.fit(Xs, y)
print(f"Dense Lasso done in {(time() - t0):.3f}s")
# 比较回归系数
coeff_diff = linalg.norm(sparse_lasso.coef_ - dense_lasso.coef_)
print(f"Distance between coefficients : {coeff_diff:.2e}")
Matrix density : 0.626%
Sparse Lasso done in 0.347s
Dense Lasso done in 0.687s
Distance between coefficients : 8.06e-12
Total running time of the script: (0 minutes 1.384 seconds)
Related examples
基于L1的稀疏信号模型
多任务Lasso的联合特征选择
Lasso 和弹性网络
Lasso模型选择:AIC-BIC / 交叉验证