![cover](/upload/数据挖掘(机器学习).jpg)
数据挖掘:作业 09
数据挖掘:实验九 特征选择
代码地址:Github
实验目的和要求
通过在Python中的实例应用,分析掌握利用特征选择算法进行数据挖掘的基本原理,加深对特征选择算法的理解,并掌握将算法应用于实际的方法、步骤
实验内容和原理
- 通过实际例子理解特征选择算法的基本原理,加深对算法的理解
- 在Python中实现特征选择算法的数据输入、参数设置以及对结果进行分析
操作方法和实验步骤
- 针对数据集german_clean,随机采样100次,利用Relief方法,给出特征重要性程度的排序。
- 结合第1步得到的特征重要性排序,将数据集的前700个数据作为训练集,后300个数据作为验证集,给出用Logistic回归方法进行分类的最佳特征子集。(注意数据集目标字段的标签是用1,2来表示,要修改成0,1)
实验结果和分析
Answer01
import numpy as np
import pandas as pd
def get_data(dataset):
new_data = pd.DataFrame()
for one in dataset.columns[:-1]:
col = dataset[one]
if (str(list(col)[0]).split(".")[0]).isdigit() or str(list(col)[0]).isdigit() or \
(str(list(col)[0]).split('-')[-1]).split(".")[-1].isdigit():
new_data[one] = dataset[one]
else:
keys = list(set(list(col)))
values = list(range(len(keys)))
new = dict(zip(keys, values))
new_data[one] = dataset[one].map(new)
new_data[dataset.columns[-1]] = dataset[dataset.columns[-1]]
return new_data
def get_diff(dataset, i, j, mode=None):
ex_dataset = None
if mode == 'nh':
ex_dataset = dataset[dataset[:, -1] == dataset[i][-1]]
if mode == 'nm':
ex_dataset = dataset[dataset[:, -1] != dataset[i][-1]]
dist = np.inf
for k in range(len(ex_dataset)):
if k == i:
continue
sub = abs(float(ex_dataset[k][j]) - float(dataset[i][j]))
if sub < dist:
dist = sub
return dist
def relief(dataset):
m, n = dataset.shape
r = [] # 相关统计量
for j in range(n - 1):
rj = 0
for i in range(m):
diff_nh = get_diff(dataset, i, j, mode='nh')
diff_nm = get_diff(dataset, i, j, mode='nm')
rj += -(diff_nh ** 2) + (diff_nm ** 2)
r.append(rj)
return r
if __name__ == '__main__':
data = pd.read_csv('data/german_clean.csv')
# 随机采样100次
feature_list = data.keys()
new_data = get_data(data)
np.random.seed(64)
arr_random = np.random.randint(0, 1000, 100)
data_random = np.array([new_data.iloc[i, :].to_numpy() for i in arr_random])
rf = np.array(relief(data_random))
index = rf.argsort()
rank = 1
for i in index:
print("{:2d} 特征名:{:25} 重要性值:{}".format(rank, np.array(data.keys()[i]), rf[i]))
rank += 1
![image-20221116091634719](https://owen-resource.oss-cn-hangzhou.aliyuncs.com/images/image-20221116091634719.png)
Answer02
import numpy as np
import pandas as pd
feature_list = ['credit_amount_new', 'savings_status', 'job', 'foreign_worker', 'purpose', 'existing_credits', 'checking_status', 'duration_new', 'age', 'credit_amount']
data = pd.read_csv('data/german_clean.csv')[feature_list].to_numpy()
target = pd.read_csv('data/german_clean.csv').iloc[:, 20:].to_numpy()
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(sparse=False)
data = enc.fit_transform(data)
target[target == 1] = 0
target[target == 2] = 1
train_data = data[:700, :]
train_target = target[:700, :]
test_data = data[700:, :]
test_target = target[700:, :]
from sklearn.linear_model import LogisticRegression
reg = LogisticRegression()
reg.fit(train_data, train_target)
reg.score(test_data, test_target)
![image-20221116092550904](https://owen-resource.oss-cn-hangzhou.aliyuncs.com/images/image-20221116092550904.png)
本文是原创文章,采用 CC BY-NC-ND 4.0 协议,完整转载请注明来自 Owen
评论
匿名评论
隐私政策
你无需删除空行,直接评论以获取最佳展示效果