-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathsearch.xml
More file actions
193 lines (100 loc) · 261 KB
/
search.xml
File metadata and controls
193 lines (100 loc) · 261 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
<?xml version="1.0" encoding="utf-8"?>
<search>
<entry>
<title>Tmux的基本使用</title>
<link href="/2019/05/07/Tmux%E7%9A%84%E5%9F%BA%E6%9C%AC%E4%BD%BF%E7%94%A8/"/>
<url>/2019/05/07/Tmux%E7%9A%84%E5%9F%BA%E6%9C%AC%E4%BD%BF%E7%94%A8/</url>
<content type="html"><![CDATA[<h1 id="Tmux是什么"><a href="#Tmux是什么" class="headerlink" title="Tmux是什么"></a>Tmux是什么</h1><p>Tmux是一个优秀的终端复用软件,可以在一个终端窗口中运行多个终端会话。不仅如此,你还可以通过Tmux使终端会话运行于后台或是按需接入、断开会话。</p><p>使用Tmux连接到服务器可以解决很多令人头疼的问题:</p><ul><li>想同时打开多个目录的时候,不得不开多个终端来回切换</li><li>运行一个脚本,服务器断掉失联之后当前进程会被服务器无情杀掉</li><li>每次ssh到服务器都要重新切到工作目录,打开工作进程等,无法保存之前的工作记录</li><li>……</li></ul><h1 id="安装Tmux"><a href="#安装Tmux" class="headerlink" title="安装Tmux"></a>安装Tmux</h1><h2 id="在Mac-OS中上安装"><a href="#在Mac-OS中上安装" class="headerlink" title="在Mac OS中上安装"></a>在Mac OS中上安装</h2><ul><li>安装HomeBrew</li></ul><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">ruby -e <span class="string">"<span class="variable">$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)</span>"</span></span><br></pre></td></tr></table></figure><ul><li>安装Tmux</li></ul><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">brew install tmux</span><br></pre></td></tr></table></figure><h2 id="在Ubuntu中安装"><a href="#在Ubuntu中安装" class="headerlink" title="在Ubuntu中安装"></a>在Ubuntu中安装</h2><p>在终端输入如下命令:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sudo apt-get install tmux</span><br></pre></td></tr></table></figure><h1 id="Tmux的配置文件"><a href="#Tmux的配置文件" class="headerlink" title="Tmux的配置文件"></a>Tmux的配置文件</h1><p>如何对Tmux的配置文件进行编写这里不做介绍,自行百度。这里直接给出网上大佬已经写好的配置文件的<a href="https://github.com/gpakosz/.tmux" target="_blank" rel="noopener">Github地址</a>。</p><p>下载方法(逐条执行):</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="built_in">cd</span> ~</span><br><span class="line">git <span class="built_in">clone</span> https://github.com/gpakosz/.tmux.git</span><br><span class="line">ln -s -f .tmux/.tmux.conf</span><br><span class="line">cp .tmux/.tmux.conf.local .</span><br></pre></td></tr></table></figure><p>下载完后,关闭所有终端,重启Tmux配置文件就生效了。有关该配置文件的详细特性,去看这位大佬的Github。这里仅列出几点重要的:</p><ul><li>Ctrl-a 和 Ctrl-b都可以作为Tmux的前缀prefix</li><li>最大化窗格为一个新的窗口 <prefix> + (再按一次就又返回原来的布局了)</prefix></li><li>鼠标模式切换 <prefix> m</prefix></li></ul><h1 id="Tmux的基本结构"><a href="#Tmux的基本结构" class="headerlink" title="Tmux的基本结构"></a>Tmux的基本结构</h1><div class="table-container"><table><thead><tr><th>单元模块</th><th>描述</th></tr></thead><tbody><tr><td>server</td><td>服务器,一个服务器可以包含多个会话</td></tr><tr><td>session</td><td>会话,一个会话可以包含多个窗口</td></tr><tr><td>window</td><td>窗口,一个窗口可以包含多个窗格</td></tr><tr><td>panel</td><td>窗格</td></tr></tbody></table></div><h1 id="Tmux基本操作"><a href="#Tmux基本操作" class="headerlink" title="Tmux基本操作"></a>Tmux基本操作</h1><p>基本操作无非是对会话、窗口、窗格进行管理,包括创建、关闭、重命名、连接、分离、选择等。Tmux默认的快捷键前缀是<strong>Ctrl+b</strong>(下文用<strong>prefix</strong>指代),按下前缀组合键后松开,再按下命令键进行快捷操作,</p><h2 id="会话管理-session"><a href="#会话管理-session" class="headerlink" title="会话管理(session)"></a>会话管理(session)</h2><h3 id="常用命令"><a href="#常用命令" class="headerlink" title="常用命令"></a>常用命令</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">tmux new<span class="comment"># 创建默认名称的会话(在tmux命令模式使用**new**命令可实现同样的功能,其他命令同理,后文不再列出tmux终端命令)</span></span><br><span class="line">tmux new -s session-name<span class="comment"># 创建名为session-name的会话 (常用)</span></span><br><span class="line">tmux ls<span class="comment"># 显示会话列表</span></span><br><span class="line">tmux a<span class="comment"># 连接(attach)上一个会话</span></span><br><span class="line">tmux a -t session-name<span class="comment"># 连接指定会话</span></span><br><span class="line">tmux rename -t s1 s2<span class="comment"># 重命名会话s1为s2</span></span><br><span class="line">tmux <span class="built_in">kill</span>-session<span class="comment"># 关闭上次打开的会话</span></span><br><span class="line">tmux <span class="built_in">kill</span>-session -t s1<span class="comment"># 关闭会话s1</span></span><br><span class="line">tmux <span class="built_in">kill</span>-session -a -t s1<span class="comment"># 关闭除s1外的所有会话</span></span><br><span class="line">tmux <span class="built_in">kill</span>-server<span class="comment"># 关闭所有会话</span></span><br></pre></td></tr></table></figure><h3 id="常用快捷键"><a href="#常用快捷键" class="headerlink" title="常用快捷键"></a>常用快捷键</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">prefix s<span class="comment"># 列出会话</span></span><br><span class="line">prefix $<span class="comment"># 重命名会话</span></span><br><span class="line">prefix d<span class="comment"># 离开当前会话</span></span><br><span class="line">prefix D<span class="comment"># 离开指定对话</span></span><br></pre></td></tr></table></figure><h2 id="窗口管理-window"><a href="#窗口管理-window" class="headerlink" title="窗口管理(window)"></a>窗口管理(window)</h2><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">prefix c<span class="comment"># 创建一个新窗口 (常用)</span></span><br><span class="line">prefix ,<span class="comment"># 重命名当前窗口</span></span><br><span class="line">prefix &<span class="comment"># 关闭当前窗口</span></span><br><span class="line">prefix 0~9 <span class="comment"># 选择编号0~9对应的窗口</span></span><br><span class="line">prefix w<span class="comment"># 列出所有窗口,可进行切换</span></span><br><span class="line">prefix n<span class="comment"># 进入下一个窗口(next)</span></span><br><span class="line">prefix p<span class="comment"># 进入上一个窗口(previous)</span></span><br><span class="line">prefix l<span class="comment"># 进入之前操作的窗口</span></span><br><span class="line">prefix .<span class="comment"># 修改当前窗口索引编号</span></span><br><span class="line">prefix <span class="string">'# 切换至指定编号(可大于9)的窗口</span></span><br><span class="line"><span class="string">prefix f# 根据显示的内容搜索窗格</span></span><br></pre></td></tr></table></figure><h2 id="窗格管理-pane"><a href="#窗格管理-pane" class="headerlink" title="窗格管理(pane)"></a>窗格管理(pane)</h2><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line">prefix %<span class="comment"># 水平方向创建窗格</span></span><br><span class="line">prefix <span class="string">"# 垂直方向创建窗格</span></span><br><span class="line"><span class="string">prefix Up|Down|Left|Right# 根据箭头方向切换窗格</span></span><br><span class="line"><span class="string">prefix x# 关闭当前窗格</span></span><br><span class="line"><span class="string">prefix q# 显示窗格编号</span></span><br><span class="line"><span class="string">prefix o# 顺时针切换窗格</span></span><br><span class="line"><span class="string">prefix Ctrl+o# 逆时针切换窗格</span></span><br><span class="line"><span class="string">prefix }# 与下一个窗格交换位置</span></span><br><span class="line"><span class="string">prefix {# 与上一个窗格交换位置</span></span><br><span class="line"><span class="string">prefix space(空格)# 重新排列当前窗口下的所有窗格</span></span><br><span class="line"><span class="string">prefix !# 将当前窗格置于新窗口</span></span><br><span class="line"><span class="string">prefix t# 在当前窗格显示时间</span></span><br><span class="line"><span class="string">prefix z# 放大当前窗格(再次按下将还原)</span></span><br><span class="line"><span class="string">prefix i# 显示当前窗格信息</span></span><br></pre></td></tr></table></figure><h1 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a>参考资料</h1><ol><li><a href="https://zhuanlan.zhihu.com/p/43687973" target="_blank" rel="noopener">手把手教你使用终端复用神器Tmux</a></li><li><a href="http://blog.jobbole.com/87584/" target="_blank" rel="noopener">Tmux 速成教程:技巧和调整</a></li></ol>]]></content>
<categories>
<category> Tmux </category>
</categories>
<tags>
<tag> Tmux </tag>
</tags>
</entry>
<entry>
<title>住房月租金预测大数据赛总结</title>
<link href="/2019/01/13/%E4%BD%8F%E6%88%BF%E6%9C%88%E7%A7%9F%E9%87%91%E9%A2%84%E6%B5%8B%E5%A4%A7%E6%95%B0%E6%8D%AE%E8%B5%9B%E6%80%BB%E7%BB%93/"/>
<url>/2019/01/13/%E4%BD%8F%E6%88%BF%E6%9C%88%E7%A7%9F%E9%87%91%E9%A2%84%E6%B5%8B%E5%A4%A7%E6%95%B0%E6%8D%AE%E8%B5%9B%E6%80%BB%E7%BB%93/</url>
<content type="html"><![CDATA[<h1 id="前言"><a href="#前言" class="headerlink" title="前言"></a>前言</h1><p>和同学一起参加这个比赛,由于报名比较晚了加之第一次参加这种数据分析的比赛经验欠缺,导致最终取得的成绩不是很好。不过,在这次比赛中的确收获了很多,使我对数据分析的整个流程有了更清楚的认识。<br>比赛已经结束后,对这个比赛做一个总结,以及对代码作出改进。改进后的模型使得预测结果在A榜达到了第3名。</p><p>数据集和代码见<a href="https://github.com/RunningGump/rental-prediction" target="_blank" rel="noopener">GitHub</a></p><h1 id="赛题"><a href="#赛题" class="headerlink" title="赛题"></a>赛题</h1><p>本次比赛数据为某地4个月的房屋租赁价格以及房屋的基本信息,官方对数据进行了脱敏处理。<br>参赛选手需要利用数据集中的房屋信息和月租金训练模型,利用测试集中的房屋信息对测试数据集中的房屋的月租金进行预测。数据集分为两组,分别是训练集和测试集。训练集为前3个月采集的数据,共196539条。测试集为第4个月采集的数据,相对于训练集,增加了“id”字段,为房屋的唯一id,且无‘’月租金‘’字段,其他字段与训练集相同,共56279条。<br>评价指标是RMSE(均方根误差),是回归算法的常用评价指标。<br>训练集所含字段如下:</p><div class="table-container"><table><thead><tr><th>字段名</th><th>说明</th></tr></thead><tbody><tr><td>时间</td><td>房屋信息采集的时间</td></tr><tr><td>小区名</td><td>房屋所在小区,脱敏处理</td></tr><tr><td>小区房屋出租数量</td><td>小区房屋出租数量,脱敏处理,保留大小关系</td></tr><tr><td>楼层</td><td>楼层高、中、低,脱敏处理</td></tr><tr><td>总层数</td><td>房屋所在建筑的总楼层数,脱敏处理</td></tr><tr><td>房屋面积</td><td>房屋面积,脱敏处理</td></tr><tr><td>房屋朝向</td><td>房屋朝向</td></tr><tr><td>居住状态</td><td>居住状态,表示是否已出租或居住中,脱敏处理</td></tr><tr><td>卧室数量</td><td>卧室的数量</td></tr><tr><td>厅的数量</td><td>厅的数量</td></tr><tr><td>卫的数量</td><td>卫的数量</td></tr><tr><td>出租方式</td><td>表示是否整租,脱敏处理</td></tr><tr><td>区</td><td>房屋所在的区级行政单位,脱敏处理</td></tr><tr><td>位置</td><td>小区所在的商圈位置,脱敏处理</td></tr><tr><td>地铁线路</td><td>数字表示第几条线路,脱敏处理</td></tr><tr><td>地铁站点</td><td>房屋临近的地铁站,脱敏处理</td></tr><tr><td>距离</td><td>房屋距地铁站距离,脱敏处理</td></tr><tr><td>装修情况</td><td>房屋的装修档次,脱敏处理</td></tr><tr><td>月租金</td><td>月租金、标签值、脱敏处理</td></tr></tbody></table></div><h1 id="分析"><a href="#分析" class="headerlink" title="分析"></a>分析</h1><p>本文的讲解主要从以下几个方面展开:<strong>数据清洗</strong>、<strong>特征构建</strong>、<strong>模型训练</strong>、<strong>模型融合</strong>。通过参加这次比赛,我认识到了<strong>特征工程</strong>在整个数据分析过程占据着举足轻重的地位,此处只列出能够提升模型性能的特征,其他测试特征见我的GitHub。</p><h1 id="数据清洗"><a href="#数据清洗" class="headerlink" title="数据清洗"></a>数据清洗</h1><p>画出房屋面积与月租金关系的散点图如下:</p><p><img src="/2019/01/13/住房月租金预测大数据赛总结/before.png" alt="image-20190112164608211"></p><p>可以看到,房屋面积大于0.1的样本点属于异常值,将异常值去除。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 异常值清洗,上分</span></span><br><span class="line">train_df = train_df.drop(train_df[(train_df[<span class="string">'房屋面积'</span>]><span class="number">0.1</span>)].index)</span><br></pre></td></tr></table></figure><p>异常值清除后,画出房屋面积与月租金关系的散点图如下:</p><p><img src="/2019/01/13/住房月租金预测大数据赛总结/after.png" alt="image-20190112164947005"></p><p>经过测试,去除异常值能够提升模型性能。</p><h1 id="特征构造"><a href="#特征构造" class="headerlink" title="特征构造"></a>特征构造</h1><p>我特征构造的思路是先根据常识来构造特征,比如:卧室&厅&卫总的房间数、卧室面积、厅的面积、卫的面积、房间相对高度等。然后,就是使用常用的套路来构造特征,比如:对类别型特征进行LabelEncoder编码、多个特征的线性组合、比例特征等等。<br>将使用原始特征求得的RMSE作为baseline,通过比较加入新构造特征后的RMSE与baseline的大小来筛选出有用的构造特征。</p><h2 id="根据常识构造特征"><a href="#根据常识构造特征" class="headerlink" title="根据常识构造特征"></a>根据常识构造特征</h2><p>所谓根据常识构造特征就是我们根据现有的知识推断出哪些特征与月租金的相关性强。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 总房间数、单个房间的平均面积、卧室面积、厅的面积、卫的面积</span></span><br><span class="line">train_df[<span class="string">'总房间数'</span>]=train_df[<span class="string">'卧室数量'</span>] + train_df[<span class="string">'厅的数量'</span>] + train_df[<span class="string">'卫的数量'</span>]</span><br><span class="line">test_df[<span class="string">'总房间数'</span>]=test_df[<span class="string">'卧室数量'</span>] + test_df[<span class="string">'厅的数量'</span>] + test_df[<span class="string">'卫的数量'</span>]</span><br><span class="line">train_df[<span class="string">'Area/Room'</span>]=train_df[<span class="string">'房屋面积'</span>] / (train_df[<span class="string">'总房间数'</span>]+<span class="number">1</span>)</span><br><span class="line">test_df[<span class="string">'Area/Room'</span>]=test_df[<span class="string">'房屋面积'</span>] / (test_df[<span class="string">'总房间数'</span>]+<span class="number">1</span>)</span><br><span class="line">train_df[<span class="string">'卧室面积'</span>]=train_df[<span class="string">'房屋面积'</span>]*(train_df[<span class="string">'卧室数量'</span>]/train_df[<span class="string">'总房间数'</span>])</span><br><span class="line">test_df[<span class="string">'卧室面积'</span>]=test_df[<span class="string">'房屋面积'</span>]*(test_df[<span class="string">'卧室数量'</span>]/test_df[<span class="string">'总房间数'</span>]) </span><br><span class="line">train_df[<span class="string">'厅的面积'</span>]=train_df[<span class="string">'房屋面积'</span>]*(train_df[<span class="string">'厅的数量'</span>]/train_df[<span class="string">'总房间数'</span>])</span><br><span class="line">test_df[<span class="string">'厅的面积'</span>]=test_df[<span class="string">'房屋面积'</span>]*(test_df[<span class="string">'厅的数量'</span>]/test_df[<span class="string">'总房间数'</span>]) </span><br><span class="line">train_df[<span class="string">'卫的面积'</span>]=train_df[<span class="string">'房屋面积'</span>]*(train_df[<span class="string">'卫的数量'</span>]/train_df[<span class="string">'总房间数'</span>])</span><br><span class="line">test_df[<span class="string">'卫的面积'</span>]=test_df[<span class="string">'房屋面积'</span>]*(test_df[<span class="string">'卫的数量'</span>]/test_df[<span class="string">'总房间数'</span>])</span><br><span class="line"><span class="comment"># 统计每个小区附近的地铁站点数</span></span><br><span class="line">temp = train_df.groupby(<span class="string">'小区名'</span>)[<span class="string">'地铁站点'</span>].count().reset_index()</span><br><span class="line">temp.columns = [<span class="string">'小区名'</span>,<span class="string">'地铁站点数量'</span>]</span><br><span class="line">train_df = train_df.merge(temp, how = <span class="string">'left'</span>,on = <span class="string">'小区名'</span>)</span><br><span class="line">test_df = test_df.merge(temp, how = <span class="string">'left'</span>,on = <span class="string">'小区名'</span>)</span><br></pre></td></tr></table></figure><h2 id="根据套路构造特征"><a href="#根据套路构造特征" class="headerlink" title="根据套路构造特征"></a>根据套路构造特征</h2><p>对类别型或者离散型数据进行编码(如LabelEncoder编码、one-hot编码)、比例特征、特征的线性组合等等。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 对房屋朝向进行LabelEncoder</span></span><br><span class="line">lb_encoder=LabelEncoder()</span><br><span class="line">lb_encoder.fit(train_df.loc[:,<span class="string">'房屋朝向'</span>].append(test_df.loc[:,<span class="string">'房屋朝向'</span>])) </span><br><span class="line">train_df.loc[:,<span class="string">'房屋朝向'</span>]=lb_encoder.transform(train_df.loc[:,<span class="string">'房屋朝向'</span>])</span><br><span class="line">test_df.loc[:,<span class="string">'房屋朝向'</span>]=lb_encoder.transform(test_df.loc[:,<span class="string">'房屋朝向'</span>])</span><br><span class="line"><span class="comment"># 构造‘相对高度’特征</span></span><br><span class="line">train_df[<span class="string">'相对高度'</span>]=train_df[<span class="string">'楼层'</span>] / (train_df[<span class="string">'总楼层'</span>] + <span class="number">1</span>)</span><br><span class="line">test_df[<span class="string">'相对高度'</span>]=test_df[<span class="string">'楼层'</span>] / (test_df[<span class="string">'总楼层'</span>] + <span class="number">1</span>)</span><br><span class="line"><span class="comment"># 构造(相对高度*卧室面积)的特征</span></span><br><span class="line">train_df[<span class="string">'PerFloorBedroomArea'</span>] = train_df[<span class="string">'相对高度'</span>] * train_df[<span class="string">'卧室面积'</span>]</span><br><span class="line">test_df[<span class="string">'PerFloorBedroomArea'</span>] = test_df[<span class="string">'相对高度'</span>] * test_df[<span class="string">'卧室面积'</span>]</span><br><span class="line"><span class="comment"># 还可以尝试其他特征。。。。</span></span><br></pre></td></tr></table></figure><h1 id="模型训练"><a href="#模型训练" class="headerlink" title="模型训练"></a>模型训练</h1><p>我采用的是XGBoost和LightGBM两个模型进行训练的,两个模型所使用的特征基本一致,最后XGBoost单模线上分数1.84,LightGBM单模先线上分数1.88。模型调参可以用Scikit-learn中的<code>sklearn.model_selection.GridSearchCV</code><br>函数(模型调参我没有花太多时间,读者有兴趣可以自行调参)。这里直接列出我两个模型的参数:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># xgb模型参数</span></span><br><span class="line">xgb.XGBRegressor(max_depth=<span class="number">8</span>, <span class="comment"># 构建树的深度,越大越容易过拟合</span></span><br><span class="line"> n_estimators=<span class="number">3880</span>, <span class="comment"># 最佳迭代次数</span></span><br><span class="line"> learning_rate=<span class="number">0.1</span>, <span class="comment"># 学习率</span></span><br><span class="line"> n_jobs=<span class="number">-1</span>) <span class="comment"># 启动cpu所有核</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># lbg模型参数</span></span><br><span class="line">lgb.LGBMRegressor(objective=<span class="string">'regression'</span>, <span class="comment"># 目标函数:回归</span></span><br><span class="line"> num_leaves=<span class="number">900</span>, <span class="comment"># 叶子节点个数</span></span><br><span class="line"> learning_rate=<span class="number">0.1</span>, <span class="comment"># 学习率</span></span><br><span class="line"> n_estimators=<span class="number">3141</span>, <span class="comment"># 最佳迭代轮数</span></span><br><span class="line"> bagging_fraction=<span class="number">0.7</span>, <span class="comment"># 建树的样本采样比例</span></span><br><span class="line"> feature_fraction=<span class="number">0.6</span>, <span class="comment"># 建树的特征选择比例</span></span><br><span class="line"> reg_alpha=<span class="number">0.3</span>, <span class="comment"># L1正则化</span></span><br><span class="line"> reg_lambda=<span class="number">0.3</span>, <span class="comment"># L2正则化</span></span><br><span class="line"> min_data_in_leaf=<span class="number">18</span>, </span><br><span class="line"> min_sum_hessian_in_leaf=<span class="number">0.001</span>)</span><br></pre></td></tr></table></figure><h1 id="模型融合"><a href="#模型融合" class="headerlink" title="模型融合"></a>模型融合</h1><p>我这里采用的XGBoost和LightGBM两个模型加权融合,通过不断调试二者的比例,来提升模型性能。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># -*- coding: utf-8 -*-</span></span><br><span class="line"><span class="keyword">import</span> pandas <span class="keyword">as</span> pd</span><br><span class="line">lgb_df = pd.read_csv(<span class="string">"./result/lgb.csv"</span>)</span><br><span class="line">xgb_df = pd.read_csv(<span class="string">"./result/xgb.csv"</span>)</span><br><span class="line">res = pd.DataFrame()</span><br><span class="line">res[<span class="string">'id'</span>] = lgb_df[<span class="string">'id'</span>]</span><br><span class="line"><span class="comment"># 比例按照线上比分计算出来</span></span><br><span class="line"><span class="comment"># 0.62/0.38 1.82066</span></span><br><span class="line"><span class="comment"># 0.65/0.35 1.82041</span></span><br><span class="line"><span class="comment"># 0.66/0.34 1.82039</span></span><br><span class="line"><span class="comment"># 0.67/0.33 下降</span></span><br><span class="line">res[<span class="string">'price'</span>] = lgb_df[<span class="string">'price'</span>] * <span class="number">0.34</span> + xgb_df[<span class="string">'price'</span>] * <span class="number">0.66</span></span><br><span class="line">res.to_csv(<span class="string">'./result/new.csv'</span>, index=<span class="keyword">False</span>)</span><br></pre></td></tr></table></figure><p>最终,模型融合后的线上分数为1.82,总排名第三。可见,模型融合是可以提升模型预测效果的。</p><h1 id="总结"><a href="#总结" class="headerlink" title="总结"></a>总结</h1><ol><li><p>模型训练仅仅会在一定程度上提升模型的性能,而特征工程决定了模型的上限,挖掘和目标值相关性强的特征是决胜的关键。</p></li><li><p>基于树的算法在处理变量时,并不是基于向量空间来度量的,数值只是个类别符号,即没有偏序关系,所以可以不用进行独热编码。</p></li><li><p>基于树的算法是不需要进行特征的归一化。</p></li><li><p>基于树的算法不擅长捕捉不同特征之间的相关性。</p></li><li><p>LightGBM和XGBoost都能将NaN作为数据的一部分进行学习,所以可以不处理缺失值。</p></li><li><p>将题目给的训练集分出一部分作为测试集后的训练效果没有全部将训练集用作训练的线上成绩效果好。</p></li></ol>]]></content>
<tags>
<tag> 数据分析 </tag>
<tag> xgboost </tag>
<tag> lightbgm </tag>
</tags>
</entry>
<entry>
<title>成功解决在hexo中无法显示数学公式的问题</title>
<link href="/2018/12/05/%E6%88%90%E5%8A%9F%E8%A7%A3%E5%86%B3%E5%9C%A8hexo%E4%B8%AD%E6%97%A0%E6%B3%95%E6%98%BE%E7%A4%BA%E6%95%B0%E5%AD%A6%E5%85%AC%E5%BC%8F%E7%9A%84%E9%97%AE%E9%A2%98/"/>
<url>/2018/12/05/%E6%88%90%E5%8A%9F%E8%A7%A3%E5%86%B3%E5%9C%A8hexo%E4%B8%AD%E6%97%A0%E6%B3%95%E6%98%BE%E7%A4%BA%E6%95%B0%E5%AD%A6%E5%85%AC%E5%BC%8F%E7%9A%84%E9%97%AE%E9%A2%98/</url>
<content type="html"><![CDATA[<h1 id="前言"><a href="#前言" class="headerlink" title="前言"></a>前言</h1><p>基于Hexo搭建的个人博客,在默认情况下渲染数学公式的时候是会出现问题的。下面的截图是我的先前的博客中出现的公式渲染错误:</p><p><img src="/2018/12/05/成功解决在hexo中无法显示数学公式的问题/before.jpg" alt="问题截图"></p><p>经过了一波百度操作后,成功将问题解决,下面是解决后的截图:</p><p><img src="/2018/12/05/成功解决在hexo中无法显示数学公式的问题/after.jpg" alt="问题截图"></p><p>下面我将我的操作写在下面,供需要的人参考。</p><h1 id="解决步骤"><a href="#解决步骤" class="headerlink" title="解决步骤"></a>解决步骤</h1><h2 id="更换Hexo的markdown渲染引擎"><a href="#更换Hexo的markdown渲染引擎" class="headerlink" title="更换Hexo的markdown渲染引擎"></a>更换Hexo的markdown渲染引擎</h2><p>先后执行下面的两条命令,第一条表示将默认的渲染引擎hexo-renderer-marked卸载,第二条命令是安装hexo-renderer-kramed渲染引擎,此渲染引擎修改了hexo-renderer-marked渲染引擎的一些bug。</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">npm uninstall hexo-renderer-marked --save</span><br><span class="line">npm install hexo-renderer-kramed --save</span><br></pre></td></tr></table></figure><h2 id="修改node-modules-kramed-lib-rules-inline-js文件"><a href="#修改node-modules-kramed-lib-rules-inline-js文件" class="headerlink" title="修改node_modules\kramed\lib\rules\inline.js文件"></a>修改node_modules\kramed\lib\rules\inline.js文件</h2><p>hexo-renderer-marked渲染引擎仍然存在一些语义冲突问题,到博客的根目录下,找到node_modules\kramed\lib\rules\inline.js,把第11行的escape变量的值做相应的修改:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">// escape: /^\\([\\`*{}\[\]()#$+\-.!_>])/,</span><br><span class="line"> escape: /^\\([`*\[\]()#$+\-.!_>])/,</span><br></pre></td></tr></table></figure><p>这一步是在原基础上取消了对\,{,}的转义(escape)。</p><p>同时把第20行的em变量也要做相应的修改。</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">// em: /^\b_((?:__|[\s\S])+?)_\b|^\*((?:\*\*|[\s\S])+?)\*(?!\*)/,</span><br><span class="line"> em: /^\*((?:\*\*|[\s\S])+?)\*(?!\*)/,</span><br></pre></td></tr></table></figure><h2 id="在主题中开启mathjax开关"><a href="#在主题中开启mathjax开关" class="headerlink" title="在主题中开启mathjax开关"></a>在主题中开启mathjax开关</h2><p>到博客根目录下,找到themes/next/_config.yml,把math默认的flase修改为true,具体如下:</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> Math Equations Render Support</span></span><br><span class="line">math:</span><br><span class="line"> enable: true</span><br><span class="line"> per_page: true</span><br><span class="line"> engine: mathjax</span><br></pre></td></tr></table></figure><h2 id="在文章的Front-matter里打开mathjax开关"><a href="#在文章的Front-matter里打开mathjax开关" class="headerlink" title="在文章的Front-matter里打开mathjax开关"></a>在文章的Front-matter里打开mathjax开关</h2><p>如果你写的文章里面用到了数学公式,需要在文章Front-matter里打开mathjax开关。如果用不到数学公式,则不需要管它。</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">---</span><br><span class="line">title: index.html</span><br><span class="line">date: 2018-12-5 01:30:30</span><br><span class="line">tags:</span><br><span class="line">mathjax: true</span><br><span class="line">--</span><br></pre></td></tr></table></figure><h2 id="重启hexo"><a href="#重启hexo" class="headerlink" title="重启hexo"></a>重启hexo</h2><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">hexo clean #清除缓存文件</span><br><span class="line">hexo g -d #生成并部署hexo</span><br></pre></td></tr></table></figure><p>到这里,hexo中无法显示数学公式的问题就得到解决了!</p><h1 id="参考文献"><a href="#参考文献" class="headerlink" title="参考文献"></a>参考文献</h1><p><a href="https://www.jianshu.com/p/7ab21c7f0674" target="_blank" rel="noopener">在Hexo中渲染MathJax数学公式</a></p>]]></content>
<categories>
<category> Hexo </category>
</categories>
<tags>
<tag> hexo </tag>
<tag> mathjax </tag>
</tags>
</entry>
<entry>
<title>破解含语序问题的点击验证码</title>
<link href="/2018/11/19/%E7%A0%B4%E8%A7%A3%E5%90%AB%E8%AF%AD%E5%BA%8F%E9%97%AE%E9%A2%98%E7%9A%84%E7%82%B9%E5%87%BB%E9%AA%8C%E8%AF%81%E7%A0%81/"/>
<url>/2018/11/19/%E7%A0%B4%E8%A7%A3%E5%90%AB%E8%AF%AD%E5%BA%8F%E9%97%AE%E9%A2%98%E7%9A%84%E7%82%B9%E5%87%BB%E9%AA%8C%E8%AF%81%E7%A0%81/</url>
<content type="html"><![CDATA[<h1 id="设计思路"><a href="#设计思路" class="headerlink" title="设计思路"></a>设计思路</h1><h2 id="前言"><a href="#前言" class="headerlink" title="前言"></a>前言</h2><p><a href="http://www.gsxt.gov.cn/index.html" target="_blank" rel="noopener">国家企业信用信息公示系统</a>中的验证码是按语序点击汉字,如下图所示:</p><p><img src="/2018/11/19/破解含语序问题的点击验证码/gsxt.png" alt="验证码"></p><p>即,如果依次点击:‘无’,‘意’,‘中’,‘发’,‘现’,就会通过验证。</p><p>本项目的<strong>破解思路</strong>主要分为以下步骤:</p><ol><li>使用目标探测网络YOLOV2进行<strong>汉字定位</strong></li><li>设计算法进行<strong>汉字切割</strong></li><li>使用darknet的分类器进行<strong>汉字识别</strong></li><li>设计算法进行<strong>汉字纠错与语序识别</strong></li></ol><p><a href="https://github.com/RunningGump/gsxt_captcha" target="_blank" rel="noopener">Github仓库直通车</a></p><h2 id="汉字定位与汉字识别"><a href="#汉字定位与汉字识别" class="headerlink" title="汉字定位与汉字识别"></a>汉字定位与汉字识别</h2><p>本项目的汉字定位和汉字识别部分都是基于<code>darknet</code>框架进行训练的。本项目对它们使用的训练网络并没有太高要求,只需懂得如何使用darknet就可以了,关于如何使用darknet框架训练汉字定位模型和汉字识别模型可查阅<strong>模型训练文档</strong>以及<a href="https://pjreddie.com/darknet/" target="_blank" rel="noopener">官方文档</a>的YOLO和Train a Classifier部分。那么,下面主要对汉字切割和语序识别进行讲解,最后再对整个破解程序进行讲解。</p><h2 id="汉字切割算法"><a href="#汉字切割算法" class="headerlink" title="汉字切割算法"></a>汉字切割算法</h2><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">seg_one_img</span><span class="params">(img_path, rets)</span>:</span></span><br><span class="line"> img = cv2.imread(img_path)</span><br><span class="line"> hanzi_list = [] <span class="comment"># 用于记录每个汉字对应的坐标:key为切割后汉字图片路径,value为中心点坐标</span></span><br><span class="line"> <span class="comment"># 对定位框进行遍历</span></span><br><span class="line"> <span class="keyword">for</span> ret <span class="keyword">in</span> rets:</span><br><span class="line"> per_dict = {}</span><br><span class="line"> <span class="keyword">if</span> ret[<span class="number">1</span>] > <span class="number">0.5</span>: <span class="comment"># 只取置信度大于0.5的定位框</span></span><br><span class="line"> coordinate = ret[<span class="number">2</span>] <span class="comment"># ret[2]为定位器返回的归一化坐标(x,y,w,h)</span></span><br><span class="line"> center = (int(coordinate[<span class="number">0</span>]*<span class="number">344</span>), int(coordinate[<span class="number">1</span>]*<span class="number">384</span>)) <span class="comment">#汉字定位框中心点坐标</span></span><br><span class="line"> origin = (coordinate[<span class="number">0</span>] - coordinate[<span class="number">2</span>]/<span class="number">2</span>, </span><br><span class="line"> coordinate[<span class="number">1</span>] - coordinate[<span class="number">3</span>]/<span class="number">2</span>) <span class="comment"># 汉字定位框左上角坐标(归一化)</span></span><br><span class="line"> <span class="comment"># 将定位框向四周均匀扩大2个像素,尽量将整个汉字切割下来。</span></span><br><span class="line"> x = int(origin[<span class="number">0</span>]*<span class="number">344</span> - <span class="number">2</span>)</span><br><span class="line"> x_plus_w =int((origin[<span class="number">0</span>] + coordinate[<span class="number">2</span>])*<span class="number">344</span> + <span class="number">4</span>)</span><br><span class="line"> y = int(origin[<span class="number">1</span>]*<span class="number">384</span> - <span class="number">2</span>)</span><br><span class="line"> y_plus_h = int((origin[<span class="number">1</span>] + coordinate[<span class="number">3</span>])*<span class="number">384</span> + <span class="number">4</span>)</span><br><span class="line"> <span class="comment"># 扩大后的定位框可能会出现越界的可能,如一个紧挨着图片边缘的汉字,fix函数调整越界的定位框</span></span><br><span class="line"> x, y, x_plus_w, y_plus_h = fix(x,y,x_plus_w,y_plus_h)</span><br><span class="line"> <span class="comment"># 下面对图片进行切割,并保存</span></span><br><span class="line"> <span class="keyword">try</span>:</span><br><span class="line"> hanzi_img = img[y:y_plus_h, x:x_plus_w] <span class="comment"># 切割</span></span><br><span class="line"> normal_img = cv2.resize(hanzi_img, (<span class="number">65</span>,<span class="number">65</span>), </span><br><span class="line"> interpolation=cv2.INTER_CUBIC) <span class="comment"># 将截取的图片规范化为65*65*3</span></span><br><span class="line"> path = <span class="string">'hanzi_img/{}_label.jpg.format(timestamp())</span></span><br><span class="line"><span class="string"> cv2.imwrite(path, normal_img)</span></span><br><span class="line"><span class="string"> per_dict[path] = center</span></span><br><span class="line"><span class="string"> hanzi_list.append(per_dict) </span></span><br><span class="line"><span class="string"> except:</span></span><br><span class="line"><span class="string"> print('</span><span class="comment">#'*20)</span></span><br><span class="line"> print(<span class="string">'存在不规则的图片'</span>)</span><br><span class="line"> <span class="keyword">return</span> hanzi_list</span><br><span class="line"></span><br><span class="line"><span class="comment"># 修正定位框的坐标,如果扩大后的定位框越界则将其设置为边界坐标</span></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">fix</span><span class="params">(x, y, x_plus_w, y_plus_h )</span>:</span></span><br><span class="line"> x = <span class="number">0</span> <span class="keyword">if</span> x < <span class="number">0</span> <span class="keyword">else</span> x</span><br><span class="line"> y = <span class="number">0</span> <span class="keyword">if</span> y < <span class="number">0</span> <span class="keyword">else</span> y</span><br><span class="line"> x_plus_w = <span class="number">384</span> <span class="keyword">if</span> x_plus_w > <span class="number">384</span> <span class="keyword">else</span> x_plus_w</span><br><span class="line"> y_plus_h = <span class="number">344</span> <span class="keyword">if</span> y_plus_h > <span class="number">344</span> <span class="keyword">else</span> y_plus_h</span><br><span class="line"> <span class="keyword">return</span> x, y, x_plus_w, y_plus_h</span><br></pre></td></tr></table></figure><p><code>seg_one_img</code>函数是对一张验证码图片进行汉字切割,切割后的汉字图片保存在当前路径下的<code>hanzi_img</code>文件夹中,并且返回由字典(key为汉字图片路径,value为坐标)组成的列表。需要注意的是,定位接口返回的定位框信息均是归一化信息,需要转换成实际的坐标信息,验证码图片大小信息为:344 × 384 × 3。如(0.25,,75)>> (0.25×344,0.75×384)</p><p><strong>算法大体思路:</strong></p><p>切割一张图片(图片路径,定位接口返回的定位框信息):</p><pre><code>遍历定位框信息,对置信度大于0.5的定位框进行如下操作: 计算汉字定位框中心坐标和左上角坐标; 将汉字定位框向四周均匀扩大两个像素; 对越界的坐标进行修正; 对汉字进行切割;</code></pre><p>定位框向四周扩大两个像素的目的:尽量将整个汉字切割下来。因为经过测试,有些定位框定位正确但是IOU不是很高,即汉字的某一小部分可能在定位框外部。扩大定位框可以更好的用于后面的汉字识别。</p><h2 id="语序识别算法"><a href="#语序识别算法" class="headerlink" title="语序识别算法"></a>语序识别算法</h2><p>语序识别算法结合了<strong>使用结巴分词识别语序</strong>和<strong>使用搜索引擎识别语序</strong>两个函数,下面分别对两个函数进行讲解。</p><h3 id="使用结巴分词识别语序"><a href="#使用结巴分词识别语序" class="headerlink" title="使用结巴分词识别语序"></a>使用结巴分词识别语序</h3><p>本部分使用的是 Python 中文分词词库<code>jieba</code>,关于结巴分词的基础知识请先阅读<a href="https://github.com/fxsjy/jieba" target="_blank" rel="noopener">结巴分词Github文档</a>,下面对使用结巴分词识别语序进行讲解。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 结巴分词 识别语序</span></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">recog_order_jieba</span><span class="params">(str)</span>:</span></span><br><span class="line"> l = len(str) <span class="comment"># l表示输入字符串个数</span></span><br><span class="line"> word_list = _permutation(str) <span class="comment"># 获得该字符串的所有排列方式</span></span><br><span class="line"> possible_words = [] <span class="comment"># 用来存放语序可能正确的词</span></span><br><span class="line"> <span class="keyword">for</span> word <span class="keyword">in</span> word_list: <span class="comment"># 编列所有排列方式</span></span><br><span class="line"> seg_list = jieba.lcut(word, cut_all=<span class="keyword">True</span> ) <span class="comment"># 对某一种排列方式使用结巴分词</span></span><br><span class="line"> index = find_longest(seg_list) <span class="comment"># 寻找结巴分词返回的列表中字符串最长的索引,并返回</span></span><br><span class="line"> <span class="keyword">if</span> len(seg_list[index]) == l: <span class="comment"># 若最长的字符串与输入的字符串长度相同,则加入可能正确列表</span></span><br><span class="line"> possible_words.append(seg_list[index])</span><br><span class="line"> <span class="keyword">if</span> len(possible_words) ==<span class="number">1</span>: <span class="comment"># 遍历完后,若可能正确的列表只有一个元素,那么他就是正确的,返回</span></span><br><span class="line"> <span class="keyword">return</span> possible_words[<span class="number">0</span>]</span><br><span class="line"> <span class="keyword">elif</span> len(possible_words) ><span class="number">1</span>: <span class="comment"># 若有可能正确列表中若有多个元素,则选取词频高的返回</span></span><br><span class="line"> <span class="keyword">return</span> highest_frequency(possible_words)</span><br><span class="line"> <span class="keyword">else</span>: <span class="comment"># 如果可能正确的列表元素为0,则返回0</span></span><br><span class="line"> <span class="keyword">return</span> <span class="number">0</span> </span><br><span class="line"> </span><br><span class="line"><span class="comment"># 获得汉字的所有排列方式</span></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">_permutation</span><span class="params">(str, r = None)</span>:</span> </span><br><span class="line"> word_list = list(permutations(str, r))</span><br><span class="line"> <span class="keyword">for</span> i <span class="keyword">in</span> range(len(word_list)):</span><br><span class="line"> word_list[i] = <span class="string">''</span>.join(word_list[i])</span><br><span class="line"> <span class="keyword">return</span> word_list</span><br><span class="line"></span><br><span class="line"><span class="comment"># 寻找列表中最长的词</span></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">find_longest</span><span class="params">(list)</span>:</span></span><br><span class="line"> l = <span class="number">0</span></span><br><span class="line"> index = <span class="number">0</span></span><br><span class="line"> <span class="keyword">for</span> i,word <span class="keyword">in</span> enumerate(list):</span><br><span class="line"> <span class="keyword">if</span> len(word) > l:</span><br><span class="line"> l = len(word)</span><br><span class="line"> index = i </span><br><span class="line"> <span class="keyword">return</span> index</span><br><span class="line"></span><br><span class="line"><span class="comment"># 输入词列表,返回结巴分词内词频最高的词</span></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">highest_frequency</span><span class="params">(possible_words)</span>:</span></span><br><span class="line"> word_dict = file2dict(<span class="string">'dict.txt'</span>) </span><br><span class="line"> possible_dict = {}</span><br><span class="line"> <span class="keyword">for</span> possible_word <span class="keyword">in</span> possible_words:</span><br><span class="line"> possible_dict[word_dict[possible_word]] = possible_word</span><br><span class="line"> sorted = sortedDictValues(possible_dict)</span><br><span class="line"> print(sortedList)</span><br><span class="line"> <span class="keyword">return</span> sortedList[<span class="number">-1</span>][<span class="number">1</span>]</span><br><span class="line"></span><br><span class="line"><span class="comment"># 对输入的字典根据key大小排序</span></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">sortedDictValues</span><span class="params">(di)</span>:</span> </span><br><span class="line"> <span class="keyword">return</span> [(k,di[k]) <span class="keyword">for</span> k <span class="keyword">in</span> sorted(di.keys())]</span><br><span class="line"></span><br><span class="line"><span class="comment"># 将文件数据转换为字典</span></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">file2dict</span><span class="params">(filename)</span>:</span></span><br><span class="line"> <span class="keyword">with</span> open(filename) <span class="keyword">as</span> f:</span><br><span class="line"> array_lines = f.readlines()</span><br><span class="line"> returnDict = {}</span><br><span class="line"> <span class="comment"># 以下三行解析文件数据到列表</span></span><br><span class="line"> <span class="keyword">for</span> line <span class="keyword">in</span> array_lines:</span><br><span class="line"> line = line.strip()</span><br><span class="line"> listFromLine = line.split()</span><br><span class="line"> returnDict[listFromLine[<span class="number">0</span>]] = int(listFromLine[<span class="number">1</span>])</span><br><span class="line"> <span class="keyword">return</span> returnDict</span><br></pre></td></tr></table></figure><p>下面我通过一个具体的实例来讲解算法思路:</p><p>输入:‘到马功成’</p><ol><li>获得字符串长度:</li></ol><p><code>l =4</code></p><ol><li><p>获得字符串的全排列</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">[<span class="string">'到马功成'</span>, <span class="string">'到马成功'</span>, <span class="string">'到功马成'</span>, <span class="string">'到功成马'</span>, <span class="string">'到成马功'</span>, <span class="string">'到成功马'</span>, <span class="string">'马到功成'</span>, <span class="string">'马到成功'</span>, <span class="string">'马功到成'</span>, <span class="string">'马功成到'</span>, <span class="string">'马成到功'</span>, <span class="string">'马成功到'</span>, <span class="string">'功到马成'</span>, <span class="string">'功到成马'</span>, <span class="string">'功马到成'</span>, <span class="string">'功马成到'</span>, <span class="string">'功成到马'</span>, <span class="string">'功成马到'</span>, <span class="string">'成到马功'</span>, <span class="string">'成到功马'</span>, <span class="string">'成马到功'</span>, <span class="string">'成马功到'</span>, <span class="string">'成功到马'</span>, <span class="string">'成功马到'</span>]</span><br></pre></td></tr></table></figure></li><li><p>对每一个排列进行结巴分词,并打印其中字符串最长元素的索引</p></li></ol> <figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br></pre></td><td class="code"><pre><span class="line">[<span class="string">'到'</span>, <span class="string">'马'</span>, <span class="string">'功'</span>, <span class="string">'成'</span>]</span><br><span class="line"><span class="number">0</span></span><br><span class="line">[<span class="string">'到'</span>, <span class="string">'马'</span>, <span class="string">'成功'</span>]</span><br><span class="line"><span class="number">2</span></span><br><span class="line">[<span class="string">'到'</span>, <span class="string">'功'</span>, <span class="string">'马'</span>, <span class="string">'成'</span>]</span><br><span class="line"><span class="number">0</span></span><br><span class="line">[<span class="string">'到'</span>, <span class="string">'功'</span>, <span class="string">'成'</span>, <span class="string">'马'</span>]</span><br><span class="line"><span class="number">0</span></span><br><span class="line">[<span class="string">'到'</span>, <span class="string">'成'</span>, <span class="string">'马'</span>, <span class="string">'功'</span>]</span><br><span class="line"><span class="number">0</span></span><br><span class="line">[<span class="string">'到'</span>, <span class="string">'成功'</span>, <span class="string">'马'</span>]</span><br><span class="line"><span class="number">1</span></span><br><span class="line">[<span class="string">'马到功成'</span>]</span><br><span class="line"><span class="number">0</span></span><br><span class="line">[<span class="string">'马到成功'</span>, <span class="string">'成功'</span>]</span><br><span class="line"><span class="number">0</span></span><br><span class="line">[<span class="string">'马'</span>, <span class="string">'功'</span>, <span class="string">'到'</span>, <span class="string">'成'</span>]</span><br><span class="line"><span class="number">0</span></span><br><span class="line">[<span class="string">'马'</span>, <span class="string">'功'</span>, <span class="string">'成'</span>, <span class="string">'到'</span>]</span><br><span class="line"><span class="number">0</span></span><br><span class="line">[<span class="string">'马'</span>, <span class="string">'成'</span>, <span class="string">'到'</span>, <span class="string">'功'</span>]</span><br><span class="line"><span class="number">0</span></span><br><span class="line">[<span class="string">'马'</span>, <span class="string">'成功'</span>, <span class="string">'到'</span>]</span><br><span class="line"><span class="number">1</span></span><br><span class="line">[<span class="string">'功'</span>, <span class="string">'到'</span>, <span class="string">'马'</span>, <span class="string">'成'</span>]</span><br><span class="line"><span class="number">0</span></span><br><span class="line">[<span class="string">'功'</span>, <span class="string">'到'</span>, <span class="string">'成'</span>, <span class="string">'马'</span>]</span><br><span class="line"><span class="number">0</span></span><br><span class="line">[<span class="string">'功'</span>, <span class="string">'马'</span>, <span class="string">'到'</span>, <span class="string">'成'</span>]</span><br><span class="line"><span class="number">0</span></span><br><span class="line">[<span class="string">'功'</span>, <span class="string">'马'</span>, <span class="string">'成'</span>, <span class="string">'到'</span>]</span><br><span class="line"><span class="number">0</span></span><br><span class="line">[<span class="string">'功'</span>, <span class="string">'成'</span>, <span class="string">'到'</span>, <span class="string">'马'</span>]</span><br><span class="line"><span class="number">0</span></span><br><span class="line">[<span class="string">'功'</span>, <span class="string">'成'</span>, <span class="string">'马'</span>, <span class="string">'到'</span>]</span><br><span class="line"><span class="number">0</span></span><br><span class="line">[<span class="string">'成'</span>, <span class="string">'到'</span>, <span class="string">'马'</span>, <span class="string">'功'</span>]</span><br><span class="line"><span class="number">0</span></span><br><span class="line">[<span class="string">'成'</span>, <span class="string">'到'</span>, <span class="string">'功'</span>, <span class="string">'马'</span>]</span><br><span class="line"><span class="number">0</span></span><br><span class="line">[<span class="string">'成'</span>, <span class="string">'马'</span>, <span class="string">'到'</span>, <span class="string">'功'</span>]</span><br><span class="line"><span class="number">0</span></span><br><span class="line">[<span class="string">'成'</span>, <span class="string">'马'</span>, <span class="string">'功'</span>, <span class="string">'到'</span>]</span><br><span class="line"><span class="number">0</span></span><br><span class="line">[<span class="string">'成功'</span>, <span class="string">'到'</span>, <span class="string">'马'</span>]</span><br><span class="line"><span class="number">0</span></span><br><span class="line">[<span class="string">'成功'</span>, <span class="string">'马'</span>, <span class="string">'到'</span>]</span><br><span class="line"><span class="number">0</span></span><br></pre></td></tr></table></figure><ol><li><p>遍历完之后,将l=4的字符串加入possible_words列表</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">['马到功成', '马到成功'] # possible_words列表</span><br></pre></td></tr></table></figure></li><li><p>现在有两个词语语序是可能正确的,由于结巴分词词库中的词语是有词频的,比如:</p></li></ol><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">......</span><br><span class="line">......</span><br><span class="line">马利诺夫斯基 3 nrt</span><br><span class="line">马到功成 3 i</span><br><span class="line">马到成功 313 i</span><br><span class="line">马刺进 2 nr</span><br><span class="line">......</span><br><span class="line">......</span><br></pre></td></tr></table></figure><p>每行的第二个元素代表词频,所以我们可以通过比较词频来确定最终的语序正确的词:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">马到成功</span><br></pre></td></tr></table></figure><h3 id="使用搜索引擎识别语序"><a href="#使用搜索引擎识别语序" class="headerlink" title="使用搜索引擎识别语序"></a>使用搜索引擎识别语序</h3><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 搜索引擎搜索关键字,返回相关列表</span></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">search_engine</span><span class="params">(word)</span>:</span></span><br><span class="line"> headers = {</span><br><span class="line"> <span class="string">'User-Agent'</span>: <span class="string">'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'</span></span><br><span class="line"> }</span><br><span class="line"> r = requests.get(<span class="string">'https://www.baidu.com/s?wd='</span> + word, headers=headers)</span><br><span class="line"> html = etree.HTML(r.text)</span><br><span class="line"> related_words1 = html.xpath(<span class="string">'//*[@id="rs"]/table//tr//th/a/text()'</span>)</span><br><span class="line"> related_words2 = html.xpath(<span class="string">'//div[@id="content_left"]//a//em/text()'</span>)</span><br><span class="line"> related_words = related_words1 + related_words2</span><br><span class="line"> <span class="keyword">return</span> related_words</span><br><span class="line"> </span><br><span class="line"> </span><br><span class="line"><span class="comment"># 调用一次线程,每一个线程对输入字符串进行百度搜索,返回相关词的列表</span></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">search</span><span class="params">(word)</span>:</span></span><br><span class="line"> related_words = search_engine(word)</span><br><span class="line"> <span class="keyword">global</span> all_related</span><br><span class="line"> all_related = all_related + related_words</span><br><span class="line"></span><br><span class="line"><span class="comment"># 通过搜索引擎识别语序</span></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">search_engine_recog</span><span class="params">(str)</span>:</span></span><br><span class="line"> word_list = _permutation(str) <span class="comment"># 获得排列</span></span><br><span class="line"> <span class="keyword">global</span> flags </span><br><span class="line"> flags = [<span class="number">0</span>] * len(word_list) <span class="comment"># 标志位</span></span><br><span class="line"> threads = []</span><br><span class="line"></span><br><span class="line"> <span class="keyword">for</span> word <span class="keyword">in</span> word_list: <span class="comment"># 遍历所有可能的排列组合,进行百度搜索</span></span><br><span class="line"> thread = threading.Thread(target=search, args=[word])</span><br><span class="line"> threads.append(thread)</span><br><span class="line"> thread.start()</span><br><span class="line"> <span class="keyword">for</span> thread <span class="keyword">in</span> threads:</span><br><span class="line"> thread.join()</span><br><span class="line"> <span class="keyword">global</span> all_related <span class="comment"># 记录所有排列组合进行百度搜索后返回的列表</span></span><br><span class="line"> <span class="keyword">for</span> i,word <span class="keyword">in</span> enumerate(word_list): <span class="comment"># 遍历所有排列</span></span><br><span class="line"> flag = <span class="number">0</span></span><br><span class="line"> <span class="keyword">for</span> related_word <span class="keyword">in</span> all_related: <span class="comment"># 对每一个排列统计在所有相关词语列表中出现的次数</span></span><br><span class="line"> <span class="keyword">if</span> word <span class="keyword">in</span> related_word:</span><br><span class="line"> flag = flag + <span class="number">1</span></span><br><span class="line"> flags[i] = flag</span><br><span class="line"> all_related = [] <span class="comment"># 清空</span></span><br><span class="line"> index = flags.index(max(flags)) <span class="comment"># 找到标志位最大的索引</span></span><br><span class="line"> <span class="keyword">return</span> word_list[index]</span><br></pre></td></tr></table></figure><p>同样,这里仍然用一个实例来讲解该算法思路:</p><p>输入:’现无中意发’</p><ol><li>获得输入字符串的排列:</li></ol><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">[<span class="string">'现无中意发'</span>, <span class="string">'现无中发意'</span>, <span class="string">'现无意中发'</span>, <span class="string">'现无意发中'</span>, <span class="string">'现无发中意'</span>, <span class="string">'现无发意中'</span>, <span class="string">'现中无意发'</span>, <span class="string">'现中无发意'</span>, <span class="string">'现中意无发'</span>, <span class="string">'现中意发无'</span>, <span class="string">'现中发无意'</span>, <span class="string">'现中发意无'</span>, <span class="string">'现意无中发'</span>, <span class="string">'现意无发中'</span>, <span class="string">'现意中无发'</span>, <span class="string">'现意中发无'</span>, <span class="string">'现意发无中'</span>, <span class="string">'现意发中无'</span>, <span class="string">'现发无中意'</span>, <span class="string">'现发无意中'</span>, <span class="string">'现发中无意'</span>, <span class="string">'现发中意无'</span>, <span class="string">'现发意无中'</span>, <span class="string">'现发意中无'</span>, <span class="string">'无现中意发'</span>, <span class="string">'无现中发意'</span>, <span class="string">'无现意中发'</span>, <span class="string">'无现意发中'</span>, <span class="string">'无现发中意'</span>, <span class="string">'无现发意中'</span>, <span class="string">'无中现意发'</span>, <span class="string">'无中现发意'</span>, <span class="string">'无中意现发'</span>, <span class="string">'无中意发现'</span>, <span class="string">'无中发现意'</span>, <span class="string">'无中发意现'</span>, <span class="string">'无意现中发'</span>, <span class="string">'无意现发中'</span>, <span class="string">'无意中现发'</span>, <span class="string">'无意中发现'</span>, <span class="string">'无意发现中'</span>, <span class="string">'无意发中现'</span>, <span class="string">'无发现中意'</span>, <span class="string">'无发现意中'</span>, <span class="string">'无发中现意'</span>, <span class="string">'无发中意现'</span>, <span class="string">'无发意现中'</span>, <span class="string">'无发意中现'</span>, <span class="string">'中现无意发'</span>, <span class="string">'中现无发意'</span>, <span class="string">'中现意无发'</span>, <span class="string">'中现意发无'</span>, <span class="string">'中现发无意'</span>, <span class="string">'中现发意无'</span>, <span class="string">'中无现意发'</span>, <span class="string">'中无现发意'</span>, <span class="string">'中无意现发'</span>, <span class="string">'中无意发现'</span>, <span class="string">'中无发现意'</span>, <span class="string">'中无发意现'</span>, <span class="string">'中意现无发'</span>, <span class="string">'中意现发无'</span>, <span class="string">'中意无现发'</span>, <span class="string">'中意无发现'</span>, <span class="string">'中意发现无'</span>, <span class="string">'中意发无现'</span>, <span class="string">'中发现无意'</span>, <span class="string">'中发现意无'</span>, <span class="string">'中发无现意'</span>, <span class="string">'中发无意现'</span>, <span class="string">'中发意现无'</span>, <span class="string">'中发意无现'</span>, <span class="string">'意现无中发'</span>, <span class="string">'意现无发中'</span>, <span class="string">'意现中无发'</span>, <span class="string">'意现中发无'</span>, <span class="string">'意现发无中'</span>, <span class="string">'意现发中无'</span>, <span class="string">'意无现中发'</span>, <span class="string">'意无现发中'</span>, <span class="string">'意无中现发'</span>, <span class="string">'意无中发现'</span>, <span class="string">'意无发现中'</span>, <span class="string">'意无发中现'</span>, <span class="string">'意中现无发'</span>, <span class="string">'意中现发无'</span>, <span class="string">'意中无现发'</span>, <span class="string">'意中无发现'</span>, <span class="string">'意中发现无'</span>, <span class="string">'意中发无现'</span>, <span class="string">'意发现无中'</span>, <span class="string">'意发现中无'</span>, <span class="string">'意发无现中'</span>, <span class="string">'意发无中现'</span>, <span class="string">'意发中现无'</span>, <span class="string">'意发中无现'</span>, <span class="string">'发现无中意'</span>, <span class="string">'发现无意中'</span>, <span class="string">'发现中无意'</span>, <span class="string">'发现中意无'</span>, <span class="string">'发现意无中'</span>, <span class="string">'发现意中无'</span>, <span class="string">'发无现中意'</span>, <span class="string">'发无现意中'</span>, <span class="string">'发无中现意'</span>, <span class="string">'发无中意现'</span>, <span class="string">'发无意现中'</span>, <span class="string">'发无意中现'</span>, <span class="string">'发中现无意'</span>, <span class="string">'发中现意无'</span>, <span class="string">'发中无现意'</span>, <span class="string">'发中无意现'</span>, <span class="string">'发中意现无'</span>, <span class="string">'发中意无现'</span>, <span class="string">'发意现无中'</span>, <span class="string">'发意现中无'</span>, <span class="string">'发意无现中'</span>, <span class="string">'发意无中现'</span>, <span class="string">'发意中现无'</span>, <span class="string">'发意中无现'</span>]</span><br></pre></td></tr></table></figure><ol><li><p>对每一个排列进行百度搜索返回相关词。</p><p>其中的百度搜索是通过爬虫实现的,爬取的结点主要有两部分:1.每次搜索结果词条中红色的词。2.每次搜索结果最下面的相关搜索中的词。</p></li></ol><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">[<span class="string">'中意隆鑫航发基地'</span>, <span class="string">'我对你中意红包怎么发'</span>, <span class="string">'中意保险几号发工资'</span>, <span class="string">'当时不知曲中意现已成为曲中人'</span>, <span class="string">'中意空调现E5怎么办'</span>, <span class="string">'初闻不知曲中意现已成为曲中人'</span>, <span class="string">'中意'</span>, <span class="string">'中意在线'</span>, <span class="string">'初闻不知曲中意,再听已是曲中人'</span>, <span class="string">'发中意'</span>, <span class="string">'没有中意'</span>, <span class="string">'中意'</span>, <span class="string">'中意'</span>, <span class="string">'中意'</span>, <span class="string">'中意'</span>, <span class="string">'中意'</span>, <span class="string">'中意'</span>,<span class="string">'中意'</span>, <span class="string">'中意'</span>, <span class="string">'没有'</span>, <span class="string">'中意'</span>, <span class="string">'中意'</span>, <span class="string">'没有'</span>, <span class="string">'中意'</span>, <span class="string">'中意隆鑫航发基地'</span>, <span class="string">'我对你中意红包怎么发'</span>, <span class="string">'中意保险几号发工资'</span>, <span class="string">'当时不知曲中意现已成为曲中人'</span>, <span class="string">'中意空调现E5怎么办'</span>, <span class="string">'初闻不知曲中意现已成为曲中人'</span>, <span class="string">'中意'</span>, <span class="string">'中意在线'</span>, <span class="string">'初闻不知曲中意,再听已是曲中人'</span>, <span class="string">'现发中意无'</span>,<span class="string">'没有中意'</span>, <span class="string">'中意'</span>, <span class="string">'没有'</span>, <span class="string">'中意'</span>, <span class="string">'没有'</span>, <span class="string">'中意'</span>, <span class="string">'中意'</span>, <span class="string">'没有'</span>, <span class="string">'中意'</span>, <span class="string">'中意'</span>, <span class="string">'没有'</span>, <span class="string">'中意'</span>, <span class="string">'没有'</span>, <span class="string">'中意'</span>, <span class="string">'中发发型'</span>, <span class="string">'中发'</span>, <span class="string">'中发卷发'</span>, <span class="string">'中发烫发'</span>, <span class="string">'中发发型图片'</span>, <span class="string">'中发图片'</span>, <span class="string">'中发编发'</span>, <span class="string">'中发烫发图片'</span>, <span class="string">'中发白'</span>, <span class="string">'中发无意现'</span>, <span class="string">'中发'</span>, <span class="string">'中发'</span>, <span class="string">'无意'</span>, <span class="string">'中发'</span>, <span class="string">'中发'</span>, <span class="string">'中发'</span>, <span class="string">'中发'</span>, <span class="string">'中发'</span>, <span class="string">'中意隆鑫航发基地'</span>, <span class="string">'我对你中意红包怎么发'</span>, <span class="string">'中意保险几号发工资'</span>, <span class="string">'当时不知曲中意现已成为曲中人'</span>, <span class="string">'中意空调现E5怎么办'</span>, <span class="string">'初闻不知你'</span>, <span class="string">'中意在线用户登录'</span>, <span class="string">'我只中意你'</span>, <span class="string">'中意保险可靠吗'</span>, <span class="string">'中意'</span>, <span class="string">'中意'</span>, <span class="string">'中意'</span>, <span class="string">'无中意'</span>, <span class="string">'无中意'</span>, <span class="string">'中意'</span>, <span class="string">'发现'</span>, <span class="string">'中意'</span>, <span class="string">'中意'</span>, <span class="string">'无中意'</span>, <span class="string">'中意'</span>, <span class="string">'无中意'</span>, <span class="string">'意什么什么发'</span>, <span class="string">'意()()发'</span>, <span class="string">'发现的近意词是什么'</span>, <span class="string">'发现的进意词'</span>, <span class="string">'发现意'</span>, <span class="string">'微信里的发现是什么意是'</span>, <span class="string">'意料之中什么意思'</span>, <span class="string">'中译意'</span>, <span class="string">'发现'</span>, <span class="string">'无意中发现'</span>, </span><br><span class="line"> ......................................................................................</span><br><span class="line"> ......................all_realated列表比较长,中间部分的词省略............................</span><br><span class="line"> .....................................................................................</span><br><span class="line"> <span class="string">'无意'</span>, <span class="string">'发现'</span>, <span class="string">'无意中发现'</span>, <span class="string">'无意中发现'</span>, <span class="string">'无意中'</span>, <span class="string">'无意中发现'</span>, <span class="string">'初闻不知曲中意现已成为曲中人'</span>, <span class="string">'中意'</span>, <span class="string">'中意在线'</span>, <span class="string">'初闻不知曲中意,再听已是曲中人'</span>, <span class="string">'发中意无'</span>, <span class="string">'没有中意'</span>, <span class="string">'中意'</span>, <span class="string">'无中意'</span>, <span class="string">'没有'</span>, <span class="string">'中意'</span>, <span class="string">'没有中意'</span>, <span class="string">'中意'</span>, <span class="string">'没有'</span>, <span class="string">'中意'</span>, <span class="string">'中意'</span>, <span class="string">'没有'</span>, <span class="string">'中意隆鑫航发基地'</span>, <span class="string">'我对你中意红包怎么发'</span>, <span class="string">'中意保险几号发工资'</span>, <span class="string">'当时不知曲中意现已成为曲中人'</span>, <span class="string">'中意空调现E5怎么办'</span>, <span class="string">'意无限'</span>, <span class="string">'意无限'</span>, <span class="string">'意无限'</span>, <span class="string">'意无限'</span>, <span class="string">'意无限'</span>, <span class="string">'意无限'</span>, <span class="string">'意无限'</span>, <span class="string">'有意瞄准无意击发'</span>, <span class="string">'无意击发'</span>, <span class="string">'有意瞄准无意击发意思'</span>, <span class="string">'有意激无意发'</span>, <span class="string">'无意和别人换了鞋有什么说发'</span>, <span class="string">'无意发了视频'</span>, <span class="string">'有意瞄准,无意击发解释'</span>, <span class="string">'有意瞄准无意击发柴静'</span>, <span class="string">'无意栽花犹发蕊'</span>, <span class="string">'无意'</span>, <span class="string">'无意'</span>, <span class="string">'无意'</span>, <span class="string">'中无意'</span>, <span class="string">'无意'</span>, <span class="string">'无意'</span>, <span class="string">'无意'</span>, <span class="string">'发发'</span>, <span class="string">'无意'</span>, <span class="string">'无意'</span>, <span class="string">'中无意'</span>, <span class="string">'无意'</span>, <span class="string">'意什么什么发'</span>, <span class="string">'意()()发'</span>, <span class="string">'发意生是什么意思'</span>, <span class="string">'发意症'</span>, <span class="string">'意发'</span>, <span class="string">'意发游戏'</span>, <span class="string">'意什么发成语接龙'</span>, <span class="string">'向发意'</span>, <span class="string">'意发股份'</span>, <span class="string">'无意中'</span>, <span class="string">'无意中'</span>, <span class="string">'无意中'</span>, <span class="string">'无意中'</span>, <span class="string">'无意中'</span>, <span class="string">'无意中'</span>, <span class="string">'无意中'</span>, <span class="string">'无意中'</span>, <span class="string">'无意中'</span>, <span class="string">'无意中'</span>, <span class="string">'无意中'</span>, <span class="string">'无意中'</span>, <span class="string">'意什么什么发'</span>, <span class="string">'意()()发'</span>, <span class="string">'发意生是什么意思'</span>, <span class="string">'发意症'</span>, <span class="string">'意发'</span>, <span class="string">'意发游戏'</span>, <span class="string">'意什么发成语接龙'</span>, <span class="string">'向发意'</span>, <span class="string">'意发股份'</span>, <span class="string">'无意'</span>, <span class="string">'无意'</span>, <span class="string">'无意'</span>, <span class="string">'无意'</span>, <span class="string">'没有'</span>, <span class="string">'无意'</span>, <span class="string">'无意'</span>, <span class="string">'无意'</span>, <span class="string">'发发无意'</span>, <span class="string">'无意'</span>, <span class="string">'意什么什么发'</span>, <span class="string">'意()()发'</span>, <span class="string">'发意生是什么意思'</span>, <span class="string">'发意症'</span>, <span class="string">'意发'</span>, <span class="string">'意发游戏'</span>, <span class="string">'意什么发成语接龙'</span>, <span class="string">'向发意'</span>, <span class="string">'意发股份'</span>, <span class="string">'意中'</span>, <span class="string">'意中'</span>, <span class="string">'意中'</span>, <span class="string">'意中'</span>, <span class="string">'意中'</span>, <span class="string">'意中'</span>, <span class="string">'意中'</span>, <span class="string">'意中'</span>, <span class="string">'意中'</span>, <span class="string">'意中'</span>, <span class="string">'意中'</span>, <span class="string">'发意中现无'</span>, <span class="string">'意中'</span>, <span class="string">'意中'</span>, <span class="string">'意中'</span>, <span class="string">'意什么什么发'</span>, <span class="string">'意()()发'</span>, <span class="string">'发意生是什么意思'</span>, <span class="string">'发意症'</span>, <span class="string">'意发'</span>, <span class="string">'意发游戏'</span>, <span class="string">'意什么发成语接龙'</span>, <span class="string">'向发意'</span>, <span class="string">'意发股份'</span>, <span class="string">'意中'</span>, <span class="string">'意中'</span>, <span class="string">'意中'</span>, <span class="string">'意中'</span>, <span class="string">'意中'</span>, <span class="string">'意中'</span>, <span class="string">'意中'</span>, <span class="string">'意中'</span>, <span class="string">'意中'</span>, <span class="string">'意中'</span>, <span class="string">'发意中无现'</span>, <span class="string">'意中'</span>, <span class="string">'意中'</span>]</span><br></pre></td></tr></table></figure><ol><li>通过一个嵌套循环来统计每一个排列在all_relalated列表中出现的次数(排列是列表元素的子串)。</li></ol><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">[<span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">1</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">1</span>, <span class="number">1</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">9</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">125</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">1</span>,<span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">1</span>, <span class="number">7</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">12</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">1</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">1</span>, <span class="number">1</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">1</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">1</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">1</span>, <span class="number">1</span>]</span><br></pre></td></tr></table></figure><ol><li>找到标志位最大的索引,返回word_list列表中该索引值对应的排列。</li></ol><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">无意中发现</span><br></pre></td></tr></table></figure><h2 id="完整破解程序"><a href="#完整破解程序" class="headerlink" title="完整破解程序"></a>完整破解程序</h2><h3 id="程序讲解"><a href="#程序讲解" class="headerlink" title="程序讲解"></a>程序讲解</h3><p>通过多次试验会发现,使用结巴分词识别语序和搜索引擎识别语序各有利弊,使用结巴分词的优点是速度很快,缺点是对于一些不是词语的语序识别会识别不出来。而搜索引擎识别语序,语序识别能力强,但是比较慢。所以在破解程序中我将二者结合了一下,充分使用了各自的优点。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br><span class="line">105</span><br><span class="line">106</span><br><span class="line">107</span><br><span class="line">108</span><br><span class="line">109</span><br><span class="line">110</span><br><span class="line">111</span><br><span class="line">112</span><br><span class="line">113</span><br><span class="line">114</span><br><span class="line">115</span><br><span class="line">116</span><br><span class="line">117</span><br><span class="line">118</span><br><span class="line">119</span><br><span class="line">120</span><br><span class="line">121</span><br><span class="line">122</span><br><span class="line">123</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># -*- coding: utf-8 -*-</span></span><br><span class="line"><span class="keyword">from</span> darknet <span class="keyword">import</span> load_net, load_meta, detect, classify, load_image</span><br><span class="line"><span class="keyword">from</span> segment <span class="keyword">import</span> seg_one_img, load_dtc_module</span><br><span class="line"><span class="keyword">from</span> recog_order <span class="keyword">import</span> search_engine_recog, recog_order_jieba</span><br><span class="line"><span class="keyword">import</span> time</span><br><span class="line"><span class="keyword">import</span> cv2</span><br><span class="line"><span class="keyword">from</span> PIL <span class="keyword">import</span> Image</span><br><span class="line"><span class="keyword">import</span> numpy <span class="keyword">as</span> np</span><br><span class="line"><span class="keyword">import</span> copy</span><br><span class="line"><span class="keyword">import</span> os</span><br><span class="line"><span class="keyword">from</span> itertools <span class="keyword">import</span> permutations</span><br><span class="line"><span class="keyword">from</span> functools <span class="keyword">import</span> reduce</span><br><span class="line"></span><br><span class="line"><span class="comment"># 求多个列表的组合</span></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">combination</span><span class="params">(*lists)</span>:</span> </span><br><span class="line"> total = reduce(<span class="keyword">lambda</span> x, y: x * y, map(len, lists)) </span><br><span class="line"> retList = [] </span><br><span class="line"> <span class="keyword">for</span> i <span class="keyword">in</span> range(<span class="number">0</span>, total): </span><br><span class="line"> step = total </span><br><span class="line"> tempItem = [] </span><br><span class="line"> <span class="keyword">for</span> l <span class="keyword">in</span> lists: </span><br><span class="line"> step /= len(l) </span><br><span class="line"> tempItem.append(l[int(i/step % len(l))]) </span><br><span class="line"> retList.append(tuple(tempItem)) </span><br><span class="line"> <span class="keyword">return</span> retList </span><br><span class="line"></span><br><span class="line"><span class="comment"># 加载模块</span></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">load_clasiify_module</span><span class="params">(cfg, weights, data)</span>:</span></span><br><span class="line"> net = load_net(cfg, weights, <span class="number">0</span>)</span><br><span class="line"> meta = load_meta(data)</span><br><span class="line"> <span class="keyword">return</span> net, meta </span><br><span class="line"></span><br><span class="line"><span class="comment"># 使用新字典记录坐标,注意字典是无序的!!</span></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">recordCoordinate</span><span class="params">(wordList, hanziList)</span>:</span></span><br><span class="line"> center = {}</span><br><span class="line"> <span class="keyword">for</span> i <span class="keyword">in</span> range(len(wordList)):</span><br><span class="line"> center[wordList[i]] = [center <span class="keyword">for</span> center <span class="keyword">in</span> hanziList[i].values()][<span class="number">0</span>]</span><br><span class="line"> <span class="keyword">return</span> center</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="comment"># 破解函数</span></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">crack</span><span class="params">(img_path, dtc_modu, classify_modu, k)</span>:</span></span><br><span class="line"> <span class="comment"># 定位汉字,得到多个矩形框</span></span><br><span class="line"> print(<span class="string">'\n'</span>*<span class="number">2</span> + <span class="string">'定位汉字'</span> + <span class="string">'\n'</span> + <span class="string">'*'</span>*<span class="number">80</span>)</span><br><span class="line"> d = time.time()</span><br><span class="line"> rets = detect(dtc_modu[<span class="number">0</span>], dtc_modu[<span class="number">1</span>], img_path.encode()) </span><br><span class="line"> print(<span class="string">'定位汉字耗时{}'</span>.format(time.time() - d))</span><br><span class="line"> l = len(rets)</span><br><span class="line"> <span class="comment"># 设置阈值</span></span><br><span class="line"> <span class="keyword">if</span> l > k:</span><br><span class="line"> <span class="keyword">return</span> <span class="number">0</span></span><br><span class="line"></span><br><span class="line"> <span class="comment"># 切割图片,得到切割后的汉字图片</span></span><br><span class="line"> print(<span class="string">'\n'</span>*<span class="number">2</span> + <span class="string">'切割图片'</span> + <span class="string">'\n'</span> + <span class="string">'*'</span>*<span class="number">80</span>)</span><br><span class="line"> s = time.time()</span><br><span class="line"> hanzi_list = seg_one_img(img_path, rets)</span><br><span class="line"> print(hanzi_list)</span><br><span class="line"> print(<span class="string">'切割图片耗时{}'</span>.format(time.time() - s))</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"> <span class="comment"># 汉字识别,得到汉字字符串</span></span><br><span class="line"> print(<span class="string">'\n'</span>*<span class="number">2</span> + <span class="string">'汉字识别'</span> + <span class="string">'\n'</span> + <span class="string">'*'</span>*<span class="number">80</span>)</span><br><span class="line"> r = time.time()</span><br><span class="line"> all_hanzi_lists = [] <span class="comment"># 存储所有汉字的列表</span></span><br><span class="line"> <span class="comment"># 提取路径存入列表</span></span><br><span class="line"> paths = []</span><br><span class="line"> <span class="keyword">for</span> per <span class="keyword">in</span> hanzi_list:</span><br><span class="line"> paths.extend([i <span class="keyword">for</span> i <span class="keyword">in</span> per.keys()])</span><br><span class="line"> </span><br><span class="line"> <span class="keyword">for</span> path <span class="keyword">in</span> paths: <span class="comment"># 对切割的汉字图片进行遍历</span></span><br><span class="line"> hanzis = []</span><br><span class="line"> img = load_image(path.encode(), <span class="number">0</span> , <span class="number">0</span>)</span><br><span class="line"> res = classify(classify_modu[<span class="number">0</span>], classify_modu[<span class="number">1</span>], img)</span><br><span class="line"> print(res[<span class="number">0</span>:<span class="number">5</span>])</span><br><span class="line"> <span class="keyword">if</span> res[<span class="number">0</span>][<span class="number">1</span>] < <span class="number">0.95</span>: <span class="comment"># 对置信度<0.95的汉字</span></span><br><span class="line"> <span class="keyword">for</span> hz <span class="keyword">in</span> res[<span class="number">0</span>:<span class="number">5</span>]: <span class="comment"># 对识别的top5进行遍历,此处可修改</span></span><br><span class="line"> hanzi = (<span class="string">'\\'</span> + hz[<span class="number">0</span>].decode(<span class="string">'utf-8'</span>)).encode(<span class="string">'utf-8'</span>).decode(<span class="string">'unicode_escape'</span>) </span><br><span class="line"> hanzis.append(hanzi)</span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> hanzi = (<span class="string">'\\'</span> + res[<span class="number">0</span>][<span class="number">0</span>].decode(<span class="string">'utf-8'</span>)).encode(<span class="string">'utf-8'</span>).decode(<span class="string">'unicode_escape'</span>)</span><br><span class="line"> hanzis.append(hanzi)</span><br><span class="line"> all_hanzi_lists.append(hanzis) </span><br><span class="line"> print(all_hanzi_lists)</span><br><span class="line"> hanzi_combination = combination(*all_hanzi_lists)</span><br><span class="line"> hanzi_combination_connect = []</span><br><span class="line"> <span class="keyword">for</span> words <span class="keyword">in</span> hanzi_combination:</span><br><span class="line"> hanzi_combination_connect.append(<span class="string">''</span>.join(words))</span><br><span class="line"> print(hanzi_combination_connect)</span><br><span class="line"> print(<span class="string">'汉字识别耗时{}'</span>.format(time.time() - r))</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"> <span class="comment"># 识别语序</span></span><br><span class="line"> hanzi_center = []</span><br><span class="line"> jieba_flag = <span class="number">0</span></span><br><span class="line"> o = time.time()</span><br><span class="line"> print(<span class="string">'\n'</span>*<span class="number">2</span> + <span class="string">'语序识别'</span> + <span class="string">'\n'</span> + <span class="string">'*'</span>*<span class="number">80</span>)</span><br><span class="line"> <span class="keyword">for</span> words <span class="keyword">in</span> hanzi_combination_connect: <span class="comment"># 对每一个组合进行结巴分词</span></span><br><span class="line"> <span class="comment"># 此处对汉字的坐标进行记录</span></span><br><span class="line"> hanzi_center = recordCoordinate(words, hanzi_list)</span><br><span class="line"> print(hanzi_center, <span class="string">'jiaba'</span>)</span><br><span class="line"> o = time.time()</span><br><span class="line"> rec_word_possible = recog_order_jieba(words)</span><br><span class="line"> <span class="keyword">if</span> rec_word_possible: <span class="comment"># 如果遇到正确的词,则标志位置1</span></span><br><span class="line"> jieba_flag = <span class="number">1</span> </span><br><span class="line"> <span class="keyword">break</span></span><br><span class="line"> <span class="keyword">if</span> jieba_flag:</span><br><span class="line"> rec_word = rec_word_possible</span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> hanzi_center = recordCoordinate(hanzi_combination_connect[<span class="number">0</span>], hanzi_list)</span><br><span class="line"> print(hanzi_center,<span class="string">'engine'</span>)</span><br><span class="line"> rec_word = search_engine_recog(hanzi_combination_connect[<span class="number">0</span>])</span><br><span class="line"> print(<span class="string">'语序识别结果:{}'</span>.format(rec_word))</span><br><span class="line"> print(<span class="string">'语序识别耗时{}'</span>.format(time.time() - o))</span><br><span class="line"></span><br><span class="line"> <span class="comment"># 按正确语序输出坐标</span></span><br><span class="line"> print(<span class="string">'\n'</span>*<span class="number">2</span> + <span class="string">'最终结果'</span> + <span class="string">'\n'</span> + <span class="string">'*'</span>*<span class="number">80</span>)</span><br><span class="line"> centers = []</span><br><span class="line"> <span class="keyword">for</span> i <span class="keyword">in</span> rec_word:</span><br><span class="line"> centers.append(hanzi_center[i])</span><br><span class="line"> print(<span class="string">'正确语序的坐标:{}'</span>.format(centers))</span><br><span class="line"> print(<span class="string">'总耗时{}'</span>.format(time.time() - d))</span><br><span class="line"> print(rec_word)</span><br><span class="line"> <span class="keyword">return</span> centers</span><br></pre></td></tr></table></figure><p>下面依然用一个实例来讲解破解程序思路:</p><p>输入:</p><p><img src="/2018/11/19/破解含语序问题的点击验证码/crack.jpg" alt="破解图片"></p><p>1.使用汉字定位模型定位汉字,得到4个定位框信息,每个定位框的第一个元素为hanzi类,第二个元素为该定位框的置信度,第三个元素为定位框的归一化坐标信息(x,y,w,h)。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">[(<span class="string">b'hanzi'</span>, <span class="number">0.8764635920524597</span>, (<span class="number">0.672152578830719</span>, <span class="number">0.355495423078537</span>, <span class="number">0.17341256141662598</span>, <span class="number">0.16976206004619598</span>)), (<span class="string">b'hanzi'</span>, <span class="number">0.8573136329650879</span>, (<span class="number">0.625790536403656</span>, <span class="number">0.7956624627113342</span>, <span class="number">0.15850003063678741</span>, <span class="number">0.13232673704624176</span>)), (<span class="string">b'hanzi'</span>, <span class="number">0.857090175151825</span>, (<span class="number">0.8480002284049988</span>, <span class="number">0.5595549941062927</span>, <span class="number">0.18965952098369598</span>, <span class="number">0.1373395025730133</span>)), (<span class="string">b'hanzi'</span>, <span class="number">0.8561009168624878</span>, (<span class="number">0.29499194025993347</span>, <span class="number">0.49679434299468994</span>, <span class="number">0.16142778098583221</span>, <span class="number">0.16253654658794403</span>))]</span><br></pre></td></tr></table></figure><p>2.根据定位框切割图片,输出由字典组成的列表(<code>key</code>为汉字图片的相对路径,<code>value</code>为汉字的中心坐标)。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">[{<span class="string">'hanzi_img/15343037353537.jpg'</span>: (<span class="number">231</span>, <span class="number">136</span>)}, {<span class="string">'hanzi_img/15343037353541.jpg'</span>: (<span class="number">215</span>, <span class="number">305</span>)}, {<span class="string">'hanzi_img/15343037353543.jpg'</span>: (<span class="number">291</span>, <span class="number">214</span>)}, {<span class="string">'hanzi_img/15343037353546.jpg'</span>: (<span class="number">101</span>, <span class="number">190</span>)}]</span><br></pre></td></tr></table></figure><p>3.使用汉字识别模型识别汉字,因为汉字识别会有识别错误的情况出现,为了一定程度上纠正错误,我们针对汉字识别置信度小于0.95的汉字,先选取其top5,然后对汉字识别结果的这几个字进行组合。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">[(<span class="string">b'u7269'</span>, <span class="number">0.9999469518661499</span>), (<span class="string">b'u7545'</span>, <span class="number">1.4645341252617072e-05</span>), (<span class="string">b'u626c'</span>, <span class="number">8.120928214339074e-06</span>), (<span class="string">b'u629b'</span>, <span class="number">6.87056399328867e-06</span>), (<span class="string">b'u573a'</span>, <span class="number">5.69164603803074e-06</span>)]</span><br><span class="line">[(<span class="string">b'u4e73'</span>, <span class="number">0.8303858637809753</span>), (<span class="string">b'u96c5'</span>, <span class="number">0.015525326132774353</span>), (<span class="string">b'u90e8'</span>, <span class="number">0.015043874271214008</span>), (<span class="string">b'u578b'</span>, <span class="number">0.008989457972347736</span>), (<span class="string">b'u64ad'</span>, <span class="number">0.008476710878312588</span>)]</span><br><span class="line">[(<span class="string">b'u52a8'</span>, <span class="number">0.9996626377105713</span>), (<span class="string">b'u5e7c'</span>, <span class="number">7.360828021774068e-05</span>), (<span class="string">b'u7ead'</span>, <span class="number">3.684992407215759e-05</span>), (<span class="string">b'u9645'</span>, <span class="number">2.4325390768353827e-05</span>), (<span class="string">b'u529f'</span>, <span class="number">2.07898483495228e-05</span>)]</span><br><span class="line">[(<span class="string">b'u901a'</span>, <span class="number">0.5683982372283936</span>), (<span class="string">b'u57d4'</span>, <span class="number">0.09509918093681335</span>), (<span class="string">b'u94fa'</span>, <span class="number">0.0750967487692833</span>), (<span class="string">b'u54fa'</span>, <span class="number">0.02881033532321453</span>),(<span class="string">b'u5398'</span>, <span class="number">0.009910903871059418</span>)]</span><br></pre></td></tr></table></figure><p>可以发现第二个和第四个汉字的置信度低于0.95,对置信度低于0.95的汉字选取其top5。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">[[<span class="string">'物'</span>], [<span class="string">'乳'</span>, <span class="string">'雅'</span>, <span class="string">'部'</span>, <span class="string">'型'</span>, <span class="string">'播'</span>], [<span class="string">'动'</span>], [<span class="string">'通'</span>, <span class="string">'埔'</span>, <span class="string">'铺'</span>, <span class="string">'哺'</span>, <span class="string">'厘'</span>]]</span><br></pre></td></tr></table></figure><p>可以发现,若仅选取置信度最高的,汉字识别结果是<code>物,乳,动,通</code>,这样就是识别错了。所以我们对置信度低的先选取其top5。</p><p>对四个汉字列表进行组合得到:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">[<span class="string">'物乳动通'</span>, <span class="string">'物乳动埔'</span>, <span class="string">'物乳动铺'</span>, <span class="string">'物乳动哺'</span>, <span class="string">'物乳动厘'</span>, <span class="string">'物雅动通'</span>, <span class="string">'物雅动埔'</span>, <span class="string">'物雅动铺'</span>, <span class="string">'物雅动哺'</span>, <span class="string">'物雅动厘'</span>, <span class="string">'物部动通'</span>, <span class="string">'物部动埔'</span>, <span class="string">'物部动铺'</span>, <span class="string">'物部动哺'</span>, <span class="string">'物部动厘'</span>, <span class="string">'物型动通'</span>, <span class="string">'物型动埔'</span>, <span class="string">'物型动铺'</span>, <span class="string">'物型动哺'</span>, <span class="string">'物型动厘'</span>, <span class="string">'物播动通'</span>, <span class="string">'物播动埔'</span>, <span class="string">'物播动铺'</span>, <span class="string">'物播动哺'</span>, <span class="string">'物播动厘'</span>]</span><br></pre></td></tr></table></figure><p>4.结合结巴分词和搜索引擎识别语序。对上一步中获得的组合进行遍历,先使用结巴分词识别语序,若结巴分词能识别出来,则直接返回;若结巴分词识别不出来,则仅对置信度最高的组合使用搜索引擎识别语序,将识别结果返回。本例中,结巴分词能正确识别,返回:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">哺乳动物</span><br></pre></td></tr></table></figure><p>5.返回汉字对应的坐标。</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">[(101, 190), (215, 305), (291, 214), (231, 136)]</span><br></pre></td></tr></table></figure><p><strong>总结:</strong>结巴分词识别一种汉字组合语序的时间大约为:0.000××(和电脑配置有关系),所以对多个组合进行遍历耗时也不会很多。而使用搜索引擎对一种组合识别语序耗时1-15妙(根据汉字个数有所不同),所以在结巴分词识别不出来时,仅对置信度最高的组合进行搜索引擎识别语序。这样的话,整体情况会比较不错。大部分语序识别耗时在1s以内,少部分通过搜索引擎识别的则会耗时2-16秒内,根据汉字个数有多不同。</p><h3 id="测试正确率"><a href="#测试正确率" class="headerlink" title="测试正确率"></a>测试正确率</h3><p><strong>文件准备:</strong></p><p><code>python/valid</code>文件和<code>python/valid.txt</code>文件,其中<code>python/valid</code>文件内存放的是验证码图片,<code>python/valid.txt</code>内存放的是验证码图片文件名及其对应的正确语序,如下:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">......</span><br><span class="line">......</span><br><span class="line">verifyCode1531875004.jpg--研究委员会</span><br><span class="line">verifyCode1531873597.jpg--求才若渴</span><br><span class="line">verifyCode1531874810.jpg--不久的将来</span><br><span class="line">......</span><br><span class="line">......</span><br></pre></td></tr></table></figure><p><strong>测试脚本:</strong></p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 加载汉字定位模型</span></span><br><span class="line">print(<span class="string">'\n'</span>*<span class="number">2</span> + <span class="string">'加载模型'</span> + <span class="string">'\n'</span> + <span class="string">'*'</span>*<span class="number">80</span>)</span><br><span class="line">dtc_modu = load_dtc_module(<span class="string">b'../cfg/yolo-origin.cfg'</span>,</span><br><span class="line"> <span class="string">b'../jiyan/backup/yolo-origin_49200.weights'</span>, <span class="string">b'../cfg/yolo-origin.data'</span>) </span><br><span class="line"><span class="comment"># 加载汉字识别模型</span></span><br><span class="line">classify_modu = load_clasiify_module(<span class="string">b"../cfg/chinese_character.cfg"</span>, </span><br><span class="line"> <span class="string">b"../chinese_classify/backup/chinese_character.backup"</span>, <span class="string">b"../cfg/chinese.data"</span>)</span><br><span class="line">cwd = os.getcwd()</span><br><span class="line">IMG_DIR = cwd.replace(<span class="string">"python"</span>, <span class="string">"python/valid/"</span>)</span><br><span class="line"><span class="keyword">with</span> open(<span class="string">'valid.txt'</span>)<span class="keyword">as</span> f:</span><br><span class="line"> lines = f.readlines()</span><br><span class="line">right = <span class="number">0</span></span><br><span class="line">num = len(lines)</span><br><span class="line"><span class="keyword">for</span> line <span class="keyword">in</span> lines:</span><br><span class="line"> line = line.strip()</span><br><span class="line"> rec_word = crack(IMG_DIR + line[:<span class="number">24</span>], dtc_modu, classify_modu, <span class="number">5</span>)</span><br><span class="line"> <span class="keyword">if</span> rec_word == line[<span class="number">26</span>:]:</span><br><span class="line"> right = right + <span class="number">1</span></span><br><span class="line"> <span class="keyword">elif</span> rec_word == <span class="number">0</span>:</span><br><span class="line"> num = num - <span class="number">1</span></span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> print(<span class="string">'#'</span>*<span class="number">20</span> + line[<span class="number">26</span>:]+<span class="string">' '</span> + rec_word)</span><br><span class="line">print(<span class="string">'正确率={}'</span>.format(right/num))</span><br></pre></td></tr></table></figure><p><strong>注意:</strong>测试接口正确率的时候,需要将破解接口最后的返回值<code>return centers</code>改为<code>return rec_word</code>。</p><hr><h1 id="模型训练文档"><a href="#模型训练文档" class="headerlink" title="模型训练文档"></a>模型训练文档</h1><p>本部分主要讲解如何使用<strong>定位器</strong>和<strong>分类器</strong>,其中包括<strong>训练数据准备</strong>、<strong>模型训练</strong>以及<strong>训练结果评估</strong>。</p><h2 id="依赖"><a href="#依赖" class="headerlink" title="依赖"></a>依赖</h2><ol><li>python3.6</li><li>opencv3</li><li>numpy</li></ol><h2 id="文件结构"><a href="#文件结构" class="headerlink" title="文件结构"></a>文件结构</h2><p>该文件结构图仅列出了一些比较重要的文件,并对文件作用进行了注解。<br><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br></pre></td><td class="code"><pre><span class="line">.</span><br><span class="line">├──Makefiledarknet配置文件</span><br><span class="line">├──README.md</span><br><span class="line">├── cfg</span><br><span class="line">│ ├── chinese.data 分类器训练配置文件 </span><br><span class="line">│ ├── chinese_character.cfg分类器网络配置文件</span><br><span class="line">│ ├── yolo-origin.cfgYOLOV2定位器网络配置文件</span><br><span class="line">│ ├── yolo-origin.dataYOLOV2定位器训练配置文件</span><br><span class="line">│ ├── yolov3.cfgYOLOV3定位器网络配置文件</span><br><span class="line">|└── yolov3.dataYOLOV3定位器训练配置文件</span><br><span class="line">├── chinese_classify</span><br><span class="line">│ ├── backup 分类器权重存储文件</span><br><span class="line">│ ├── data</span><br><span class="line">│ │ ├── train分类器训练集</span><br><span class="line">│ │ ├── valid分类器验证集</span><br><span class="line">│ │ ├── labels.txt分类器所有训练样本的标签</span><br><span class="line">│ │ ├── train.list分类器所有训练样本的路径</span><br><span class="line">│ │ └── valid.list分类器所有验证样本的路径</span><br><span class="line">│ ├── new_img分类器样本标记后存储在该文件夹</span><br><span class="line">│ ├── old_img分类器样本标记前存储在该文件夹</span><br><span class="line">│ └── label_hanzi.py标记分类器样本的脚本</span><br><span class="line">├── darknetdarknet二进制文件</span><br><span class="line">├── examples</span><br><span class="line">├── getmap.py计算定位器的mAP</span><br><span class="line">├── include</span><br><span class="line">├── jiyan</span><br><span class="line">│ ├── backup定位器权重存储文件</span><br><span class="line">│ ├── data</span><br><span class="line">│ │ ├── train定位器训练集</span><br><span class="line">│ │ ├── valid定位器验证集</span><br><span class="line">│ │ ├── train.txt定位器所有训练样本的路径</span><br><span class="line">│ │ ├── valid.txt定位器所有验证样本的路径</span><br><span class="line">│ │ └── yolo.names定位器标签仅一个,hanzi</span><br><span class="line">│ ├── get_pic.py爬取gsxt网站验证码的脚本</span><br><span class="line">│ └── raw_img最初爬取的2W张极验验证码图片(汉字识别已用过)</span><br><span class="line">├── python</span><br><span class="line">│ ├── hanzi_img 破解验证码时,存储的切割汉字图片的文件夹</span><br><span class="line">│ ├── valid测试集-500张(测试破解接口的准确率)</span><br><span class="line">│ ├── valid.txt测试集标记文件</span><br><span class="line">│ ├── crack_pro.py具有一定汉字纠错能力的验证码破解接口</span><br><span class="line">│ ├── darknet.py定位器和分类器调用接口</span><br><span class="line">│ ├── recog_order.py识别语序接口</span><br><span class="line">│ ├── segment.py 切割汉字接口</span><br><span class="line">│ └── dict.txt识别语序时用到的带词频的字典</span><br><span class="line">├── results生成的anchor.txt会存在此文件</span><br><span class="line">├── scripts</span><br><span class="line">├── srcdarknet源代码</span><br><span class="line">└── tools</span><br><span class="line">├── voc_label.py.xml标签转换为.txt标签</span><br><span class="line">├── generate_anchorsv2.pyYOLOV2生成anchors脚本</span><br><span class="line">└── generate_anchorsv3.pyYOLOV3生成anchors脚本</span><br></pre></td></tr></table></figure></p><h3 id="Makefile文件配置"><a href="#Makefile文件配置" class="headerlink" title="Makefile文件配置"></a>Makefile文件配置</h3><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br></pre></td><td class="code"><pre><span class="line">GPU=1 # 置1,使用GPU训练</span><br><span class="line">CUDNN=0</span><br><span class="line">OPENCV=1# 置1,开启opencv</span><br><span class="line">OPENMP=0</span><br><span class="line">DEBUG=0</span><br><span class="line"></span><br><span class="line"># 若make时,下面这行出现问题,可强制执行此行</span><br><span class="line">ARCH= -D_FORCE_INLINES -gencode arch=compute_30,code=sm_30 \</span><br><span class="line"> -gencode arch=compute_35,code=sm_35 \</span><br><span class="line"> -gencode arch=compute_50,code=[sm_50,compute_50] \</span><br><span class="line"> -gencode arch=compute_52,code=[sm_52,compute_52]</span><br><span class="line"># -gencode arch=compute_20,code=[sm_20,sm_21] \ This one is deprecated?</span><br><span class="line"></span><br><span class="line"># This is what I use, uncomment if you know your arch and want to specify</span><br><span class="line"># ARCH= -gencode arch=compute_52,code=compute_52</span><br><span class="line"></span><br><span class="line">VPATH=./src/:./examples</span><br><span class="line">SLIB=libdarknet.so</span><br><span class="line">ALIB=libdarknet.a</span><br><span class="line">EXEC=darknet</span><br><span class="line">OBJDIR=./obj/</span><br><span class="line">......</span><br><span class="line">......</span><br></pre></td></tr></table></figure><blockquote><p><strong>提示</strong>:修改Makefile或源码后,需在命令行先敲击<code>make clean</code>命令,再敲击<code>make</code>命令,方能生效。</p></blockquote><h2 id="定位器训练"><a href="#定位器训练" class="headerlink" title="定位器训练"></a>定位器训练</h2><h3 id="训练数据准备"><a href="#训练数据准备" class="headerlink" title="训练数据准备"></a>训练数据准备</h3><h4 id="样本的获取"><a href="#样本的获取" class="headerlink" title="样本的获取"></a>样本的获取</h4><p>训练定位器样本的获取方法:爬取<a href="http://www.gsxt.gov.cn/index.html" target="_blank" rel="noopener">国家企业信用信息公示系统</a>,使用脚本<code>jiyan/get_pic.py</code>进行爬取即可。</p><h4 id="样本的标注"><a href="#样本的标注" class="headerlink" title="样本的标注"></a>样本的标注</h4><p>使用标注软件<strong>labelimg</strong>对样本图片进行标注。有关<strong>labelimg</strong>软件的安装与使用请自行百度,这里不再赘述。</p><p>通过labelimg将图片标注后,会生成该图片对应的<strong>.xml标签</strong>,训练数据时需要的是<strong>.txt标签</strong>,我们需要将<strong>.xml标签</strong>转化为<strong>.txt标签</strong>。转换脚本见<code>tools/voc_label.py</code>。</p><h4 id="文件准备"><a href="#文件准备" class="headerlink" title="文件准备"></a>文件准备</h4><ol><li>定位器网络配置文件<code>cfg/yolo-origin.cfg</code></li></ol><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br></pre></td><td class="code"><pre><span class="line">[net]</span><br><span class="line"><span class="comment"># Training</span></span><br><span class="line">batch=64<span class="comment"># 每batch个样本更新一次参数</span></span><br><span class="line">subdivisions=16<span class="comment"># 如果内存不够大将batch分割为subdivisions个子batch,每个# 子batch的大小为batch/subdivisions</span></span><br><span class="line">height=416<span class="comment"># input图像的高</span></span><br><span class="line">width=416<span class="comment"># input图像的宽</span></span><br><span class="line">channels=3<span class="comment"># input图像的通道数</span></span><br><span class="line">momentum=0.9<span class="comment"># 梯度下降法中一种加速技术,建议配置0.9</span></span><br><span class="line">decay=0.0005<span class="comment"># 衰减权重正则项,防止过拟合</span></span><br><span class="line">angle=0<span class="comment"># 通过旋转角度来生成更多训练样本</span></span><br><span class="line">saturation = 1.5<span class="comment"># 通过调整饱和度来生成更多训练样本</span></span><br><span class="line">exposure = 1.5<span class="comment"># 通过调整曝光度来生成更多训练样本</span></span><br><span class="line">hue=.1<span class="comment"># 通过调整色调来生成更多训练样本</span></span><br><span class="line"></span><br><span class="line">learning_rate=0.001<span class="comment"># 学习率</span></span><br><span class="line">burn_in=1000 </span><br><span class="line">max_batches = 80200<span class="comment"># 训练达到max_batches后停止学习</span></span><br><span class="line">policy=steps<span class="comment"># 调整学习率的policy</span></span><br><span class="line">steps=40000,60000<span class="comment"># 根据batch_bum调整学习率</span></span><br><span class="line">scales=.1,.1<span class="comment"># 学习率的变化比例,累计相乘:迭代达到40000,学习率衰减十倍</span></span><br><span class="line"><span class="comment"># 60000迭代时,学习率又会在前一个学习率的基础上衰减十倍</span></span><br><span class="line">[convolutional]</span><br><span class="line">batch_normalize=1</span><br><span class="line">filters=32</span><br><span class="line">size=3</span><br><span class="line">stride=1</span><br><span class="line">pad=1</span><br><span class="line">activation=leaky</span><br><span class="line">......</span><br><span class="line">......</span><br><span class="line">[convolutional]</span><br><span class="line">size=1</span><br><span class="line">stride=1</span><br><span class="line">pad=1</span><br><span class="line">filters=30<span class="comment"># region前最后一个卷积层的filters数是特定的,本项目为30</span></span><br><span class="line">activation=linear</span><br><span class="line"></span><br><span class="line"></span><br><span class="line">[region]</span><br><span class="line">anchors = 2.1882101346810248, 1.7273508247089326, 2.5877106474986844, 2.3101804619114943, 1.8459417510394331, 1.7319281925870702, 2.125632169606032, 2.0649405635139693, 2.458238797504399, 1.9738465447154578</span><br><span class="line"><span class="comment"># 预选框,可以通过generate_anchorsv2.py文件生成</span></span><br><span class="line">bias_match=1</span><br><span class="line">classes=1<span class="comment"># 网络需要识别的物体种类数,本项目为1即:hanzi</span></span><br><span class="line">coords=4<span class="comment"># 每个box的4个坐标:tx,ty,tw,th</span></span><br><span class="line">num=5<span class="comment"># 每个网格单元预测几个box,和anchors的数量一致</span></span><br><span class="line">softmax=1<span class="comment"># 使用softmax做激活函数</span></span><br><span class="line">jitter=.3<span class="comment"># 通过抖动增加噪声来抑制过拟合</span></span><br><span class="line">rescore=1</span><br><span class="line"></span><br><span class="line">object_scale=5</span><br><span class="line">noobject_scale=1</span><br><span class="line">class_scale=1</span><br><span class="line">coord_scale=1</span><br><span class="line"></span><br><span class="line">absolute=1</span><br><span class="line">thresh=.6</span><br><span class="line">random=1<span class="comment"># random为1时,会启用Multi-Scale Training,随机使用不同 # 尺寸的图片进行训练。</span></span><br></pre></td></tr></table></figure><p><strong>重要参数讲解:</strong> </p><ul><li><p><code>filter</code>应为30,其计算公式为:</p><script type="math/tex; mode=display">filter = num\cdot (classes + coord + 1)</script><script type="math/tex; mode=display">30 = 5\cdot (1 + 4 + 1)</script></li><li><p><code>num</code>代表每个网格单元预测几个box。</p></li><li><p><code>classes</code>代表一共有多少个类,本项目为1。</p></li><li><p><code>coord</code>代表回归的四个位置,分别为<code><x><y><width><height></code>代表物体中心点相对位置以及物体相对大小。如<code>不</code>的标签为<code>0 0.20 0.76 0.16 0.16</code> , <code>忘</code>的标签为<code>0 0.75 0.25 0.15 0.17</code>,<code>初</code>的标签为<code>0 0.25 0.20 0.17 0.17</code>,<code>心</code>的标签为<code>0 0.60 0.62 0.16 0.18</code></p><p><img src="/2018/11/19/破解含语序问题的点击验证码/不忘初心.png" alt="验证码"></p><p>这张图的标签为:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">0 0.25 0.20 0.17 0.17</span><br><span class="line">0 0.75 0.25 0.15 0.17</span><br><span class="line">0 0.60 0.62 0.16 0.18</span><br><span class="line">0 0.20 0.76 0.16 0.16</span><br></pre></td></tr></table></figure></li></ul><ol><li>定位器训练配置文件<code>cfg/yolo-origin.data</code></li></ol><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">classes= 1<span class="comment"># 类的个数,本项目为1</span></span><br><span class="line">train = jiyan/data/train.txt<span class="comment"># 存放train.txt文件的路径</span></span><br><span class="line">valid = jiyan/data/valid.txt<span class="comment"># 存放valid.txt文件的路径</span></span><br><span class="line">names = /home/geng/gsxt_captcha/jiyan/data/yolo.names<span class="comment"># 存放类名字的路径</span></span><br><span class="line">backup = jiyan/backup<span class="comment"># 训练后权重保存位置</span></span><br></pre></td></tr></table></figure><p><strong>重要参数讲解:</strong></p><ul><li><p><code>train</code>:该参数为存放<code>train.txt</code>文件的路径,<code>train.txt</code>文件格式如下:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">......</span><br><span class="line">......</span><br><span class="line">/home/geng/darknet/jiyan/data/train/1530946690.jpg</span><br><span class="line">/home/geng/darknet/jiyan/data/train/3062060956.jpg</span><br><span class="line">/home/geng/darknet/jiyan/data/train/3062062538.jpg</span><br><span class="line">......</span><br><span class="line">......</span><br></pre></td></tr></table></figure><p>即<code>train.txt</code>文件保存了所有训练样本的路径。在上述路径中的train文件内必须同时存放样本及其对应的标签,即:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">......</span><br><span class="line">......</span><br><span class="line">1528780328740.jpg 1530947539.jpg 3062055104.jpg 3062097132.jpg</span><br><span class="line">1528780328740.txt 1530947539.txt 3062055104.txt 3062097132.txt</span><br><span class="line">1528780333363.jpg 1530947544.jpg 3062055126.jpg 3062097142.jpg</span><br><span class="line">1528780333363.txt 1530947544.txt 3062055126.txt 3062097142.txt</span><br><span class="line">......</span><br><span class="line">......</span><br></pre></td></tr></table></figure><p><strong>生成train.txt:</strong>通过以下命令行命令来生成路径文件<code>train.txt</code>:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">find `<span class="built_in">pwd</span>`/train -name \*.jpg > train.txt</span><br></pre></td></tr></table></figure></li><li><p><code>valid</code>:该参数为存放<code>valid.txt</code>文件的路径,<code>valid.txt</code>文件格式如下:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">......</span><br><span class="line">......</span><br><span class="line">/home/geng/darknet/jiyan/data/valid/1530953866.jpg</span><br><span class="line">/home/geng/darknet/jiyan/data/valid/1530954184.jpg</span><br><span class="line">/home/geng/darknet/jiyan/data/valid/1530952279.jpg</span><br><span class="line">......</span><br><span class="line">......</span><br></pre></td></tr></table></figure><p>即<code>valid.txt</code>文件保存了所有验证样本的路径。<strong>需要注意的是:</strong>上述路径中的valid文件内必须同时存放样本及其对应的<code>.txt</code>标签和<code>.xml</code>标签,即:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">1530952510.jpg 1530953474.jpg 1530955039.jpg 3062099066.jpg 3062107032.jpg</span><br><span class="line">1530952510.txt 1530953474.txt 1530955039.txt 3062099066.txt 3062107032.txt</span><br><span class="line">1530952510.xml 1530953474.xml 1530955039.xml 3062099066.xml 3062107032.xml</span><br><span class="line">1530952529.jpg 1530953479.jpg 1530955044.jpg 3062099078.jpg 3062107044.jpg</span><br><span class="line">1530952529.txt 1530953479.txt 1530955044.txt 3062099078.txt 3062107044.txt</span><br><span class="line">1530952529.xml 1530953479.xml 1530955044.xml 3062099078.xml 3062107044.xml</span><br></pre></td></tr></table></figure><p><strong>生成valid.txt:</strong>通过以下命令行命令来生成路径文件<code>valid.txt</code>:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">find `<span class="built_in">pwd</span>`/valid -name \*.jpg > valid.txt</span><br></pre></td></tr></table></figure></li></ul><h3 id="模型训练"><a href="#模型训练" class="headerlink" title="模型训练"></a>模型训练</h3><p>配置文件和数据集准备好之后,我们就可以开始训练了,训练命令如下:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">./darknet detector train cfg/yolo-origin.data cfg/yolo-origin.cfg</span><br></pre></td></tr></table></figure><p>若在原有权重的基础上进行训练,使用如下命令:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">./darknet detector train cfg/yolo-origin.data cfg/yolo-origin.cfg jiyan/backup/yolo-origin.backup</span><br></pre></td></tr></table></figure><h3 id="训练结果评估"><a href="#训练结果评估" class="headerlink" title="训练结果评估"></a>训练结果评估</h3><p>目标检测中衡量识别精度的指标是mAP(mean average precision),mAP越接近1,表示定位效果越好。在计算mAP时,需要先根据训练的模型生成检测结果,然后使用<code>getmap.py</code>脚本计算mAP。</p><p><strong>生成检测结果:</strong></p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">./darknet detector valid cfg/yolo-origin.data cfg/yolo-origin.cfg jiyan/backup/yolo-origin.backup</span><br></pre></td></tr></table></figure><p>生成的检测结果会存放在<code>results/comp4_det_test_hanzi.txt</code>文件内。</p><p><strong>计算mAP:</strong></p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">python getmap.py results/comp4_det_test_hanzi.txt jiyan/data/valid.txt hanzi</span><br></pre></td></tr></table></figure><h2 id="分类器训练"><a href="#分类器训练" class="headerlink" title="分类器训练"></a>分类器训练</h2><h3 id="训练数据准备-1"><a href="#训练数据准备-1" class="headerlink" title="训练数据准备"></a>训练数据准备</h3><h4 id="样本的获取-1"><a href="#样本的获取-1" class="headerlink" title="样本的获取"></a>样本的获取</h4><p>分类器样本的获取方法是通过对定位的汉字进行切割获取的。汉字切割脚本见<code>python/segment.py</code>。</p><h4 id="样本的标注-1"><a href="#样本的标注-1" class="headerlink" title="样本的标注"></a>样本的标注</h4><p>分类器样本的标注,即对切割的汉字图片进行标记。<strong>实质上是修改汉字图片的文件名</strong>,所涉及的原理在下面的文件准备中有讲解。标注过程解释如下:</p><div class="table-container"><table><thead><tr><th style="text-align:center">汉字</th><th style="text-align:center">汉字对应的Unicode</th><th style="text-align:center">切割得到的汉字图片文件名</th><th style="text-align:center">标记后的汉字图片文件名</th></tr></thead><tbody><tr><td style="text-align:center">健</td><td style="text-align:center">\u5065</td><td style="text-align:center">15310991303473_label.jpg</td><td style="text-align:center">15310991303473_u5065.jpg</td></tr><tr><td style="text-align:center">声</td><td style="text-align:center">\u58f0</td><td style="text-align:center">15310991261628_label.jpg</td><td style="text-align:center">15310991261628_u58f0.jpg</td></tr></tbody></table></div><p>汉字识别的标注脚本已经写好,见<code>label_hanzi.py</code>。</p><h4 id="文件准备-1"><a href="#文件准备-1" class="headerlink" title="文件准备"></a>文件准备</h4><ol><li>分类器网络配置文件<code>cfg/chinese_character.cfg</code></li></ol><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br></pre></td><td class="code"><pre><span class="line">[net]</span><br><span class="line">batch=64</span><br><span class="line">subdivisions=1</span><br><span class="line">height=64</span><br><span class="line">width=64</span><br><span class="line">channels=3</span><br><span class="line">max_crop=64</span><br><span class="line">min_crop=64</span><br><span class="line">angle=7</span><br><span class="line">hue=.1</span><br><span class="line">saturation=.75</span><br><span class="line">exposure=.75</span><br><span class="line"></span><br><span class="line">learning_rate=0.1</span><br><span class="line">policy=poly</span><br><span class="line">power=4</span><br><span class="line">max_batches = 45000</span><br><span class="line">momentum=0.9</span><br><span class="line">decay=0.0005</span><br><span class="line"></span><br><span class="line">[convolutional]</span><br><span class="line">batch_normalize=1</span><br><span class="line">filters=128</span><br><span class="line">size=3</span><br><span class="line">stride=1</span><br><span class="line">pad=1</span><br><span class="line">activation=leaky</span><br><span class="line"></span><br><span class="line">......</span><br><span class="line">......</span><br><span class="line"></span><br><span class="line">[convolutional]</span><br><span class="line">filters=3604 <span class="comment"># 此处需要与chinese_classify/data/labels.txt内的标签个数一致</span></span><br><span class="line">size=1</span><br><span class="line">stride=1</span><br><span class="line">pad=1</span><br><span class="line">activation=leaky</span><br><span class="line"></span><br><span class="line">[avgpool]</span><br><span class="line"></span><br><span class="line">[softmax]</span><br><span class="line">groups=1</span><br><span class="line"></span><br><span class="line">[cost]</span><br><span class="line"><span class="built_in">type</span>=sse</span><br></pre></td></tr></table></figure><p><strong>重要参数讲解:</strong></p><ul><li>最后一个convolutional层的<code>filters</code>:此处filters的值为3604,但是需要注意的是:此值应该与<code>chinese_classify/data/labels.txt</code>内的类标签个数一致。</li></ul><ol><li>分类器训练配置文件<code>cfg/chinese.data</code></li></ol><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">classes=3604 <span class="comment"># 类的个数,此处即汉字的个数</span></span><br><span class="line">train = chinese_classify/data/train.list <span class="comment"># 存放train.list文件的路径</span></span><br><span class="line">valid = chinese_classify/data/valid.list <span class="comment"># 存放valid.list文件的路径</span></span><br><span class="line">labels = chinese_classify/data/labels.txt <span class="comment"># 存放labels.txt的路径</span></span><br><span class="line">backup = chinese_classify/backup<span class="comment"># 训练时权重的保存位置</span></span><br><span class="line">top=100</span><br></pre></td></tr></table></figure><p><strong>重要参数讲解:</strong></p><ul><li><p><code>classes</code>代表分类个数</p></li><li><p><code>train</code>该参数为存放<code>train.list</code>文件的路径,<code>train.list</code>文件内格式如下:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">......</span><br><span class="line">......</span><br><span class="line">/home/geng/darknet/chinese_classify/data/train/15310999153792_u5347.jpg</span><br><span class="line">/home/geng/darknet/chinese_classify/data/train/15310991401861_u4e58.jpg</span><br><span class="line">/home/geng/darknet/chinese_classify/data/train/15310993054970_u4e4b.jpg</span><br><span class="line">......</span><br><span class="line">......</span><br></pre></td></tr></table></figure><p>即<code>train.txt</code>文件保存了所有训练样本的路径。在上述路径中的train文件内需要存放训练样本,即:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">......</span><br><span class="line">......</span><br><span class="line">15310993303114_u5b9d.jpg 15310996251829_u5174.jpg 15310999168611_u6765.jpg</span><br><span class="line">15310993303400_u90e8.jpg 15310996252114_u4efb.jpg 15310999168899_u5145.jpg</span><br><span class="line">15310993303403_u5185.jpg 15310996252119_u4f55.jpg 15310999168901_u7535.jpg</span><br><span class="line">......</span><br><span class="line">......</span><br></pre></td></tr></table></figure></li><li><p><code>valid</code>该参数的格式与<code>train</code>一样,不再赘述。只是该参数涉及的是验证集。</p></li><li><p><code>labels</code>为存放类标签文件<code>labels.txt</code>的路径,<code>labels.txt</code>文件内格式如下:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">......</span><br><span class="line">......</span><br><span class="line">u6bd3</span><br><span class="line">u5347</span><br><span class="line">u90f8</span><br><span class="line">u5d58</span><br><span class="line">u8426</span><br><span class="line">......</span><br><span class="line">......</span><br></pre></td></tr></table></figure><p>darknet是通过文件名与<code>labels.txt</code>中的字符串做匹配,匹配到则认为该标签为匹配到的字符串。如<code>/home/geng/darknet/chinese_classify/data/train/15310999153792_u5347.jpg</code>,由于<code>labels.txt</code>中出现<code>u5347</code>,所以这张图的标签为<code>u5347</code>。所以图片的路径绝对不可以出现多个<code>labels.txt</code>中包含的字符串,如果路径为<code>somepath1/data2/3.jpg</code>,<code>labels.txt</code>包含1,2,3,则<strong>这张图片会被认为匹配1,2,3多个label从而报错</strong>。</p></li><li><p><code>top</code>代表valid时取前多少计算正确率。如<code>top100</code>代表分类时概率最大的前100类中出现了正确的标签就认为正确。</p></li></ul><h3 id="模型训练-1"><a href="#模型训练-1" class="headerlink" title="模型训练"></a>模型训练</h3><p>配置文件和数据集准备好之后,我们就可以开始训练了,训练命令如下:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">./darknet classifier train cfg/chinese.data cfg/chinese_character.cfg</span><br></pre></td></tr></table></figure><p>若在原有权重的基础上进行训练,使用如下命令:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">./darknet classifier train cfg/chinese.data cfg/chinese_character.cfg chinese_classify/backup/chinese_character.backup</span><br></pre></td></tr></table></figure><p><strong>需要注意的是:</strong>如果增加样本后<code>label.txt</code>文件内增加了新的汉字标签,就不能在原有权重的基础上进行训练了,需要重新训练。 </p><h3 id="训练结果评估-1"><a href="#训练结果评估-1" class="headerlink" title="训练结果评估"></a>训练结果评估</h3><p>分类器模型验证比较简单,直接用准确率来评估,即:验证集中<strong>分类正确的个数/分类错误的个数</strong>。分类器模型验证命令如下:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">./darknet classifier valid cfg/chinese.data cfg/chinese_character.cfg chinese_classify/backup/chinese_character.backup</span><br></pre></td></tr></table></figure><h1 id="参考文献"><a href="#参考文献" class="headerlink" title="参考文献"></a>参考文献</h1><p><a href="https://pjreddie.com/darknet/" target="_blank" rel="noopener">Darknet官方网站</a></p><p>You Only Look Once: Unified ,Real-Time Object Detection</p><p><a href="https://cos120.github.io/crack/" target="_blank" rel="noopener">https://cos120.github.io/crack/</a></p><h1 id="免责声明"><a href="#免责声明" class="headerlink" title="免责声明"></a>免责声明</h1><p><strong>该项目仅用于学术交流,不得任何商业使用!</strong></p>]]></content>
<categories>
<category> 深度学习 </category>
</categories>
<tags>
<tag> yolo </tag>
<tag> 点击验证码 </tag>
<tag> 国家企业信用信息公示系统 </tag>
</tags>
</entry>
<entry>
<title>体验活动的生命周期</title>
<link href="/2018/11/16/%E4%BD%93%E9%AA%8C%E6%B4%BB%E5%8A%A8%E7%9A%84%E7%94%9F%E5%91%BD%E5%91%A8%E6%9C%9F/"/>
<url>/2018/11/16/%E4%BD%93%E9%AA%8C%E6%B4%BB%E5%8A%A8%E7%9A%84%E7%94%9F%E5%91%BD%E5%91%A8%E6%9C%9F/</url>
<content type="html"><![CDATA[<p>在动手实现这小实验之前,最好先去了解活动生命周期的4种状态以及Activity类中定义的7个回调方法。</p><h2 id="新建一个项目"><a href="#新建一个项目" class="headerlink" title="新建一个项目"></a>新建一个项目</h2><ol><li>新建一个名为ActivityLifeCycleTest的项目,点击Next。</li></ol><p><img src="/2018/11/16/体验活动的生命周期/image-20181116085352727.png" alt="image-20181116085352727"></p><ol><li>保持默认选项,再点击Next。</li></ol><p><img src="/2018/11/16/体验活动的生命周期/image-20181116085453492.png" alt="image-20181116085453492"></p><ol><li>选择一个空项目,让Android帮我们自动创建活动和布局。</li></ol><p><img src="/2018/11/16/体验活动的生命周期/image-20181116085840469.png" alt="image-20181116085840469"></p><ol><li>活动名和布局名都是用默认值。</li></ol><p><img src="/2018/11/16/体验活动的生命周期/image-20181116090722988.png" alt="image-20181116090722988"></p><p>到此为止,我们的主活动就创建完毕了。</p><h2 id="创建两个子活动"><a href="#创建两个子活动" class="headerlink" title="创建两个子活动"></a>创建两个子活动</h2><p>创建两个子活动,分别名为NormalActivity和DialogActivity。</p><ol><li>新建NormalActivity子活动。</li></ol><p><img src="/2018/11/16/体验活动的生命周期/image-20181116091658116.png" alt="image-20181116091658116"></p><p>布局起名为activity_normal,点击Finish。</p><p><img src="/2018/11/16/体验活动的生命周期/image-20181116091844066.png" alt="image-20181116091844066"></p><ol><li>新建DialogActivity子活动,布局起名为activity_dialog,创建方法同上。</li></ol><p><img src="/2018/11/16/体验活动的生命周期/image-20181116092419983.png" alt="image-20181116092419983"></p><p>到此,两个子活动创建完毕。</p><h2 id="编写活动的布局文件"><a href="#编写活动的布局文件" class="headerlink" title="编写活动的布局文件"></a>编写活动的布局文件</h2><ol><li>编写activity_normal.xml文件,将里面的代码替换成如下内容:</li></ol><figure class="highlight xml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line"><?xml version="1.0" encoding="utf-8"?></span><br><span class="line"><span class="tag"><<span class="name">LinearLayout</span> <span class="attr">xmlns:android</span>=<span class="string">"http://schemas.android.com/apk/res/android"</span></span></span><br><span class="line"><span class="tag"><span class="attr">android:layout_width</span>=<span class="string">"match_parent"</span></span></span><br><span class="line"><span class="tag"><span class="attr">android:layout_height</span>=<span class="string">"match_parent"</span></span></span><br><span class="line"><span class="tag"><span class="attr">android:orientation</span>=<span class="string">"vertical"</span>></span></span><br><span class="line"></span><br><span class="line"><span class="tag"><<span class="name">TextView</span></span></span><br><span class="line"><span class="tag"><span class="attr">android:layout_width</span>=<span class="string">"match_parent"</span></span></span><br><span class="line"><span class="tag"><span class="attr">android:layout_height</span>=<span class="string">"wrap_content"</span></span></span><br><span class="line"><span class="tag"><span class="attr">android:text</span>=<span class="string">"This is a normal activity"</span>/></span></span><br><span class="line"></span><br><span class="line"><span class="tag"></<span class="name">LinearLayout</span>></span></span><br></pre></td></tr></table></figure><p>这个布局中我们使用了一个TextView,用于显示一行文字。</p><ol><li>编辑activity_dialog.xml文件,将里面的内容代码替换成如下内容:</li></ol><figure class="highlight xml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line"><?xml version="1.0" encoding="utf-8"?></span><br><span class="line"><span class="tag"><<span class="name">LinearLayout</span> <span class="attr">xmlns:android</span>=<span class="string">"http://schemas.android.com/apk/res/android"</span></span></span><br><span class="line"><span class="tag"><span class="attr">android:layout_width</span>=<span class="string">"match_parent"</span></span></span><br><span class="line"><span class="tag"><span class="attr">android:layout_height</span>=<span class="string">"match_parent"</span></span></span><br><span class="line"><span class="tag"><span class="attr">android:orientation</span>=<span class="string">"vertical"</span>></span></span><br><span class="line"></span><br><span class="line"><span class="tag"><<span class="name">TextView</span></span></span><br><span class="line"><span class="tag"><span class="attr">android:layout_width</span>=<span class="string">"match_parent"</span></span></span><br><span class="line"><span class="tag"><span class="attr">android:layout_height</span>=<span class="string">"wrap_content"</span></span></span><br><span class="line"><span class="tag"><span class="attr">android:text</span>=<span class="string">"This is a dialog activity"</span>/></span></span><br><span class="line"></span><br><span class="line"><span class="tag"></<span class="name">LinearLayout</span>></span></span><br></pre></td></tr></table></figure><ol><li>修改AndroidManifest.xml文件的<code><activity></code>标签的配置,将DialogActivity活动指定为对话框式的主题。</li></ol><figure class="highlight xml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line"><span class="tag"><<span class="name">activity</span> <span class="attr">android:name</span>=<span class="string">".MainActivity"</span>></span></span><br><span class="line"><span class="tag"><<span class="name">intent-filter</span>></span></span><br><span class="line"><span class="tag"><<span class="name">action</span> <span class="attr">android:name</span>=<span class="string">"android.intent.action.MAIN"</span> /></span></span><br><span class="line"></span><br><span class="line"><span class="tag"><<span class="name">category</span> <span class="attr">android:name</span>=<span class="string">"android.intent.category.LAUNCHER"</span> /></span></span><br><span class="line"><span class="tag"></<span class="name">intent-filter</span>></span></span><br><span class="line"><span class="tag"></<span class="name">activity</span>></span></span><br><span class="line"><span class="tag"><<span class="name">activity</span> <span class="attr">android:name</span>=<span class="string">".NormalActivit"</span> /></span></span><br><span class="line"><span class="tag"><<span class="name">activity</span> <span class="attr">android:name</span>=<span class="string">".DialogActivity"</span></span></span><br><span class="line"><span class="tag"><span class="attr">android:theme</span>=<span class="string">"@style/Theme.AppCompat.Dialog"</span>></span></span><br><span class="line"><span class="tag"></<span class="name">activity</span>></span></span><br></pre></td></tr></table></figure><ol><li>修改activity_main.xml文件,重新定制主活动的布局,替换为如下内容:</li></ol><figure class="highlight xml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><span class="line"><?xml version="1.0" encoding="utf-8"?></span><br><span class="line"><span class="tag"><<span class="name">LinearLayout</span> <span class="attr">xmlns:android</span>=<span class="string">"http://schemas.android.com/apk/res/android"</span></span></span><br><span class="line"><span class="tag"><span class="attr">android:layout_width</span>=<span class="string">"match_parent"</span></span></span><br><span class="line"><span class="tag"><span class="attr">android:layout_height</span>=<span class="string">"match_parent"</span></span></span><br><span class="line"><span class="tag"><span class="attr">android:orientation</span>=<span class="string">"vertical"</span>></span></span><br><span class="line"></span><br><span class="line"><span class="tag"><<span class="name">Button</span></span></span><br><span class="line"><span class="tag"><span class="attr">android:id</span>=<span class="string">"@+id/start_normal_activity"</span></span></span><br><span class="line"><span class="tag"><span class="attr">android:layout_width</span>=<span class="string">"match_parent"</span></span></span><br><span class="line"><span class="tag"><span class="attr">android:layout_height</span>=<span class="string">"wrap_content"</span> </span></span><br><span class="line"><span class="tag"><span class="attr">android:text</span>=<span class="string">"Start NormalActivity"</span>/></span></span><br><span class="line"></span><br><span class="line"><span class="tag"><<span class="name">Button</span></span></span><br><span class="line"><span class="tag"><span class="attr">android:id</span>=<span class="string">"@+id/start_dialog_activity"</span></span></span><br><span class="line"><span class="tag"><span class="attr">android:layout_width</span>=<span class="string">"match_parent"</span></span></span><br><span class="line"><span class="tag"><span class="attr">android:layout_height</span>=<span class="string">"wrap_content"</span></span></span><br><span class="line"><span class="tag"><span class="attr">android:text</span>=<span class="string">"Start DialogActivity"</span>/></span></span><br><span class="line"></span><br><span class="line"><span class="tag"></<span class="name">LinearLayout</span>></span></span><br></pre></td></tr></table></figure><p>我们在LineaLayout中加入了两个按钮,一个用于启动NormalActivity,一个用于启动DialogActivity。</p><h2 id="修改MainActivity"><a href="#修改MainActivity" class="headerlink" title="修改MainActivity"></a>修改MainActivity</h2><figure class="highlight java"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">package</span> cn.edu.pku.gengzehao.activitylifecycletest;</span><br><span class="line"></span><br><span class="line"><span class="keyword">import</span> android.content.Intent;</span><br><span class="line"><span class="keyword">import</span> android.support.v7.app.AppCompatActivity;</span><br><span class="line"><span class="keyword">import</span> android.os.Bundle;</span><br><span class="line"><span class="keyword">import</span> android.util.Log;</span><br><span class="line"><span class="keyword">import</span> android.view.View;</span><br><span class="line"><span class="keyword">import</span> android.widget.Button;</span><br><span class="line"></span><br><span class="line"><span class="keyword">public</span> <span class="class"><span class="keyword">class</span> <span class="title">MainActivity</span> <span class="keyword">extends</span> <span class="title">AppCompatActivity</span> <span class="keyword">implements</span> <span class="title">View</span>.<span class="title">OnClickListener</span> </span>{</span><br><span class="line"></span><br><span class="line"><span class="keyword">public</span> <span class="keyword">static</span> <span class="keyword">final</span> String TAG = <span class="string">"MainActivity"</span>;</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="meta">@Override</span></span><br><span class="line"><span class="function"><span class="keyword">protected</span> <span class="keyword">void</span> <span class="title">onCreate</span><span class="params">(Bundle savedInstanceState)</span> </span>{</span><br><span class="line"><span class="keyword">super</span>.onCreate(savedInstanceState);</span><br><span class="line">Log.d(TAG, <span class="string">"onCreate"</span>);</span><br><span class="line">setContentView(R.layout.activity_main);</span><br><span class="line"></span><br><span class="line">Button startNormalActivity = (Button) findViewById(R.id.start_normal_activity);</span><br><span class="line">startNormalActivity.setOnClickListener(<span class="keyword">this</span>);</span><br><span class="line"></span><br><span class="line">Button startDialogActivity = (Button) findViewById(R.id.start_dialog_activity);</span><br><span class="line">startDialogActivity.setOnClickListener(<span class="keyword">this</span>);</span><br><span class="line"></span><br><span class="line"></span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="meta">@Override</span></span><br><span class="line"><span class="function"><span class="keyword">public</span> <span class="keyword">void</span> <span class="title">onClick</span><span class="params">(View v)</span> </span>{</span><br><span class="line"><span class="keyword">if</span> (v.getId() == R.id.start_normal_activity){</span><br><span class="line">Intent intent = <span class="keyword">new</span> Intent(MainActivity.<span class="keyword">this</span>, NormalActivit.class);</span><br><span class="line">startActivity(intent);</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> (v.getId() == R.id.start_dialog_activity){</span><br><span class="line">Intent intent = <span class="keyword">new</span> Intent(MainActivity.<span class="keyword">this</span>, DialogActivity.class);</span><br><span class="line">startActivity(intent);</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="meta">@Override</span></span><br><span class="line"><span class="function"><span class="keyword">protected</span> <span class="keyword">void</span> <span class="title">onStart</span><span class="params">()</span></span>{</span><br><span class="line"><span class="keyword">super</span>.onStart();</span><br><span class="line">Log.d(TAG, <span class="string">"onStart"</span>);</span><br><span class="line">}</span><br><span class="line"><span class="meta">@Override</span></span><br><span class="line"><span class="function"><span class="keyword">protected</span> <span class="keyword">void</span> <span class="title">onResume</span><span class="params">()</span></span>{</span><br><span class="line"><span class="keyword">super</span>.onResume();</span><br><span class="line">Log.d(TAG, <span class="string">"onResume"</span>);</span><br><span class="line">}</span><br><span class="line"><span class="meta">@Override</span></span><br><span class="line"><span class="function"><span class="keyword">protected</span> <span class="keyword">void</span> <span class="title">onPause</span><span class="params">()</span></span>{</span><br><span class="line"><span class="keyword">super</span>.onPause();</span><br><span class="line">Log.d(TAG, <span class="string">"onPause"</span>);</span><br><span class="line">}</span><br><span class="line"><span class="meta">@Override</span></span><br><span class="line"><span class="function"><span class="keyword">protected</span> <span class="keyword">void</span> <span class="title">onStop</span><span class="params">()</span></span>{</span><br><span class="line"><span class="keyword">super</span>.onStop();</span><br><span class="line">Log.d(TAG, <span class="string">"onStop"</span>);</span><br><span class="line">}</span><br><span class="line"><span class="meta">@Override</span></span><br><span class="line"><span class="function"><span class="keyword">protected</span> <span class="keyword">void</span> <span class="title">onDestroy</span><span class="params">()</span></span>{</span><br><span class="line"><span class="keyword">super</span>.onDestroy();</span><br><span class="line">Log.d(TAG, <span class="string">"onDestroy"</span>);</span><br><span class="line">}</span><br><span class="line"><span class="meta">@Override</span></span><br><span class="line"><span class="function"><span class="keyword">protected</span> <span class="keyword">void</span> <span class="title">onRestart</span><span class="params">()</span></span>{</span><br><span class="line"><span class="keyword">super</span>.onRestart();</span><br><span class="line">Log.d(TAG, <span class="string">"onRestart"</span>);</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line">}</span><br></pre></td></tr></table></figure><p>再onCreate方法中,我们我们分别为两个按钮注册了点击事件,点击第一个按钮会启动NoramalActivity。点击第二个按钮会启动DialogActivity。然后再Activity的7个回调方法中分别打印了回调方法的名字,这样就可以通过观察日志的方式来更直观的理解活动的生命周期。</p><h2 id="运行程序"><a href="#运行程序" class="headerlink" title="运行程序"></a>运行程序</h2><ol><li>启动程序:</li></ol><p><img src="/2018/11/16/体验活动的生命周期/image-20181116105015976.png" alt="image-20181116105015976"></p><p>启动程序时,logcat中的日志信息如下图,可以看到当MainActivity第一被创建时会一次执行onCreate()、onStart()、onResume()方法。</p><p><img src="/2018/11/16/体验活动的生命周期/image-20181116105139134.png" alt="image-20181116105139134"></p><ol><li>点击一个按钮,启动NormalActivity。</li></ol><p><img src="/2018/11/16/体验活动的生命周期/image-20181116105613231.png" alt="image-20181116105613231"></p><p>打开NormalActivity时的打印日志:</p><p><img src="/2018/11/16/体验活动的生命周期/image-20181116105804164.png" alt="image-20181116105804164"></p><p>由于NormalActivity已经把MainActivity完全遮挡,因此onPause()和onStop()方法都会得到执行。</p><ol><li>按下Back键返回MainActivity的打印日志信息如下。</li></ol><p><img src="/2018/11/16/体验活动的生命周期/image-20181116110125194.png" alt="image-20181116110125194"></p><p>由于之前的MainActivity已经进入了停止状态,所以onRestart() 方法才会的到执行,之后又会依次执行onSrart() 和onResume()方法。值的注意的是,此时onCreate()方法不会执行。</p><ol><li>再点击第二个按钮,启动DialogActivity。</li></ol><p><img src="/2018/11/16/体验活动的生命周期/image-20181116120841989.png" alt="image-20181116120841989"></p><p>打开DialogActivity时的打印日志:</p><p><img src="/2018/11/16/体验活动的生命周期/image-20181116120948371.png" alt="image-20181116120948371"></p><p>只有onPause()方法得到了执行,onStop()方法并没有执行。这是因为DialogActivity并没有完全遮挡住MainActivity,此时MainActivity只是进入了暂停状态,并没有进入停止状态。</p><ol><li>按下Back键再次返回MainActivity时的打印日志为:</li></ol><p><img src="/2018/11/16/体验活动的生命周期/image-20181116121404897.png" alt="image-20181116121404897"></p><p>可以看到,只有onResume ()方法得到了执行。</p><ol><li>最后,在MainActivity按下Back键退出程序,打印日志如下:</li></ol><p><img src="/2018/11/16/体验活动的生命周期/image-20181116121609853.png" alt="image-20181116121609853"></p><p>依次会执行onPause()、onStop()、 onDestroy()方法,最终销毁MainActivity。</p><h2 id="参考文献"><a href="#参考文献" class="headerlink" title="参考文献"></a>参考文献</h2><p>《第一行代码 Android》郭霖 </p>]]></content>
<categories>
<category> Android </category>
</categories>
</entry>
<entry>
<title>Hexo博客迁移到一台新电脑</title>
<link href="/2018/10/29/Hexo%E5%8D%9A%E5%AE%A2%E8%BF%81%E7%A7%BB%E5%88%B0%E5%8F%A6%E4%B8%80%E5%8F%B0%E7%94%B5%E8%84%91/"/>
<url>/2018/10/29/Hexo%E5%8D%9A%E5%AE%A2%E8%BF%81%E7%A7%BB%E5%88%B0%E5%8F%A6%E4%B8%80%E5%8F%B0%E7%94%B5%E8%84%91/</url>
<content type="html"><![CDATA[<h1 id="前言"><a href="#前言" class="headerlink" title="前言"></a>前言</h1><p>由于10月份换了一台Macbook Pro,导致自己搭建的Hexo博客一直停滞了。我想人做事一定要善始善终,不要忘记当初搭建Hexo博客的初心。于是乎,就有了这一篇文章。</p><h1 id="迁移思路"><a href="#迁移思路" class="headerlink" title="迁移思路"></a>迁移思路</h1><p>在已经推送到Github上的Hexo静态网页创建一个分支,利用这个分支来管理自己的Hexo环境文件。</p><h1 id="迁移步骤"><a href="#迁移步骤" class="headerlink" title="迁移步骤"></a>迁移步骤</h1><h2 id="1-在旧机器上克隆Github上面生成的静态文件到本地"><a href="#1-在旧机器上克隆Github上面生成的静态文件到本地" class="headerlink" title="1.在旧机器上克隆Github上面生成的静态文件到本地"></a>1.在旧机器上克隆Github上面生成的静态文件到本地</h2><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">git <span class="built_in">clone</span> https://github.com/username/username.github.io.git</span><br></pre></td></tr></table></figure><h2 id="2-针对克隆到本地的文件中,将除去-git文件的所有文件都删除"><a href="#2-针对克隆到本地的文件中,将除去-git文件的所有文件都删除" class="headerlink" title="2.针对克隆到本地的文件中,将除去.git文件的所有文件都删除"></a>2.针对克隆到本地的文件中,将除去<code>.git</code>文件的所有文件都删除</h2><h2 id="3-将旧机器中所有文件-gitignore文件中包含的文件除外)拷贝到我们克隆下来的文件内"><a href="#3-将旧机器中所有文件-gitignore文件中包含的文件除外)拷贝到我们克隆下来的文件内" class="headerlink" title="3.将旧机器中所有文件(.gitignore文件中包含的文件除外)拷贝到我们克隆下来的文件内"></a>3.将旧机器中所有文件(<code>.gitignore</code>文件中包含的文件除外)拷贝到我们克隆下来的文件内</h2><h2 id="4-创建并切换到一个叫hexo的分支"><a href="#4-创建并切换到一个叫hexo的分支" class="headerlink" title="4. 创建并切换到一个叫hexo的分支"></a>4. 创建并切换到一个叫hexo的分支</h2><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">git checkout -b hexo</span><br></pre></td></tr></table></figure><h2 id="5-提交复制过来的文件到暂存取"><a href="#5-提交复制过来的文件到暂存取" class="headerlink" title="5. 提交复制过来的文件到暂存取"></a>5. 提交复制过来的文件到暂存取</h2><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">git add -A</span><br></pre></td></tr></table></figure><h2 id="6-提交"><a href="#6-提交" class="headerlink" title="6.提交"></a>6.提交</h2><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">git commit -m <span class="string">"ceate a new branch file"</span></span><br></pre></td></tr></table></figure><h2 id="7-推送到Github"><a href="#7-推送到Github" class="headerlink" title="7.推送到Github"></a>7.推送到Github</h2><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">git push --<span class="built_in">set</span>-upstream origin hexo</span><br></pre></td></tr></table></figure><p>这个时候hexo分支已经创建完毕,接下来,我们在新电脑上搭建环境。</p><h2 id="8-新电脑配置环境"><a href="#8-新电脑配置环境" class="headerlink" title="8.新电脑配置环境"></a>8.新电脑配置环境</h2><p>安装node.js,根据自己电脑系统自行百度安装。</p><p>安装git,git相关教程推荐<a href="https://link.jianshu.com/?t=https://www.liaoxuefeng.com/wiki/0013739516305929606dd18361248578c67b8067c8c017b000" target="_blank" rel="noopener">廖雪峰老师的git教程</a>。</p><p>安装hexo: <code>npm install -g hexo-cli</code></p><h2 id="9-clone远程仓库到本地"><a href="#9-clone远程仓库到本地" class="headerlink" title="9.clone远程仓库到本地"></a>9.clone远程仓库到本地</h2><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">git <span class="built_in">clone</span> -b hexo https://github.com/username/username.github.io.git</span><br></pre></td></tr></table></figure><h2 id="10-根据package-json安装依赖"><a href="#10-根据package-json安装依赖" class="headerlink" title="10.根据package.json安装依赖"></a>10.根据<code>package.json</code>安装依赖</h2><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">npm install *** --save</span><br></pre></td></tr></table></figure><p>将<code>***</code>替换为<code>package.json</code>文件内的依赖包</p><h2 id="11-开始写文章"><a href="#11-开始写文章" class="headerlink" title="11.开始写文章"></a>11.开始写文章</h2><p>我们现在可以通过<code>hexo n "文章标题"</code> 创建一篇文章了!</p><h2 id="12-提交hexo环境文件"><a href="#12-提交hexo环境文件" class="headerlink" title="12. 提交hexo环境文件"></a>12. 提交hexo环境文件</h2><p><code>git add .</code></p><p><code>git commit -m "some description"</code></p><p><code>git push origin hexo</code></p><h2 id="13-发布文章"><a href="#13-发布文章" class="headerlink" title="13.发布文章"></a>13.发布文章</h2><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">hexo g -d</span><br></pre></td></tr></table></figure><p>到这里,我们的Hexo博客就迁移完毕了!!以后再写文章时,只需要重复步骤11、12、13就ok啦!!</p>]]></content>
<categories>
<category> Hexo </category>
</categories>
<tags>
<tag> Hexo迁移 </tag>
</tags>
</entry>
<entry>
<title>基于Scrapy框架的CrawlSpider类爬取当当全网商品信息</title>
<link href="/2018/09/01/%E5%9F%BA%E4%BA%8EScrapy%E6%A1%86%E6%9E%B6%E7%9A%84CrawlSpider%E7%B1%BB%E7%88%AC%E5%8F%96%E5%BD%93%E5%BD%93%E5%85%A8%E7%BD%91%E5%95%86%E5%93%81%E4%BF%A1%E6%81%AF/"/>
<url>/2018/09/01/%E5%9F%BA%E4%BA%8EScrapy%E6%A1%86%E6%9E%B6%E7%9A%84CrawlSpider%E7%B1%BB%E7%88%AC%E5%8F%96%E5%BD%93%E5%BD%93%E5%85%A8%E7%BD%91%E5%95%86%E5%93%81%E4%BF%A1%E6%81%AF/</url>
<content type="html"><![CDATA[<h1 id="前言"><a href="#前言" class="headerlink" title="前言"></a>前言</h1><p>本项目所爬取的数据为当当网站的商品信息以及商品的评论信息,并分别存储到数据库的商品信息表和评论信息表两个表中。</p><p>本项目通过使用Scrapy框架的CrawlSpider类,对当当全网商品信息进行爬取并将信息保存至mysql数据库,当当网反爬措施是对IP访问频率的限制,所以本项目使用了中间件<code>scrapy-rotating-proxies</code>来管控IP代理池,有关代理ip的爬取请见我的另一篇博文。</p><p>CrawlSpider是Spider的派生类,Spider类的设计原则是只爬取start_url列表中的网页,而CrawlSpider类通过定义一些规则(rule)来跟进所爬取网页中的link,从爬取的网页中获取link并继续爬取。</p><p>Github地址: <a href="https://github.com/RunningGump/crawl_dangdang" target="_blank" rel="noopener">https://github.com/RunningGump/crawl_dangdang</a></p><h1 id="依赖"><a href="#依赖" class="headerlink" title="依赖"></a>依赖</h1><ol><li>scrapy 1.5.0</li><li>python3.6</li><li>mysql 5.7.24</li><li>pymysql 库</li><li>scrapy-rotating-proxies 库</li><li>fake-useragent 库</li></ol><h1 id="创建项目"><a href="#创建项目" class="headerlink" title="创建项目"></a>创建项目</h1><p>首先,我们需要创建一个Scrapy项目,在shell中使用<code>scrapy startproject</code>命令:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">$ scrapy startproject Dangdang</span><br><span class="line">New Scrapy project <span class="string">'Dangdang'</span>, using template directory <span class="string">'/usr/local/lib/python3.6/dist-packages/scrapy/templates/project'</span>, created <span class="keyword">in</span>:</span><br><span class="line"> /home/geng/Dangdang</span><br><span class="line"></span><br><span class="line">You can start your first spider with:</span><br><span class="line"> <span class="built_in">cd</span> Dangdang</span><br><span class="line"> scrapy genspider example example.com</span><br></pre></td></tr></table></figure><p>创建好一个名为<code>Dangdang</code>的项目后,接下来,你进入新建的项目目录:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">$ <span class="built_in">cd</span> Dangdang</span><br></pre></td></tr></table></figure><p>然后,使用<code>scrapy genspider -t <template> <name> <domain></code>创建一个spider:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">$ scrapy genspider -t crawl dd dangdang.com</span><br><span class="line">Created spider <span class="string">'dd'</span> using template <span class="string">'crawl'</span> <span class="keyword">in</span> module:</span><br><span class="line"> Dangdang.spiders.dd</span><br></pre></td></tr></table></figure><p>此时,你通过<code>cd ..</code>返回上级目录,使用<code>tree</code>命令查看项目目录下的文件,显示如下:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br></pre></td><td class="code"><pre><span class="line">$ <span class="built_in">cd</span> ..</span><br><span class="line">$ tree Dangdang/</span><br><span class="line">Dangdang/</span><br><span class="line">├── Dangdang</span><br><span class="line">│ ├── __init__.py</span><br><span class="line">│ ├── items.py</span><br><span class="line">│ ├── middlewares.py</span><br><span class="line">│ ├── pipelines.py</span><br><span class="line">│ ├── __pycache__</span><br><span class="line">│ │ ├── __init__.cpython-36.pyc</span><br><span class="line">│ │ └── settings.cpython-36.pyc</span><br><span class="line">│ ├── settings.py</span><br><span class="line">│ └── spiders</span><br><span class="line">│ ├── dd.py</span><br><span class="line">│ ├── __init__.py</span><br><span class="line">│ └── __pycache__</span><br><span class="line">│ └── __init__.cpython-36.pyc</span><br><span class="line">└── scrapy.cfg</span><br><span class="line"></span><br><span class="line">4 directories, 11 files</span><br></pre></td></tr></table></figure><p>到此为止,我们的项目就创建成功了。</p><h1 id="rules"><a href="#rules" class="headerlink" title="rules"></a>rules</h1><p>在rules中包含一个或多个Rule对象,每个Rule对爬取网站的动作设置了爬取规则。</p><p> <strong>参数介绍:</strong></p><p><code>link_extractor</code>:是一个Link Extractor对象,用于定义需要提取的链接。</p><p><code>callback</code>: 回调函数,对link_extractor获得的链接进行处理与解析。</p><p><strong>注意事项:</strong>当编写爬虫规则时,避免使用parse作为回调函数。由于CrawlSpider使用parse方法来实现其逻辑,如果覆盖了 parse方法,crawl spider将会运行失败。</p><p><code>follow</code>:是一个布尔(boolean)值,指定了根据规则从response提取的链接是否需要跟进。 如果callback为None,follow 默认设置为True ,否则默认为False</p><p><code>process_links</code>:指定该spider中哪个的函数将会被调用,从link_extractor中获取到链接列表时将会调用该函数。该方法主要用来过滤链接。 </p><p><code>process_request</code>:指定该spider中哪个的函数将会被调用, 该规则提取到每个request时都会调用该函数。 (用来过滤request)</p><h1 id="LinkExtrator"><a href="#LinkExtrator" class="headerlink" title="LinkExtrator"></a>LinkExtrator</h1><p><strong>参数介绍:</strong></p><p><code>allow</code>:满足括号中“正则表达式”的值会被提取,如果为空,则全部匹配。</p><p><code>deny</code>:与这个正则表达式(或正则表达式列表)匹配的URL不提取。</p><p><code>allow_domains</code>:会被提取的链接的域名。</p><p><code>deny_domains</code>:不会被提取链接的域名。</p><p><code>restrict_xpaths</code>:使用Xpath表达式与allow共同作用提取出同时符合对应Xpath表达式和正则表达式的链接;</p><h1 id="项目代码"><a href="#项目代码" class="headerlink" title="项目代码"></a>项目代码</h1><h2 id="编写item-py文件"><a href="#编写item-py文件" class="headerlink" title="编写item.py文件"></a>编写item.py文件</h2><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># -*- coding: utf-8 -*-</span></span><br><span class="line"><span class="keyword">import</span> scrapy</span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">DangdangItem</span><span class="params">(scrapy.Item)</span>:</span></span><br><span class="line"> goods_id = scrapy.Field() <span class="comment"># 商品id</span></span><br><span class="line"> category = scrapy.Field() <span class="comment"># 商品类别</span></span><br><span class="line"> title = scrapy.Field() <span class="comment"># 商品名称</span></span><br><span class="line"> link = scrapy.Field() <span class="comment"># 商品链接</span></span><br><span class="line"> price = scrapy.Field() <span class="comment"># 商品价格</span></span><br><span class="line"> comment_num = scrapy.Field() <span class="comment"># 商品评论数</span></span><br><span class="line"> good_comment_num = scrapy.Field() <span class="comment"># 商品好评数</span></span><br><span class="line"> mid_comment_num = scrapy.Field() <span class="comment"># 商品中评数</span></span><br><span class="line"> bad_comment_num = scrapy.Field() <span class="comment"># 商品差评数</span></span><br><span class="line"> rate = scrapy.Field() <span class="comment"># 商品的好评率</span></span><br><span class="line"> source = scrapy.Field() <span class="comment"># 商品的来源地</span></span><br><span class="line"> detail = scrapy.Field() <span class="comment"># 商品详情</span></span><br><span class="line"> img_link = scrapy.Field() <span class="comment">#商品图片链接</span></span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">CommentItem</span><span class="params">(scrapy.Item)</span>:</span></span><br><span class="line"> goods_id = scrapy.Field() <span class="comment"># 商品id</span></span><br><span class="line"> comment = scrapy.Field() <span class="comment"># 商品的所有评论</span></span><br><span class="line"> score = scrapy.Field() <span class="comment"># 评论对应的评分</span></span><br><span class="line"> time = scrapy.Field() <span class="comment"># 评论的时间</span></span><br></pre></td></tr></table></figure><h2 id="编写pipeline-py文件"><a href="#编写pipeline-py文件" class="headerlink" title="编写pipeline.py文件"></a>编写pipeline.py文件</h2><p>需提前创建好数据库,本项目创建的数据库名字为<code>dd</code>,并创建了两个数据表<code>goods</code>,<code>comments</code>。</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># -*- coding: utf-8 -*-</span></span><br><span class="line"><span class="keyword">import</span> pymysql</span><br><span class="line"><span class="keyword">from</span> scrapy.conf <span class="keyword">import</span> settings</span><br><span class="line"><span class="keyword">import</span> json</span><br><span class="line"><span class="keyword">from</span> Dangdang.items <span class="keyword">import</span> DangdangItem</span><br><span class="line"><span class="keyword">from</span> Dangdang.items <span class="keyword">import</span> CommentItem</span><br><span class="line"></span><br><span class="line"><span class="comment">## pipeline默认是不开启的,需在settings.py中开启</span></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">DangdangPipeline</span><span class="params">(object)</span>:</span></span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">process_item</span><span class="params">(self, item, spider)</span>:</span></span><br><span class="line"> <span class="comment"># 连接数据库</span></span><br><span class="line"> conn = pymysql.connect(host=<span class="string">"localhost"</span>,user=<span class="string">"root"</span>,passwd=<span class="string">'******'</span>,db=<span class="string">"dd"</span>,use_unicode=<span class="keyword">True</span>, charset=<span class="string">"utf8"</span>)</span><br><span class="line"> cur = conn.cursor() <span class="comment"># 用来获得python执行Mysql命令的方法,也就是我们所说的操作游标</span></span><br><span class="line"> print(<span class="string">"mysql connect success"</span>) <span class="comment"># 测试语句,这在程序执行时非常有效的理解程序是否执行到这一步</span></span><br><span class="line"> <span class="comment"># 存储当当商品信息的逻辑</span></span><br><span class="line"> <span class="keyword">if</span> isinstance(item, DangdangItem): <span class="comment"># 判断传入的item是否为DangdangItem</span></span><br><span class="line"> <span class="keyword">try</span>:</span><br><span class="line"> goods_id = item[<span class="string">"goods_id"</span>]</span><br><span class="line"> category = item[<span class="string">"category"</span>]</span><br><span class="line"> title = item[<span class="string">"title"</span>]</span><br><span class="line"> <span class="keyword">if</span> len(title)><span class="number">40</span>:</span><br><span class="line"> title = title[<span class="number">0</span>:<span class="number">40</span>] + <span class="string">'...'</span></span><br><span class="line"> link = item[<span class="string">"link"</span>]</span><br><span class="line"> img_link = item[<span class="string">'img_link'</span>]</span><br><span class="line"> price = item[<span class="string">"price"</span>]</span><br><span class="line"> comment_num = item[<span class="string">"comment_num"</span>]</span><br><span class="line"> good_comment_num = item[<span class="string">"good_comment_num"</span>]</span><br><span class="line"> mid_comment_num = item[<span class="string">"mid_comment_num"</span>]</span><br><span class="line"> bad_comment_num = item[<span class="string">"bad_comment_num"</span>]</span><br><span class="line"> rate = item[<span class="string">"rate"</span>]</span><br><span class="line"> source = item[<span class="string">"source"</span>]</span><br><span class="line"> detail = item[<span class="string">"detail"</span>]</span><br><span class="line"></span><br><span class="line"> sql = <span class="string">"INSERT INTO goods(goods_id,category,title,price,comment_num,good_comment_num,mid_comment_num,bad_comment_num,rate,source,detail,link,img_link) VALUES ('%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s')"</span> % \</span><br><span class="line"> (goods_id,category,title,price,comment_num,good_comment_num,mid_comment_num,bad_comment_num,rate,source,detail,link,img_link)</span><br><span class="line"> print(sql)</span><br><span class="line"> <span class="keyword">except</span> Exception <span class="keyword">as</span> err:</span><br><span class="line"> print(err,<span class="string">'EORR1111!!!'</span>)</span><br><span class="line"></span><br><span class="line"> <span class="string">'''########################################################</span></span><br><span class="line"><span class="string"> 执行sql语句,将商品信息存入goods数据表 </span></span><br><span class="line"><span class="string"> ########################################################'''</span></span><br><span class="line"> <span class="keyword">try</span>:</span><br><span class="line"> cur.execute(sql) <span class="comment"># 真正执行MySQL语句,即查询TABLE_PARAMS表的数据</span></span><br><span class="line"> print(<span class="string">"insert goods success"</span>) <span class="comment"># 测试语句</span></span><br><span class="line"> <span class="keyword">except</span> Exception <span class="keyword">as</span> err:</span><br><span class="line"> print(err)</span><br><span class="line"> conn.rollback() <span class="comment">#事务回滚,为了保证数据的有效性将数据恢复到本次操作之前的状态.有时候会存在一个事务包含多个操作,而多个操作又都有顺序,顺序执行操作时,有一个执行失败,则之前操作成功的也会回滚,即未操作的状态</span></span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> conn.commit() <span class="comment">#当没有发生异常时,提交事务,避免出现一些不必要的错误</span></span><br><span class="line"></span><br><span class="line"> <span class="keyword">elif</span> isinstance(item, CommentItem): <span class="comment"># 判断传入的item是否为CommentItem</span></span><br><span class="line"> <span class="keyword">try</span>:</span><br><span class="line"> <span class="comment"># 遍历所有评论</span></span><br><span class="line"> goods_id = item[<span class="string">"goods_id"</span>]</span><br><span class="line"> comment = item[<span class="string">"comment"</span>]</span><br><span class="line"> score = item [<span class="string">"score"</span>]</span><br><span class="line"> comment_time = item[<span class="string">"time"</span>]</span><br><span class="line"></span><br><span class="line"> sql2 = <span class="string">"INSERT INTO comments(goods_id,comment,score,comment_time) VALUES ('%s','%s','%s','%s')"</span> % \</span><br><span class="line"> (goods_id,comment,score,comment_time)</span><br><span class="line"> print(sql2)</span><br><span class="line"> <span class="keyword">except</span> Exception <span class="keyword">as</span> err:</span><br><span class="line"> print(err,<span class="string">'EORR222!!!'</span>)</span><br><span class="line"></span><br><span class="line"> <span class="string">'''########################################################</span></span><br><span class="line"><span class="string"> 执行sql语句,将评论信息存入comments数据表 </span></span><br><span class="line"><span class="string"> ########################################################'''</span></span><br><span class="line"> <span class="keyword">try</span>:</span><br><span class="line"> cur.execute(sql2) <span class="comment"># 真正执行MySQL语句,即查询TABLE_PARAMS表的数据</span></span><br><span class="line"> print(<span class="string">"insert comments success"</span>) <span class="comment"># 测试语句</span></span><br><span class="line"> <span class="keyword">except</span> Exception <span class="keyword">as</span> err:</span><br><span class="line"> print(err)</span><br><span class="line"> conn.rollback() <span class="comment">#事务回滚,为了保证数据的有效性将数据恢复到本次操作之前的状态.有时候会存在一个事务包含多个操作,而多个操作又都有顺序,顺序执行操作时,有一个执行失败,则之前操作成功的也会回滚,即未操作的状态</span></span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> conn.commit() <span class="comment">#当没有发生异常时,提交事务,避免出现一些不必要的错误</span></span><br><span class="line"></span><br><span class="line"> conn.close() <span class="comment">#关闭连接 </span></span><br><span class="line"></span><br><span class="line"> <span class="keyword">return</span> item <span class="comment">#框架要求返回一个item对象</span></span><br></pre></td></tr></table></figure><h2 id="编写middlewares-py"><a href="#编写middlewares-py" class="headerlink" title="编写middlewares.py"></a>编写middlewares.py</h2><p>本项目添加了<code>RandomUserAgentMiddleWare</code>中间件,用来随机更换UserAgent。在<code>middlewares.py</code>文件的最后面添加如下中间件:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">from</span> fake_useragent <span class="keyword">import</span> UserAgent</span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">RandomUserAgentMiddleWare</span><span class="params">(object)</span>:</span></span><br><span class="line"> <span class="string">"""</span></span><br><span class="line"><span class="string"> 随机更换User-Agent,避免ban user-agent.</span></span><br><span class="line"><span class="string"> """</span></span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">__init__</span><span class="params">(self,crawler)</span>:</span></span><br><span class="line"> super(RandomUserAgentMiddleWare, self).__init__()</span><br><span class="line"> self.ua = UserAgent()</span><br><span class="line"> self.ua_type = crawler.settings.get(<span class="string">"RANDOM_UA_TYPE"</span>, <span class="string">"random"</span>)</span><br><span class="line"></span><br><span class="line"><span class="meta"> @classmethod</span></span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">from_crawler</span><span class="params">(cls, crawler)</span>:</span></span><br><span class="line"> <span class="keyword">return</span> cls(crawler)</span><br><span class="line"></span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">process_request</span><span class="params">(self, request, spider)</span>:</span></span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">get_ua_type</span><span class="params">()</span>:</span></span><br><span class="line"> <span class="keyword">return</span> getattr(self.ua, self.ua_type) <span class="comment"># 取对象 ua 的 ua_type 的这个属性, 相当于 self.ua.self.ua_type</span></span><br><span class="line"></span><br><span class="line"> request.headers.setdefault(<span class="string">'User-Agent'</span>, get_ua_type())</span><br></pre></td></tr></table></figure><h2 id="修改settings-py-文件"><a href="#修改settings-py-文件" class="headerlink" title="修改settings.py 文件"></a>修改settings.py 文件</h2><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br></pre></td><td class="code"><pre><span class="line">BOT_NAME = <span class="string">'Dangdang'</span></span><br><span class="line"></span><br><span class="line">SPIDER_MODULES = [<span class="string">'Dangdang.spiders'</span>]</span><br><span class="line">NEWSPIDER_MODULE = <span class="string">'Dangdang.spiders'</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 不遵循robots协议</span></span><br><span class="line">ROBOTSTXT_OBEY = <span class="keyword">False</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 下载延迟设置为0,提高爬取速度</span></span><br><span class="line">DOWNLOAD_DELAY = <span class="number">0</span></span><br><span class="line"></span><br><span class="line"><span class="comment">#禁用Cookie(默认情况下启用)</span></span><br><span class="line">COOKIES_ENABLED = <span class="keyword">False</span> </span><br><span class="line"></span><br><span class="line"><span class="comment"># 启用所需要的下载中间件,对于爬取当当网也可以将1,2,4注释掉。</span></span><br><span class="line">DOWNLOADER_MIDDLEWARES = {</span><br><span class="line"> <span class="string">'rotating_proxies.middlewares.RotatingProxyMiddleware'</span>: <span class="number">610</span>,</span><br><span class="line"> <span class="string">'rotating_proxies.middlewares.BanDetectionMiddleware'</span>: <span class="number">620</span>,</span><br><span class="line"> <span class="string">'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware'</span>: <span class="keyword">None</span>,</span><br><span class="line"> <span class="string">'Dangdang.middlewares.RandomUserAgentMiddleWare'</span>: <span class="number">400</span>,</span><br><span class="line">}</span><br><span class="line"><span class="comment"># 代理IP文件路径,此处需改为你自己的路径</span></span><br><span class="line">ROTATING_PROXY_LIST_PATH = <span class="string">'/home/geng/Projects/Dangdang/proxy.txt'</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 随机更换UserAgent </span></span><br><span class="line">RANDOM_UA_TYPE = <span class="string">"random"</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 开启pipeline</span></span><br><span class="line">ITEM_PIPELINES = {</span><br><span class="line"> <span class="string">'Dangdang.pipelines.DangdangPipeline'</span>: <span class="number">300</span>,</span><br><span class="line">}</span><br></pre></td></tr></table></figure><h2 id="spider文件-dd-py-编写"><a href="#spider文件-dd-py-编写" class="headerlink" title="spider文件(dd.py)编写"></a>spider文件(dd.py)编写</h2><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># -*- coding: utf-8 -*-</span></span><br><span class="line"><span class="keyword">import</span> scrapy</span><br><span class="line"><span class="keyword">from</span> scrapy.linkextractors <span class="keyword">import</span> LinkExtractor</span><br><span class="line"><span class="keyword">from</span> scrapy.spiders <span class="keyword">import</span> CrawlSpider, Rule</span><br><span class="line"><span class="keyword">from</span> Dangdang.items <span class="keyword">import</span> DangdangItem</span><br><span class="line"><span class="keyword">from</span> Dangdang.items <span class="keyword">import</span> CommentItem</span><br><span class="line"><span class="keyword">import</span> re</span><br><span class="line"><span class="keyword">import</span> urllib.request</span><br><span class="line"><span class="keyword">import</span> json</span><br><span class="line"><span class="keyword">import</span> requests</span><br><span class="line"><span class="keyword">from</span> lxml <span class="keyword">import</span> etree</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">DdSpider</span><span class="params">(CrawlSpider)</span>:</span></span><br><span class="line"> name = <span class="string">'dd'</span></span><br><span class="line"> allowed_domains = [<span class="string">'dangdang.com'</span>]</span><br><span class="line"> start_urls = [<span class="string">'http://category.dangdang.com/'</span>]</span><br><span class="line"></span><br><span class="line"> <span class="comment"># 分析网页链接,编写rules规则,提取商品详情页的链接</span></span><br><span class="line"> rules = (</span><br><span class="line"> Rule(LinkExtractor(allow=<span class="string">r'/cp\d{2}.\d{2}.\d{2}.\d{2}.\d{2}.\d{2}.html$|/pg\d+-cp\d{2}.\d{2}.\d{2}.\d{2}.\d{2}.\d{2}.html$'</span>, deny=<span class="string">r'/cp98.\d{2}.\d{2}.\d{2}.\d{2}.\d{2}.\d{2}.html'</span>),</span><br><span class="line"> follow=<span class="keyword">True</span>),</span><br><span class="line"> Rule(LinkExtractor(allow=<span class="string">r'/cid\d+.html$|/pg\d+-cid\d+.html$'</span>, deny=<span class="string">r'/cp98.\d{2}.\d{2}.\d{2}.\d{2}.\d{2}.\d{2}.html'</span>),</span><br><span class="line"> follow=<span class="keyword">True</span>),</span><br><span class="line"> Rule(LinkExtractor(allow=<span class="string">r'product.dangdang.com/\d+.html$'</span>, restrict_xpaths=(<span class="string">"//p[@class='name']/a"</span>)),</span><br><span class="line"> callback=<span class="string">'parse_item'</span>,</span><br><span class="line"> follow=<span class="keyword">False</span>), <span class="comment"># allow与restrict_xpath配合使用,效果很好,可以更精准筛选链接.</span></span><br><span class="line"> )</span><br><span class="line"></span><br><span class="line"> <span class="comment"># 解析商品详情页面</span></span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">parse_item</span><span class="params">(self, response)</span>:</span></span><br><span class="line"> item = DangdangItem() <span class="comment"># 实例化item</span></span><br><span class="line"> commment_item = CommentItem()</span><br><span class="line"> item[<span class="string">"category"</span>] = response.xpath(<span class="string">'//*[@id="breadcrumb"]/a[1]/b/text()'</span>).extract_first()+<span class="string">'>'</span>+response.xpath(<span class="string">'//*[@id="breadcrumb"]/a[2]/text()'</span>).extract_first()+<span class="string">'>'</span>+response.xpath(<span class="string">'//*[@id="breadcrumb"]/a[3]/text()'</span>).extract_first()</span><br><span class="line"> item[<span class="string">"title"</span>] = response.xpath(<span class="string">"//*[@id='product_info']/div[1]/h1/@title"</span>).extract_first()</span><br><span class="line"> item[<span class="string">"detail"</span>] = json.dumps(response.xpath(<span class="string">"//*[@id='detail_describe']/ul//li/text()"</span>).extract(),ensure_ascii=<span class="keyword">False</span>)</span><br><span class="line"> item[<span class="string">"link"</span>] = response.url</span><br><span class="line"> item[<span class="string">"img_link"</span>] =json.dumps(response.xpath(<span class="string">"//div[@class='img_list']/ul//li/a/@data-imghref"</span>).extract())</span><br><span class="line"> <span class="keyword">try</span>:</span><br><span class="line"> item[<span class="string">"price"</span>] = response.xpath(<span class="string">"//*[@id='dd-price']/text()"</span>).extract()[<span class="number">1</span>].strip()</span><br><span class="line"> <span class="keyword">except</span> IndexError <span class="keyword">as</span> e:</span><br><span class="line"> item[<span class="string">"price"</span>] = response.xpath(<span class="string">"//*[@id='dd-price']/text()"</span>).extract()[<span class="number">0</span>].strip()</span><br><span class="line"> item[<span class="string">"comment_num"</span>] = response.xpath(<span class="string">"//*[@id='comm_num_down']/text()"</span>).extract()[<span class="number">0</span>]</span><br><span class="line"></span><br><span class="line"> <span class="keyword">try</span>:</span><br><span class="line"> item[<span class="string">"source"</span>] = response.xpath(<span class="string">"//*[@id='shop-geo-name']/text()"</span>).extract()[<span class="number">0</span>].replace(<span class="string">'\xa0至'</span>,<span class="string">''</span>)</span><br><span class="line"> <span class="keyword">except</span> IndexError <span class="keyword">as</span> e:</span><br><span class="line"> item[<span class="string">"source"</span>] = <span class="string">'当当自营'</span></span><br><span class="line"> </span><br><span class="line"> <span class="comment"># 通过正则表达式提取url中的商品id</span></span><br><span class="line"> goodsid = re.compile(<span class="string">'\/(\d+).html'</span>).findall(response.url)[<span class="number">0</span>] </span><br><span class="line"> commment_item[<span class="string">'goods_id'</span>] = goodsid</span><br><span class="line"> item[<span class="string">"goods_id"</span>] = goodsid</span><br><span class="line"></span><br><span class="line"> <span class="string">'''########################################################</span></span><br><span class="line"><span class="string"> 通过抓包分析,提取商品的好评率 </span></span><br><span class="line"><span class="string"> ########################################################'''</span></span><br><span class="line"> <span class="comment"># 提取详情页源码中的categoryPath</span></span><br><span class="line"> script = response.xpath(<span class="string">"/html/body/script[1]/text()"</span>).extract()[<span class="number">0</span>]</span><br><span class="line"> categoryPath = re.compile(<span class="string">r'.*categoryPath":"(.*?)","describeMap'</span>).findall(script)[<span class="number">0</span>]</span><br><span class="line"> <span class="comment"># 构造包含好评率包的链接</span></span><br><span class="line"> rate_url = <span class="string">"http://product.dangdang.com/index.php?r=comment%2Flist&productId="</span>+str(goodsid)+<span class="string">"&categoryPath="</span>+str(categoryPath)+<span class="string">"&mainProductId="</span>+str(goodsid)</span><br><span class="line"> r = requests.get(rate_url)</span><br><span class="line"> data_dict = json.loads(r.text)</span><br><span class="line"> item[<span class="string">"rate"</span>] = data_dict[<span class="string">'data'</span>][<span class="string">'list'</span>][<span class="string">'summary'</span>][<span class="string">'goodRate'</span>]</span><br><span class="line"> item[<span class="string">"good_comment_num"</span>] = data_dict[<span class="string">'data'</span>][<span class="string">'list'</span>][<span class="string">'summary'</span>][<span class="string">'total_crazy_count'</span>]</span><br><span class="line"> item[<span class="string">"mid_comment_num"</span>] = data_dict[<span class="string">'data'</span>][<span class="string">'list'</span>][<span class="string">'summary'</span>][<span class="string">'total_indifferent_count'</span>]</span><br><span class="line"> item[<span class="string">"bad_comment_num"</span>] = data_dict[<span class="string">'data'</span>][<span class="string">'list'</span>][<span class="string">'summary'</span>][<span class="string">'total_detest_count'</span>]</span><br><span class="line"> <span class="keyword">yield</span> item</span><br><span class="line"></span><br><span class="line"> <span class="string">'''#####################################################</span></span><br><span class="line"><span class="string"> 开始对评论、评分进行清洗并爬取 </span></span><br><span class="line"><span class="string"> #####################################################'''</span></span><br><span class="line"> html_str = data_dict[<span class="string">'data'</span>][<span class="string">'list'</span>][<span class="string">'html'</span>]</span><br><span class="line"> html = etree.HTML(html_str)</span><br><span class="line"> comment_items = html.xpath(<span class="string">'//div[@class="comment_items clearfix"]'</span>)</span><br><span class="line"> pageIndex = <span class="number">1</span></span><br><span class="line"> <span class="keyword">while</span> comment_items: </span><br><span class="line"> pageIndex += <span class="number">1</span></span><br><span class="line"> <span class="keyword">for</span> item <span class="keyword">in</span> comment_items:</span><br><span class="line"> comment_unit = item.xpath(<span class="string">'.//div[@class="describe_detail"][1]/span[not(@class="icon")]/text()'</span>)</span><br><span class="line"> score = item.xpath(<span class="string">'.//div[@class="pinglun"]/em/text()'</span>)[<span class="number">0</span>]</span><br><span class="line"> time = item.xpath(<span class="string">'.//div[@class="items_right"]/div[@class="starline clearfix"][1]/span[1]/text()'</span>)[<span class="number">0</span>]</span><br><span class="line"> comment = <span class="string">' '</span>.join(comment_unit)</span><br><span class="line"> commment_item[<span class="string">"comment"</span>] = comment </span><br><span class="line"> commment_item[<span class="string">'score'</span>] = score</span><br><span class="line"> commment_item[<span class="string">"time"</span>] = time</span><br><span class="line"> <span class="keyword">yield</span> commment_item</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"> rate_url = <span class="string">"http://product.dangdang.com/index.php?r=comment%2Flist&productId="</span>+str(goodsid)+<span class="string">"&categoryPath="</span>+str(categoryPath)+<span class="string">"&mainProductId="</span>+str(goodsid) + <span class="string">"&pageIndex="</span> + str(pageIndex)</span><br><span class="line"> r = requests.get(rate_url)</span><br><span class="line"> data_dict = json.loads(r.text)</span><br><span class="line"> html_str = data_dict[<span class="string">'data'</span>][<span class="string">'list'</span>][<span class="string">'html'</span>]</span><br><span class="line"> html = etree.HTML(html_str)</span><br><span class="line"> comment_items = html.xpath(<span class="string">'//div[@class="comment_items clearfix"]'</span>)</span><br></pre></td></tr></table></figure><h1 id="建立数据库"><a href="#建立数据库" class="headerlink" title="建立数据库"></a>建立数据库</h1><p>本项目使用的是mysql数据库,创建数据库的名字为dd,且创建了两个数据表分别为goods和comments,这两个表的结构如下:</p><p><code>goods</code>表中的字段:</p><p><code>goods_id</code>,<code>category</code>,<code>title</code>,<code>price</code>,<code>comment_num</code>,<code>good_comment_num</code>,<code>mid_comment_num</code>,<code>bad_comment_num</code>,<code>rate</code>,<code>source</code>,<code>detail</code>, <code>link</code>,<code>img_link</code>。分别代表商品id(主键)、商品类别、商品名称、商品价格、评论数量、好评数、中评数、差评数、好评率、商品来源、商品详情、商品连接(unique)、商品图片连接。</p><p><code>comments</code>表中的字段:</p><p><code>comments_id</code>,<code>goods_id</code>,<code>comment</code>,<code>score</code>,<code>comment_time</code>。分别代表评论id(主键)、商品id、评论、商品评分、评论时间。</p><p>关于创建数据库/表的操作这里不再赘述,请自行百度。</p><h1 id="使用方法"><a href="#使用方法" class="headerlink" title="使用方法"></a>使用方法</h1><p>以上步骤操作完成后,在命令行中执行以下命令开始爬取:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">$ scrapy crawl dd</span><br></pre></td></tr></table></figure><h1 id="结果展示"><a href="#结果展示" class="headerlink" title="结果展示"></a>结果展示</h1><p>共爬取了<strong>214673</strong>条商品信息,和<strong>1375225</strong>条评论信息。(这并不是当当全部商品信息哦,我仅爬取了一部分)</p><p><strong>商品信息表:</strong></p><p><img src="/2018/09/01/基于Scrapy框架的CrawlSpider类爬取当当全网商品信息/goods.png" alt="结果展示"></p><p><strong>评论信息表:</strong></p><p><img src="/2018/09/01/基于Scrapy框架的CrawlSpider类爬取当当全网商品信息/comments.png" alt="结果展示"></p>]]></content>
<categories>
<category> python3网络爬虫 </category>
</categories>
<tags>
<tag> 爬虫 </tag>
<tag> scrapy </tag>
</tags>
</entry>
<entry>
<title>基于scrapy框架爬取西刺代理并验证有效性</title>
<link href="/2018/08/02/%E5%9F%BA%E4%BA%8Escrapy%E6%A1%86%E6%9E%B6%E7%88%AC%E5%8F%96%E8%A5%BF%E5%88%BA%E4%BB%A3%E7%90%86%E5%B9%B6%E9%AA%8C%E8%AF%81%E6%9C%89%E6%95%88%E6%80%A7/"/>
<url>/2018/08/02/%E5%9F%BA%E4%BA%8Escrapy%E6%A1%86%E6%9E%B6%E7%88%AC%E5%8F%96%E8%A5%BF%E5%88%BA%E4%BB%A3%E7%90%86%E5%B9%B6%E9%AA%8C%E8%AF%81%E6%9C%89%E6%95%88%E6%80%A7/</url>
<content type="html"><![CDATA[<h1 id="前言"><a href="#前言" class="headerlink" title="前言"></a>前言</h1><p>在爬取网站数据的时候,一些网站会对用户的访问频率进行限制,如果爬取过快会被封ip,而使用代理可防止被封禁。本项目使用scrapy框架对<a href="http://www.xicidaili.com/" target="_blank" rel="noopener">西刺网站</a>进行爬取,并验证爬取代理的有效性,最终将有效的代理输出并存储到json文件中。</p><p>Github地址: <a href="https://github.com/RunningGump/crawl_xiciproxy" target="_blank" rel="noopener">https://github.com/RunningGump/crawl_xiciproxy</a></p><h1 id="依赖"><a href="#依赖" class="headerlink" title="依赖"></a>依赖</h1><ol><li>python3.6</li><li>Scrapy 1.5.0</li></ol><h1 id="创建项目"><a href="#创建项目" class="headerlink" title="创建项目"></a>创建项目</h1><p>首先,我们需要创建一个Scrapy项目,在shell中使用<code>scrapy startproject</code>命令:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">$ scrapy startproject xiciproxy</span><br><span class="line">New Scrapy project <span class="string">'xiciproxy'</span>, using template directory <span class="string">'/usr/local/lib/python3.6/dist-packages/scrapy/templates/project'</span>, created <span class="keyword">in</span>:</span><br><span class="line"> /home/geng/xiciproxy</span><br><span class="line"></span><br><span class="line">You can start your first spider with:</span><br><span class="line"> <span class="built_in">cd</span> xiciproxy</span><br><span class="line"> scrapy genspider example example.com</span><br></pre></td></tr></table></figure><p>创建好一个名为<code>xiciproject</code>的项目后,接下来,你进入新建的项目目录:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">$ <span class="built_in">cd</span> xiciproxy</span><br></pre></td></tr></table></figure><p>然后,使用<code>scrapy genspider <name> <domain></code>创建一个spider:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">$ scrapy genspider xici xicidaili.com</span><br><span class="line">Created spider <span class="string">'xici'</span> using template <span class="string">'basic'</span> <span class="keyword">in</span> module:</span><br><span class="line"> xiciproxy.spiders.xici</span><br></pre></td></tr></table></figure><p>此时,你通过<code>cd ..</code>返回上级目录,使用<code>tree</code>命令查看项目目录下的文件,显示如下:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br></pre></td><td class="code"><pre><span class="line">$ <span class="built_in">cd</span> ..</span><br><span class="line">$ tree xiciproxy</span><br><span class="line">xiciproxy</span><br><span class="line">├── scrapy.cfg</span><br><span class="line">└── xiciproxy</span><br><span class="line"> ├── __init__.py</span><br><span class="line"> ├── items.py</span><br><span class="line"> ├── middlewares.py</span><br><span class="line"> ├── pipelines.py</span><br><span class="line"> ├── __pycache__</span><br><span class="line"> │ ├── __init__.cpython-36.pyc</span><br><span class="line"> │ └── settings.cpython-36.pyc</span><br><span class="line"> ├── settings.py</span><br><span class="line"> └── spiders</span><br><span class="line"> ├── __init__.py</span><br><span class="line"> ├── __pycache__</span><br><span class="line"> │ └── __init__.cpython-36.pyc</span><br><span class="line"> └── xici.py</span><br><span class="line"></span><br><span class="line">4 directories, 11 files</span><br></pre></td></tr></table></figure><p>到此为止,我们的项目就创建成功了。</p><h1 id="分析页面"><a href="#分析页面" class="headerlink" title="分析页面"></a>分析页面</h1><p>编写爬虫程序之间,首先需要对待爬取的页面进行分析,主流的浏览器中都带有分析页面的工具或插件,这里我们选用Chrome浏览器的开发者工具分析页面。</p><h2 id="链接信息"><a href="#链接信息" class="headerlink" title="链接信息"></a>链接信息</h2><p>在Chrome浏览器中打开页面<a href="http://www.xicidaili.com/" target="_blank" rel="noopener">http://www.xicidaili.com/</a>, 通过点击<code>国内高匿代理</code>和<code>国内普通代理</code>以及进行翻页操作,会发现以下规律:</p><p><code>http://www.xicidaili.com/参数1/参数2</code></p><p>参数1中<code>nn</code>代表高匿代理,<code>nt</code>代表普通代理;参数2中1,2,3,4…代表页数。</p><h2 id="数据信息"><a href="#数据信息" class="headerlink" title="数据信息"></a>数据信息</h2><p>爬取网页信息时一般使用高匿代理,高匿代理不改变客户机的请求,这样在服务器看来就像有个真正的客户浏览器在访问它,这时客户的真是IP是隐藏的,不会认为我们使用了代理。</p><p>本部分以爬取高匿代理为例子来分析如何爬取网页的数据信息。在Chrome浏览器中打开页面<a href="http://www.xicidaili.com/nn" target="_blank" rel="noopener">http://www.xicidaili.com/nn</a>, 并按<code>F12</code>键来打开开发者工具,点击Elements(元素)来查看其HTML代码,会发现每一条代理的信息都包裹在一个<code>tr</code>标签下,如下图所示:</p><p><img src="/2018/08/02/基于scrapy框架爬取西刺代理并验证有效性/xici1.png" alt="一个tr标签对应一条代理信息"></p><p>再来单独对一个<code>tr</code>标签进行分析:</p><figure class="highlight html"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><span class="line"><span class="tag"><<span class="name">tr</span> <span class="attr">class</span>=<span class="string">"odd"</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">td</span> <span class="attr">class</span>=<span class="string">"country"</span>></span><span class="tag"><<span class="name">img</span> <span class="attr">src</span>=<span class="string">"http://fs.xicidaili.com/images/flag/cn.png"</span> <span class="attr">alt</span>=<span class="string">"Cn"</span>></span><span class="tag"></<span class="name">td</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">td</span>></span>115.198.35.213<span class="tag"></<span class="name">td</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">td</span>></span>6666<span class="tag"></<span class="name">td</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">td</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">a</span> <span class="attr">href</span>=<span class="string">"/2018-07-20/zhejiang"</span>></span>浙江杭州<span class="tag"></<span class="name">a</span>></span></span><br><span class="line"> <span class="tag"></<span class="name">td</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">td</span> <span class="attr">class</span>=<span class="string">"country"</span>></span>高匿<span class="tag"></<span class="name">td</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">td</span>></span>HTTPS<span class="tag"></<span class="name">td</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">td</span> <span class="attr">class</span>=<span class="string">"country"</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">div</span> <span class="attr">title</span>=<span class="string">"0.144秒"</span> <span class="attr">class</span>=<span class="string">"bar"</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">div</span> <span class="attr">class</span>=<span class="string">"bar_inner fast"</span> <span class="attr">style</span>=<span class="string">"width:85%"</span>></span></span><br><span class="line"> </span><br><span class="line"> <span class="tag"></<span class="name">div</span>></span></span><br><span class="line"> <span class="tag"></<span class="name">div</span>></span></span><br><span class="line"> <span class="tag"></<span class="name">td</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">td</span> <span class="attr">class</span>=<span class="string">"country"</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">div</span> <span class="attr">title</span>=<span class="string">"0.028秒"</span> <span class="attr">class</span>=<span class="string">"bar"</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">div</span> <span class="attr">class</span>=<span class="string">"bar_inner fast"</span> <span class="attr">style</span>=<span class="string">"width:96%"</span>></span></span><br><span class="line"> </span><br><span class="line"> <span class="tag"></<span class="name">div</span>></span></span><br><span class="line"> <span class="tag"></<span class="name">div</span>></span></span><br><span class="line"> <span class="tag"></<span class="name">td</span>></span></span><br><span class="line"> </span><br><span class="line"> <span class="tag"><<span class="name">td</span>></span>15天<span class="tag"></<span class="name">td</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">td</span>></span>18-08-04 15:33<span class="tag"></<span class="name">td</span>></span></span><br><span class="line"> <span class="tag"></<span class="name">tr</span>></span></span><br></pre></td></tr></table></figure><p>会发现:IP地址包裹在<code>td[2]</code>标签下,端口port包裹在<code>td[3]</code>标签下,类型(http/https)包裹在<code>td[6]</code>标签下。</p><h1 id="程序编写"><a href="#程序编写" class="headerlink" title="程序编写"></a>程序编写</h1><p>分析完页面后,接下来编写爬虫。本项目主要是对<code>xici.py</code>进行编写,对<code>settings.py</code>仅做了轻微改动。</p><h2 id="实现spider"><a href="#实现spider" class="headerlink" title="实现spider"></a>实现spider</h2><p>即编写<code>xici.py</code>文件,程序如下:</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> scrapy</span><br><span class="line"><span class="keyword">import</span> json</span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">scrapy crawl xici -o out.json -a num_pages=50 -a typ=nn</span></span><br><span class="line"><span class="string">其中`out.json`是输出有效代理的json文件,`num_pages`是爬取页数,`typ`表示代理类型,`nn`是高匿代理,`nt`是普通代理。</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">XiCiSpider</span><span class="params">(scrapy.Spider)</span>:</span></span><br><span class="line"> <span class="comment"># 每一个爬虫的唯一标识</span></span><br><span class="line"> name = <span class="string">'xici'</span></span><br><span class="line"> <span class="comment"># 使用-a选项,可以将命令行参数传递给spider的__init__方法</span></span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">__init__</span><span class="params">(self, num_pages=<span class="number">5</span>, typ=<span class="string">'nn'</span>, *args, **kwargs)</span>:</span></span><br><span class="line"> num_pages = int(num_pages)</span><br><span class="line"> self.num_pages = num_pages</span><br><span class="line"> self.typ = typ</span><br><span class="line"> </span><br><span class="line"> <span class="comment"># 定义起始爬取点</span></span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">start_requests</span><span class="params">(self)</span>:</span> </span><br><span class="line"> <span class="keyword">for</span> page <span class="keyword">in</span> range(<span class="number">1</span>, self.num_pages + <span class="number">1</span>):</span><br><span class="line"> url = <span class="string">'http://www.xicidaili.com/{}/{}'</span>.format(self.typ, page)</span><br><span class="line"> <span class="keyword">yield</span> scrapy.Request(url=url)</span><br><span class="line"></span><br><span class="line"> <span class="comment"># 解析response返回的网页</span></span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">parse</span><span class="params">(self, response)</span>:</span></span><br><span class="line"> proxy_list = response.xpath(<span class="string">'//table[@id = "ip_list"]/tr[position()>1]'</span>) </span><br><span class="line"> <span class="keyword">for</span> tr <span class="keyword">in</span> proxy_list:</span><br><span class="line"> <span class="comment"># 提取代理的 ip, port, scheme(http or https)</span></span><br><span class="line"> ip = tr.xpath(<span class="string">'td[2]/text()'</span>).extract_first()</span><br><span class="line"> port = tr.xpath(<span class="string">'td[3]/text()'</span>).extract_first()</span><br><span class="line"> scheme = tr.xpath(<span class="string">'td[6]/text()'</span>).extract_first()</span><br><span class="line"></span><br><span class="line"> <span class="comment"># 使用爬取到的代理再次发送请求到http(s)://httpbin.org/ip, 验证代理是否可用</span></span><br><span class="line"> url = <span class="string">'%s://httpbin.org/ip'</span> % scheme</span><br><span class="line"> proxy = <span class="string">'%s://%s:%s'</span> % (scheme, ip, port)</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"> meta = {</span><br><span class="line"> <span class="string">'proxy'</span>: proxy,</span><br><span class="line"> <span class="string">'dont_retry'</span>: <span class="keyword">True</span>,</span><br><span class="line"> <span class="string">'download_timeout'</span>: <span class="number">5</span>,</span><br><span class="line"> <span class="comment"># 下面的ip字段是传递给check_available方法的信息,方便检测是否可隐藏ip</span></span><br><span class="line"> <span class="string">'_proxy_ip'</span>:ip,</span><br><span class="line"> }</span><br><span class="line"> <span class="keyword">yield</span> scrapy.Request(url, callback=self.check_available, meta=meta, dont_filter=<span class="keyword">True</span>)</span><br><span class="line"> </span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">check_available</span><span class="params">(self, response)</span>:</span></span><br><span class="line"> proxy_ip = response.meta[<span class="string">'_proxy_ip'</span>]</span><br><span class="line"> <span class="comment"># 判断代理是否具有隐藏IP功能</span></span><br><span class="line"> <span class="keyword">if</span> proxy_ip == json.loads(response.text)[<span class="string">'origin'</span>]:</span><br><span class="line"> <span class="keyword">yield</span>{</span><br><span class="line"> <span class="string">'proxy'</span>: response.meta[<span class="string">'proxy'</span>]</span><br><span class="line"> }</span><br></pre></td></tr></table></figure><h2 id="修改配置文件"><a href="#修改配置文件" class="headerlink" title="修改配置文件"></a>修改配置文件</h2><ul><li>更改USER_AGENT:西刺代理网站会通过识别请求中的user-agent来判断这次请求是真实用户所为还是机器所为。</li><li>不遵守robots协议:网站会通过robots协议告诉搜索引擎那些页面可以抓取,哪些不可以抓取,而robots协议大多不允许抓取有价值的信息,所以咱们不遵守。</li><li>禁用cookies:如果用不到cookies,就不要让服务器知道你的cookies。</li></ul><p>文件<code>settings.py</code>中的改动如下:</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36"</span><br><span class="line">ROBOTSTXT_OBEY = True</span><br><span class="line">COOKIES_ENABLED = False</span><br></pre></td></tr></table></figure><p>在编写好<code>xici.py</code>和<code>settings.py</code>后,我们的项目就大功告成啦!</p><h1 id="使用方法"><a href="#使用方法" class="headerlink" title="使用方法"></a>使用方法</h1><p>使用方法就是在命令行中执行以下命令即可:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">$ scrapy crawl xici -o out.json -a num_pages=10 -a typ=nn</span><br></pre></td></tr></table></figure><p>其中<code>out.json</code>是最终输出有效代理的json文件,<code>num_pages</code>是爬取页数,<code>typ</code>表示要爬取的代理类型,<code>nn</code>是高匿代理,<code>nt</code>是普通代理。</p><blockquote><p><strong>提示</strong> :程序在验证代理有效性的过程中,对于无效的代理会抛出超时异常,不要管这些异常,让程序继续执行直到结束。</p></blockquote>]]></content>
<categories>
<category> python3网络爬虫 </category>
</categories>
<tags>
<tag> 爬虫 </tag>
<tag> 西刺代理 </tag>
</tags>
</entry>
</search>