Python Scrapy数据抓取——信息入库
2016-7-5 邪哥
接前面的抓取项目创建,这边主要介绍,抓取数据的入库
关于页面的分级抓取等等,如上节所说,可自行查阅scrapy的相关文档,也可关注后续的篇章
好了废话完毕,咱们首先创建MySQL数据表
这边直接在之前安装好的mysql.test库中进行创建
CREATE TABLE `base_province` ( `province_code` tinyint(2) NOT NULL, `province_name` varchar(8) DEFAULT '' COMMENT '省份名称', `codenum` bigint(12) DEFAULT '0' COMMENT '统一编码', `status` tinyint(1) DEFAULT '0' COMMENT '状态', PRIMARY KEY (`province_code`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='省份信息表';然后就是 安装python MySQLdb扩展
#这边鉴于权限问题,依然使用root进行操作 [root@localhost ~]# /usr/local/python/2.7.12/bin/pip install mysql-python Collecting mysql-python Downloading MySQL-python-1.2.5.zip (108kB) 100% |████████████████████████████████| 112kB 65kB/s Installing collected packages: mysql-python Running setup.py install for mysql-python ... done Successfully installed mysql-python-1.2.5 [193942 refs]修改抓取文件
[root@localhost ~]# su jeen [jeen@localhost root]# /data/jeen/pyscrapy/govstats [jeen@localhost govstats]# vi govstats/spiders/jarea.py #在顶部 引入 MySQLdb import MySQLdb #完善parse方法 print "==Province==URL" + response.url conn = MySQLdb.connect(host='localhost',user='root',passwd='root',port=3306,db='test',charset='utf8') cur = conn.cursor() tds = response.css(".provincetr").xpath('./td') for td in tds: code = td.xpath("./a/@href").extract()[0] code = code.replace('.html','') name = td.xpath(".//text()").extract()[0] name = name.strip() codenum = code + ("0"*(12 - len(code))) value = [ code, name, codenum ] sql1 = "INSERT INTO `base_province` (`province_code`, `province_name`, `codenum`) VALUES " sql2 = " (%s, %s, %s)" sql = "".join([sql1, sql2]) try: cur.execute(sql, value) except: print 'data exist' print ":".join(value) conn.commit() cur.close() conn.close() #完成后保存退出测试抓取...
[jeen@localhost govstats]$ scrapy crawl jarea ..... ImportError: libmysqlclient.so.18: cannot open shared object file: No such file or directory [244992 refs] #libmysqlclient.so.18 没找到,需要再次引入,于是咱们参照之前的lnmp环境安装配置再次 #使用root账户进行操作,ps:你可以开多个命令行窗口 [root@localhost ~]# cd soft/ [root@localhost soft]# cat lnmysql.sh src=/usr/local/mysql/lib dest=/usr/lib for i in `ls $src | egrep "libmysql."` do ln -s $src/$i $dest/$i done [root@localhost soft]# ./lnmysql.sh [root@localhost soft]# ldconfig #关于这边创建软链,不重复介绍了再次尝试抓取入库
[jeen@localhost govstats]$ scrapy crawl jarea .... 64:宁夏回族自治区:640000000000 65:新疆维吾尔自治区:650000000000 2016-07-05 17:20:08 [scrapy] INFO: Closing spider (finished) .... 2016-07-05 17:20:08 [scrapy] INFO: Spider closed (finished) [315847 refs] #抓取完成成功,去看看数据里面的数据吧 :)
Thanks for your kindness :)
sometimes
then
How about you ?
发表评论: