Python Scrapy数据抓取——信息入库

2016-7-5 邪哥

接前面的抓取项目创建,这边主要介绍,抓取数据的入库

关于页面的分级抓取等等,如上节所说,可自行查阅scrapy的相关文档,也可关注后续的篇章


好了废话完毕,咱们首先创建MySQL数据表

这边直接在之前安装好的mysql.test库中进行创建

CREATE TABLE `base_province` (
  `province_code` tinyint(2) NOT NULL,
  `province_name` varchar(8) DEFAULT '' COMMENT '省份名称',
  `codenum` bigint(12) DEFAULT '0' COMMENT '统一编码',
  `status` tinyint(1) DEFAULT '0' COMMENT '状态',
  PRIMARY KEY (`province_code`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='省份信息表';
然后就是 安装python MySQLdb扩展
#这边鉴于权限问题,依然使用root进行操作
[root@localhost ~]# /usr/local/python/2.7.12/bin/pip install mysql-python
Collecting mysql-python
  Downloading MySQL-python-1.2.5.zip (108kB)
    100% |████████████████████████████████| 112kB 65kB/s 
Installing collected packages: mysql-python
  Running setup.py install for mysql-python ... done
Successfully installed mysql-python-1.2.5
[193942 refs]
修改抓取文件
[root@localhost ~]# su jeen 
[jeen@localhost root]# /data/jeen/pyscrapy/govstats 
[jeen@localhost govstats]# vi govstats/spiders/jarea.py
#在顶部 引入 MySQLdb
import MySQLdb

#完善parse方法
        print "==Province==URL" + response.url
        conn = MySQLdb.connect(host='localhost',user='root',passwd='root',port=3306,db='test',charset='utf8')
        cur = conn.cursor()
        tds = response.css(".provincetr").xpath('./td')
        for td in tds:
            code = td.xpath("./a/@href").extract()[0]
            code = code.replace('.html','')
            name = td.xpath(".//text()").extract()[0]
            name = name.strip()
            codenum = code + ("0"*(12 - len(code)))
            value = [
                code,
                name,
                codenum
            ]
            sql1 = "INSERT INTO `base_province` (`province_code`, `province_name`, `codenum`) VALUES  "
            sql2 = " (%s, %s, %s)"
            sql = "".join([sql1, sql2])
            try:
                cur.execute(sql, value)
            except:
                print 'data exist'
            print ":".join(value) 
        conn.commit()
        cur.close()
        conn.close()
#完成后保存退出        
测试抓取...
[jeen@localhost govstats]$ scrapy crawl jarea  
.....
ImportError: libmysqlclient.so.18: cannot open shared object file: No such file or directory
[244992 refs]
#libmysqlclient.so.18 没找到,需要再次引入,于是咱们参照之前的lnmp环境安装配置再次
#使用root账户进行操作,ps:你可以开多个命令行窗口
[root@localhost ~]# cd soft/
[root@localhost soft]# cat lnmysql.sh 
src=/usr/local/mysql/lib
dest=/usr/lib
 
for i in `ls $src | egrep "libmysql."`
do
        ln -s $src/$i $dest/$i 
done 
[root@localhost soft]# ./lnmysql.sh 
[root@localhost soft]# ldconfig
#关于这边创建软链,不重复介绍了
再次尝试抓取入库
[jeen@localhost govstats]$ scrapy crawl jarea     
....
64:宁夏回族自治区:640000000000
65:新疆维吾尔自治区:650000000000
2016-07-05 17:20:08 [scrapy] INFO: Closing spider (finished)
....
2016-07-05 17:20:08 [scrapy] INFO: Spider closed (finished)
[315847 refs]
#抓取完成
成功,去看看数据里面的数据吧 :)


发表评论: