baiduzhidaoSpider

百度知道

Preface

this is a project about crawling data from zhidao.baidu.com. though baiduzhidao is not a very authoritative community, it is quite interesting to get some scattered knowledge or so-called experience. as we know, truth is quite unattainable especially personal experience. this project will help you to establish a databse of baiduzhidao, containing questions, best answers, other answers and other related items such as number of thumbup, post time, author and so on.

this project is based on python2.7 scrapy frame in windows system. python3.x may bring some unpredictable mistakes.
redis is used to remove duplicates, so redis database and scrapy-redis module for python2.7 are must be installed.
for persistent storage of data, mongo database is required.

download this project, and put it into your IDE. (e.g.Pycharm)
firstly, you need modify setting.py in REDIS_HOST and REDIS_PORT. if you use local redise database, default is ok.
the same as the first step, modify redis_tomongo.py in host, port, database, name, sheet name.
in the cmd, type into "scrapy crawl zhidaospider". and run redis_tomongo.py. data will be automatically stored in mongodb.

this project is a primary edition, you can add new items in it and impove details.
if you like this project, please star it, thanks.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.idea		.idea
baiduspider		baiduspider
README.md		README.md
__init__.py		__init__.py
redisTomongo.py		redisTomongo.py
scrapy.cfg		scrapy.cfg