binux/pyspider

 binux / pyspider

binux / pyspider

A Powerful Spider(Web Crawler) System in Python. http://docs.pyspider.org/

pyspider882088208830883088408840885088508860886088708870888088802017-05-132017-05-142017-05-152017-05-162017-05-172017-05-182017-05-192017-05-202017-05-212017-05-222017-05-232017-05-242017-05-252017-05-26pyspider88158.34650851448199.565737722017-05-13882640.4484643394169.6743554362017-05-14882772.5504201643166.9569570472017-05-158836104.652375989142.5003715422017-05-168847136.754331814112.6089892582017-05-178851168.856287639101.73939572017-05-188854200.95824346493.58720053192017-05-198860233.06019928977.28281019542017-05-208860265.16215511477.28281019542017-05-218868297.26411093955.54362307992017-05-228874329.36606676439.23923274342017-05-238875361.46802258936.52183435392017-05-248880393.56997841322.93484240682017-05-258887425.6719342383.913053680782017-05-26star

 README

pyspider Build Status Coverage Status Try

A Powerful Spider(Web Crawler) System in Python. TRY IT NOW!

Tutorial: http://docs.pyspider.org/en/latest/tutorial/
Documentation: http://docs.pyspider.org/
Release notes: https://github.com/binux/pyspider/releases

Sample Code

from pyspider.libs.base_handler import *

class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://scrapy.org/', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('title').text(),
        }

Demo

Installation

Quickstart: http://docs.pyspider.org/en/latest/Quickstart/

Contribute

TODO

v0.4.0

  • a visual scraping interface like portia

License

Licensed under the Apache License, Version 2.0