渗透测试工具 | 基于scrapy可见可得的爬虫工具arachnado

先上Git地址：https://github.com/TeamHG-Memex/arachnado

这个库在去年8月就已经上线了，作者写的东西和整体的UI界面满不错的，

这是从youtube下载下来后上传到youku的演示效果

整体的效果确实真的很不错，基于tornado 高效，封装了一些scrapyd webservice 的api，数据都是保存在mongo之中的，可以自己自由定制，不过可惜的是，目前只能通过修改spider里面的代码来个性爬虫的整体逻辑，不过代码逻辑不复杂，可以学习自己封装一些api .
一，关于定制spider ，是否任何网站都可以爬去

https://github.com/TeamHG-Memex/arachnado/blob/master/arachnado/spider.py

class ArachnadoSpider(scrapy.Spider):
    """
    A base spider that contains common attributes and utilities for all
    Arachnado spiders
    """
    crawl_id = None
    domain = None
    motor_job_id = None
    def __init__(self, *args, **kwargs):
        super(ArachnadoSpider, self).__init__(*args, **kwargs)
        # don't log scraped items
        logging.getLogger("scrapy.core.scraper").setLevel(logging.INFO)
    def get_page_item(self, response, type_='page'):
        return {
            'crawled_at': datetime.datetime.utcnow(),
            'url': response.url,
            'status': response.status,
            'headers': response.headers,
            'body': response.body_as_unicode(),
            'meta': response.meta,
            '_type': type_,
        }
class CrawlWebsiteSpider(ArachnadoSpider):
    """
    A spider which crawls all the website.
    To run it, set its ``crawl_id`` and ``domain`` arguments.
    """
    name = 'crawlwebsite'
    custom_settings = {
        'DEPTH_LIMIT': 10,
    }
    def __init__(self, *args, **kwargs):
        super(CrawlWebsiteSpider, self).__init__(*args, **kwargs)
        self.start_url = add_scheme_if_missing(self.domain)
    def start_requests(self):
        self.logger.info("Started job %s#%d for domain %s",
                         self.motor_job_id, self.crawl_id, self.domain)
        yield scrapy.Request(self.start_url, self.parse_first,
                             dont_filter=True)
    def parse_first(self, response):
        # If there is a redirect in the first request, use the target domain
        # to restrict crawl instead of the original.
        self.domain = get_netloc(response.url)
        self.crawler.stats.set_value('arachnado/start_url', self.start_url)
        self.crawler.stats.set_value('arachnado/domain', self.domain)
        allow_domain = self.domain
        if self.domain.startswith("www."):
            allow_domain = allow_domain[len("www."):]
        self.get_links = LinkExtractor(
            allow_domains=[allow_domain]
        ).extract_links
        for elem in self.parse(response):
            yield elem
    def parse(self, response):
        if not isinstance(response, HtmlResponse):
            self.logger.info("non-HTML response is skipped: %s" % response.url)
            return
        yield self.get_page_item(response)
        for link in self.get_links(response):
            yield scrapy.Request(link.url, self.parse)

class ArachnadoSpider(scrapy.Spider):

"""

A base spider that contains common attributes and utilities for all

Arachnado spiders

"""

crawl_id = None

domain = None

motor_job_id = None

def __init__(self, *args, **kwargs):

super(ArachnadoSpider, self).__init__(*args, **kwargs)

# don't log scraped items

logging.getLogger("scrapy.core.scraper").setLevel(logging.INFO)

def get_page_item(self, response, type_='page'):

return {

'crawled_at': datetime.datetime.utcnow(),

'url': response.url,

'status': response.status,

'headers': response.headers,

'body': response.body_as_unicode(),

'meta': response.meta,

'_type': type_,

}

class CrawlWebsiteSpider(ArachnadoSpider):

"""

A spider which crawls all the website.

To run it, set its ``crawl_id`` and ``domain`` arguments.

"""

name = 'crawlwebsite'

custom_settings = {

'DEPTH_LIMIT': 10,

}

def __init__(self, *args, **kwargs):

super(CrawlWebsiteSpider, self).__init__(*args, **kwargs)

self.start_url = add_scheme_if_missing(self.domain)

def start_requests(self):

self.logger.info("Started job %s#%d for domain %s",

self.motor_job_id, self.crawl_id, self.domain)

yield scrapy.Request(self.start_url, self.parse_first,

dont_filter=True)

def parse_first(self, response):

# If there is a redirect in the first request, use the target domain

# to restrict crawl instead of the original.

self.domain = get_netloc(response.url)

self.crawler.stats.set_value('arachnado/start_url', self.start_url)

self.crawler.stats.set_value('arachnado/domain', self.domain)

allow_domain = self.domain

if self.domain.startswith("www."):

allow_domain = allow_domain[len("www."):]

self.get_links = LinkExtractor(

allow_domains=[allow_domain]

).extract_links

for elem in self.parse(response):

yield elem

def parse(self, response):

if not isinstance(response, HtmlResponse):

self.logger.info("non-HTML response is skipped: %s" % response.url)

return

yield self.get_page_item(response)

for link in self.get_links(response):

yield scrapy.Request(link.url, self.parse)

其实是将数据一些常见的链接抓出来了，并没有对特定的数据进行处理

二，数据如何处理
数据都是经过piepline来处理的，可以查看代码
https://github.com/TeamHG-Memex/arachnado/blob/master/arachnado/motor_exporter/pipelines.py

存入到mongodb ，数据拿到后的样子

作者：brucetang
Github：https://github.com/BruceDone

黑客

基于scrapy可见可得的爬虫工具arachnado

相关文章