2019年11月23日 星期六

python3.8 新特性async實現自動化爬蟲功能


下面程式碼實現在指定URL下抓取HTML頁面裡包含javascript腳本生成quote節點後的訊息。
import asyncio
from pyppeteer import launch
from pyquery import PyQuery as pq

async def main():
    print ('start main()')
    browser = await launch()
    print('browser ready')
    page = await browser.newPage()
    print('Page ready')
    await page.goto('http://quotes.toscrape.com/js/')
    print('goto ready')
    doc = pq(await page.content())
    print('pq ready')
    print('Quotes:', doc('.quote').length)
    await browser.close()
    print('browser closed')


if __name__ == '__main__':
    if asyncio.iscoroutinefunction(main):
        asyncio.get_event_loop().run_until_complete(main())
    else:
        main()

好久沒有更新blogger了,來新增一波內容
pyppeteer :https://github.com/miyakogi/pyppeteer
chromium automation library

PyQuery: jquery-like library for python
async: keyword for coroutine, to write concurrent code in python3.8