stayhigh, Python: 3月 2016

2016年3月20日星期日

研究如何突破Koding VM always on限制

Koding是什麼？
提供開發者雲端開發環境的服務商，提供註冊用戶免費使用一個虛擬機並提供基本設定檔案與環境。如果有需要使用免費的 VM練習使用網頁，可以前往https://koding.com/註冊帳號，註冊完畢後就可以擁有免費的虛擬機器提供用戶練習程式。但美中不足的部分是免費版本的帳號每30分鐘會關閉虛擬機器，付費版本才會提供Keep VM always on的功能。

略過註冊的部分，此處提供登入Koding後畫面：

點擊觀看左側koding-vm-0相關設定，General標籤裡面出現Keep VM always on，但免費用戶無法使用：

官方建議免費版本用戶透過登入koding網站讓VM機器保持開啟狀態，於是開始思考如何突破限制。本次研究嘗試網頁自動化工具selenium，撰寫程式讓電腦每30分鐘內自動化登入，藉由此法成功保持VM持續開機！！本次採用的方案使用selenium進行網頁自動化登入koding帳號之後啟用VM

python範例程式碼：
https://github.com/stayhigh/koding-vm-active-selenium/blob/master/koding_login_active_vm.py

使用selenium網頁自動化的功能提到兩種重要的等待功能(wait)之間的差異：
explicit wait (wait for certain conditions, less than the specified time seconds)
implicit wait (DOM polling, waiting for elements)
官方網站也特別說明請勿混用，會造成等待時間增加。
http://www.seleniumhq.org/docs/04_webdriver_advanced.jsp

本次範例程式碼當中採用的自動化工具是selenium，但還有其他自動化工具可以參考：

xdotool: 應用shell script相當方便，可用於模擬鍵盤與滑鼠行為
關於python的鍵盤與滑鼠自動化的模組羅列如下：

- pyautogui: python的模組，主要提供控制鍵盤與滑鼠的GUI自動化功能，
pyautogui安裝方法：
On Windows, there are no other modules to install.
On OS X run sudo pip3 install pyobjc-framework-Quartz, sudo pip3 install pyobjc-core, and then sudo pip3 install pyobjc.
On Linux, run sudo pip3 install python3-xlib, sudo apt-get install scrot, sudo apt-get install python3-tk, and sudo apt-get install python3-dev. (Scrot is a screenshot program that PyAutoGUI uses.)
pyautogui詳情參考：https://automatetheboringstuff.com/chapter18/

關於其他python自動化工具請參考：http://schurpf.com/python-automation/
- SendKeysCtypes
- PYHK
- win32gui
- pywinauto
- mouse

HTML的script 標籤三個屬性說明！請愛用 async

參考來源：http://peter.sh/experiments/asynchronous-and-deferred-javascript-execution-explained/

Asynchronous and deferred JavaScript execution explained

The HTML <script> element allows you to define when the JavaScript code in your page should start executing. The “async” and “defer” attributes were added to WebKit early September. Firefox has been supporting them quite a while already. Does your browser support the attributes?

- Normal execution <script>
- This is the default behavior of the <script> element. Parsing of the HTML code pauses while the script is executing. For slow servers and heavy scripts this means that displaying the webpage will be delayed.

- Deferred execution <script defer>
- Simply put: delaying script execution until the HTML parser has finished. A positive effect of this attribute is that the DOM will be available for your script. However, since not every browser supports defer yet, don’t rely on it!

- Asynchronous execution <script async>
- Don’t care when the script will be available? Asynchronous is the best of both worlds: HTML parsing may continue and the script will be executed as soon as it’s ready. I’d recommend this for scripts such as Google Analytics.

根據上面英文原文重點整理，重述script標籤的三種屬性：
- <script>
- <script defer>
- <script async>

上圖當中出現三個名詞：parser｜net｜execution
- net 代表下載該javascript檔案的時段
- execution 代表執行該javascript檔案的時段
- parser代表http client 解析的過程的時段

可以看到async的屬性相當在設計網頁場景當中相當實用，parser解析時同時將下載該javascript檔案。請愛用async!

使用selenium自動化登入github網頁

主要使用到以下工具完成自動化登入github網頁：
selenium：網頁自動化工具，請參考 http://www.seleniumhq.org/
python 的getpass模組：提供使用者輸入密碼功能

範例程式碼已放置於stayhigh的github空間：
https://github.com/stayhigh/github-login-selenium/blob/master/github_login.py

程式碼用途：
場景為redhero0702@gmail.com作為用戶名稱登入github網站，如有需要可將
github_account = "redhero0702@gmail.com"
改成您的github用戶名即可，假設你的用戶名稱為your_github_account@gmail.com
github_account = "your_github_account@gmail.com"

2016年3月14日星期一

Python Scrapy 快速攻略與教學

# 安裝google chrome plugin，SelectorGadget快速取得selector與xpath等相關資訊，網頁爬蟲特別實用

# 安裝firefox plugin，sqlite manager觀看資料庫內容：http://www.minwt.com/website/server/4964.html

#開啟apple新的scrapy專案

stayhigh@stayhighnet:/Users/stayhigh/projects/apple $ scrapy startproject apple

#執行apple的scrapy專案

stayhigh@stayhighnet:/Users/stayhigh/projects/apple $ scrapy crawl apple

#執行apple的scrap專案，並且將輸出a.json的json格式檔案

stayhigh@stayhighnet:/Users/stayhigh/projects/apple $ scrapy crawl apple -o a.json -t json

#執行分段爬蟲任務並放置相關資料於job1目錄

stayhigh@stayhighnet:/Users/stayhigh/projects/apple $ scrapy crawl apple -s JOBDIR=job1

#觀看apple專案目錄結構

- crawler.py為使用者自行定義的爬蟲程式，藉由繼承scrapy.Spider類別進行網頁抓取

- items.py 用於定義資料欄位

- pipeline.py 用於定義爬蟲程式的控制流程

- settings.py 設定檔，用於設定啟用的功能，如常見的pipeline功能，並切記設定時指定pipeline.py當中的apple.pipelines.ApplePipeline

ITEM_PIPELINES = {
    'apple.pipelines.ApplePipeline': 300,
}

stayhigh@stayhighnet:/Users/stayhigh/projects/apple $ tree

├── a.json

├── apple

│ ├── __init__.py

│ ├── __init__.pyc

│ ├── items.py

│ ├── items.pyc

│ ├── pipelines.py

│ ├── settings.py

│ ├── settings.pyc

│ └── spiders

│ ├── __init__.py

│ ├── __init__.pyc

│ ├── crawler.py

│ └── crawler.pyc

└── scrapy.cfg

#如何實現多網頁爬取功能

from scrapy.spiders import CrawlSpider

# crawler.py內的爬蟲類別繼承CrawlSpider

class AppleCrawler(CrawlSpider):

2016年3月20日 星期日