stayhigh, Python: 2016

2016年12月28日星期三

Machine Learning Cheatsheet

機器學習常用的cheatsheets方便快速複習程式工具。
透過python可以加速很多的操作流程，另外也有附上許多的相關資料分析工具。
來源：http://www.kdnuggets.com/2015/07/good-data-science-machine-learning-cheat-sheets.html

Cheat sheets for Python:
Python is a popular choice for beginners, yet still powerful enough to back some of the world’s most popular products and applications. It's design makes the programming experience feel almost as natural as writing in English. Python basics or Python Debugger cheatsheets for beginners covers important syntax to get started. Community-provided libraries such as numpy, scipy, sci-kit and pandas are highly relied on and the NumPy/SciPy/Pandas Cheat Sheet provides a quick refresher to these.

Cheat sheets for R:
The R's ecosystem has been expanding so much that a lot of referencing is needed. The R Reference Card covers most of the R world in few pages.The Rstudio has also published a series of cheatsheets to make it easier for the R community. The data visualization with ggplot2 seems to be a favorite as it helps when you are working on creating graphs of your results.

Cheat sheets for MySQL & SQL:
For a data scientist basics of SQL are as important as any other language as well. Both PIG and Hive Query Language are closely associated with SQL- the original Structured Query Language. SQL cheatsheets provide a 5 minute quick guide to learning it and then you may explore Hive & MySQL!

Cheat sheets for Spark:
Apache Spark is an engine for large-scale data processing. For certain applications, such as iterative machine learning, Spark can be up to 100x faster than Hadoop (using MapReduce). The essentials of Apache Spark cheatsheet explains its place in the big data ecosystem, walks through setup and creation of a basic Spark application, and explains commonly used actions and operations.

Cheat sheets for Hadoop & Hive:
Hadoop emerged as an untraditional tool to solve what was thought to be unsolvable by providing an open source software framework for the parallel processing of massive amounts of data. Explore the Hadoop cheatsheets to find out Useful commands when using Hadoop on the command line. A combination of SQL & Hive functions is another one to check out.

Cheat sheets for Machine learning:
We often find ourselves spending time thinking which algorithm is best? And then go back to our big books for reference! These cheat sheets gives an idea about both the nature of your data and the problem you're working to address, and then suggests an algorithm for you to try.

Cheat sheets for Django :
Django is a free and open source web application framework, written in Python. If you are new to Django, you can go over these cheatsheets and brainstorm quick concepts and dive in each one to a deeper level.

2016年4月13日星期三

doxygen常見問題

設定wizard時需要注意的細節：

Step1設定working directory時，不可使用複製貼上字串的方式設定指定目錄
切記需要使用DOT_PATH為/usr/local/bin，若無設定則會在run doxygen之後發生下述錯誤訊息 sh: dot: command not found，可透過安裝graphviz的方式並確認DOT_PATH設定無誤。

關於相關指令用法以及註解撰寫格式可參照：

https://blog.longwin.com.tw/2011/04/doxygen-document-generator-2011/

GNU parallel 應用範例

說明範例採用1.txt與2.txt作為輸入檔案範例：
檔案1.txt內容為
A
B
C
檔案2.txt內容為：
D
E
F

指令範例1：
parallel echo ::: $(cat 1.txt) ::: $(cat 2.txt) 2>/dev/null

輸出：
B F
C D
B E
B D
C E
A F
A E
A D
C F

指令範例2：
parallel -k echo ::: $(cat 1.txt) ::: $(cat 2.txt) 2>/dev/null

輸出：
A D
A E
A F
B D
B E
B F
C D
C E
C F

2016年4月12日星期二

使用lynx快速取得網頁的超連結並存成清單，並搭配wget下載URL清單檔案

簡介

常常在製作爬蟲程式時，相當需要快速將單頁網頁上面出現的所有超連結存成清單，以方便後續利用wget程式。

殺手級的下載工具的功能主要有：

找出指定網頁的所有相關URL連結並存成文件
使用wget指令下載所有相關URL
使用GNU parallel平行執行多個執行工作

實際範例

下面以gstreamer官方網站的網址作為舉例：
https://gstreamer.freedesktop.org/data/events/gstreamer-conference/2015/

1)利用如下指令快速輸出URL清單列表：

lynx -dump https://gstreamer.freedesktop.org/data/events/gstreamer-conference/2015/ |grep http|awk '{print $2}' > urls.txt

2)使用wget指令下載URL清單：

wget -i urls.txt

3)使用GNU parallel工具平行執行多個工作
關於parallel指令使用方法參考：https://www.gnu.org/software/parallel/parallel_tutorial.html

cat urls.txt | parallel "wget -i {}"

將(1)(2)(3)三個步驟觀念整合後可轉寫成shell將其自動化。

補充（若想要監控目錄的下載狀況，可透過watch指令觀察指定目錄的文件狀態）：

watch -d ls

補充（若發生zombie的狀況，可使用下列指令清除zombie的parent process）：

kill $(ps -A -ostat,ppid | awk '/[zZ]/{print $2}')

由於zombie process已經結束，所以無法使用kill指令刪除，可透過終結parent process的方式刪除zombie。（當parent process終結時，zombie將繼承init，並且parent process將會等待init並且清除在process table上面的記錄）

參考來源：http://stackoverflow.com/questions/16944886/how-to-kill-zombie-process

A zombie is already dead, so you cannot kill it. To clean up a zombie, it must be waited on by its parent, so killing the parent should work to eliminate the zombie. (After the parent dies, the zombie will be inherited by init, which will wait on it and clear its entry in the process table.) If your daemon is spawning children that become zombies, you have a bug. Your daemon should notice when its children die and wait on them to determine their exit status.

2016年4月11日星期一

快速調整man指令的分頁工具// How to check and set pager for 'man' command?

快速調整man指令的分頁工具

常見的分頁工具如下兩個：

/bin/more
/bin/less

如何快速使用指令更改：

使用file指令查詢連結
使用readlink -f 快速查找最終連結到的檔案
若要重新設定連結的指定位置可以使用ln -sf /etc/alternatives/pager /bin/more其他pager
（由於/etc/alternatives/pager已經存在，若要重新設定連結需要搭配-f參數）

2016年4月3日星期日

NAT 與穿越防火牆技術

參考來源：http://www.cs.nccu.edu.tw/~lien/Writing/NGN/firewall.htm

關於Network Address Translation (NAT)：

Why NAT? 解決IPv4地址短缺的方案
What is NAT? IP封包通過路由器或防火牆時重寫源IP地址或目的IP地址的技術。
How many NAT types?

Full cone NAT
Address-Restricted cone NAT
Port-Restrict cone NAT
Symmetric NAT

Full cone NAT，亦即著名的一對一（one-to-one）NAT 一旦一個內部地址（iAddr:port1）映射到外部地址（eAddr:port2），所有發自iAddr:port1的包都經由eAddr:port2向外發送。任意外部主機都能通過給eAddr:port2發包到達iAddr:port1
Address-Restricted cone NAT 一旦一個內部地址（iAddr:port1）映射到外部地址（eAddr:port2），所有發自iAddr:port1的包都經由eAddr:port2向外發送。任意外部主機（hostAddr:any）都能通過給eAddr:port2發包到達iAddr:port1的前提是：iAddr:port1之前發送過包到hostAddr:any. "any"也就是說埠不受限制
Port-Restricted cone NAT 類似受限制錐形NAT（Restricted cone NAT），但是還有埠限制。一旦一個內部地址（iAddr:port1）映射到外部地址（eAddr:port2），所有發自iAddr:port1的包都經由eAddr:port2向外發送。一個外部主機（hostAddr:port3）能夠發包到達iAddr:port1的前提是：iAddr:port1之前發送過包到hostAddr:port3.
Symmetric NAT（對稱NAT）每一個來自相同內部IP與埠，到一個特定目的地地址和埠的請求，都映射到一個獨特的外部IP位址和埠。同一內部IP與埠發到不同的目的地和埠的信息包，都使用不同的映射只有曾經收到過內部主機封包的外部主機，才能夠把封包發回

常見穿越防火牆/NAT的相關技術：

UPnP(Universal Plug and Play)
STUN(Simple Traversal of UDP Through Network Address Translators)-RFC 3489
TURN(Traversal Using Relay NAT)
ALG(Application Layer Gateway)
ICE(Interactive Connectivity Establish)

UPnP缺點：NAT必須支援UPnP協定

STUN缺點：Symmetric NAT無法穿透

TURN缺點 : TURN server需要承受連線頻寬

ALG缺點：基於網路安全，網管人員將不會接受用戶的應用程式控制他們的NAT

相關開源工具：

pystun, 提供查詢外部IP位置及NAT型別：https://github.com/jtriley/pystun

pjnath, Open Source ICE, STUN, and TURN Library: http://www.pjsip.org/pjnath/docs/html/

2016年3月20日星期日

研究如何突破Koding VM always on限制

Koding是什麼？
提供開發者雲端開發環境的服務商，提供註冊用戶免費使用一個虛擬機並提供基本設定檔案與環境。如果有需要使用免費的 VM練習使用網頁，可以前往https://koding.com/註冊帳號，註冊完畢後就可以擁有免費的虛擬機器提供用戶練習程式。但美中不足的部分是免費版本的帳號每30分鐘會關閉虛擬機器，付費版本才會提供Keep VM always on的功能。

略過註冊的部分，此處提供登入Koding後畫面：

點擊觀看左側koding-vm-0相關設定，General標籤裡面出現Keep VM always on，但免費用戶無法使用：

官方建議免費版本用戶透過登入koding網站讓VM機器保持開啟狀態，於是開始思考如何突破限制。本次研究嘗試網頁自動化工具selenium，撰寫程式讓電腦每30分鐘內自動化登入，藉由此法成功保持VM持續開機！！本次採用的方案使用selenium進行網頁自動化登入koding帳號之後啟用VM

python範例程式碼：
https://github.com/stayhigh/koding-vm-active-selenium/blob/master/koding_login_active_vm.py

使用selenium網頁自動化的功能提到兩種重要的等待功能(wait)之間的差異：
explicit wait (wait for certain conditions, less than the specified time seconds)
implicit wait (DOM polling, waiting for elements)
官方網站也特別說明請勿混用，會造成等待時間增加。
http://www.seleniumhq.org/docs/04_webdriver_advanced.jsp

本次範例程式碼當中採用的自動化工具是selenium，但還有其他自動化工具可以參考：

xdotool: 應用shell script相當方便，可用於模擬鍵盤與滑鼠行為
關於python的鍵盤與滑鼠自動化的模組羅列如下：

- pyautogui: python的模組，主要提供控制鍵盤與滑鼠的GUI自動化功能，
pyautogui安裝方法：
On Windows, there are no other modules to install.
On OS X run sudo pip3 install pyobjc-framework-Quartz, sudo pip3 install pyobjc-core, and then sudo pip3 install pyobjc.
On Linux, run sudo pip3 install python3-xlib, sudo apt-get install scrot, sudo apt-get install python3-tk, and sudo apt-get install python3-dev. (Scrot is a screenshot program that PyAutoGUI uses.)
pyautogui詳情參考：https://automatetheboringstuff.com/chapter18/

關於其他python自動化工具請參考：http://schurpf.com/python-automation/
- SendKeysCtypes
- PYHK
- win32gui
- pywinauto
- mouse

HTML的script 標籤三個屬性說明！請愛用 async

參考來源：http://peter.sh/experiments/asynchronous-and-deferred-javascript-execution-explained/

Asynchronous and deferred JavaScript execution explained

The HTML <script> element allows you to define when the JavaScript code in your page should start executing. The “async” and “defer” attributes were added to WebKit early September. Firefox has been supporting them quite a while already. Does your browser support the attributes?

- Normal execution <script>
- This is the default behavior of the <script> element. Parsing of the HTML code pauses while the script is executing. For slow servers and heavy scripts this means that displaying the webpage will be delayed.

- Deferred execution <script defer>
- Simply put: delaying script execution until the HTML parser has finished. A positive effect of this attribute is that the DOM will be available for your script. However, since not every browser supports defer yet, don’t rely on it!

- Asynchronous execution <script async>
- Don’t care when the script will be available? Asynchronous is the best of both worlds: HTML parsing may continue and the script will be executed as soon as it’s ready. I’d recommend this for scripts such as Google Analytics.

根據上面英文原文重點整理，重述script標籤的三種屬性：
- <script>
- <script defer>
- <script async>

上圖當中出現三個名詞：parser｜net｜execution
- net 代表下載該javascript檔案的時段
- execution 代表執行該javascript檔案的時段
- parser代表http client 解析的過程的時段

可以看到async的屬性相當在設計網頁場景當中相當實用，parser解析時同時將下載該javascript檔案。請愛用async!

使用selenium自動化登入github網頁

主要使用到以下工具完成自動化登入github網頁：
selenium：網頁自動化工具，請參考 http://www.seleniumhq.org/
python 的getpass模組：提供使用者輸入密碼功能

範例程式碼已放置於stayhigh的github空間：
https://github.com/stayhigh/github-login-selenium/blob/master/github_login.py

程式碼用途：
場景為redhero0702@gmail.com作為用戶名稱登入github網站，如有需要可將
github_account = "redhero0702@gmail.com"
改成您的github用戶名即可，假設你的用戶名稱為your_github_account@gmail.com
github_account = "your_github_account@gmail.com"

2016年3月14日星期一

Python Scrapy 快速攻略與教學

# 安裝google chrome plugin，SelectorGadget快速取得selector與xpath等相關資訊，網頁爬蟲特別實用

# 安裝firefox plugin，sqlite manager觀看資料庫內容：http://www.minwt.com/website/server/4964.html

#開啟apple新的scrapy專案

stayhigh@stayhighnet:/Users/stayhigh/projects/apple $ scrapy startproject apple

#執行apple的scrapy專案

stayhigh@stayhighnet:/Users/stayhigh/projects/apple $ scrapy crawl apple

#執行apple的scrap專案，並且將輸出a.json的json格式檔案

stayhigh@stayhighnet:/Users/stayhigh/projects/apple $ scrapy crawl apple -o a.json -t json

#執行分段爬蟲任務並放置相關資料於job1目錄

stayhigh@stayhighnet:/Users/stayhigh/projects/apple $ scrapy crawl apple -s JOBDIR=job1

#觀看apple專案目錄結構

- crawler.py為使用者自行定義的爬蟲程式，藉由繼承scrapy.Spider類別進行網頁抓取

- items.py 用於定義資料欄位

- pipeline.py 用於定義爬蟲程式的控制流程

- settings.py 設定檔，用於設定啟用的功能，如常見的pipeline功能，並切記設定時指定pipeline.py當中的apple.pipelines.ApplePipeline

ITEM_PIPELINES = {
    'apple.pipelines.ApplePipeline': 300,
}

stayhigh@stayhighnet:/Users/stayhigh/projects/apple $ tree

├── a.json

├── apple

│ ├── __init__.py

│ ├── __init__.pyc

│ ├── items.py

│ ├── items.pyc

│ ├── pipelines.py

│ ├── settings.py

│ ├── settings.pyc

│ └── spiders

│ ├── __init__.py

│ ├── __init__.pyc

│ ├── crawler.py

│ └── crawler.pyc

└── scrapy.cfg

#如何實現多網頁爬取功能

from scrapy.spiders import CrawlSpider

# crawler.py內的爬蟲類別繼承CrawlSpider

class AppleCrawler(CrawlSpider):

訂閱：文章 (Atom)

2016年12月28日 星期三

2016年4月13日 星期三

2016年4月12日 星期二

2016年4月11日 星期一

2016年4月3日 星期日

2016年3月20日 星期日

2016年3月14日 星期一