2012年6月26日 星期二

用Ren'Py 簡單製作美少女遊戲

Ren'Py載點: http://www.renpy.org/latest.html


首先在建立一個New Project之前,可以先按option的選像
上半部Text Editor 跟 JEdit是改變編輯器用的 ,一般我們不會更動到這部分,因此可以先忽略,主要是下面的Projects Directory,這邊可以選擇新的project要建立在哪裡,或是從哪個資料夾讀取舊的project,隨個人喜好,若是以後建立新的project之後找不到該project的資料夾,可以打開這個選項看是不是放到其他地方去了。

接下來就可以開始用Ren'Py建立第一個project,上一次有提到,按下New Project之後,程式會要你輸入名字:
名字只要是英文的就可以隨便取,Project的名字跟之後project的內容不會有任何關聯,只要取一個自己認得的名字就好,我這邊先隨便取一個Demo然後按下enter,接下來隨便選一個Theme跟主題顏色,程式就會跳回來,只是左上角Project的名字變成剛剛所建立的那一個。
在這裡我們先看看project所在的資料夾,看有沒有多了甚麼東西,筆者這邊是跟Ren'Py的主程式同一個資料夾:
可以看到最左上角多了一個Demo的資料夾,Project的名字所影響到的就是資料夾的名字,其他執行視窗的標題都可以再調整,接下來我們點進Demo資料夾,看看裡面有甚麼東西:
有一個game的資料夾,跟一個README.html,README.html就是使用手冊,之後可以按照個人需要修改,之後跟著遊戲一起發佈出去,甚麼都不改也是可以的,官方自己作的範本已經很詳細,只不過是英文版的。接下來我們點進game的資料夾看看:

這些檔案是甚麼呢?.rpy的都是腳本檔,其他同名但副檔名不同的是JEdit讀取該腳本檔會用到的額外資料,以後如果要更新的話只需要更新.rpy檔就可以了,接下來我們回到Ren'Py的主程式:

這邊先點點看Launch,看看Ren'Py最基本就有些甚麼功能:
按了之後會發現新跳出一個視窗,馬上來看看,一開始的顏色跟按鈕樣式會因為大家之前選的主題及配色不同,不過看到的按鈕都會是這五個,Quit應該不用介紹,就是離開遊戲,Help會直接打開之前提到的README.html,是的,Ren'Py可以直接呼叫瀏覽器打開網頁檔案或是網址,這點krkr也可以做到,Nscripter就不行了。
這裡我們先看看Preference,也就是所謂的config,用詞不太一樣,不過之後都可以藉著修改腳本檔調整:

噹噹~~功能齊全的cofig,右下角還有許多捷徑,其中After Choices是說在skip模式之下如果遇到分支選項,選擇之後要繼續skip還是停止,Transitions就是遊戲一些換場景的特效要開還是關,目前這邊先看看就好,接下來我們來看看Load畫面,可以直接點右下角的Load Game就好,Save Game按鈕還是黑的是因為我們根本還沒開啟遊戲:
很陽春的畫面,不過功能已經很齊全了,Ren'Py有Auto Save的功能,檔案都會放到Auto的頁面裡,Quick是Quick Save的功能,其它應該不用多說了,接下來我們按Return回到開頭按下Start Game,來看看預設的遊戲畫面:

後面的格子狀是什麼?我想有用過繪圖軟體的應該都知道,就是什麼都沒有的意思,這邊請注意一下對話框右下角:
這個是遊戲一開始就幫你做好的快捷按鈕,Auto代表自動前進,Q.Save就是quick save,Q.Load就是Quick Load,所以基本上即使你對程式一竅不通,使用Ren'Py還是很方便的。

接下來我來介紹一下滑鼠按鈕的功用,滑鼠左鍵就是下一句,滑鼠中間的鍵就是隱藏對話框看背景(很奇特的設計,滑鼠只有兩個鍵的很抱歉無法使用此功能),滑鼠右鍵會直接跳到Save畫面:
隨便點一個格子Save吧~
儲存格會有遊戲截圖、遊戲儲存時間,至於遊戲對話擷取很遺憾我目前找不到實現的方式,這點krkr可以比較容易實現,除此之外,Ren'Py看上一句話並不像krkr會出現回想模式的文字記綠,而是讓整個遊戲回到上一句話的動作,包含音效以及特效都可以再看一次,這點以後有長一點的對話之後再自己試試就知道了。

接下來因為這個遊戲目前甚麼都沒有,所以可以先關掉,回到Ren'Py主程式,是時候來看看JEdit了!那麼就來按按看Edit Script吧,因為是Java程式,所以第一次執行的時候都要先等一小段時間,可別一急之下就按了很多次,這樣會跑出很多JEdit視窗~
很先進的編輯器,而且色彩很豐富,這邊先說明一個基本且最重要的概念,那就是在Ren'Py的程式裡,一句話最前面的間隔是很重要的,大家可以看label start:下面的那兩句,空行並不重要,但是大家可以發現e"... ..."是從第二個字開始的,前面的空白格不可以去掉,因為Ren'Py的程式會依照每一句的起始點來判斷這些句子是否屬於同一個區塊,這點在下次詳細介紹語法的時候會再說明一次,目前大家只要知道開出來是長這個樣子就好了。

那麼最後還得處理一個很棘手的問題,那就是...中文,畢竟這邊是台灣,沒用中文怎麼可以呢?但大家可以把裡面e"... ..."中間的英文字直接打中文試試,會發現根本無法顯示,因為Ren'Py本身並沒有附贈中文字型,當然顯示不出來啦。

那麼怎麼辦,只好做一個全英文遊戲了嘛!?當然不是,首先我們要找一個中文字型檔(副檔名.ttf),隨便找一個自己喜歡的丟進Demo下面的game資料夾裡,我這裡是找MSJHBD.TTF
那是什麼字型我忘了,總之重點它是繁體字型,接下來回到JEdit,按下options.rpy的分頁,第148行以及第152行:
改成下面這樣:
記住我前面說過的話,要注意這行指令前面的空白間距,絕對不可以多一格或少一格,不然程式都會報錯,在JEdit裡,ctrl+s就是save,ctrl+z就是恢復上一步,基本上跟word一樣,以及只要不關閉JEdit,你可以一直上一步到最初開啟的文件狀態,還滿方便的,完成之後重新在執行一次project吧~

喔耶~~ 中文出現了!!!
自己準備字型檔的好處是:Ren'Py不會管你現在的OS有沒有安裝繁中字體,即使你現在把整個程式拿去日文win執行,還是可以顯示出繁中,這點在unicode程式是非常重要的,也不需要讓不同語系玩家為了執行這個程式一直轉換語系。

最後的最後,如果真的對英文沒轍怎麼辦呢?請到官網,http://renpy.org/wiki/renpy/doc/translations/Translations 這個頁面有很多語言,要怎麼用呢?
我們台灣當然是選繁體中文啦~
把裡面的程式碼copy起來,回到JEdit,按上面的file→New新增一個頁面:

然後儲存檔案放在跟腳本同一個資料夾,檔名叫做translations.rpy:

回到options.rpy分頁,加上一行句子如下:
都完成之後我們回去在執行一次project看看變甚麼樣子吧:






都變成中文啦!!! 就連確認方塊也是!!

 如果對翻譯不滿意,可以直接開啟translations.rpy修改,如果是照官網的方法可以讓整個主程式都變成繁中版的,但我個人覺得這還是有點風險,畢竟語系上還是可能會出點問題,而且這樣所有英文按鈕也變成中文了,所以我比較喜歡每個project看有沒有需要再個別更新,畢竟有些按鈕保持英文也是不錯的,這週就先到這裡,下週就開始介紹腳本的script了

2012年6月25日 星期一

python 簡單處理HTTP Basic authentication


Introduction

This tutorial aims to explain and illustrate what basic authentication is, and how to deal with it from Python. You can download the code from this tutorial from the Voidspace Python Recipebook.
The first example, So Let's Do It, shows how to do it manually. This illustrates how authentication works.
The second example, Doing it Properly, shows how to handle it automatically - with a handler.
These examples make use of the Python module urllib2. This provides a simple interface to fetching pages across the internet, the urlopen function. It provides a more complex interface to specialised situations in the form of openers and handlers. These are often confusing to even intermediate level programmers. For a good introduction to using urllib2, read my urllib2 tutorial.

Basic Authentication

There is a system for requiring a username/password before a client can visit a webpage. This is called authentication and is implemented on the server. It allows a whole set of pages (called a realm) to be protected by authentication.
This scheme (or schemes) are defined by the HTTP spec, and so whilst python supports authentication it doesn't document it very well. HTTP documentation is in the form of RFC[1] which are technical documents and so not the most readable Very Happy .
The two normal [2] authentication schemes are basic and digest authentication. Between these two, basic is overwhelmingly the most common. As you might guess, it is also the simpler of the two.
A summary of basic authentication goes like this :
  • client makes a request for a webpage
  • server responds with an error, requesting authentication
  • client retries request - with authentication details encoded in request
  • server checks details and sends the page requested, or another error
The following sections covers these steps in more details.

Making a Request

A client is any program that makes requests over the internet. It could be a browser - or it could be a python program. When a client asks for a web page, it is sending a request to a server. The request is made up of headers with information about the request. These are the 'http request headers'.

Getting A Response

When the request reaches the server it sends a response back. The request may still fail (the page may not be found for example), but the response will still contain headers from the server. These are 'http response headers'.
If there is a problem then this response will include an error code that describes the problem. You will already be familiar with some of these codes - 404 : Page not found500 : Internal Server Error, etc. If this happens; an exception [3] will be raised by urllib2, and it will have a 'code' attribute. The code attribute is an integer that corresponds to the http error code [4].

Error 401 and realms

If a page requires authentication then the error code is 401. Included in the response headers will be a 'WWW-authenticate' header. This tells us the authentication scheme the server is using for this page and also something called a realm. It is rarely just a single page that is protected by authentication but a section - a 'realm' of a website. The name of the realm is included in this header line.
The 'WWW-Authenticate' header line looks like WWW-Authenticate: SCHEME realm="REALM".
For example, if you try to access the popular website admin application cPanel your browser will be sent a header that looks like : WWW-Authenticate: Basic realm="cPanel"
If the client already knows the username/password for this realm then it can encode them into the request headers and try again. If the username/password combination are correct, then the request will succeed as normal. If the client doesn't know the username/password it should ask the user. This means that if you enter a protected 'realm' the client effectively has to request each page twice. The first time it will get an error code and be told what realm it is attempting to access - the client can then get the right username/password for that realm (on that server) and repeat the request.
HTTP is a 'stateless' protocol. This means that a server using basic authentication won't 'remember' you are logged in [5] and will need to be sent the right header for every protected page you attempt to access.

First Example

Suppose we attempt to fetch a webpage protected by basic authentication. :
theurl = 'http://www.someserver.com/somepath/someprotectedpage.html'
req = urllib2.Request(theurl)
try:
    handle = urllib2.urlopen(req)
except IOError, e:
    if hasattr(e, 'code'):
        if e.code != 401:
            print 'We got another error'
            print e.code
        else:
            print e.headers
            print e.headers['www-authenticate']
Note
If the exception has a 'code' attribute it also has an attribute called 'headers'. This is a dictionary like object with all the headers in - but you can also print it to display all the headers. See the last line that displays the 'www-authenticate' header line which ought to be present whenever you get a 401 error.
A typical output from above example looks like :
WWW-Authenticate: Basic realm="cPanel"
Connection: close
Set-Cookie: cprelogin=no; path=/
Server: cpsrvd/9.4.2

Content-type: text/html

Basic realm="cPanel"
You can see the authentication scheme and the 'realm' part of the 'www-authenticate' header. Assuming you know the username and password you can then navigate around that website - whenever you get a 401 error with the same realm you can just encode the username/password into your request headers and your request should succeed.

The Username/Password

Lets assume you need to access pages which are all in the same realm. Assuming you have got the username and password from the user, you can extract the realm from the header. Then whenever you get a 401 error in the same realm you know the username and password to use. So the only detail left, is knowing how to encode the username/password into the request header Smile . This is done by encoding it as a base 64 string. It doesn't actually look like clear text - but it is only the most vaguest of 'encryption'. This means basic authentication is just that - basic. Anyone sniffing your traffic who sees an authentication request header will be able to extract your username and password from it. Many websites like yahoo or ebay, use javascript hashing/encryption and other tricks to authenticate a login. This is much harder to detect and mimic from python ! You may need to use a proxy client server and see what information your browser is actually sending to the website [6].

base64

There is a very simple recipe base64 recipe over on the Activestate Python Cookbook (It's actually in the comments of that page). It shows how to encode a username/password into a request header. It goes like this :
import base64
base64string = base64.encodestring('%s:%s' % (username, password))[:-1]
req.add_header("Authorization", "Basic %s" % base64string)
Where req is our request object like in the first example.

So Let's Do It

Let's wrap all this up with an example that shows accessing a page, extracting the realm, then doing the authentication. We'll use a regular expression to pull the scheme and realm out of the response header. :
import urllib2
import sys
import re
import base64
from urlparse import urlparse

theurl = 'http://www.someserver.com/somepath/somepage.html'
# if you want to run this example you'll need to supply
# a protected page with your username and password

username = 'johnny'
password = 'XXXXXX'            # a very bad password

req = urllib2.Request(theurl)
try:
    handle = urllib2.urlopen(req)
except IOError, e:
    # here we *want* to fail
    pass
else:
    # If we don't fail then the page isn't protected
    print "This page isn't protected by authentication."
    sys.exit(1)

if not hasattr(e, 'code') or e.code != 401:
    # we got an error - but not a 401 error
    print "This page isn't protected by authentication."
    print 'But we failed for another reason.'
    sys.exit(1)

authline = e.headers['www-authenticate']
# this gets the www-authenticate line from the headers
# which has the authentication scheme and realm in it


authobj = re.compile(
    r'''(?:\s*www-authenticate\s*:)?\s*(\w*)\s+realm=['"]([^'"]+)['"]''',
    re.IGNORECASE)
# this regular expression is used to extract scheme and realm
matchobj = authobj.match(authline)

if not matchobj:
    # if the authline isn't matched by the regular expression
    # then something is wrong
    print 'The authentication header is badly formed.'
    print authline
    sys.exit(1)

scheme = matchobj.group(1)
realm = matchobj.group(2)
# here we've extracted the scheme
# and the realm from the header
if scheme.lower() != 'basic':
    print 'This example only works with BASIC authentication.'
    sys.exit(1)

base64string = base64.encodestring(
                '%s:%s' % (username, password))[:-1]
authheader =  "Basic %s" % base64string
req.add_header("Authorization", authheader)
try:
    handle = urllib2.urlopen(req)
except IOError, e:
    # here we shouldn't fail if the username/password is right
    print "It looks like the username or password is wrong."
    sys.exit(1)
thepage = handle.read()
When the code has run the contents of the page we've fetched is saved as a string in the variable 'thepage'. The regular expression used to match the authentication header in this example is r'''(?:\s*www-authenticate\s*:)?\s*(\w*)\s+realm=['"]([^'"]+)['"]'''. This doesn't work where there is a space in the realm, which you can fix by replacing \w+ with [^'"]+. This gives us the regular expression:
r'''(?:\s*www-authenticate\s*:)?\s*(\w*)\s+realm=['"]([^'"]+)['"]'''
Warning
If you are writing an http client of any sort that has to deal with basic authentication, don't do it this way. The next example that shows using a handler is the right way of doing it.

Doing it Properly

In actual fact the proper way to do BASIC authentication with Python is to install an opener that uses an authentication handler. The authentication handler needs a passowrd manager - and then you're away Laughing .
Every time you use urlopen you are using handlers to deal with your request - whether you know it or not. The default opener has handlers for all the standard situations installed [7]. What we need to do is create an opener that has a handler that can deal with basic authentication. The right handler for our needs is calledurllib2.HTTPBasicAuthHandler. As I mentioned it also needs a password manager -urllib2.HTTPPasswordMgr.
Unfortunately our friend HTTPPasswordMgr has a slight problem - you must already know the realm you're fetching. Luckily it has a near cousin HTTPPasswordMgrWithDefaultRealm. Despite the keyboard busting name, it's a bit more friendly to use. If you don't know the name of the realm - then pass in None for the realm, and it will try the username and password you give it - whatever the realm. Seeing as you are going to specify a specific URL, it is likely that this will be sufficient. If you aren't convinced then you can always use HTTPPasswordMgr and extract the realm from the authentication header the first time you meet it.
This example goes through the following steps :
  • establishes the top level url, username and password
  • Create our password manager (with default realm)
  • Gives the password to the manager
  • Creates the handler with the manager
  • Creates an opener with the handler installed
At this point we have a choice. We can either use the open method of the opener directly. This leavesurllib2.urlopen using the default opener. Alternatively we can make our opener the default one. This means all future calles to urlopen will use this opener. As all openers have the default handlers installed as well as the ones you pass it, it shouldn't break urlopen to do this. In the example below we install it, making it the default opener :
import urllib2

theurl = 'http://www.someserver.com/toplevelurl/somepage.htm'
username = 'johnny'
password = 'XXXXXX'
# a great password

passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
# this creates a password manager
passman.add_password(None, theurl, username, password)
# because we have put None at the start it will always
# use this username/password combination for  urls
# for which `theurl` is a super-url

authhandler = urllib2.HTTPBasicAuthHandler(passman)
# create the AuthHandler

opener = urllib2.build_opener(authhandler)

urllib2.install_opener(opener)
# All calls to urllib2.urlopen will now use our handler
# Make sure not to include the protocol in with the URL, or
# HTTPPasswordMgrWithDefaultRealm will be very confused.
# You must (of course) use it when fetching the page though.

pagehandle = urllib2.urlopen(theurl)
# authentication is now handled automatically for us
Hurrah - not so bad hey Wink .

A Word About Cookies

Some websites may also use cookies alongside authentication. Luckily there is a library that will allow you to have automatic cookie management without having to think about it. This is ClientCookie. In Python 2.4 it becomes part of the python standard library as cookielib. See my article on cookielib - for an example of how to use it.

Footnotes

[1]http://www.faqs.org/rfcs/rfc2617.html is the RFC that describes basic and digest authentication
[2]There is also a M$ proprietary authentication scheme called NTLM, but it's usually found on intranets - I've never had to deal with it live on the web.
[3]An HTTPError, which is a subclass of IOError
[4]Or at least state management is a separate subject. Using cookies the server may well have details of your session - but you will still need to authenticate each request.
[5]See http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html for a full list of error codes
[6]See this comp.lang.python thread for suggestions of several proxy servers that can do this.
[7]See the urllib2 tutorial for a slightly more detailed discussion of openers and handlers.

python 實用的抓圖片程式碼


#-*-coding:utf8 -*-
import re
import time
import urllib2
#set target webpage
target = 'http://www.twbbs.net.tw/760449.html'
content= urllib2.urlopen(target).read()




#simulate web access action
headers = {
    'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6',
    'Referer':'http://www.twbbs.net.tw/760449.html'
}


#if match the url pattern,append each match into urls list
urls = re.findall(r'src=[\'"]?([^\'" >]+)', content)
picurls = []


#Distinguish the picture extension from others
for url in urls:
    if url.startswith('http://tinypic.com/') and ( url.endswith('jpg') or url.endswith('png') or url.endswith('gif') ):
        picurls.append(url)








#Starting  the Scrap process
for idx,eachpic in enumerate(picurls):
    req = urllib2.Request(
        url = str(eachpic),
        headers = headers
    )
    result = urllib2.urlopen(req).read()
    #you should select binary option in your file mode setting
    picf = open("D:\\onepiecePics\\"+str(idx+1)+'.jpg',"wb")
    picf.write(urllib2.urlopen(req).read())
    picf.close()
    print str(idx+1)+'.jpg picture Saved.'
    #you should wait for some minutes to avoid robot detection
    #maybe random time for each iteration will be better.
    time.sleep(1) 

利用python抓取網頁圖片


資料來源:http://gae-django-cms.appspot.com/
這些腳本有一個共性,都是和web相關的,總要用到獲取鏈接的一些方法,再加上simplecd這 個半爬蟲半網站的項目,累積不少爬蟲抓站的經驗,在此總結一下,那麼以後做東西也就不用重複勞動了。
-
1.最基本的抓站
import urllib2
content = urllib2.urlopen('http://XXXX').read()
-
2.使用代理服務器
這在某些情況下比較有用,比如IP被封了,或者比如IP訪問的次數受到限制等等。
import urllib2
proxy_support = urllib2.ProxyHandler({'http':'http://XX.XX.XX.XX:XXXX'})
opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler)
urllib2.install_opener(opener)
content = urllib2.urlopen('http://XXXX').read()
-
3.需要登錄的情況
登錄的情況比較麻煩我把問題拆分一下:
-
3.1 cookie的處理
import urllib2, cookielib
cookie_support= urllib2.HTTPCookieProcessor(cookielib.CookieJar())
opener = urllib2.build_opener(cookie_support, urllib2.HTTPHandler)
urllib2.install_opener(opener)
content = urllib2.urlopen('http://XXXX').read()
是的沒錯,如果想同時用代理和cookie,那就加入proxy_support然後operner改爲
opener = urllib2.build_opener(proxy_support, cookie_support, urllib2.HTTPHandler)
-
3.2 表單的處理
登錄必要填表,表單怎麼填?首先利用工具截取所要填表的內容
比如我一般用firefox+httpfox插件來看看自己到底發送了些什麼包
這個我就舉個例子好了,以verycd爲例,先找到自己發的POST請求,以及POST表單項:
可以看到verycd的話需要填username,password,continueURI,fk,login_submit這幾項,其中fk是 隨機生 成的(其實不太隨機,看上去像是把epoch時間經過簡單的編碼生成的),需要從網頁獲取,也就是說得先訪問一次網頁,用正則表達式等工具截取返回數據中 的fk項。continueURI顧名思義可以隨便寫,login_submit是固定的,這從源碼可以看出。還有username,password那 就很顯然了。
-
好的,有了要填寫的數據,我們就要生成postdata
import urllib
postdata=urllib.urlencode({
    'username':'XXXXX',
    'password':'XXXXX',
    'continueURI':'http://www.verycd.com/',
    'fk':fk,
    'login_submit':'登錄'
    })
-
然後生成http請求,再發送請求:
req = urllib2.Request(
    url = 'http://secure.verycd.com/signin/*/http://www.verycd.com/',
    data = postdata)
result = urllib2.urlopen(req).read()
-
3.3 僞裝成瀏覽器訪問
某些網站反感爬蟲的到訪,於是對爬蟲一律拒絕請求
這時候我們需要僞裝成瀏覽器,這可以通過修改http包中的header來實現
#…
headers = {
    'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'
}
req = urllib2.Request(
    url = 'http://secure.verycd.com/signin/*/http://www.verycd.com/',
    data = postdata,
    headers = headers
)
#...
-
3.4 反”反盜鏈”
某些站點有所謂的反盜鏈設置,其實說穿了很簡單,就是檢查你發送請求的header裏面,referer站點是不是他自己,所以我們只需要像3.3一樣, 把headers的referer改成該網站即可,以黑幕著稱地cnbeta爲例:
#...
headers = {
    'Referer':'http://www.cnbeta.com/articles'
}
#...
headers是一個dict數據結構,你可以放入任何想要的header,來做一些僞裝。例如,有些自作聰明的網站總喜歡窺人隱私,別人通過代理 訪問,他偏偏要讀取header中的X-Forwarded-For來看看人家的真實IP,沒話說,那就直接把X-Forwarde-For改了吧,可以 改成隨便什麼好玩的東東來欺負欺負他,呵呵。
-
3.5 終極絕招
有時候即使做了3.1-3.4,訪問還是會被據,那麼沒辦法,老老實實把httpfox中看到的headers全都寫上,那一般也就行了。
再不行,那就只能用終極絕招了,selenium直 接控制瀏覽器來進行訪問,只要瀏覽器可以做到的,那麼它也可以做到。類似的還有pamie,watir,等等等等。
-
4.多綫程併發抓取
單綫程太慢的話,就需要多綫程了,這裏給個簡單的綫程池模板
這個程序只是簡單地打印了1-10,但是可以看出是併發地。
from threading import Thread
from Queue import Queue
from time import sleep
#q是任務隊列
#NUM是併發綫程總數
#JOBS是有多少任務
q = Queue()
NUM = 2
JOBS = 10
#具體的處理函數,負責處理單個任務
def do_somthing_using(arguments):
    print arguments
#這個是工作進程,負責不斷從隊列取數據並處理
def working():
    while True:
        arguments = q.get()
        do_somthing_using(arguments)
        sleep(1)
        q.task_done()
#fork NUM個綫程等待隊列
for i in range(NUM):
    t = Thread(target=working)
    t.setDaemon(True)
    t.start()
#把JOBS排入隊列
for i in range(JOBS):
    q.put(i)
#等待所有JOBS完成
q.join()
5.驗證碼的處理
碰到驗證碼咋辦?這裏分兩種情況處理:
-
1.google那種驗證碼,涼拌
-
2.簡單的 驗證碼:字符個數有限,只使用了簡單的平移或旋轉加噪音而沒有扭曲的,這種還是有可能可以處理的,一般思路是旋轉的轉回來,噪音去掉,然後劃分 單個字符,劃分好了以後再通過特徵提取的方法(例如PCA) 降維並生成特徵庫,然後把驗證碼和特徵庫進行比較。這個比較複雜,一篇博文是說不完的,這裏就不展開了,具體做法請弄本相關教科書好好研究一下。
-
3.事實上有些驗證碼還是很弱的,這裏就不點名了,反正我通過2的方法提取過準確度非常高的驗證碼,所以2事實上是可行的。
-
6.總結
基本上我遇到過的所有情況,用以上方法都順利解決了。