python 15－斑的家

這章是在談web 。

如果我們想要從網頁得到我們想要的資料。

首先我們必須知道它的url

再來是分析哪些資料是對我們有用的。

上面的動作有個名詞：screen scraping

我們使用HTMLParser module 來幫助我們找

哪些tag 裡的資料是我們需要的。

對於每一個tag 都有一個事件。

像是handle_starttag、handle_data 、handle_endtag

這些是等下會被Parser 呼叫的function 。

分別表示遇到starttag 、被tag 包圍的data 、結尾的tag 。

範例從書上來的

from urllib import urlopen
from HTMLParser import HTMLParser

class Scraper(HTMLParser):
        in_link = False

        def handle_starttag(self,tag,attrs):
                attrs = dict(attrs)

                if tag == 'a' and 'href' in attrs:
                        self.in_link = True
                        self.chunks =[]
                        self.url= attrs['href']

        def handle_data(self,data):
                if self.in_link:
                        self.chunks.append(data)
        def handle_endtag(self,tag):
                if tag == 'a':
                        if self.in_link:
                                print '%s (%s)' % (''.join(self.chunks),self.url)
                                self.in_link= False

text = urlopen('http://python.org/community/jobs').read()
parser = Scraper()
parser.feed(text)
parser.close()

從attr 可以得到key,value ，例如href = "www.xxx.org"

href 是key , 網址是value 。

上面的code 是要找出有URL 的地方。

另外也有個好用的module 叫beautiful soup （好喝的湯）

它也可以用來parse HTML 的檔案。

再來談cgi program 。

我們要用python 來寫cgi program 。

首先要啟動apache。

然後把我們的program 放在cgi-bin 底下。

我的是在/var/www/cgi-bin/

記得program 的mode 要調成executable

像是chmod 755 xxx.py

然後program 一開始要打 #! /usr/bin/python

一個基本的樣式

#! /usr/bin/python

print 'Content-type: text/plain'
print 'Hello, world!'

網頁有互動式的，像是form

我們使用cgi module 就可以輕易地得到

get 或post 裡的值。

import cgi
form = cgi.FieldStorage() # 得到client 傳入form 的key,value
name = form.getvalue('name','world') # 找name是key 的value ，沒有的話回傳world

print """Content-type: text/html #""" xxx """ . xxx 是plain text 。
<html>
<head>
<title>Greeting Page</title>
</head>
<body>
<h1>Hello, %s! </h1>

<form action='form.py'>
Change name <input type='text' name='name' />
<input type="submit" />
</form>

</body>
</html>
""" % name

網頁在執行上，我們要debug 很麻煩，只能

看網頁上出現的東西是不是對的。

有一個module cgitb 很好用，它會偵錯。

#! /usr/bin/python

import cgitb

cgitb.enable()
print 'Content-type: text/html'
print 1/0
print 'Hello World'

lettice0913

斑的家

lettice0913 發表在痞客邦留言(0) 人氣()

E-mail轉寄

斑的家

心情隨筆

python 15

歷史上的今天

留言列表

站方公告

活動快報

【寵物...

我的好友

熱門文章

文章分類

最新文章

最新留言

動態訂閱

文章精選

文章搜尋

新聞交換(RSS)

誰來我家

參觀人氣

QR Code

POWERED BY