Archive for the 'Python' Category

moin wiki农场

two parts:

  1. step 0-2. all are required
  2. step 3,4,5. web server configration, you only need one of them.

all these installation are done in $HOME , so don’t need any root prerogative


step 0. install/upgrade Moin to 1.9.2 (when upgrade,
recommend to delete the old release, I didn’t do this before , that cause a big problem )

install new moin ignore.

this is the upgrade:

1.8.3->1.9.2

python setup.py install

gavin_kou@shadow:~/downloads/python/moin-1.9.2$ ll /home/gavin_kou/local/lib/python2.5/site-packages/moin-* -d
drwxrwxr-x 6 gavin_kou pg2184500 4096 2009-06-03 02:32 /home/gavin_kou/local/lib/python2.5/site-packages/moin-1.8.3-py2.5.egg
-rw-rw-r– 1 gavin_kou pg2184500 3183 2010-03-10 02:26 /home/gavin_kou/local/lib/python2.5/site-packages/moin-1.9.2-py2.5.egg-info

>>> import MoinMoin.version

>>> MoinMoin.version.release
‘1.9.2′

there are a special directory: ~/local/share/moin/ ,it contains all of wiki initialization data, especially the data and underlay directory and
some files in the server and config directory: server(eg: moin.fcgi is the server of fcgi ENV, and moin.cgi is the cgi script , etc ) and
config( eg: config/wikifarm/farmconfig.py is the wiki farm configuration sample )
gavin_kou@shadow:~/sites/wiki/bin$ ll ~/local/share/moin/
total 16
drwxrwxr-x 5 gavin_kou pg2184500 4096 2010-03-10 02:26 config
drwxrwxr-x 7 gavin_kou pg2184500 4096 2010-03-10 02:26 data
drwxrwxr-x 2 gavin_kou pg2184500 4096 2010-03-10 02:26 server
drwxrwxr-x 3 gavin_kou pg2184500 4096 2010-03-10 02:26 underlay

step 1. design the directory structure, all the setting are based on this stucture. if change the structure ,
DO NOT forget to change the corresponding setting files.

wiki/
├─bin/
│ ├─mointwisted
│ ├─mointwisted.py
│ ├─moin.fcgi
│ ├─moin.cgi
│ └─moin
├─config/
│ ├─farmconfig.py
│ ├─hackgou.py
│ └─hiking.py
├─data/
│ ├─hackgou
│ │ ├─data/
│ │ └─underlay/
│ ├─hiking/
│ │ ├─data/
│ │ └─underlay/
│ └─user/
└─static/
└─htdocs/
and create it by:

mkdir -p wiki/{bin,config,data/hackgou,data/hiking,static}

step1. copy data ( the data and underlay dir)
cp -rp ~/local/share/moin/data ~/local/share/moin/underlay wiki/data/hackgou
cp -rp ~/local/share/moin/data ~/local/share/moin/underlay wiki/data/hiking

step2. copy the ~/local/share/moin/config/wikifarm/farmconfig.py and ~/local/share/moin/config/wikifarm/mywiki.py to your config dir
mywiki.py

cp ~/local/share/moin/config/wikifarm/farmconfig.py wiki/config/
cp ~/local/share/moin/config/wikifarm/mywiki.py wiki/config/hackgou.py
cp ~/local/share/moin/config/wikifarm/mywiki.py wiki/config/hiking.py

and change the farmconfig.py:

wikis = [
("hiking", r'^http://hackgou.itbbq.com/wiki/hiking.*$'),
("hackgou", r'^http://hackgou.itbbq.com/.*$'),
]

add the following code into the hackgou.py and hiking.py(Both of them):

import os
app_root=os.path.realpath( os.path.join( os.path.dirname( os.path.realpath(__file__) ) ,’..’) )
data_root=os.path.join(app_root, ‘data’)
data_dir = os.path.join(data_root, __name__, ‘data’ )
data_underlay_dir = os.path.join(data_root, __name__, ‘underlay’ )

step3. config the mointwisted ,

add the following lines into mointwisted.py

import sys, os
app_root=os.path.realpath( os.path.join( os.path.dirname( os.path.realpath(__file__) ) ,’..’) )
sys.path.insert(0, os.path.join(app_root, ‘config’) )

step4. config the WSGI

cp ~/local/share/moin/server/moin.wsgi bin/

change the following codes:

app_root=os.path.realpath( os.path.join( os.path.dirname( os.path.realpath(__file__) ) ,’..’) )
sys.path.insert(0, os.path.join(app_root, ‘config’) )

application = make_application(shared=os.path.join(app_root,’static’,'htdocs’))

setp5. config the FCGI, two files: moin.fcgi and .htaccess
cp ~/local/share/moin/server/moin.fcgi and change it.

change the following line

import sys, os

sys.path.insert(0, ‘/home/gavin_kou/local/lib/python2.5/site-packages’)

app_root=os.path.realpath( os.path.join( os.path.dirname( os.path.realpath(__file__) ) ,’..’,'wiki’) ) # based the above directory structure
sys.path.insert(0, os.path.join(app_root, ‘config’) )

from MoinMoin import log
#enable the log, for trouble shooting. the log.conf is the logger config file,
# if enable it , please make sure it’s right
log.load_config( os.path.join( app_root,’config’,'log.conf’) )
logging = log.getLogger(__name__)

from MoinMoin.web.serving import make_application
app = make_application( shared = os.path.join(app_root,’static’,'htdocs’) ) # <– adapt here as above directory structure

fix_script_name = ‘/wiki’

.htaccess add the following lines:

AddHandler fastcgi-script .fcgi
RewriteRule wiki(/?.*) /moin.fcgi/$1 [L]

other advanced setting:

user_dir=os.path.join(app_root, ‘data’, ‘user’)
#session data stored in cache_dir/__session_. so
# if shared the cache_dir, shared the login information(SSO)
cache_dir=os.path.join(app_root,’var’,'cache’)

crawler collection

python based crawler:

  1. Atomisator: http://atomisator.ziade.org/ to build custom RSS feeds
  2. Orchid: http://pypi.python.org/pypi/Orchid/1.1
    Orchid is a python crawler I developed for one of my graduate courses. It is a generic multi-threaded web crawler complete with documentation. We used this crawler to locate web pages which contained malicious code. However, the logic of what to do with the crawled pages is implemented in a separate class and therefore Orchid can easily be used for any application which requires crawling the web
  3. Ruya: http://pypi.python.org/pypi/Ruya/1.0
    Ruya is a Python-based breadth-first, level-, delayed, event-based-crawler for crawling English, Japanese websites. It is targeted solely towards developers who want crawling functionality in their projects using API, and crawl control
  4. harvestman:
    HarvestMan (with a capital ‘H’ and a capital ‘M’) is a webcrawler program. HarvestMan belongs to a family of
    programs frequently addressed as webcrawlers, webbots, web-robots, offline browsers etc.
    These programs are used to crawl a distributed network of computers like the Internet and download files locally
    1. http://code.google.com/p/harvestman-crawler/
    2. http://www.harvestmanontheweb.com/
  5. : http://dev.scrapy.org/
    Scrapy a is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.
    Even though Scrapy was originally designed for screen scraping, it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler.
    The purpose of this document is to introduce you to the concepts behind Scrapy so you can get an idea of how it works and decide if Scrapy is what you need.
  6. Webstemmer : http://www.unixuser.org/~euske/python/webstemmer/index.html
    Webstemmer is a web crawler and HTML layout analyzer that automatically extracts main text of a news site without having banners, ads and/or navigation links mixed up

Other Crawler:

  1. droids http://incubator.apache.org/droids/
  2. Heritrix: http://crawler.archive.org/articles/user_manual/creating.html

Del.icio.us : , , , , , , ,

再次质疑Django+M od_python时对环境 变量的处理

今天在升级一个django开发的系统到V2的时候,
发现~/无法正确的展开成 /home/hackgou.(apache是以hackgou帐号执行的,)
也不会展开成/root(apache是以root启动的) 觉得非常奇怪,
就算是用os.path.expanduser(‘~/’)也无济于事。 于是怀疑是os.envrion['HOME']不对,
因为expanduser是需要这个变量来展开~/的。 于是使用setenv HOME /home/hackgou/ 可是也不工作,
后来在django/core/handlers/modpython.py的ModPythonHandler中 发现Django开发组已经注意到,
mod_python不理会apache的setENV指令:
# mod_python fakes the environ, and thus doesn’t process SetEnv. This fixes that
os.environ.update(req.subprocess_env)

但是这个和SetEnv DJANGO_SETTINGS_MODULE app.settings 自相矛盾矛盾,真晕,
自己不设置正确的,也不搭理管理员指定的,怎么办?
后来在http://code.google.com/p/modwsgi/wiki/ApplicationIssues找到一些类似的情况,
提到sudo的时候有bug,会导致HOME和root启动的HOME不一样,太好了,我就要这样的bug。
设置hackgou账号的sudo权限,然后用sudo来启动apache,果然所有的expanduser(‘~/’)都顺利。
除此之外, 文中提到
import os, pwd os.environ["HOME"] = pwd.getpwuid(os.getuid()).pw_dir
这样的代码,似乎可以解决这个问题,神啊,我很反感这种做法:
1.django中似乎没有地方可以放置这样的需要这个APP都需要的代码,我的DRY啊
2.Django修改环境变量似乎成了习惯,之前碰到个TIME_ZONE的问题,就是因为修改了APACHE的环境变量,
导致 别的应用环境受到污染,要知道一个apache进程是很多应用共享的, 除了Djano还有别的应用,
比如别的Django应用或者PHP,都有可能在一个相同的APACHE进程空间中处理,
说有可能是因为apache本身的一些设置会出现这些差别,跟py没有关系, 这样会导致他们的工作环境受到污染。
这似乎没有好的方法可以解决这个问题, 对Django或者说对mod_python 的这种拖泥带水的做法感觉非常的不爽。
也不知道有没有更好的更彻底的解决方法?
如果谁知道,能够告诉我那是最好不过的了

from Crypto import Util 出错

在twisted的conch中会使用Crypto
Crypto是pycrypto:http://www.amk.ca/python/code/crypto 中的一个模块。
给python提供许多加密算法支持:
* Hash functions: MD2, MD4, RIPEMD, SHA256.
* Block encryption algorithms: AES, ARC2, Blowfish, CAST, DES, Triple-DES, IDEA, RC5.
* Stream encryption algorithms: ARC4, simple XOR.
* Public-key algorithms: RSA, DSA, ElGamal, qNEW.
* Protocols: All-or-nothing transforms, chaffing/winnowing.
* Miscellaneous: RFC1751 module for converting 128-key keys into a set of English words, primality testing.
* Some demo programs (currently all quite old and outdated).
这些东西以前很多是收到米国出口限制的,不过今非昔比。 使用easy_install就可以很方便的安装并使用了。
原来第一次用的时候也有这个问题。当时解决了,回来就忘了,今天有碰到了,有想了半天!才记起有这回事。
好记性不如烂笔头,还是记一笔的好!

让easy_install构造自己的py thon小天地

在linux环境中玩、用python,常常需要安装额外的一些python lib
但是由于权限的问题,我们一般是无法往系统中/usr/local之类的目录
里面写东西的,而这些额外的lib又非得需要一个site-packages目录来安装
不可,虽然可以指定pure-lib之类的参数,但是很多时候还是会出错,尤其是现在很多python
lib都是使用的setuptools来生成安装包,更是如此要求了,

  1. [gavin@Korea downloads]$ ll /usr/local/python24/lib/python2.4/site-packages/
  2. total 24
  3. drwxr-xr-x  2 root root 4096 Dec 18 16:05 PIL
  4. -rw-r--r--  1 root root    4 Dec 18 16:05 PIL.pth
  5. -rw-r--r--  1 root root  119 Dec 18 14:59 README

这可麻烦了,每个文件都是root的,旁人只能看,
其实这儿有个很好的解决方法:在自己的目录下面,安装一个虚拟的python:
在自己的目录下面建立一些lib、include等等的目录,构成一个独立的python小天地,这样,系统范围内没有的python库,就可以自己动手安装在自己的小天地中
安装,既不需要root权限,又可以满足自己的需求,一举两得。
这些安装步骤,当然不需要我们重新造轮子,下载

http://peak.telecommunity.com/dist/virtual-python.py

这个脚本,使用你喜欢的python(有的环境提供多个python版本,比如DreamHost)
执行一下这个virtual-python.py,就会自动在~/下面建立所需的目录(~/bin、/lib、
~/include),以及所需python版本,以及创建python所依赖的其他的.h头文件、.py库文件等等软链接,而且会在~/bin/下面copy一个可以执行的python文件,以后直接使用这个~/bin/python来执行py程序,它就会自动找到额外安装在小天地中的那些python库了。如果觉得把bin、lib、include放在~下面不好可以给virtual-python.py指定一个–prefix参数:

  1. [gavin@Korea bin]$ python24 virtual-python.py --prefix=~/python-lib
  2. [gavin@Korea bin]$ pwd
  3. /home/gavin/python-lib/bin
  4. [gavin@Korea bin]$ ll
  5. total 2360
  6. -rwxrwxr-x  1 gavin gavin 2404367 Dec 18 16:53 python

会把那些bin、lib、include安装在~/python-lib下面,
这下,你就可以使用easy__install来安装自己额外需要的那些库了,
不过等等先,由于此时使用的easy_install是系统范围的,所以它会把东西安装在
/usr/local之类的目录下,所以我们得给我们自己的环境安装一个easy_install。
下载

  1. wget http://peak.telecommunity.com/dist/ez_setup.py
  2. ~/python-lib/bin/python ez_setup.py
  3. Downloading http://cheeseshop.python.org/packages/2.4/s/setuptools/setuptools-0.6c3-py2.4.egg
  4. Processing setuptools-0.6c3-py2.4.egg
  5. creating /home/gavin/python-lib/lib/python2.4/site-packages/setuptools-0.6c3-py2.4.egg
  6. Extracting setuptools-0.6c3-py2.4.egg to
  7. /home/gavin/python-lib/lib/python2.4/site-packages
  8. Adding setuptools 0.6c3 to easy-install.pth file
  9. Installing easy_install script to /home/gavin/python-lib/bin
  10. Installing easy_install-2.4 script to /home/gavin/python-lib/bin
  11.  
  12. Installed /home/gavin/python-lib/lib/python2.4/site-packages/setuptools-0.6c3-py2.4.egg
  13. Processing dependencies for setuptools==0.6c3
  14. [gavin@Korea downloads]$ ll ~/python-lib/bin/
  15. total 2376
  16. -rwxr-xr-x  1 gavin gavin     298 Dec 18 17:02 easy_install
  17. -rwxr-xr-x  1 gavin gavin     306 Dec 18 17:02 easy_install-2.4
  18. -rwxrwxr-x  1 gavin gavin 2404367 Dec 18 16:53 python

好了,我自己的easy_install已经安装好了,就可以使用它来安装
自己想安装的所有东西了,而且不用担心权限的问题:

  1. [gavin@Korea downloads]$ ll ~/python-lib/lib/python2.4/site-packages/

我好像没有simplejson呢,ok,安装一个,先:

  1. [gavin@Korea downloads]$ ~/python-lib/bin/easy_install simplejson
  2. Searching for simplejson
  3. Reading http://www.python.org/pypi/simplejson/
  4. Reading http://undefined.org/python/#simplejson
  5. Reading http://www.python.org/pypi/simplejson/1.4
  6. Best match: simplejson 1.4
  7. Downloading http://cheeseshop.python.org/packages/2.4/s/simplejson/simplejson-1.4-py2.4.egg#md5=4f18e31fd095cd54e5015e7b7a147093
  8. Processing simplejson-1.4-py2.4.egg
  9. Moving simplejson-1.4-py2.4.egg to
  10. /home/gavin/python-lib/lib/python2.4/site-packages
  11. Adding simplejson 1.4 to easy-install.pth file
  12.  
  13. Installed /home/gavin/python-lib/lib/python2.4/site-packages/simplejson-1.4-py2.4.egg
  14. Processing dependencies for simplejson
  15. [gavin@Korea downloads]$ ll ~/python-lib/lib/python2.4/site-packages/
  16. total 76
  17. -rw-rw-r--  1 gavin gavin   241 Dec 18 17:07 easy-install.pth
  18. lrwxrwxrwx  1 gavin gavin    51 Dec 18 16:53 PIL ->
  19. /usr/local/python24/lib/python2.4/site-packages/PIL
  20. lrwxrwxrwx  1 gavin gavin    55 Dec 18 16:53 PIL.pth ->
  21. /usr/local/python24/lib/python2.4/site-packages/PIL.pth
  22. lrwxrwxrwx  1 gavin gavin    54 Dec 18 16:53 README ->
  23. /usr/local/python24/lib/python2.4/site-packages/README
  24. drwxrwxr-x  4 gavin gavin  4096 Dec 18 17:02 setuptools-0.6c3-py2.4.egg
  25. -rw-rw-r--  1 gavin gavin    29 Dec 18 17:02 setuptools.pth
  26. -rw-rw-r--  1 gavin gavin 35898 Dec 18 17:07 simplejson-1.4-py2.4.egg
  27. [gavin@Korea downloads]$

非常漂亮

  1. [gavin@Korea downloads]$ ~/python-lib/bin/python
  2. Python 2.4.4 (#1, Dec 18 2006, 14:54:46)
  3. [GCC 3.4.3 20041212 (Red Hat 3.4.3-9.EL4)] on linux2
  4. Type "help", "copyright", "credits" or "license" for more information.
  5. >>> import simplejson
  6. >>> dir(simplejson)
  7. ['JSONDecoder', 'JSONEncoder', '__all__', '__builtins__', '__doc__',
  8. '__file__', '__loader__', '__name__', '__path__', '__version__',
  9. 'decoder', 'dump', 'dumps', 'encoder', 'load', 'loads', 'read',
  10. 'scanner', 'write']
  11. >>>

这个法子对使用dreamhost这类的虚拟主机带来的便利是非常好的,所有的都是自己独立的,再也不用担心Django找不到PIL、找不到……出错了,也不需要在fcgi转发程序里面添加一堆的sys.path,美哉!