| |
|
2008年12月7日星期日
Weekly Twitter 2008 #49
2005年10月18日星期二
regex compilation in Perl
昨天写一个简单的perl脚本处理一个比较大的文本文件时,觉得速度不太理想。想到regular expression多次匹配也许会有一些影响,记得不会变化的reg.ex.是可以编译的,编译后的执行速度应该会快一些,python里面的re模块 就提供compile函数。但对于perl,却不记得有这么个东西,在perl的帮助里面找了半天也没有找到。
最后还是用了google查到(也许应该先问它才是),很简单,在表达式后面添加一个修饰符(modifiier):
o - Compile a regular expression once
If you ever end up with a really long regular expression, you can use this modifier to compile it before it's used. This means that long and complicated expressions don't have to be compiled each time they're used.
The only thing you must remember is that if you use this modifier, you are promising Perl that you won't attempt to change it while the script is running. If you do, it won't be taken into account. There won't be an example of how to use this modifier, since if you're able to write regular expression this long and complicated, you're way ahead of anything I could tell you in this article!
2005年8月10日星期三
Perl这种语言...
在/.cn上看见一则<全球编程语言流行程度列表>,让我没有想到的是Perl居然排第四,只居于Java, C和C++之后。
恰好两周以前公司里面我原来所在的部门跟BT(Britian Telecom, 可不是BitTorrent)有个项目是用Perl开发的,偌大一个部门居然没有人会,我以前的主管打电话给我让我过去支援两周。
对 于Perl 4我倒是比较熟悉,但对Perl 5以后的包、引用等的了解就少一点。考虑到很久没怎么用了(后来投身到Python去了),于是赶紧找了两本电子书(一本Advanced Perl Programming, 一本Perl Cookbook)来翻。由于我还是更喜欢纸做的书,周末还打算去书店淘两本,谁知道跑了两家大书店,两家小书店,都没有几本Perl的书(仅看见 O'Reilly的Learning Perl, 还有一本Perl for C++ Programmer, 好像还有一本CGI Programming with Perl)。china-pub和当当也没什么好的。记得以前还常常看到一些的啊,怎么...也怪不得他们没多少人会了。
回头说说Perl这门语言,这个东西约定的东西太多了,到处都是约定、特殊变量、特殊语法。举个例子,Advanced Perl Programming第一章讲引用(reference):
$s = \('a', 'b', 'c'); # WARNING: probably not what you think
$s指向什么?指向('a', 'b', 'c')这样一个list么?嘿嘿,可不是:
As it happens, this is identical to
$s = ('a', 'b', 'c'); # List of references to scalarsAn enumerated list always yields the last element in a scalar context (as in C), which means that $s contains a reference to the constant string c.
用Perl写东西,有些时候写起来还挺顺,但调起来就够费劲的,而且先不多写点注释的话回头就看不懂了。至少后来很多小玩意儿改用Python来做就易写易读了。
--------BTW: 看见Delphi/Pascal的流行程度不断下降,很有些难过。前两日看见Bob Swart(Dr. Bob)在他的网站上打出了Forever Loyal to Delphi的标语,觉得都到了这个地步了,更是黯然。
2005年4月8日星期五
Convert CHM contents to normal HTML contents
I have some eBooks (CHM format or SRM format). Now I want to copy them to my cellphone. As CHM or SRM format could not be supported, thus I choose PalmDoc (.pdb) format.
Yes,I can convert a pack of HTML files into one PDB file. But:
1) Some CHM books don't have a content page. They use the CHM contents. With a content page, browsing the result PDB file would be not a happy experience.
2) SRM books could be exported as CHM files. All of them
don't have a content page either.
Then came this simple recipe.
I remember two or three years ago, I used to do these things in Perl. Perl's regex feature is so powerful. The only problem is that after a few days, the script seems to be unreadable. :-(
Python is differenent than Perl. This recipe is so simple, isn't it?
#!env python
from sgmllib import SGMLParser
import htmlentitydefs
from chmmaker import HHCWriter
import os
class SiteMapParser(SGMLParser):
def reset(self):
SGMLParser.reset(self)
# some temp variables
self.level = 0
self.link_url = ""
self.link_title = ""
def start_ul(self, attrs):
self.on_section_starts()
def end_ul(self):
self.on_section_ends()
def start_param(self, attrs):
if len(attrs)>1:
if attrs[0][0]=='name':
if attrs[0][1]=='Name':
self.link_title=attrs[1][1]
elif attrs[0][1]=="Local":
self.link_url=attrs[1][1]
def start_object(self, attrs):
self.link_title = ""
self.link_url = ""
def end_object(self):
self.on_link_found(self.link_title, self.link_url)
def on_section_starts(self):
self.level = self.level + 1
def on_section_ends(self):
self.level = self.level - 1
def on_link_found(self, title, url):
# you can override this
if title and url:
print " " * self.level + "%s [%s]" % (title, url)
class ContentParser(SiteMapParser):
""" A simple class to convert CHM contents (foo.hhc) to a normal HTML contents """
def reset(self):
print "<HTML><HEAD></HEAD><BODY>"
SiteMapParser.reset(self)
def on_section_starts(self):
print "<ul>"
def on_section_ends(self):
print "</ul>"
def on_link_found(self, title, url):
print '<li><a href="%s">%s</a></li>' % (url, title)
if __name__=='__main__':
import sys
if len(sys.argv)<2:
print "Usage: %s foo.hhc" % sys.argv[0]
sys.exit()
trans=ContentParser()
fh=open(sys.argv[1], "r")
try:
trans.feed(fh.read())
except:
pass
trans.close()
fh.close()
# vim:expandtab softtabstop=4
Powered by ScribeFire.
2004年12月17日星期五
html2rtf.pl的超级链接支持
line: 257 处添加如下代码:
# now href
urlobj_data1 = "{field{*fldinst {fs24insrsid13071880 hichaf1dbchaf13lochf1 hichaf1dbchaf13lochf1n HYPERLINK";
urlobj_data2 = "hichaf1dbchaf13lochf1 }{fs24insrsid13071880charrsid13071880 {*datafield 00d0c9ea79f9bace118c8200aa004ba90b0200000003000000e0c9ea79f9bace118c8200aa004ba90b5a0000";
# urlobj_data3 is the URL(unicode) in hex code. e.g. http://www.zope.org/Members/Brian/PythonNet/
urlobj_data3 = "0068007400740070003a002f002f007700770077002e007a006f00700065002e006f00720067002f004d0065006d0062006500720073002f0042007200690061006e002f0050007900740068006f006e004e00650074002f";
urlobj_data4 = "000000}}}{fldrslt {cs15fs24ulcf2insrsid13071880charrsid13071880 hichaf1dbchaf13lochf1 ";
instream =~ s/]*>/urlobj_data1 "1" urlobj_data2&url_str2hex(1)urlobj_data4/ig;
instream =~ s//}}}/ig
其中url_str2hex的实现如下,随便放在什么地方
# input: http://
# output: 0068007400740070003a002f002f
sub url_str2hex {
local(s);
s = _[0];
out = "";
i=0;
while(i
ch = substr(s, i, 1);
#printf "%04xn", ord(ch);
out = out.sprintf("%04x", ord(ch));
i ++;
}
printf out;
return out;
}
html2rtf.pl的网址: http://fresh.t-systems-sfr.com/unix/src/www/.warix/html2rtf.pl.html