巴蛮子的新万花筒: Convert CHM contents to normal HTML contents

2005年4月8日星期五

Convert CHM contents to normal HTML contents

What do I want?

I have some eBooks (CHM format or SRM format). Now I want to copy them to my cellphone. As CHM or SRM format could not be supported, thus I choose PalmDoc (.pdb) format.

Yes,I can convert a pack of HTML files into one PDB file. But:
1) Some CHM books don't have a content page. They use the CHM contents. With a content page, browsing the result PDB file would be not a happy experience.
2) SRM books could be exported as CHM files. All of them
don't have a content page either.

Then came this simple recipe.

I remember two or three years ago, I used to do these things in Perl. Perl's regex feature is so powerful. The only problem is that after a few days, the script seems to be unreadable. :-(
Python is differenent than Perl. This recipe is so simple, isn't it?

#!env python

from sgmllib import SGMLParser
import htmlentitydefs

from chmmaker import HHCWriter
import os

class SiteMapParser(SGMLParser):
   def reset(self):
       SGMLParser.reset(self)
       # some temp variables
       self.level = 0
       self.link_url = ""
       self.link_title = ""

   def start_ul(self, attrs):
       self.on_section_starts()

   def end_ul(self):
       self.on_section_ends()

   def start_param(self, attrs):
       if len(attrs)&gt;1:
           if attrs[0][0]=='name':
               if attrs[0][1]=='Name':
                   self.link_title=attrs[1][1]
               elif attrs[0][1]=="Local":
                   self.link_url=attrs[1][1]

   def start_object(self, attrs):
       self.link_title = ""
       self.link_url = ""

   def end_object(self):
       self.on_link_found(self.link_title, self.link_url)

   def on_section_starts(self):
       self.level = self.level + 1

   def on_section_ends(self):
       self.level = self.level - 1

   def on_link_found(self, title, url):
       # you can override this
       if title and url:
           print "  " * self.level + "%s [%s]" % (title, url)

class ContentParser(SiteMapParser):
   """ A simple class to convert CHM contents (foo.hhc) to a normal HTML contents """
   def reset(self):
       print "&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;"
       SiteMapParser.reset(self)

   def on_section_starts(self):
       print "&lt;ul&gt;"

   def on_section_ends(self):
       print "&lt;/ul&gt;"

   def on_link_found(self, title, url):
       print '&lt;li&gt;&lt;a href="%s"&gt;%s&lt;/a&gt;&lt;/li&gt;' % (url, title)

if __name__=='__main__':
   import sys
   if len(sys.argv)&lt;2:
       print "Usage: %s foo.hhc" % sys.argv[0]
       sys.exit()

   trans=ContentParser()
   fh=open(sys.argv[1], "r")
   try:
       trans.feed(fh.read())
   except:
       pass
   trans.close()
   fh.close()
# vim:expandtab softtabstop=4

没有评论:

发表评论