Code Tags and Syntax Highlighting

UPDATE: Now that I’m using WordPress, I no longer rely on this. With markdown and markdown syntax highlighting plugins, all this logic is automatically handled for me at page render time. This is still a good reference for people using plogspot or implementing their own syntax highlighting on the back-end. Also, as Adam has mentioned, for blogspot there are other alternatives available that are simpler to setup if you don’t want to do your own implementation.

I really like the flexibility of blogspot, it’s one of very few free blogging platforms that allows me to customize just about any feature of my blog, from custom CSS stylesheets to unique Javascript snipplets. Naturally, I was surprised that there was no <code> tag to wrap code blocks in despite so many existing programmer blogs. Luckily, a quick search revealed another blog that resolved this via a simple CSS tag. So I followed the same advice, but somehow staring at monotone text in a code block was about as satisfying as coding in notepad.

As I started investigating, I found multiple solutions, ranging from online parsers to various language libraries. My favorite was pygments, a python library (are you surprised?) for syntax highlighting. After a little more searching, I found the following article that not only explains how to use pygments but also writes a quick script for parsing Python files. The script had a small glitch, instead of indir=r”, I had to use indir=r’.’, so the code I started with ended up looking like this:

from pygments import highlight
from pygments.lexers import PythonLexer
from pygments.formatters import HtmlFormatter
import os

formatter = HtmlFormatter()

indir= r'.'
for x in os.listdir(indir):
    if x.endswith('.py'):
        infile = os.path.join(indir, x)
        outfile = infile.replace('.py', '.html')
        text = ''.join(list(open(infile)))
        html = highlight(text, PythonLexer(), formatter)
        f = open(outfile, 'w')
        f.write(html)
        f.close()
        print 'finished :',x

I then decided to go a step further and update the script to handle multiple programming/markup languages, since I also use CSS, HTML, XML, and Javascript in my blog. But creating a separate file for every snipplet of code I decide to use in my blog post didn’t seem like a good solution.

Instead, I decided to parse out and highlight all the <code> tags from a single file. This would allow me to type an entire blog as one text file and run it through the parser once instead of combining multiple files into a single blog post. I was initially thinking of using Python’s regular expression library, but decided to use XML parsing library instead. It just seemed like the right tool for the job, it can handle additional attributes added to the <code> tag, as well as edit the document tree in-place. In other words, with XML parser I’m much less likely to screw something up with my future “improvements”.

There are several XML libraries available in Python, I decided to go with xml.dom. The main reason I chose it is that its naming scheme is the same as that of Pyjamas and Javascript’s DOM methods (unlike lxml, whose method names feel awkward to me), xml.dom also happens to be the library Pyjamas Desktop uses for XML parsing (I’ll write another tutorial on that soon).

As far as applying correct syntax highlighting based on the language, there are a few ways to go about doing it. One is to use guess_lexer() method from pygments, which can usually detect the correct language automatically. However, a lot of my code-blocks are two-liners that would probably look the same in many different languages, so I decided against that. As a blog writer, I should know what language I’m posting the code in anyway, so it’s very easy for me to specify it myself. Since I’m using a real XML parser, I can add as many attributes to my <code> tag as I wish. So I decided to to specify the language via the “class” attribute (i.e. <code class='python'>).

I then imported the lexers I wanted to use, and defined a dictionary mapping “class” names to lexers (note that omitting the class attribute entirely defaults to non-highlighting TextLexer):

from pygments.lexers import HtmlLexer, JavascriptLexer, CssLexer, XmlLexer, 
                            PythonLexer, TextLexer
import xml.dom.minidom

lexers = {'': TextLexer,
          'html': HtmlLexer,
          'js': JavascriptLexer,
          'css': CssLexer,
          'xml': XmlLexer,
          'python': PythonLexer}

The next step was to loop through all of these code tags, selecting the correct lexer based on the “class” attribute, and applying it to the contents. The last step was to convert the XML DOM tree back to xml format before outputting it to the file. Trying something like this, however, will not work:

for element in code_tags:
    lexer = lexers[element.getAttribute('class')]()
    element.firstChild.nodeValue =
        highlight(element.firstChild.nodeValue, lexer, formatter)
html = doc.toxml()

The problem with the above code is that xml.dom automatically converts < and > angle brackets to &lt; and &gt;, respectively to avoid interpreting the node value as part of the tree. This means that all of the pretty syntax highlighting pygments added will be shown to the user as code instead of getting interpreted by the browser. Since I’m outputting all of the “XML” back to the document, I figured there is no harm in xml.dom “interpreting” the pygments output, and replaced the last line of the for loop to the following:

element.replaceChild(xml.dom.minidom.parseString(
    highlight(element.firstChild.nodeValue, lexer, formatter))
    .documentElement, element.firstChild)

This worked great for Python, CSS, and Javascript parsing, but HTML and XML got “interpreted” into the tree before I could pass them into pygments (element.firstChild.nodeValue did not exist). The trick was to apply the reverse of the solution I applied to get normal highlighting working. We do this via toxml() call, which converts the tree structure into a string. The problem is, if you call toxml() on the entire element, you’ll end up also syntax-highlighting the <code> tag itself, which we’re trying to keep hidden from the viewer. And since nodeList doesn’t have toxml() method, you’ll simply have to loop through each child individually as follows:

for child in element.childNodes:
    element.replaceChild(xml.dom.minidom.parseString(
        highlight(child.toxml(), lexer, formatter))
        .documentElement, child)

The work is almost complete. If you’ve been following along, adding these lines to your code and attempting to run it, you probably noticed that xml.dom throws a syntax error when you try to parse your text file. The reason is that xml.dom requires all of file contents to be inside a root element, and this is true of all XML content (i.e. webpages are always inside tag, svg images are always inside <svg> tag). My initial solution while testing this was to create fake <root> tag around the text body. It works, but this wouldn’t be much of a convenience tool if it made me jump through extra hoops. For my final solution I decided to get rid of them. It’s really not hard, all you have to do is add <root> to the beginning of the string and </root> to the end before parsing it with xml.dom. And in the end, simply remove the last 7 characters and the first 28 (to account for <root> as well as <?xml version="1.0" ?> xml.dom slaps on). The final version of the code ended up looking as follows:

from pygments import highlight
from pygments.lexers import HtmlLexer, JavascriptLexer, CssLexer, XmlLexer,
                            PythonLexer, TextLexer
from pygments.formatters import HtmlFormatter
import os
import xml.dom.minidom

lexers = {'': TextLexer,
          'html': HtmlLexer,
          'js': JavascriptLexer,
          'css': CssLexer,
          'xml': XmlLexer,
          'python': PythonLexer}

formatter = HtmlFormatter()

indir= r'.'
for x in os.listdir(indir):
    if x.endswith('.txt'):
        infile = os.path.join(indir, x)
        outfile = infile.replace('.txt', '.html')
        data = list(open(infile))
        data.insert(0, '<root>')
        data.append('</root>')
        text = ''.join(data)
        doc = xml.dom.minidom.parseString(text)
        code_tags = doc.getElementsByTagName('code')
        for element in code_tags:
            lexer = lexers[element.getAttribute('class')]()
            for child in element.childNodes:
                element.replaceChild(xml.dom.minidom.parseString(
                    highlight(child.toxml(), lexer, formatter))
                    .documentElement, child)
        html = doc.toxml()[28:-7]
        f = open(outfile, 'w')
        f.write(html)
        f.close()
        print 'finished :',x

If you want, you can take this a step further by customizing the <code> tag for each language you use. If you want something like line numbers, just add the following to the formatter initialization line:

formatter = HtmlFormatter(linenos=True)

Also, don’t forget to generate new classes responsible for actually defining the colors you highlight with and placing them into your blog’s CSS stylesheet. To generate the stylesheet, just run the following command in bash:

pygmentize -S default -f html > style.css

And just to prove that my new parser works, I’ve used it to highlight this entire blog post (note, this was originally posted on Blogspot, this WordPress post no longer uses this mechanism). When I have the time, I might go back and highlight my old posts the same way as well.

This entry was posted in How To and tagged , , by Alexander Tsepkov. Bookmark the permalink.

About Alexander Tsepkov

Founder and CEO of Pyjeon. He started out with C++, but switched to Python as his main programming language due to its clean syntax and productivity. He often uses other languages for his work as well, such as JavaScript, Perl, and RapydScript. His posts tend to cover user experience, design considerations, languages, web development, Linux environment, as well as challenges of running a start-up.

One thought on “Code Tags and Syntax Highlighting

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>