Serialising Gtk TextBuffers to HTML

Working on Gourmet, we had the need to convert the content of a Gtk TextBuffer with Pango rich text markup to HTML, the format that data is stored in, within the application.

I couldn’t find an easy way to do it out of the box, and I ended up writing my own converter from the Gtk serialised content to html.

It uses html, which is part of the standard library, and as we already had BeautifulSoup4 as a dependency, it leverages it as well, although all the work could be done with html only.

First, we defined a class derived from Gtk.TextBuffer that overwrites the get_text method, that returns the content as text or as HTML when include_hidden_chars is set:

class PangoBuffer(Gtk.TextBuffer):

    def get_text(self,
                  start: Optional[Gtk.TextIter] = None,
                  end: Optional[Gtk.TextIter] = None,
                  include_hidden_chars: bool = False) -> str:
         """Get the buffer content.

         If `include_hidden_chars` is set, then the html markup content is
         returned. If False, then the text only is returned."""
         if start is None:
             start = self.get_start_iter()
         if end is None:
             end = self.get_end_iter()

         if include_hidden_chars is False:
             return super().get_text(start, end, include_hidden_chars=False)
             format_ = self.register_serialize_tagset()
             content = self.serialize(self, format_, start, end)
             return PangoToHtml().feed(content)

The important parts are within the else block. I would have preferred to develop my own serialiser, but the documentation is sparse. We therefore use the built-in serialiser, resuting in a binary content.

This content is basically XML markup with an extra header and footer:

 # Truncated for legibility.

        <tag id="12" priority="12"> </tag>  # Tags can be empty
        <tag name="italic" priority="2">
            <attr name="style" type="PangoStyle" value="PANGO_STYLE_ITALIC" />
         <tag id="7" priority="7">
             <attr name="background-gdk" type="GdkColor" value="0:0:ffff" />
             <attr name="style" type="PangoStyle" value="PANGO_STYLE_ITALIC" />
             <attr name="weight" type="gint" value="700" />
         <apply_tag name="italic">This is italic</apply_tag>
         <apply_tag id="1">. </apply_tag>
         <apply_tag id="2">This is italic</apply_tag>
         <apply_tag id="3">\n            </apply_tag>
         <apply_tag id="7">This is bold, italic, and has background colouring.</apply_tag>

From this, we can establish that tags are not sorted, and they can have either an id or a name.

Tags that contain id are referred to as anonymous, and are typically created by Pango when deserialising content.

Named tags are typically the ones defined in your application:

tag_bold = TextBuffer.create_tag("bold", weight=Pango.Weight.BOLD)
tag_italic = TextBuffer.create_tag("italic", style=Pango.Style.ITALIC)
tag_underline = TextBuffer.create_tag("underline", underline=Pango.Underline.SINGLE)

The header contains a checksum that may not be deserialised when calling bytes.decode, thus it will have to be removed prior to decoding to xml string.

Then the PangoToHtml class does the actual job:

from html.parser import HTMLParser
from typing import Dict, List, Optional, Tuple

from bs4 import BeautifulSoup
from bs4.element import Tag
from gi.repository import Pango

class PangoToHtml(HTMLParser):
    """Decode a subset of Pango markup and serialize it as HTML.
    Because Pango can only deserialize a subset of HTML, the encoding here uses
    a subset of HTML. Moreover, only the Pango markup used within Gourmet is
    handled, although expanding it is not difficult.
    Due to the way that Pango attributes work, the HTML is not necessarily the
    simplest. For example italic tags may be closed early and reopened if other
    attributes, eg. bold, are inserted mid-way:
        <i> italic text </i><i><u>and underlined</u></i>
    This means that the HTML resulting from the conversion by this object may
    differ from the original that was fed to the caller.
    def __init__(self):
        self.markup_text: str = ""  # the resulting content
        self.current_opening_tags: str = ""  # used during parsing
        self.current_closing_tags: List[str] = ""  # used during parsing

        # The key is the Pango id of a tag, and the value is a tuple of opening
        # and closing html tags for this id.
        self.tags: Dict[str: Tuple[str, str]] = {}

    # These are the tags supported by our parser.
    # If your application uses more, extend the dictionary here to add your tags.
    tag2html: Dict[str, Tuple[str, str]] = {
        Pango.Style.ITALIC.value_name: ("<i>", "</i>"),  # Pango doesn't do <em>
        str(Pango.Weight.BOLD.real): ("<b>", "</b>"),
        Pango.Underline.SINGLE.value_name: ("<u>", "</u>"),
        "foreground-gdk": (r'<span foreground="{}">', "</span>"),
        "background-gdk": (r'<span background="{}">', "</span>")

    def pango_to_html_hex(val: str) -> str:
        """Convert 32 bit Pango color hex string to 16 html.
        Pango string have the format 'ffff:ffff:ffff' (for white).
        These values get truncated to 16 bits per color into a single string:
        red, green, blue = val.split(":")
        red = hex(255 * int(red, base=16) // 65535)[2:].zfill(2)
        green = hex(255 * int(green, base=16) // 65535)[2:].zfill(2)
        blue = hex(255 * int(blue, base=16) // 65535)[2:].zfill(2)
        return f"#{red}{green}{blue}"

    def feed(self, data: bytes) -> str:
        """Convert a buffer (text and and the buffer's iterators to html string.
        Unlike an HTMLParser, the whole string must be passed at once, chunks
        are not supported.
        # Remove the Pango header: it contains a length mark, which we don't
        # care about, but which does not necessarily decodes as valid char.
        header_end = data.find(b"<text_view_markup>")
        data = data[header_end:].decode()

        # Get the tags
        tags_begin = data.index("<tags>")
        tags_end = data.index("</tags>") + len("</tags>")
        tags = data[tags_begin:tags_end]
        data = data[tags_end:]

        # Get the textual content
        text_begin = data.index("<text>")
        text_end = data.index("</text>") + len("</text>")
        text = data[text_begin:text_end]

        # The remaining is serialized Pango footer, which we don't need.

        # Convert the tags to html.
        # We know that only a subset of HTML is handled in Gourmet:
        # italics, bold, underlined, normal, and links (coloured and underlined)
        soup = BeautifulSoup(tags, features="lxml")
        tags = soup.find_all("tag")

        tags_list = {}
        for tag in tags:
            opening_tags = ""
            closing_tags = ""

            # The tag may have a name, for named tags, or else an id
            tag_name = tag.attrs.get('id')
            tag_name = tag.attrs.get('name', tag_name)

            attributes = [c for c in tag.contents if isinstance(c, Tag)]
            for attribute in attributes:
                vtype = attribute['type']
                value = attribute['value']
                name = attribute['name']

                if vtype == "GdkColor":  # Convert colours to html
                    if name in ['foreground-gdk', 'background-gdk']:
                        opening, closing = self.tag2html[name]
                        hex_color = self.pango_to_html_hex(value)
                        opening = opening.format(hex_color)
                        continue  # no idea!
                    opening, closing = self.tag2html[value]

                opening_tags += opening
                closing_tags = closing + closing_tags   # closing tags are FILO

            tags_list[tag_name] = opening_tags, closing_tags

            if opening_tags:
                tags_list[tag_name] = opening_tags, closing_tags

        self.tags = tags_list

        # Create a single output string that will be sequentially appended to
        # during feeding of text. It can then be returned once we've parse all
        self.markup_text = ""
        self.current_opening_tags = ""
        self.current_closing_tags = []  # Closing tags are FILO


        return self.markup_text

    def handle_starttag(self, tag: str, attrs: List[Tuple[str, str]]) -> None:
        # The only tag in pango markup is "apply_tag". This could be ignored or
        # made an assert, but we let our parser quietly handle nonsense.
        if tag == "apply_tag":
            attrs = dict(attrs)
            tag_name = attrs.get('id')  # A tag may have a name, or else an id
            tag_name = attrs.get('name', tag_name)
            tags = self.tags.get(tag_name)

            if tags is not None:
                self.current_opening_tags, closing_tag = tags

    def handle_data(self, data: str) -> None:
        self.markup_text += data

    def handle_endtag(self, tag: str) -> None:
        self.markup_text += self.current_closing_tags.pop()
        self.current_opening_tags = ""

As per the HTMLParser doc, it serves as the basis for parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML. We know that we want to handle start and end tags, and the content in between.

Within the serialised content, tags are referred by their name or id, which therefore must be processed beforehand.

In this case, I chose to use BeautifulSoup, as it offers an easy way to go through the XML tags in a simple loop.

Could the whole thing have been made with only either BeautifulSoup or the html library? Probably yes, but in Gourmet, we also support all sorts of links, so the end result differs a bit, as we needed the flexibility that HTMLParser offers.

There are also unit tests available.