I have been working on a personal automation project. In short, I write most of my notes in markdown so I wanted to grab them and store them in a specific format with annotations (e.g. everything under heading deployment notes
is labeled as such in the final data file). These are not high volume, large files. I have written them manually, I am talking about a 10-20 KB file (with most content being pasted code/request snippets). I am not looking for efficiency.
Blackfriday is the markdown parser for Hugo, so I was somewhat familiar with it. Since version 2, it has a markdown parser.
In this post, I am going to describe what I learned during the process and how I leveraged Blackfriday's markdown parser in some hacky ways to get annotated data. To visualize the AST (Abstract Syntax Tree) generated by Blackfriday, I used gographviz.
A simple package parse
and code can be found here:
Given a markdown file with the following structure, grab data and annotate them.
# title
) headings.The solution will not pass coding interviews. But I decline all coding interviews anyway, checkmate atheists interviewers.
I will use this data:
# Heading 1
## Heading 1-1
Content of heading 1-1.
More lines in heading 1-1.
## Heading 1-2
Content of heading 1-2.
More lines in heading 1-2.
## Heading 1-3
* https://example.net
* email: [email protected]
* address: 123 street name
* https://google.com
* email: blahblah
* https://parsiya.net
* https://parsiya.io
# Heading 2
## Heading 2-1
Heading 2-1 content.
First, we need to get the headings and their content. Using the markdown parser for this purpose did not work. In the AST, the content of each heading is not a child of the heading, instead everything is a child of root. See below for a representation of AST using gographviz.
Instead, I used regex. I am trash at regex but I somehow got this to work.
package main
import (
"fmt"
"regexp"
)
var testData = `
# Heading 1
... // removed
`
func main() {
reStr := "(?m)^\\s*#{1}\\s*([^#\\n]+)$"
re := regexp.MustCompile(reStr)
result := re.FindAllStringSubmatch(testData, -1)
for _, match := range result {
fmt.Println(match)
}
}
The result is a [][]string
. Each []string
has two items, the first one is the complete line and the second is the match (just the heading). Run it in the Go playground at https://play.golang.org/p/09CPo4Cz32Z or see regex.code
on Github.
[
# Heading 1 Heading 1]
[
# Heading 2 Heading 2]
This only returns the headings, but we want all the content. By switching the regex method to FindAllStringSubmatchIndex, we can get the index of these items.
func main() {
reStr := "(?m)^\\s*#{1}\\s*([^#\\n]+)$"
re := regexp.MustCompile(reStr)
result := re.FindAllStringSubmatchIndex(testData, -1)
for _, match := range result {
fmt.Println(match)
}
}
Run it in the Go playground: https://play.golang.org/p/jizwSQHTRG4.
[0 12 3 12]
[338 350 341 350]
result[2]
and result[3]
. In other words, testData[result[2]:result[3]]
.We can also do the same for level 2 headings. Just change the regex to (?m)^\\s*#{2}\\s*([^#\\n]+)$
(the number of #
s to look for). I ended up with the following function that can give me heading and content of every heading level.
// RawHeading represents a heading, raw content, and subheadings (if any).
type RawHeading struct {
Title string
Content string
}
// Heading reads a markdown string and returns a slice of RawHeadings.
func Heading(content string, level int) (fi []RawHeading, err error) {
defer func() {
if r := recover(); r != nil {
err = fmt.Errorf("panic in parse.Heading %v", r)
}
}()
if level < 1 {
level = 1
}
// Split into different sections.
// TODO: Find better regex.
// Narrator voice: This never happened.
reStr := fmt.Sprintf("(?m)^\\s*#{%d}\\s*([^#\\n]+)$", level)
re := regexp.MustCompile(reStr)
result := re.FindAllStringSubmatchIndex(content, -1)
/*
Returns slices of four ints.
First two are the complete heading, including the #.
Last two are only the heading name.
The rest of the heading will be from the last number of one to start of the next.
I will forget how this works, but it works. Don't touch it future Parsia.
*/
for i := range result {
var raw RawHeading
section := result[i]
headingTextStart := section[2]
headingTextEnd := section[3]
raw.Title = content[headingTextStart:headingTextEnd]
var startOfNextHeading int
// Check for last item, last item continues to the end.
if i == len(result)-1 {
startOfNextHeading = len(content) - 1
} else {
startOfNextHeading = result[i+1][0]
}
// Trim whitespace from start and ending of content.
raw.Content = strings.TrimSpace(content[section[3]:startOfNextHeading])
fi = append(fi, raw)
}
return fi, nil
}
Note the name of the package. It might not make sense in the main
but all these functions are part of a package parse
. When using them outside, we call parse.Heading(...)
which sounds nice. For more information on package naming conventions, please read this Golang.org blog post: https://blog.golang.org/package-names.
Running it gives us what we want: https://play.golang.org/p/YncrOEfiBxC.
func main() {
levelOnes, err := Heading(testData, 1)
if err != nil {
panic(err)
}
for _, l1 := range levelOnes {
fmt.Println(l1.Title)
fmt.Println(l1.Content)
fmt.Println("--------------------")
}
}
And the result:
Heading 1
## Heading 1-1
Content of heading 1-1.
More lines in heading 1-1.
## Heading 1-2
Content of heading 1-2.
More lines in heading 1-2.
## Heading 1-3
* https://example.net
* email: [email protected]
* address: 123 street name
* https://google.com
* email: blahblah
* https://parsiya.net
* https://parsiya.io
--------------------
Heading 2
## Heading 2-1
Heading 2-1 content.
--------------------
This function also works for sub-headings, meaning we can pass .Content
of level-ones to get level-twos: https://play.golang.org/p/ugfY0D_nEWT or heading.go
on Github.
func main() {
levelOnes, err := Heading(testData, 1)
if err != nil {
panic(err)
}
for _, l1 := range levelOnes {
fmt.Println("Level 1 title:", l1.Title)
levelTwos, _ := Heading(l1.Content, 2)
for _, l2 := range levelTwos {
fmt.Println("Level 2 title:", l2.Title)
fmt.Println("Level 2 content:", l2.Content)
fmt.Println("********************")
}
fmt.Println("--------------------")
}
}
Result:
Level 1 title: Heading 1
Level 2 title: Heading 1-1
Level 2 content: Content of heading 1-1.
More lines in heading 1-1.
********************
Level 2 title: Heading 1-2
Level 2 content: Content of heading 1-2.
More lines in heading 1-2.
********************
Level 2 title: Heading 1-3
Level 2 content: * https://example.net
* email: [email protected]
* address: 123 street name
* https://google.com
* email: blahblah
* https://parsiya.net
* http://parsiya.i
********************
--------------------
Level 1 title: Heading 2
Level 2 title: Heading 2-1
Level 2 content: Heading 2-1 content
********************
--------------------
Problem (somewhat) solved.
This was perhaps the easiest part of the task. Most content in that file is either used as plaintext or passed directly to Blackfriday for parsing to HTML. When importing blackfriday, be sure to use version 2.0. You can either use the new Go modules or use the pinned package at https://gopkg.in/russross/blackfriday.v2.
To generate HTML, use Run. Pass the text as a byte slice, indicate what kind of extensions to use (or pass nothing for a set of standard extensions), and get the HTML bytes.
mdBytes := blackfriday.Run([]byte(input), blackfriday.WithNoExtensions())
mdStr := string(mdBytes)
We also want to remove the <p>
and </p>
tags from the result. It can be done with a strings.Replacer. Pass strings in pairs where first one is the match and next is the replacement. In this code snippet, we are replacing both tags with nothing (i.e. removing them). Then call Replacer.Replace.
removePTags := strings.NewReplacer("<p>", "", "</p>", "")
out := removePTags.Replace(md)
The final RichText
function looks like this. Call it with parse.RichText
.
// RichText returns a string with the formatted rich text section.
func RichText(input string) string {
// Richtext content can be passed to markdown safely.
md := string(blackfriday.Run([]byte(input), blackfriday.WithNoExtensions()))
// Remove <p> and </p>.
removePTags := strings.NewReplacer("<p>", "", "</p>", "")
out := removePTags.Replace(md)
// Trim whitespace.
return strings.TrimSpace(out)
}
Which results in (reproduce it by running richtext.go
):
$ go run richtext.go
This is line one.
This is line two.
This is a list:
<ul>
<li>item1</li>
<li>item2</li>
</ul>
In this section, we will learn how to use the parser and then visualize the resulting AST.
Here's a quick start:
New
or nothing for a standard set.*Markdown
to get a *Node.Document
.Node
to traverse the sub-tree under that node. This can be used for any node.type NodeVisitor func(node *Node, entering bool) WalkStatus
.Most of the parsing logic happens inside the callback function.
entering == true
and the other when leaving after all its children are visited with entering == false
.NodeVisitor
can be an in-line or anonymous function. This allows us to use the parent function variables.NodeVisitor
is of type WalkStatus. We can use it to control the parser.GotoNext
= Default, go to next node.SkipChildren
= Skip all children of current node.Terminate
= Terminate the traversal.NodeType
some extra struct fields (e.g. HeadingData
) might be populated. The text content of every node (if any) is stored in Literal
.Let's do some parsing on our test data to see how the AST looks like (see parse-print.go
):
// PrintNode returns a string representation of the node.
func PrintNode(n *blackfriday.Node) string {
var sb strings.Builder
sb.WriteString(fmt.Sprintf("Type: %v - ", n.Type))
sb.WriteString(fmt.Sprintf("Title: %v - ", n.Title))
sb.WriteString(fmt.Sprintf("Parent: %v - ", n.Parent))
sb.WriteString(fmt.Sprintf("Literal: %v", string(n.Literal)))
sb.WriteString("\n--------------------")
return sb.String()
}
Output of parse-print.go
It might have been easier to just print the node as a JSON string. But each node contains a lot of children and I got a stack overflow (as in literally).
Text does not really help. Getting a visualization of the AST helps a lot more. There are probably better ways of doing it (I did not like using the global variable counter
) but it works and was reasonably simple to figure out in 30 minutes or so. gographviz creates graphs for us and then returns the dot
file. These are text files that can be passed to any number of implementations (including web services) to generate pictures (e.g. svg, png, etc.).
The magic is in viz.go
. At each node we:
counter
.label
attribute is displayed in the node, otherwise ID will be used.\n
to the string and enclosing them in double quotes. See Label
function.Literal
.""
), add an edge from its parent to the node.Do the same for every child of the node (see for
in code).
var counter = 0
// Viz adds a node to the graph and adds an edge to its parent.
func Viz(graph *gographviz.Graph, graphName, parentID string, node *blackfriday.Node) {
myID := strconv.Itoa(counter)
attrs := make(map[string]string)
attrs[string(gographviz.Label)] = Label(node)
graph.AddNode(graphName, myID, attrs)
// If not root, add an edge to parent.
// TODO: How can we eliminate this check to speed things up?
if parentID != "" {
graph.AddEdge(parentID, myID, true, nil)
}
// Increase counter.
counter++
child := node.FirstChild
for child != nil {
Viz(graph, graphName, myID, child)
child = child.Next
}
}
// Label returns a label for the node. Label is "Node.Type\n\Node.String()".
func Label(node *blackfriday.Node) string {
var sb strings.Builder
// We might need to add a new line to label, so we need to enclose the
// label in double-quotes.
sb.WriteString("\"")
sb.WriteString(node.Type.String())
if len(node.Literal) != 0 {
sb.WriteString("\\n" + node.String())
}
sb.WriteString("\"")
return sb.String()
}
Run viz-ast.go
to get the graph.dot
file, then generate the graph. Here's a 210KB copy (open in new tab and zoom, it's around 3500*800 pixels):
testData AST
The Github repository also contains a 30KB svg
version. As I mentioned above, most items are children of root.
Let's do a bit more and look at headings. To make our life easier, we expand our package with some helper functions. We already know heading's title (if any) is its first and only child. There's an edge case, not every heading has a title.
Headings in AST
// IsHeading returns true if node is type heading.
func IsHeading(n *blackfriday.Node) bool {
return n.Type == blackfriday.Heading
}
// HeadingTitle returns the title of the heading by returning the Literal of its
// first child.
func HeadingTitle(n *blackfriday.Node) string {
// Check if it has a child and its of type Text. Headings might not have titles.
if n.FirstChild != nil && n.FirstChild.Type == blackfriday.Text {
return string(n.FirstChild.Literal)
}
// This is not exactly idiomatic because successful return value should be
// the last return. However, this looks clearer.
return ""
}
// PrintHeading returns the information of a Heading node.
func PrintHeading(n *blackfriday.Node) string {
var sb strings.Builder
sb.WriteString(fmt.Sprintf("Heading Title: %s - ", HeadingTitle(n)))
sb.WriteString(fmt.Sprintf("Heading Level: %d - ", n.HeadingData.Level))
sb.WriteString(fmt.Sprintf("Heading HeadingID: %s - ", n.HeadingData.HeadingID))
sb.WriteString(fmt.Sprintf("Heading IsTitleBlock: %v", n.HeadingData.IsTitleblock))
sb.WriteString("\n")
sb.WriteString(PrintNode(n))
return sb.String()
}
I am passing the responsibility of checking the node type to the caller. Next, we modify the anonymous NodeVisitor
:
func main() {
md := blackfriday.New(blackfriday.WithNoExtensions())
rootNode := md.Parse([]byte(testData))
// rootNode is always of NodeType "Document" or 0.
rootNode.Walk(func(node *blackfriday.Node, entering bool) blackfriday.WalkStatus {
if parse.IsHeading(node) {
fmt.Println(parse.PrintHeading(node))
}
return blackfriday.GoToNext
})
}
And finally run parse-print-heading.go
to see heading levels and titles:
Output of parse-print-heading.go
Armed with our knowledge of markdown parser, we can move on to lists. To just visualize the list:
List
with a parent of type Document
.parse.Viz
.Cancel walk by return blackfriday.Terminate
. This works for this document where we only have one list, in a document with multiple top-level lists, we must return blackfriday.SkipChildren
.
func main() {
md := blackfriday.New(blackfriday.WithNoExtensions())
rootNode := md.Parse([]byte(testData))
g := gographviz.NewGraph()
g.SetName("list")
g.SetDir(true)
rootNode.Walk(func(node *blackfriday.Node, entering bool) blackfriday.WalkStatus {
// Check if node has a parent, otherwise we will panic when we check
// the panret type.
if node.Parent != nil {
if node.Type == blackfriday.List && node.Parent.Type == blackfriday.Document {
parse.Viz(g, "list", "", node)
return blackfriday.Terminate
}
}
return blackfriday.GoToNext
})
fi, err := os.Create("graph-list.dot")
if err != nil {
panic(err)
}
defer fi.Close()
fi.WriteString(g.String())
}
Run viz-list.go
to generate graph-list.dot
of the following list:
* https://example.net
* email: [email protected]
* address: 123 street name
* https://google.com
* email: blahblah
* https://parsiya.net
* https://parsiya.io
List in AST (open in new tab for full size image
We learned quite a few tricks. Learned how to use Blackfriday's markdown parser and how to generate graphs. The graph generation from AST is pretty cool and I am sure will come handy later on.