Blackfriday's Parser and Generating graphs with gographviz

Blackfriday's Parser and Generating graphs with gographviz
2018-10-29 00:31:34 Author: parsiya.net(查看原文) 阅读量:71 收藏

I have been working on a personal automation project. In short, I write most of my notes in markdown so I wanted to grab them and store them in a specific format with annotations (e.g. everything under heading deployment notes is labeled as such in the final data file). These are not high volume, large files. I have written them manually, I am talking about a 10-20 KB file (with most content being pasted code/request snippets). I am not looking for efficiency.

Blackfriday is the markdown parser for Hugo, so I was somewhat familiar with it. Since version 2, it has a markdown parser.

In this post, I am going to describe what I learned during the process and how I leveraged Blackfriday's markdown parser in some hacky ways to get annotated data. To visualize the AST (Abstract Syntax Tree) generated by Blackfriday, I used gographviz.

A simple package parse and code can be found here:

https://github.com/parsiya/Parsia-Code/tree/master/markdown-parsing

Given a markdown file with the following structure, grab data and annotate them.

The file can have multiple level-one (e.g. # title) headings.
Each level-one heading has multiple sub-headings. All sub-headings are considered level-two headings (even if they are not) and treated equally.
The content of each sub-heading must be annotated with the title of that sub-heading.
The content of sub-headings can be one of the two following forms:
- Free form: Free form text can be markdown or plaintext. It's passed to Blackfriday and the HTML result is stored in the final data file.
- Lists: Lists are used to create lists of items. Lists can have two levels. Each sub-level is associated with its previous level.

The solution will not pass coding interviews. But I decline all coding interviews anyway, checkmate ~~atheists~~ interviewers.

Test Data

I will use this data:

# Heading 1

## Heading 1-1
Content of heading 1-1.

More lines in heading 1-1.

## Heading 1-2
Content of heading 1-2.

More lines in heading 1-2.

## Heading 1-3
* https://example.net
    * email: [email protected]
    * address: 123 street name
* https://google.com
    * email: blahblah
* https://parsiya.net
* https://parsiya.io

# Heading 2

## Heading 2-1
Heading 2-1 content.

Level-One Heading and Their Content

First, we need to get the headings and their content. Using the markdown parser for this purpose did not work. In the AST, the content of each heading is not a child of the heading, instead everything is a child of root. See below for a representation of AST using gographviz.

Instead, I used regex. I am trash at regex but I somehow got this to work.

package main

import (
	"fmt"
	"regexp"
)

var testData = `
# Heading 1
... // removed
`

func main() {
	reStr := "(?m)^\\s*#{1}\\s*([^#\\n]+)$"
	re := regexp.MustCompile(reStr)
	result := re.FindAllStringSubmatch(testData, -1)

	for _, match := range result {
		fmt.Println(match)
	}
}

The result is a [][]string. Each []string has two items, the first one is the complete line and the second is the match (just the heading). Run it in the Go playground at https://play.golang.org/p/09CPo4Cz32Z or see regex.code on Github.

[
# Heading 1 Heading 1]
[
# Heading 2 Heading 2]

This only returns the headings, but we want all the content. By switching the regex method to FindAllStringSubmatchIndex, we can get the index of these items.

func main() {
	reStr := "(?m)^\\s*#{1}\\s*([^#\\n]+)$"
	re := regexp.MustCompile(reStr)
	result := re.FindAllStringSubmatchIndex(testData, -1)

	for _, match := range result {
		fmt.Println(match)
	}
}

Run it in the Go playground: https://play.golang.org/p/jizwSQHTRG4.

[0 12 3 12]
[338 350 341 350]

Heading: Everything between result[2] and result[3]. In other words, testData[result[2]:result[3]].
Content: Everything between the last number in one result and the first one in the next result (or until the end of the document if it's the last result).

We can also do the same for level 2 headings. Just change the regex to (?m)^\\s*#{2}\\s*([^#\\n]+)$ (the number of #s to look for). I ended up with the following function that can give me heading and content of every heading level.

// RawHeading represents a heading, raw content, and subheadings (if any).
type RawHeading struct {
	Title   string
	Content string
}

// Heading reads a markdown string and returns a slice of RawHeadings.
func Heading(content string, level int) (fi []RawHeading, err error) {
	defer func() {
		if r := recover(); r != nil {
			err = fmt.Errorf("panic in parse.Heading %v", r)
		}
	}()

	if level < 1 {
		level = 1
	}
	// Split into different sections.
    // TODO: Find better regex.
    // Narrator voice: This never happened.
	reStr := fmt.Sprintf("(?m)^\\s*#{%d}\\s*([^#\\n]+)$", level)
	re := regexp.MustCompile(reStr)
	result := re.FindAllStringSubmatchIndex(content, -1)

	/*
    Returns slices of four ints.
    First two are the complete heading, including the #.
    Last two are only the heading name.
    The rest of the heading will be from the last number of one to start of the next.
    I will forget how this works, but it works. Don't touch it future Parsia.
	*/
	for i := range result {
		var raw RawHeading
		section := result[i]
		headingTextStart := section[2]
		headingTextEnd := section[3]

		raw.Title = content[headingTextStart:headingTextEnd]

		var startOfNextHeading int
		// Check for last item, last item continues to the end.
		if i == len(result)-1 {
			startOfNextHeading = len(content) - 1
		} else {
			startOfNextHeading = result[i+1][0]
		}
		// Trim whitespace from start and ending of content.
		raw.Content = strings.TrimSpace(content[section[3]:startOfNextHeading])
		fi = append(fi, raw)
	}
	return fi, nil
}

Note the name of the package. It might not make sense in the main but all these functions are part of a package parse. When using them outside, we call parse.Heading(...) which sounds nice. For more information on package naming conventions, please read this Golang.org blog post: https://blog.golang.org/package-names.

Running it gives us what we want: https://play.golang.org/p/YncrOEfiBxC.

func main() {
	levelOnes, err := Heading(testData, 1)
	if err != nil {
		panic(err)
	}
	for _, l1 := range levelOnes {
		fmt.Println(l1.Title)
		fmt.Println(l1.Content)
		fmt.Println("--------------------")
	}
}

And the result:

Heading 1
## Heading 1-1
Content of heading 1-1.

More lines in heading 1-1.

## Heading 1-2
Content of heading 1-2.

More lines in heading 1-2.

## Heading 1-3
* https://example.net
    * email: [email protected]
    * address: 123 street name
* https://google.com
    * email: blahblah
* https://parsiya.net
* https://parsiya.io
--------------------
Heading 2
## Heading 2-1
Heading 2-1 content.
--------------------

This function also works for sub-headings, meaning we can pass .Content of level-ones to get level-twos: https://play.golang.org/p/ugfY0D_nEWT or heading.go on Github.

func main() {
	levelOnes, err := Heading(testData, 1)
	if err != nil {
		panic(err)
	}
	for _, l1 := range levelOnes {
		fmt.Println("Level 1 title:", l1.Title)
		levelTwos, _ := Heading(l1.Content, 2)
		for _, l2 := range levelTwos {
			fmt.Println("Level 2 title:", l2.Title)
			fmt.Println("Level 2 content:", l2.Content)
			fmt.Println("********************")
		}
		fmt.Println("--------------------")
	}
}

Result:

Level 1 title: Heading 1
Level 2 title: Heading 1-1
Level 2 content: Content of heading 1-1.

More lines in heading 1-1.
********************
Level 2 title: Heading 1-2
Level 2 content: Content of heading 1-2.

More lines in heading 1-2.
********************
Level 2 title: Heading 1-3
Level 2 content: * https://example.net
    * email: [email protected]
    * address: 123 street name
* https://google.com
    * email: blahblah
* https://parsiya.net
* http://parsiya.i
********************
--------------------
Level 1 title: Heading 2
Level 2 title: Heading 2-1
Level 2 content: Heading 2-1 content
********************
--------------------

Problem (somewhat) solved.

Markdown to HTML with Blackfriday

This was perhaps the easiest part of the task. Most content in that file is either used as plaintext or passed directly to Blackfriday for parsing to HTML. When importing blackfriday, be sure to use version 2.0. You can either use the new Go modules or use the pinned package at https://gopkg.in/russross/blackfriday.v2.

To generate HTML, use Run. Pass the text as a byte slice, indicate what kind of extensions to use (or pass nothing for a set of standard extensions), and get the HTML bytes.

mdBytes := blackfriday.Run([]byte(input), blackfriday.WithNoExtensions())
mdStr := string(mdBytes)

We also want to remove the <p> and </p> tags from the result. It can be done with a strings.Replacer. Pass strings in pairs where first one is the match and next is the replacement. In this code snippet, we are replacing both tags with nothing (i.e. removing them). Then call Replacer.Replace.

removePTags := strings.NewReplacer("<p>", "", "</p>", "")
out := removePTags.Replace(md)

The final RichText function looks like this. Call it with parse.RichText.

// RichText returns a string with the formatted rich text section.
func RichText(input string) string {
	// Richtext content can be passed to markdown safely.
	md := string(blackfriday.Run([]byte(input), blackfriday.WithNoExtensions()))
	// Remove <p> and </p>.
	removePTags := strings.NewReplacer("<p>", "", "</p>", "")
	out := removePTags.Replace(md)
	// Trim whitespace.
	return strings.TrimSpace(out)
}

Which results in (reproduce it by running richtext.go):

$ go run richtext.go
This is line one.

This is line two.

This is a list:

<ul>
<li>item1</li>
<li>item2</li>
</ul>

Blackfriday's Markdown AST

In this section, we will learn how to use the parser and then visualize the resulting AST.

Here's a quick start:

Use the New function to get a *Markdown pointer. Pass options for markdown extensions to New or nothing for a standard set.
Use the Parse method on the *Markdown to get a *Node.
- This node points to the root of the AST. This root node will always be of NodeType Document.
Use the Walk on a Node to traverse the sub-tree under that node. This can be used for any node.
This method has a callback function of type NodeVisitor or type NodeVisitor func(node *Node, entering bool) WalkStatus.

NodeVisitor

Most of the parsing logic happens inside the callback function.

It is called twice for every node, one when first visiting it with entering == true and the other when leaving after all its children are visited with entering == false.
- Useful when you want to gather information about all children of a node.
NodeVisitor can be an in-line or anonymous function. This allows us to use the parent function variables.
The return value of NodeVisitor is of type WalkStatus. We can use it to control the parser.
- GotoNext = Default, go to next node.
- SkipChildren = Skip all children of current node.
- Terminate = Terminate the traversal.
Depending on NodeType some extra struct fields (e.g. HeadingData) might be populated. The text content of every node (if any) is stored in Literal.

Print Everything

Let's do some parsing on our test data to see how the AST looks like (see parse-print.go):

// PrintNode returns a string representation of the node.
func PrintNode(n *blackfriday.Node) string {
	var sb strings.Builder
	sb.WriteString(fmt.Sprintf("Type: %v - ", n.Type))
	sb.WriteString(fmt.Sprintf("Title: %v - ", n.Title))
	sb.WriteString(fmt.Sprintf("Parent: %v - ", n.Parent))
	sb.WriteString(fmt.Sprintf("Literal: %v", string(n.Literal)))
	sb.WriteString("\n--------------------")
	return sb.String()
}

Output of parse-print.go

It might have been easier to just print the node as a JSON string. But each node contains a lot of children and I got a stack overflow (as in literally).

AST Visualization with gographviz

Text does not really help. Getting a visualization of the AST helps a lot more. There are probably better ways of doing it (I did not like using the global variable counter) but it works and was reasonably simple to figure out in 30 minutes or so. gographviz creates graphs for us and then returns the dot file. These are text files that can be passed to any number of implementations (including web services) to generate pictures (e.g. svg, png, etc.).

The magic is in viz.go. At each node we:

Create a unique ID for this node using counter.
Add the node's type and the first 16 chars of text (if any) to the label.
- If added, label attribute is displayed in the node, otherwise ID will be used.
- We can add new lines to labels in by adding \n to the string and enclosing them in double quotes. See Label function.
- Node.String() returns the first 16 chars of Literal.
Add the node to the graph with the label and ID.
If the node is not root (root's parent ID is passed with ""), add an edge from its parent to the node.
Increase counter.

Do the same for every child of the node (see for in code).

var counter = 0

// Viz adds a node to the graph and adds an edge to its parent.
func Viz(graph *gographviz.Graph, graphName, parentID string, node *blackfriday.Node) {

	myID := strconv.Itoa(counter)
	attrs := make(map[string]string)
	attrs[string(gographviz.Label)] = Label(node)
	graph.AddNode(graphName, myID, attrs)
	// If not root, add an edge to parent.
	// TODO: How can we eliminate this check to speed things up?
	if parentID != "" {
		graph.AddEdge(parentID, myID, true, nil)
	}
	// Increase counter.
	counter++
	child := node.FirstChild
	for child != nil {
		Viz(graph, graphName, myID, child)
		child = child.Next
	}
}

// Label returns a label for the node. Label is "Node.Type\n\Node.String()".
func Label(node *blackfriday.Node) string {
	var sb strings.Builder
	// We might need to add a new line to label, so we need to enclose the
	// label in double-quotes.
	sb.WriteString("\"")
	sb.WriteString(node.Type.String())
	if len(node.Literal) != 0 {
		sb.WriteString("\\n" + node.String())
	}
	sb.WriteString("\"")
	return sb.String()
}

Run viz-ast.go to get the graph.dot file, then generate the graph. Here's a 210KB copy (open in new tab and zoom, it's around 3500*800 pixels):

testData AST

The Github repository also contains a 30KB svg version. As I mentioned above, most items are children of root.

A Closer Look at Headings

Let's do a bit more and look at headings. To make our life easier, we expand our package with some helper functions. We already know heading's title (if any) is its first and only child. There's an edge case, not every heading has a title.

Headings in AST

// IsHeading returns true if node is type heading.
func IsHeading(n *blackfriday.Node) bool {
	return n.Type == blackfriday.Heading
}

// HeadingTitle returns the title of the heading by returning the Literal of its
// first child.
func HeadingTitle(n *blackfriday.Node) string {
	// Check if it has a child and its of type Text. Headings might not have titles.
	if n.FirstChild != nil && n.FirstChild.Type == blackfriday.Text {
		return string(n.FirstChild.Literal)
	}
	// This is not exactly idiomatic because successful return value should be
	// the last return. However, this looks clearer.
	return ""
}

// PrintHeading returns the information of a Heading node.
func PrintHeading(n *blackfriday.Node) string {
	var sb strings.Builder
	sb.WriteString(fmt.Sprintf("Heading Title: %s - ", HeadingTitle(n)))
	sb.WriteString(fmt.Sprintf("Heading Level: %d - ", n.HeadingData.Level))
	sb.WriteString(fmt.Sprintf("Heading HeadingID: %s - ", n.HeadingData.HeadingID))
	sb.WriteString(fmt.Sprintf("Heading IsTitleBlock: %v", n.HeadingData.IsTitleblock))
	sb.WriteString("\n")
	sb.WriteString(PrintNode(n))

	return sb.String()
}

I am passing the responsibility of checking the node type to the caller. Next, we modify the anonymous NodeVisitor:

func main() {
	md := blackfriday.New(blackfriday.WithNoExtensions())
	rootNode := md.Parse([]byte(testData))
	// rootNode is always of NodeType "Document" or 0.

	rootNode.Walk(func(node *blackfriday.Node, entering bool) blackfriday.WalkStatus {
		if parse.IsHeading(node) {
			fmt.Println(parse.PrintHeading(node))
		}
		return blackfriday.GoToNext
	})
}

And finally run parse-print-heading.go to see heading levels and titles:

Output of parse-print-heading.go

A Closer Look at Lists

Armed with our knowledge of markdown parser, we can move on to lists. To just visualize the list:

Traverse the tree until we get to a node of type List with a parent of type Document.
Pass the node to parse.Viz.

Cancel walk by return blackfriday.Terminate. This works for this document where we only have one list, in a document with multiple top-level lists, we must return blackfriday.SkipChildren.

func main() {
	md := blackfriday.New(blackfriday.WithNoExtensions())
	rootNode := md.Parse([]byte(testData))

	g := gographviz.NewGraph()
	g.SetName("list")
	g.SetDir(true)

	rootNode.Walk(func(node *blackfriday.Node, entering bool) blackfriday.WalkStatus {
		// Check if node has a parent, otherwise we will panic when we check
		// the panret type.
		if node.Parent != nil {
			if node.Type == blackfriday.List && node.Parent.Type == blackfriday.Document {
				parse.Viz(g, "list", "", node)
				return blackfriday.Terminate
			}
		}
		return blackfriday.GoToNext
	})
	fi, err := os.Create("graph-list.dot")
	if err != nil {
		panic(err)
	}
	defer fi.Close()
	fi.WriteString(g.String())
}

Run viz-list.go to generate graph-list.dot of the following list:

* https://example.net
    * email: [email protected]
    * address: 123 street name
* https://google.com
    * email: blahblah
* https://parsiya.net
* https://parsiya.io

List in AST (open in new tab for full size image

We learned quite a few tricks. Learned how to use Blackfriday's markdown parser and how to generate graphs. The graph generation from AST is pretty cool and I am sure will come handy later on.

文章来源: https://parsiya.net/blog/2018-10-28-blackfridays-parser-and-generating-graphs-with-gographviz/
如有侵权请联系:admin#unsafe.sh