unhtml: HTML unmarshaler
最近要用golang
寫一個需要解析HTML
的專案,到網上找了一個庫叫goquery
。雖然它的API
挺不錯,css selector
基本上也全支援了,但寫這種程式碼果然還是有點無聊,於是我就想,為什麼不能跟go
的json
庫和xml
庫一樣,直接Unmarshal(HTML)
呢?
然後我花了兩天時間擼出了unhtml
->ofollow,noindex" target="_blank">Github 傳送門
樣例 & 效能
有個HTML
var AllTypeHTML = []byte(` <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Title</title> </head> <body> <div id="test"> <ul> <li>0</li> <li>1</li> <li>2</li> <li>3</li> </ul> <div> <p>Hexilee</p> <p>20</p> <p>true</p> </div> <p>Hello World!</p> <p>10</p> <p>3.14</p> <p>true</p> </div> </body> </html> `)
如果你想把它解析為一個結構體
package example type ( PartTypesStruct struct { Slice[]int StructTestUser Stringstring Intint Float64 float64 Boolbool } TestUser struct { Namestring Ageuint LikeLemon bool } )
直接用goquery
要這樣寫
package example import ( "bytes" "github.com/PuerkitoBio/goquery" "strconv" ) func parsePartTypesLogically() (PartTypesStruct, error) { doc, err := goquery.NewDocumentFromReader(bytes.NewReader(AllTypeHTML)) partTypes := PartTypesStruct{} if err == nil { selection := doc.Find(partTypes.Root()) partTypes.Slice = make([]int, 0) selection.Find(`ul > li`).Each(func(i int, selection *goquery.Selection) { Int, parseErr := strconv.Atoi(selection.Text()) if parseErr != nil { err = parseErr } partTypes.Slice = append(partTypes.Slice, Int) }) if err == nil { partTypes.Struct.Name = selection.Find(`#test > div > p:nth-child(1)`).Text() Int, parseErr := strconv.Atoi(selection.Find(`#test > div > p:nth-child(2)`).Text()) if err = parseErr; err == nil { partTypes.Struct.Age = uint(Int) Bool, parseErr := strconv.ParseBool(selection.Find(`#test > div > p:nth-child(3)`).Text()) if err = parseErr; err == nil { partTypes.Struct.LikeLemon = Bool String := selection.Find(`#test > p:nth-child(3)`).Text() Int, parseErr := strconv.Atoi(selection.Find(`#test > p:nth-child(4)`).Text()) if err = parseErr; err != nil { return partTypes, err } Float64, parseErr := strconv.ParseFloat(selection.Find(`#test > p:nth-child(5)`).Text(), 0) if err = parseErr; err != nil { return partTypes, err } Bool, parseErr := strconv.ParseBool(selection.Find(`#test > p:nth-child(6)`).Text()) if err = parseErr; err != nil { return partTypes, err } partTypes.String = String partTypes.Int = Int partTypes.Float64 = Float64 partTypes.Bool = Bool } } } } return partTypes, err }
寫得很難受
而現在你只要這麼寫
package main import ( "encoding/json" "fmt" "github.com/Hexilee/unhtml" "io/ioutil" ) type ( PartTypesStruct struct { Slice[]int`html:"ul > li"` StructTestUser `html:"#test > div"` Stringstring`html:"#test > p:nth-child(3)"` Intint`html:"#test > p:nth-child(4)"` Float64 float64`html:"#test > p:nth-child(5)"` Boolbool`html:"#test > p:nth-child(6)"` } TestUser struct { Namestring `html:"p:nth-child(1)"` Ageuint`html:"p:nth-child(2)"` LikeLemon bool`html:"p:nth-child(3)"` } ) func (PartTypesStruct) Root() string { return "#test" } func main() { allTypes := PartTypesStruct{} _ := unhtml.Unmarshal(AllTypeHTML, &allTypes) result, _ := json.Marshal(&allTypes) fmt.Println(string(result)) }
就能得到結果
{ "Slice": [ 0, 1, 2, 3 ], "Struct": { "Name": "Hexilee", "Age": 20, "LikeLemon": true }, "String": "Hello World!", "Int": 10, "Float64": 3.14, "Bool": true }
開發效率大大提升!但毫無疑問用了大量反射,讓人擔心它的執行效率。於是我寫了兩個Benchmarks
func BenchmarkUnmarshalPartTypes(b *testing.B) { assert.NotNil(b, AllTypeHTML) for i := 0; i < b.N; i++ { partTypes := PartTypesStruct{} assert.Nil(b, Unmarshal(AllTypeHTML, &partTypes)) } } func BenchmarkParsePartTypesLogically(b *testing.B) { assert.NotNil(b, AllTypeHTML) for i := 0; i < b.N; i++ { _, err := parsePartTypesLogically() assert.Nil(b, err) } }
測試結果:
> go test -bench=. goos: darwin goarch: amd64 pkg: github.com/Hexilee/unhtml BenchmarkUnmarshalPartTypes-43000054096 ns/op BenchmarkParsePartTypesLogically-43000045188 ns/op PASS okgithub.com/Hexilee/unhtml4.098s
執行效率稍微低些,但這只是展示和測試用的HTML
,在解析實際中更復雜的HTML
時兩者的執行效率是十分接近的。
一些注意事項和特性請看README