Difference between revisions of "Hypergraph Format"
David Chiang (talk | contribs) (→JSON) |
David Chiang (talk | contribs) (→JSON) |
||
Line 26: | Line 26: | ||
* '''nodes''': a list of Node objects | * '''nodes''': a list of Node objects | ||
* '''edges''': a list of Edge objects | * '''edges''': a list of Edge objects | ||
− | * '''root''': a | + | * '''root''': a node id, which is an integer index into the '''nodes''' list |
A Node object has the following optional fields: | A Node object has the following optional fields: | ||
Line 34: | Line 34: | ||
An Edge object has the following required fields: | An Edge object has the following required fields: | ||
* '''head''': a node id | * '''head''': a node id | ||
− | * '''tails''': a (possibly empty) list of | + | * '''tails''': a (possibly empty) list of node ids |
and the following optional fields: | and the following optional fields: | ||
* '''label''': string | * '''label''': string |
Revision as of 11:40, 8 November 2010
Overall goal
Make it easy to share packed representations across NLP applications. Therefore we want a spec that is primarily easy to use from a variety of different platforms and languages. A memory efficient and fast representation is also useful.
Serialization library options
JSON
Pro:
- Implementations in every language (often packaged with language).
- Human readable
- Already used in CDec for forest output
Con:
- Space inefficient
- Requires custom parser for speed
- Need additional code to check for well-formed hypergraphs, since there is no schema for JSON objects
Proposed schema:
A Forest object has the following required fields:
- nodes: a list of Node objects
- edges: a list of Edge objects
- root: a node id, which is an integer index into the nodes list
A Node object has the following optional fields:
- label: string
- features: a FeatureVector object
An Edge object has the following required fields:
- head: a node id
- tails: a (possibly empty) list of node ids
and the following optional fields:
- label: string
- features: a FeatureVector object
A FeatureVector object has arbitrary fields with float values.
Proposed extensions (yea or nay?):
When a hypergraph represents a set of trees, the Node.labels will be the labels of the tree nodes. It might be convenient to allow a shorthand for leaf/terminal labels: in Edge.tails, a string "a" would be shorthand for {label: "a"}
In Node.label, a value of null means that the tree node is labeled with epsilon, the empty string. This is not the same as ""': the former would not contribute anything to the yield of the tree, whereas the latter would contribute a token of length 0.
- David Chiang: nay, this should be left to the application. An empty Edge.tails list has the same effect. And people who care about explicit empty nodes might want to distinguish several kinds of empty nodes (t, PRO, pro, etc.).
When a hypergraph represents a CFG, the Nodes will be the nonterminal symbols and the Edges will be the productions. It will be ugly for numeric Node ids to appear in the productions, so symbolic names might be preferable. Perhaps a Node object can have a string-valued id field by which it can be referred to. Con: who is going to guarantee that the names are unique? Alternatively, a Forest object can have a nodealiases field which is an object mapping from symbolic names to numeric ids.
Protocol Buffers
Pro:
- Conversion to and from JSON (protobuf-json)
- Very fast to read (particularly in C++ and Java, hopefully soon in python)
- Very space efficient
- Implementations in Java, C++ and Python; generates typed stubs in those languages
Con:
- No implementations for Perl, C#, or other languages commonly used by NLP folks
- Requires a separate library; adds an external dependency to spec
- "It's really easy to get up to some of the data size limits that are in place to prevent malicious data from having the PB parser allocate too much memory". Some of the limits are described in the section describing SetTotalBytesLimit on this page.
- "You typically have to create a full hypergraph protocol buffer object before you can serialize it, so you either have to use the PB data structures internally in your code or you have to copy your data structure. While doing this copy, you can end up with two copies of the forest in memory, which is bad for memory usage."
Variation of SLF (Standard Lattice Format)
Pro:
- Blindingly fast.
- Could be implemented to work lazy/streaming.
Con:
- Requires a custom format
- Probably need specialized language bindings.