Difference between revisions of "Hypergraph Format"
Jump to navigation
Jump to search
(51 intermediate revisions by 6 users not shown) | |||
Line 1: | Line 1: | ||
− | + | = Overall goal = | |
+ | Make it easy to share packed representations across NLP applications. | ||
+ | Therefore we want a spec that is primarily easy to use from a variety of different platforms and languages. | ||
+ | A memory efficient and fast representation is also useful. | ||
+ | = Serialization library options = | ||
− | + | == JSON == | |
− | + | [http://www.json.org/ JSON Description] | |
− | * SLF | + | Pro: |
+ | * Implementations in every language (often packaged with language). | ||
+ | * Human readable | ||
+ | * Already used in CDec for forest output | ||
+ | |||
+ | Con: | ||
+ | * Space inefficient | ||
+ | * Requires custom parser for speed | ||
+ | * Need additional code to check for well-formed hypergraphs, since there is no schema for JSON objects | ||
+ | * Some languages (e.g., Python) do not natively support event-driven parsers for JSON, meaning it's hard to do process JSON files without first loading the entire thing. Since parse forests can be '''big''' in real applications, event-driven parsers that construct a hypergraph library's internal data structure are crucial. For example, loading the example hypergraph using Python's json.load() command takes almost 10 minutes and 7gb of memory. I understand the desire for familiarity and simplicity, but this scaling behavior makes me worry this won't be usable for real applications. | ||
+ | ** For Python, yajl + ijson (http://pypi.python.org/pypi/ijson/) or yajl-py (http://pykler.github.com/yajl-py/) might address these concerns. | ||
+ | |||
+ | Proposed schema: | ||
+ | |||
+ | A Forest object has the following required fields: | ||
+ | * '''nodes''': a list of Node objects | ||
+ | * '''edges''': a list of Edge objects | ||
+ | * '''root''': a node id, which is an integer index into the '''nodes''' list | ||
+ | |||
+ | An Edge object has the following required fields: | ||
+ | * '''head''': a node id | ||
+ | * '''tails''': a (possibly empty) list of node ids | ||
+ | |||
+ | A Node or Edge object has the following optional fields: | ||
+ | * '''label''': string | ||
+ | * '''features''': a FeatureVector object | ||
+ | and any other application-specific fields. | ||
+ | |||
+ | A FeatureVector object has arbitrary fields with float values. | ||
+ | |||
+ | Example: http://www.isi.edu/~chiang/software/forest/example | ||
+ | |||
+ | === Proposed extensions (yea or nay?) / Open questions === | ||
+ | |||
+ | * (ChrisD) ''question'': some libraries don't represent node-level features internally (at least Joshua & cdec), so these would need to denormalize node-level features to either all incoming or outgoing edges of the node in question. This may not be completely straightforward to do. Should we possibly consider just edge-level features? | ||
+ | ** David: Since edges are not shared, it should always be easy to propagate node features down to its edges (i.e., to the edges which it is the head of). I would favor, however, eliminating node features. | ||
+ | |||
+ | *When a hypergraph represents a set of trees, the Node.'''label'''s will be the labels of the tree nodes. It might be convenient to allow a shorthand for leaf/terminal labels: in Edge.'''tails''', a string '''"'''''a'''''"''' would be shorthand for '''{label: "'''''a'''''"}''' | ||
+ | ** David: yea | ||
+ | |||
+ | *In Node.'''label''', a value of '''null''' means that the tree node is labeled with epsilon, the empty string. This is not the same as '''""'''': the former would not contribute anything to the yield of the tree, whereas the latter would contribute a token of length 0. | ||
+ | ** David: nay, this should be left to the application. An empty Edge.'''tails''' list has the same effect. And people who care about explicit empty nodes might want to distinguish several kinds of empty nodes (''t'', PRO, pro, etc.). | ||
+ | ** ChrisD: nay. agree with David. | ||
+ | |||
+ | *When a hypergraph represents a CFG, the Nodes will be the nonterminal symbols and the Edges will be the productions. It will be ugly for numeric Node ids to appear in the productions, so symbolic names might be preferable. Perhaps a Node object can have a string-valued '''id''' field by which it can be referred to. Con: who is going to guarantee that the names are unique? Alternatively, a Forest object can have a '''nodealiases''' field which is an object mapping from symbolic names to numeric ids. | ||
+ | ** ChrisD: I'm in favor of referring to nodes/nonterminals with a numeric id for consistency enforcement (which is admittedly ugly), but supporting optional string aliases/labels for applications that care about such things. | ||
+ | ** David: on the fence about this one | ||
+ | |||
+ | *Another possible extension for edges: it may be useful to encode synchronous forests (for example, imagine the forest of derivations over an input lattice). Do we want to have an optional alt_tails? or a vector of tails (for multiple languages?)? | ||
+ | ** David: IMO that would take us beyond hypergraphs. But nothing would stop you from adding your own fields: | ||
+ | <pre> | ||
+ | { head: 123, tails: [456, 789], | ||
+ | french: ["le", 456, "que", 789], | ||
+ | english: ["the", 456, "that", 789], | ||
+ | chinese: [789, "de", "456"] } | ||
+ | </pre> | ||
+ | I don't think the standard needs to specify exactly how this is done. | ||
+ | |||
+ | *What do we think about non-coaccessible states? Is a forest well-formed if it contains elements that cannot be reached from the root? | ||
+ | ** David: yes, I don't think the format should care | ||
+ | ** ChrisQ: agree. Provider of forest should ensure that it is well formed | ||
+ | |||
+ | * Should Edge have an optional '''weight''' field? '''logweight'''? | ||
+ | ** David: yea, I think it should be called '''weight''' and the weight of the forest is the sum-product. | ||
+ | ** ChrisQ: agree. | ||
+ | |||
+ | * Should an Edge with empty '''tails''' be allowed? If so, should the following two forests be considered equivalent: | ||
+ | <pre> | ||
+ | { nodes: [ { label: "a" } ], | ||
+ | edges: [ ] } | ||
+ | { nodes : [ { label: "a" } ], | ||
+ | edges: [ { head: 0, tails: [ ] } ] } | ||
+ | </pre> | ||
+ | ** David: yes, tailless edges should be allowed, otherwise it's not nice to represent the set of trees { (a b) , (a (b c)) }. But the two example forests above should be considered equivalent since they generate the same set of trees. | ||
+ | |||
+ | * A tree can be represented as a Forest where every node has only one incoming edge, but is there any desire for a more concise representation of a tree belonging to an existing Forest? Like a list of Edge ids? | ||
+ | |||
+ | * Can/should we require that '''edges''' come after '''nodes'''? | ||
+ | |||
+ | == Protocol Buffers == | ||
+ | |||
+ | [http://code.google.com/p/protobuf/ Protocol Buffer Description] | ||
+ | |||
+ | [http://github.com/srush/hypergraph Implementation Sketch] | ||
+ | |||
+ | Pro: | ||
+ | * Conversion to and from JSON ([http://code.google.com/p/protobuf-json/ protobuf-json]) | ||
+ | * Very fast to read (particularly in C++ and Java, hopefully soon in python) | ||
+ | * Very space efficient | ||
+ | * Implementations in Java, C++ and Python; generates typed stubs in those languages | ||
+ | |||
+ | Con: | ||
+ | * No implementations for Perl, C#, or other languages commonly used by NLP folks | ||
+ | * Requires a separate library; adds an external dependency to spec | ||
+ | * "It's really easy to get up to some of the data size limits that are in place to prevent malicious data from having the PB parser allocate too much memory". Some of the limits are described in the section describing SetTotalBytesLimit on [http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.io.coded_stream.html this page]. | ||
+ | * "You typically have to create a full hypergraph protocol buffer object before you can serialize it, so you either have to use the PB data structures internally in your code or you have to copy your data structure. While doing this copy, you can end up with two copies of the forest in memory, which is bad for memory usage." | ||
+ | |||
+ | == Variation of SLF (Standard Lattice Format) == | ||
+ | |||
+ | [http://labrosa.ee.columbia.edu/doc/HTKBook21/node257.html SLF Specification] | ||
+ | |||
+ | Pro: | ||
+ | * Blindingly fast. | ||
+ | * Could be implemented to work lazy/streaming. | ||
+ | |||
+ | Con: | ||
+ | * Requires a custom format | ||
+ | * Probably need specialized language bindings. | ||
+ | |||
+ | == Tiburon Format == | ||
+ | |||
+ | [http://www.isi.edu/licensed-sw/tiburon/ Tiburon Specification] | ||
+ | |||
+ | == Bindings/Libraries/Software == | ||
+ | |||
+ | Python | ||
+ | * (David) this module could be easily adapted to the new format: http://www.isi.edu/~chiang/software/forest/forest.py. In 400 lines it has an Earley-style parser that does inside-outside with correct handling of cycles, and lazy k-best derivations. | ||
+ | |||
+ | C++ | ||
+ | * (ChrisD) I'll add support for whatever we come up to the cdec hypergraph library. This should make my hypergraph MERT available to non-cdec decoders with very little overhead. | ||
+ | |||
+ | Software | ||
+ | * (David) I've written a web app for visually exploring forests that I will move over to the new format if it's JSON (currently it uses XML). Please e-mail me for the link if you want to play; it uses a lot of bandwidth. | ||
+ | |||
+ | = See also = | ||
+ | |||
+ | * [[Machine translation]] | ||
+ | * [[Machine translation software]] | ||
+ | |||
+ | |||
+ | [[Category:Machine translation]] |
Latest revision as of 04:15, 25 June 2012
Overall goal
Make it easy to share packed representations across NLP applications. Therefore we want a spec that is primarily easy to use from a variety of different platforms and languages. A memory efficient and fast representation is also useful.
Serialization library options
JSON
Pro:
- Implementations in every language (often packaged with language).
- Human readable
- Already used in CDec for forest output
Con:
- Space inefficient
- Requires custom parser for speed
- Need additional code to check for well-formed hypergraphs, since there is no schema for JSON objects
- Some languages (e.g., Python) do not natively support event-driven parsers for JSON, meaning it's hard to do process JSON files without first loading the entire thing. Since parse forests can be big in real applications, event-driven parsers that construct a hypergraph library's internal data structure are crucial. For example, loading the example hypergraph using Python's json.load() command takes almost 10 minutes and 7gb of memory. I understand the desire for familiarity and simplicity, but this scaling behavior makes me worry this won't be usable for real applications.
- For Python, yajl + ijson (http://pypi.python.org/pypi/ijson/) or yajl-py (http://pykler.github.com/yajl-py/) might address these concerns.
Proposed schema:
A Forest object has the following required fields:
- nodes: a list of Node objects
- edges: a list of Edge objects
- root: a node id, which is an integer index into the nodes list
An Edge object has the following required fields:
- head: a node id
- tails: a (possibly empty) list of node ids
A Node or Edge object has the following optional fields:
- label: string
- features: a FeatureVector object
and any other application-specific fields.
A FeatureVector object has arbitrary fields with float values.
Example: http://www.isi.edu/~chiang/software/forest/example
Proposed extensions (yea or nay?) / Open questions
- (ChrisD) question: some libraries don't represent node-level features internally (at least Joshua & cdec), so these would need to denormalize node-level features to either all incoming or outgoing edges of the node in question. This may not be completely straightforward to do. Should we possibly consider just edge-level features?
- David: Since edges are not shared, it should always be easy to propagate node features down to its edges (i.e., to the edges which it is the head of). I would favor, however, eliminating node features.
- When a hypergraph represents a set of trees, the Node.labels will be the labels of the tree nodes. It might be convenient to allow a shorthand for leaf/terminal labels: in Edge.tails, a string "a" would be shorthand for {label: "a"}
- David: yea
- In Node.label, a value of null means that the tree node is labeled with epsilon, the empty string. This is not the same as ""': the former would not contribute anything to the yield of the tree, whereas the latter would contribute a token of length 0.
- David: nay, this should be left to the application. An empty Edge.tails list has the same effect. And people who care about explicit empty nodes might want to distinguish several kinds of empty nodes (t, PRO, pro, etc.).
- ChrisD: nay. agree with David.
- When a hypergraph represents a CFG, the Nodes will be the nonterminal symbols and the Edges will be the productions. It will be ugly for numeric Node ids to appear in the productions, so symbolic names might be preferable. Perhaps a Node object can have a string-valued id field by which it can be referred to. Con: who is going to guarantee that the names are unique? Alternatively, a Forest object can have a nodealiases field which is an object mapping from symbolic names to numeric ids.
- ChrisD: I'm in favor of referring to nodes/nonterminals with a numeric id for consistency enforcement (which is admittedly ugly), but supporting optional string aliases/labels for applications that care about such things.
- David: on the fence about this one
- Another possible extension for edges: it may be useful to encode synchronous forests (for example, imagine the forest of derivations over an input lattice). Do we want to have an optional alt_tails? or a vector of tails (for multiple languages?)?
- David: IMO that would take us beyond hypergraphs. But nothing would stop you from adding your own fields:
{ head: 123, tails: [456, 789], french: ["le", 456, "que", 789], english: ["the", 456, "that", 789], chinese: [789, "de", "456"] }
I don't think the standard needs to specify exactly how this is done.
- What do we think about non-coaccessible states? Is a forest well-formed if it contains elements that cannot be reached from the root?
- David: yes, I don't think the format should care
- ChrisQ: agree. Provider of forest should ensure that it is well formed
- Should Edge have an optional weight field? logweight?
- David: yea, I think it should be called weight and the weight of the forest is the sum-product.
- ChrisQ: agree.
- Should an Edge with empty tails be allowed? If so, should the following two forests be considered equivalent:
{ nodes: [ { label: "a" } ], edges: [ ] } { nodes : [ { label: "a" } ], edges: [ { head: 0, tails: [ ] } ] }
- David: yes, tailless edges should be allowed, otherwise it's not nice to represent the set of trees { (a b) , (a (b c)) }. But the two example forests above should be considered equivalent since they generate the same set of trees.
- A tree can be represented as a Forest where every node has only one incoming edge, but is there any desire for a more concise representation of a tree belonging to an existing Forest? Like a list of Edge ids?
- Can/should we require that edges come after nodes?
Protocol Buffers
Pro:
- Conversion to and from JSON (protobuf-json)
- Very fast to read (particularly in C++ and Java, hopefully soon in python)
- Very space efficient
- Implementations in Java, C++ and Python; generates typed stubs in those languages
Con:
- No implementations for Perl, C#, or other languages commonly used by NLP folks
- Requires a separate library; adds an external dependency to spec
- "It's really easy to get up to some of the data size limits that are in place to prevent malicious data from having the PB parser allocate too much memory". Some of the limits are described in the section describing SetTotalBytesLimit on this page.
- "You typically have to create a full hypergraph protocol buffer object before you can serialize it, so you either have to use the PB data structures internally in your code or you have to copy your data structure. While doing this copy, you can end up with two copies of the forest in memory, which is bad for memory usage."
Variation of SLF (Standard Lattice Format)
Pro:
- Blindingly fast.
- Could be implemented to work lazy/streaming.
Con:
- Requires a custom format
- Probably need specialized language bindings.
Tiburon Format
Bindings/Libraries/Software
Python
- (David) this module could be easily adapted to the new format: http://www.isi.edu/~chiang/software/forest/forest.py. In 400 lines it has an Earley-style parser that does inside-outside with correct handling of cycles, and lazy k-best derivations.
C++
- (ChrisD) I'll add support for whatever we come up to the cdec hypergraph library. This should make my hypergraph MERT available to non-cdec decoders with very little overhead.
Software
- (David) I've written a web app for visually exploring forests that I will move over to the new format if it's JSON (currently it uses XML). Please e-mail me for the link if you want to play; it uses a lot of bandwidth.