Difference between revisions of "Hypergraph Format"

From ACL Wiki
Jump to: navigation, search
Line 6: Line 6:
 
* Implementations in every language (often packaged with language).
 
* Implementations in every language (often packaged with language).
 
* Human readable
 
* Human readable
 +
* Already used in CDec for forest output
  
 
Con:  
 
Con:  
Line 21: Line 22:
 
* Very space efficient
 
* Very space efficient
 
* Implementations in every language (although requires a separate library)
 
* Implementations in every language (although requires a separate library)
 +
* Automatically generates typed stubs
  
 
Con:
 
Con:
Line 28: Line 30:
 
* "You typically have to create a full hypergraph protocol buffer object before you can serialize it, so you either have to use the PB data structures internally in your code or you have to copy your data structure. While doing this copy, you can end up with two copies of the forest in memory, which is bad for memory usage."
 
* "You typically have to create a full hypergraph protocol buffer object before you can serialize it, so you either have to use the PB data structures internally in your code or you have to copy your data structure. While doing this copy, you can end up with two copies of the forest in memory, which is bad for memory usage."
  
== SLF (Standard Lattice Format) ==
+
== Variation of SLF (Standard Lattice Format) ==
  
 
[http://labrosa.ee.columbia.edu/doc/HTKBook21/node257.html SLF Specification]
 
[http://labrosa.ee.columbia.edu/doc/HTKBook21/node257.html SLF Specification]
 +
 +
Pro:
 +
* Blindingly fast.
 +
* Could be implemented to work lazy/streaming.
 +
 +
Con:
 +
* Requires a custom format
 +
* Probably need specialized language bindings.
 +
 +
== Tiburon Format ==
 +
 +
[http://www.isi.edu/licensed-sw/tiburon/ Tiburon Specification]

Revision as of 19:49, 6 November 2010

JSON

JSON Description

Pro:

  • Implementations in every language (often packaged with language).
  • Human readable
  • Already used in CDec for forest output

Con:

  • Space inefficiency

Protocol Buffers

Protocol Buffer Description

Implementation Sketch

Pro:

  • Conversion to and from JSON (protobuf-json)
  • Very fast to read (particularly in C++ and Java, hopefully soon in python)
  • Very space efficient
  • Implementations in every language (although requires a separate library)
  • Automatically generates typed stubs

Con:

  • "It's really easy to get up to some of the data size

limits that are in place to prevent malicious data from having the PB parser allocate too much memory"

  • "You typically have to create a full hypergraph protocol buffer object before you can serialize it, so you either have to use the PB data structures internally in your code or you have to copy your data structure. While doing this copy, you can end up with two copies of the forest in memory, which is bad for memory usage."

Variation of SLF (Standard Lattice Format)

SLF Specification

Pro:

  • Blindingly fast.
  • Could be implemented to work lazy/streaming.

Con:

  • Requires a custom format
  • Probably need specialized language bindings.

Tiburon Format

Tiburon Specification