What is the protobuf file format

Protocol buffers is a language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML or JSON, but smaller and faster. A protobuf scheme describes once how the data is structured, after which special generated source code can be used to easily write and read your structured data to and from a variety of data streams and using a variety of languages. Protobuf is a binary format, and it does not contain any structural information. Therefore, you will need a scheme (with the structure) and code to deserialize it in order to access the data.

Protobuf was developed by Google in the early 2000’s as a way to communicate between systems, and is now used as the standard format for GTFS-RT as well

Why protobuf is used for GTFS-RT

Protobuf is the standard for GTFS-RT data. Since all producers use the same protobuf format and scheme, consumers only have to code an application once, whereafter it is available

When we publish data in realtime, new data is published and fetched every 15 seconds (TripUpdates, ServiceAlerts) or even every 3 seconds (VehiclePositions). For TripUpdates, this means 5760 updates, per operator, per day. For example, a file size reduction with 500kb would result in a saving of 2,5gb of data transfer per day, just for this one feed with one consumer. This is important both for producers, since data transfer can make up a significant part of the hosting costs, and for consumers such as you, who also have to pay for transfer.

Thanks to its smaller size,protobuf can bring advantages when it comes to decoding performance. This means that systems can decode and handle messages quicker, which becomes especially important when you want to consume a large number of feeds.

Protobuf schemes

A protobuf scheme defines how the information is structured inside. Take a look at this excerpt from the GTFS-RT protobuf scheme:

 1 // Realtime update for arrival and/or departure events for a given stop on a
 2// trip. Updates can be supplied for both past and future events.
 3// The producer is allowed, although not required, to drop past events.
 4message
 5StopTimeUpdate
 6{
 7    // The update is linked to a specific stop either through stop_sequence or
 8    // stop_id, so one of the fields below must necessarily be set.
 9    // See the documentation in TripDescriptor for more information.
10
11    // Must be the same as in stop_times.txt in the corresponding GTFS feed.
12    optional
13    uint32
14    stop_sequence = 1;
15    // Must be the same as in stops.txt in the corresponding GTFS feed.
16    optional
17    string
18    stop_id = 4;
19
20    optional
21    StopTimeEvent
22    arrival = 2;
23    optional
24    StopTimeEvent
25    departure = 3;
26
27    // The relation between this StopTime and the static schedule.
28enum
29    ScheduleRelationship
30    {
31        // The vehicle is proceeding in accordance with its static schedule of
32        // stops, although not necessarily according to the times of the schedule.
33        // At least one of arrival and departure must be provided. If the schedule
34        // for this stop contains both arrival and departure times then so must
35        // this update.
36        SCHEDULED = 0;
37
38        // The stop is skipped, i.e., the vehicle will not stop at this stop.
39        // Arrival and departure are optional.
40        SKIPPED = 1;
41
42        // No data is given for this stop. The main intention for this value is to
43        // give the predictions only for part of a trip, i.e., if the last update
44        // for a trip has a NO_DATA specifier, then StopTimes for the rest of the
45        // stops in the trip are considered to be unspecified as well.
46        // Neither arrival nor departure should be supplied.
47        NO_DATA = 2;
48    }
49    optional
50    ScheduleRelationship
51    schedule_relationship = 5
52        [
53default
54    = SCHEDULED
55]
56    ;
57
58    // The extensions namespace allows 3rd-party developers to extend the
59    // GTFS Realtime Specification in order to add and evaluate new features
60    // and modifications to the spec.
61    extensions
62    1000
63    to
64    1999;
65}

Every field is described: whether it is optional, the data type, and the position in the binary data (which is needed to decode the file). Note that these schemes are not meant as a specification for the information/content, but as a specification for the serialization and deserialisation. If you are looking for the GTFS-RT specification (not the protobuf scheme), you can find it here.

Decoding a protobuf file

When you decode a protbuf file, you end up with structured data. The exact datastructures which are used to hold this data can differ between programming languages (for example List vs Arrays), but the content will be the same. Below you can see a fragment of a GTFS-RT TripUpdate file after it has been decoded from protobuf.

TripUpdate.pb, fragment deserialized to jsonld:

 1header
 2{
 3    gtfs_realtime_version: "2.0"
 4    incrementality: FULL_DATASET
 5    timestamp: 1608130478
 6}
 7entity
 8{
 9  id: "205330500466160320"
10  trip_update
11  {
12    trip
13    {
14      trip_id: "205330000096654402"
15      start_date: "20201216"
16      schedule_relationship: SCHEDULED
17    }
18    stop_time_update
19    {
20      stop_sequence: 49
21      arrival
22      {
23        delay: 173
24        time: 1608129893
25        uncertainty: 0
26      }
27      departure
28      {
29        delay: 190
30        time: 1608129910
31        uncertainty: 0
32      }
33      stop_id: "9022020526118002"
34    }
35  }
36}

There are protobuf libraries and tools for almost all programming languages. You can find some examples for Java, Python and C# below.

Further reading