Monthly Archives: January 2017

Tensorflow: Retrieving serving signatures from an exported model

Simple question, but it took me many hours of digging into the code to figure out :-(. Someone should have added a document or something.

Basically, just import the meta graph, then unpack the protobuf object from serving_signatures collection. I really don’t understand why it is not added to signature def. Anyway, later you can just call read_serving_signature(path/to/export.meta) to retrieve the exported signatures. It will be very helpful if you want to implement a generic serving interface for tensorflow.

I also made a gist here for reference.


CouchDB – my first try

I recently have an interesting task about data ingestion. The use case is we have hundreds GB of CSV files, each file contains 10k – 10m records with around 20 – 40 fields, with some common schema. And the goal is to ingest all the files into a single data storage so we can query the data in these files efficiently.

My initial thinking is to use MongoDB because there is no well defined schema, and also my experience with MongoDB. But this time I want to give CouchDB a try this time. Why did i choose CouchDB? Mainly just because i have heard and read about it for a while, but i have never tried for some real use case. To be clear, i want to try CouchDB so I have some hand-on experience with it so that I can use it in the future.

The goal is quite simple:  import all csv files into CouchDB and then make some simple filter queries. I first use Docker to start a single CouchDB using an Docker image on Dockerhub. First impression is the api interface is quite simple, all you need to make a POST call. This is a good document for starter. There is also a nice couchdb client with very straight forward client api. It also has a very nice built in UI. However, I quickly stumped into issues.

The first issue is insertion speed is quite slow. I was doing insert row-by-row. I searched for bulk insert, there is a REST API but I can’t find the equivalent python implementation (hmmm). I did some google and find mpcouch python package, which utilize multiple threads for insertion and it did improve the insertion speed a lot. However, after one hour only 9m records were inserted.

The second issue is disk usage. I was quite surprised that CouchDB consumed a lot of disk space. Just after the first 10 thousands, it was already used 7MB, while the actual CSV is only 440KB.

After 3 hours, I managed to ingest 25m records, but it consumed 15.2GB of disk. The actual sample is 18GB of CSV files, and there are 280m of records. Proportionally, dumping all the csv files to Couchdb it could take more than 100GB of storage. I also did some more research, and disk usage is really an issue with CouchDB. So if you plan to use CouchDB for production, make sure you also plan for plenty of storage.

And CouchDB views are not straight forward to query and aggregate the data. You have to write a javascript function to create view. And when you create view, it will actually consume a lot of disk as well. Another thing is when I search for CouchDB, there are not a lot of recent articles, and there are some articles but they are quite old. This gives me a feeling that CouchDB community is not so active.

So in summary, for this particular case, CouchDB is not the answer. Slow ingestion and disk consumption is my main concern. There are surely many other use cases where you can use CouchDB. But going through this exercise helps me to understand more about CouchDB and how it works. Hopefully i will have a chance to come back with it in future.