I recently have an interesting task about data ingestion. The use case is we have hundreds GB of CSV files, each file contains 10k – 10m records with around 20 – 40 fields, with some common schema. And the goal is to ingest all the files into a single data storage so we can query the data in these files efficiently.
My initial thinking is to use MongoDB because there is no well defined schema, and also my experience with MongoDB. But this time I want to give CouchDB a try this time. Why did i choose CouchDB? Mainly just because i have heard and read about it for a while, but i have never tried for some real use case. To be clear, i want to try CouchDB so I have some hand-on experience with it so that I can use it in the future.
The goal is quite simple: import all csv files into CouchDB and then make some simple filter queries. I first use Docker to start a single CouchDB using an Docker image on Dockerhub. First impression is the api interface is quite simple, all you need to make a POST call. This is a good document for starter. There is also a nice couchdb client with very straight forward client api. It also has a very nice built in UI. However, I quickly stumped into issues.
The first issue is insertion speed is quite slow. I was doing insert row-by-row. I searched for bulk insert, there is a REST API but I can’t find the equivalent python implementation (hmmm). I did some google and find mpcouch python package, which utilize multiple threads for insertion and it did improve the insertion speed a lot. However, after one hour only 9m records were inserted.
The second issue is disk usage. I was quite surprised that CouchDB consumed a lot of disk space. Just after the first 10 thousands, it was already used 7MB, while the actual CSV is only 440KB.
After 3 hours, I managed to ingest 25m records, but it consumed 15.2GB of disk. The actual sample is 18GB of CSV files, and there are 280m of records. Proportionally, dumping all the csv files to Couchdb it could take more than 100GB of storage. I also did some more research, and disk usage is really an issue with CouchDB. So if you plan to use CouchDB for production, make sure you also plan for plenty of storage.
So in summary, for this particular case, CouchDB is not the answer. Slow ingestion and disk consumption is my main concern. There are surely many other use cases where you can use CouchDB. But going through this exercise helps me to understand more about CouchDB and how it works. Hopefully i will have a chance to come back with it in future.