Efficient Dataset Storage: Beyond CSVs

Published: February 3, 2023

As a data scientist, one of the crucial aspects of your work is managing and storing datasets efficiently. How should you be storing your datasets though? CSV's may be common, and are handy when you want to share and read your data, but there are other file formats to try, which are way more efficient in terms of speed and disk space

In this blog post, I'll explore alternative file formats that can outperform CSVs, especially when dealing with larger datasets.

Size on Disk by File Type and Number of Rows Load Time by File Type and Number of Rows Save Time by File Type and Number of Rows

Binary formats like Pickle and Parquet offer enhanced performance for both reading and compression. While Pickle is faster than CSVs, Parquet excels in reducing disk space usage. Pickles are a lot faster to read/write than CSVs, but not much smaller. The alternatives (parquet or even feather) offer better compression aswell

Both of these binary formats maintain your data-types too! Although, For a little extra load time, parquet will save you space on disk. This becomes more and more important when working with larger datasets.

With parquet, you also have the option to only read selected columns, making it more efficient for large datasets:

# load certain columns of a parquet file
master_ref = pd.read_parquet('./datasets/master_ref.parquet',
                                columns=['short_1', 'short_2'])

Overall, CSVs may not always be the best option, especially for large datasets. Consider binary formats for better speed & compression.