Overview
In this lesson I will be explaining the different kinds of Data File formats used in Big Data, These are widely used but unspoken of. Anyone aspiring to be a Data Engineer/Data Analyst/ML engineer should be aware of this hidden magic in big data.
Why do we even need a data file format?
- Imagine storing all information you get in a random order, Will this help? No, it won't
- Every company gets 10s and 1000s of GB data every day. If these are not stored in a proper format then understanding this data will be difficult.
- More time you spend sorting through the data, The company is missing out on the opportunity to retain customers or generate more orders/revenue.
This is why data file formats are used in Data lake.
Different kinds of file formats present
We have CSV,JSON,ORC,Parquet,Avro. These are the leading data file formats present.
Which one to choose from and why to choose them completely depends on the use case of a company.
Use-case 1:
If you are looking into total sales data from a table let's say orders, Then this requires 1 column in your table sale_amount to be scanned/quired.
Hence storing this table in columnar format is the best way to go.
File formats to choose: ORC, Parquet
Use-case 2 :
If you are trying to identify the customer behavior, What kind of items are customers placing the order?
To gather this information you have to scan row-level information spanning multiple columns like user_id, item_name, sale_amount.
Here, Storing the data in row-level format is the right choice to make.
File formats to choose: CSV, JSON, Avro
Conclusion
It's important to know that we don't have one fit-all. The question you must ask "How will this data be used? " Based on this file format should be defined.