This document explores common file formats used in data science, including their structure, advantages, and typical use cases for data storage and exchange.
This document covers the essential file formats used in data science, such as CSV, JSON, and Excel. It explains their structure, how to read and write them in Python, and the advantages and limitations of each format for data storage and exchange.
File formats are fundamental to data science, enabling the storage, exchange, and analysis of data. Understanding the structure and use cases of different file formats is crucial for efficient data handling.
| Format | Description | Typical Use Cases |
|---|---|---|
| CSV | Comma-separated values; plain text, tabular data | Data export/import, spreadsheets |
| JSON | JavaScript Object Notation; hierarchical, human-readable | APIs, config files, web data |
| Excel | Microsoft Excel format; supports formulas, formatting | Business data, analytics |
CSV files store tabular data in plain text, with each line representing a row and columns separated by commas. They are widely used for data exchange due to their simplicity and compatibility.
1name,age,score
2Alice,30,85
3Bob,25,90
JSON is a lightweight, human-readable format for representing structured data. It supports nested objects and arrays, making it suitable for complex data structures.
1[
2 { "name": "Alice", "age": 30, "score": 85 },
3 { "name": "Bob", "age": 25, "score": 90 }
4]
Excel files (XLS, XLSX) are binary formats that support multiple sheets, formulas, and formatting. They are commonly used in business and analytics.
Python provides libraries for working with these formats:
csv for CSV filesjson for JSON filespandas for all formats, including Excel1import csv
2with open('data.csv', 'r') as file:
3 reader = csv.reader(file)
4 for row in reader:
5 print(row)
1import json
2with open('data.json', 'r') as file:
3 data = json.load(file)
4 print(data)
1import pandas as pd
2df = pd.read_excel('data.xlsx')
3print(df.head())
| Format | Advantages | Limitations |
|---|---|---|
| CSV | Simple, widely supported | No data types, no hierarchy |
| JSON | Supports complex/nested data, human-readable | Larger files, not ideal for tabular data |
| Excel | Rich features, formulas, formatting | Proprietary, larger file size |
Understanding file formats is essential for effective data science workflows. Choosing the right format depends on the data structure, use case, and compatibility requirements. Python makes it easy to read and write these formats for analysis and sharing.
(2) CSV is a plain text format where each line represents a row and columns are separated by commas, making it ideal for tabular data exchange.
(1) JSON supports complex and nested data structures, making it suitable for hierarchical data and web APIs.
(3) pandas is a powerful Python library that can read and write Excel files, as well as CSV and JSON formats.
| Format | Use Case |
|---|---|
| A. CSV | 1. Data export/import, spreadsheets |
| B. JSON | 2. APIs, config files, web data |
| C. Excel | 3. Business data, analytics |
A-1, B-2, C-3.
(2) CSV does not support data types or hierarchical structures, which limits its use for complex data.
JSON files are human-readable and support nested data structures.
True. JSON is designed to be both human-readable and capable of representing complex, nested data structures.
(2) The csv module can only read plain text; Excel-specific features like formatting and formulas will be lost.
(2) Compatibility with the tools and systems is the most important factor when selecting a file format for data exchange.