This document explains methods for retrieving data from various sources, including CSV and JSON files, and highlights practical considerations when working with these formats using Python and Pandas.
Retrieving Data from Different Sources
- CSV Files: Comma Separated Values files are widely used for storing tabular data. They can be easily read into Pandas DataFrames.
- JSON Files: JavaScript Object Notation files are commonly used for structured data storage. They can also be read into Pandas DataFrames.
Downloading Data Files
For this exercise, the Iris dataset is used which contains information about different species of iris flowers. The dataset is available in both CSV and JSON formats. Download the files from the following links:
To download the files directly from a Jupyter Notebook, use the following code:
1# Download the CSV file
2!wget -O data/iris_data.csv https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv
3
4# Download the JSON file
5!wget -O data/iris_data.json https://gist.githubusercontent.com/lisawagner/f7cbb8bae9743cca9c12c7b9682adfee/raw/iris_data.json
Why Use !wget in Jupyter Notebook
In Jupyter Notebook, the ! character is used to execute shell commands directly from a code cell. This allows you to run commands like wget without leaving the notebook environment. These commands are called magic commands. For example:
!wgetruns thewgetcommand to download files.!mkdir -p dataensures thedatafolder exists before downloading files.
This approach is convenient for managing files and directories while working interactively in a notebook. Alternatively, download the files directly from the links above and save them in a folder named data in a working directory. Ensure the data folder exists before running the code.
Format of Raw CSV File
Format of Raw JSON File
The raw iris.json downloaded contents are shown below:
iris_data.json
1{
2 "a": "b"
3}
Reading CSV Files
CSV (Comma Separated Values) files store data in rows separated by commas. These files can be easily read into Pandas DataFrames using the read_csv function. Below is an example:
1import pandas as pd
2
3file_path = "data/iris_data.csv"
4data = pd.read_csv(file_path)
5print(data.iloc[:5]) # Displays the first five rows
The output will display columns such as sepal length, sepal width, petal length, petal width, and species for the first five rows.
Useful Arguments for read_csv
Separator (
sep): Specify the delimiter using thesepargument. For example:1data = pd.read_csv("data/tab_separated_file.tsv", sep="\t")Use
delim_whitespace=Truefor files where columns are separated by whitespace.Header (
header): Define which row to use as column headers. For example:1data = pd.read_csv("data/no_header_file.csv", header=None)Column Names (
names): Manually specify column names as a list:1column_names = ["col1", "col2", "col3"] 2data = pd.read_csv("data/file.csv", names=column_names, header=None)Null Values (
na_values): Specify values to treat as null:1data = pd.read_csv("data/file.csv", na_values=["NA", "N/A", "99"])Data Types (
dtype): Enforce specific data types for columns:1data = pd.read_csv("data/file.csv", dtype={"col1": int, "col2": float})Chunking (
chunksize): Read large files in chunks:1for chunk in pd.read_csv("data/large_file.csv", chunksize=1000): 2 print(chunk.head())
Writing CSV Files
To write a DataFrame to a CSV file, use the to_csv function:
1data.to_csv("output.csv", index=False)
Reading JSON Files
JSON (JavaScript Object Notation) files are widely used for storing data in a structured format, resembling Python dictionaries. They are common in NoSQL databases and APIs.
Example JSON Structure
A JSON file may look like this:
1[
2 {
3 "sepal_length": 5.1,
4 "sepal_width": 3.5,
5 "petal_length": 1.4,
6 "petal_width": 0.2,
7 "species": "setosa"
8 },
9 {
10 "sepal_length": 4.9,
11 "sepal_width": 3.0,
12 "petal_length": 1.4,
13 "petal_width": 0.2,
14 "species": "setosa"
15 }
16]
Reading JSON Files with Pandas
To read a JSON file into a Pandas DataFrame, use the read_json function:
1data = pd.read_json("data/iris_data.json")
2print(data.head())
Useful Arguments for read_json
Orientation (
orient): Specify the file’s structure. Options include:records: Each JSON object is a row.index: Keys are row indices.
1data = pd.read_json("data/iris_data.json", orient="records")Lines (
lines): For JSON files with one object per line:1data = pd.read_json("data/iris_data_lines.json", lines=True)
Writing JSON Files
To write a DataFrame to a JSON file, use the to_json function:
1data.to_json("output.json", orient="records", lines=True)
Handling Nested JSON
For deeply nested JSON files, consider using the json_normalize function from Pandas:
1from pandas import json_normalize
2import json
3
4with open("data/nested.json") as f:
5 nested_data = json.load(f)
6
7data = json_normalize(nested_data, record_path=["nested_key"], meta=["meta_key"])
Customizing the Reading Process: - sep: Specify the delimiter (default is a comma ,). - header: Define the row number to use as column names (default is the first row). - dtype: Set the data type for specific columns. - na_values: Define additional strings to recognize as missing values.
Example with Custom Parameters:
1df = pd.read_csv( 2 'file.csv', 3 sep=',', # Delimiter 4 header=0, # First row as header 5 dtype={'column_name': 'int'}, # Specify data type 6 na_values=['NA', 'null'] # Treat 'NA' and 'null' as missing values 7) 8print(df.info()) # Display DataFrame structureCommon Errors and Solutions:
- FileNotFoundError: Ensure the file path is correct.
- ParserError: Check if the file has a consistent structure and matches the specified delimiter.
- MemoryError: Use the chunksize parameter for large files.
Conclusion
This document outlines methods for retrieving data from CSV and JSON files using Pandas. It also highlights key arguments and considerations for handling these formats effectively, ensuring smooth data processing workflows. By understanding the nuances of these file formats, you can streamline your data analysis tasks and avoid common pitfalls.
FAQ
Reading a CSV file into a Pandas DataFrame is one of the most common tasks in data analysis. A CSV (Comma-Separated Values) file is a plain text file that stores tabular data, where each line represents a row and columns are separated by a delimiter (usually a comma).
In Pandas, you can use the read_csv() method to load a CSV file into a DataFrame. This method provides several parameters to customize the reading process, such as specifying the delimiter, handling missing values, and defining data types for columns.
Basic Example:
1import pandas as pd
2df = pd.read_csv('file.csv', sep=',', header=0, dtype={'column_name': 'int'})
To read the given data into a Pandas DataFrame, you can use the read_csv() function with appropriate parameters. Since the data is space-separated, you need to specify the delimiter as a space (sep=' '), and you may also need to handle multiple spaces using the delim_whitespace=True parameter.
Example:
1import pandas as pd
2df = pd.read_csv('data.txt', sep=' ', header=None)
JSON is better for hierarchical or nested structured data, while CSV is more suitable for flat, tabular data.
Example:
- JSON: A dataset representing a company’s organizational structure, where each employee has nested details like name, position, and a list of subordinates.
- CSV: A dataset containing sales records with columns like
Date,Product,Quantity, andPrice.






