Retrieving Data from CSV and JSON Files

Methods for retrieving data from various sources including CSV and JSON files with practical considerations using Python and Pandas

This document explains methods for retrieving data from various sources, including CSV and JSON files, and highlights practical considerations when working with these formats using Python and Pandas.


Retrieving Data from Different Sources

  1. CSV Files: Comma Separated Values files are widely used for storing tabular data. They can be easily read into Pandas DataFrames.
  2. JSON Files: JavaScript Object Notation files are commonly used for structured data storage. They can also be read into Pandas DataFrames.

Downloading Data Files

For this exercise, the Iris dataset is used which contains information about different species of iris flowers. The dataset is available in both CSV and JSON formats. Download the files from the following links:

To download the files directly from a Jupyter Notebook, use the following code:

1# Download the CSV file
2!wget -O data/iris_data.csv https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv
3
4# Download the JSON file
5!wget -O data/iris_data.json https://gist.githubusercontent.com/lisawagner/f7cbb8bae9743cca9c12c7b9682adfee/raw/iris_data.json

Why Use !wget in Jupyter Notebook

In Jupyter Notebook, the ! character is used to execute shell commands directly from a code cell. This allows you to run commands like wget without leaving the notebook environment. These commands are called magic commands. For example:

  • !wget runs the wget command to download files.
  • !mkdir -p data ensures the data folder exists before downloading files.

This approach is convenient for managing files and directories while working interactively in a notebook. Alternatively, download the files directly from the links above and save them in a folder named data in a working directory. Ensure the data folder exists before running the code.


Format of Raw CSV File


Format of Raw JSON File

The raw iris.json downloaded contents are shown below:

iris_data.json
1{
2  "a": "b"
3}

Reading CSV Files

CSV (Comma Separated Values) files store data in rows separated by commas. These files can be easily read into Pandas DataFrames using the read_csv function. Below is an example:

1import pandas as pd
2
3file_path = "data/iris_data.csv"
4data = pd.read_csv(file_path)
5print(data.iloc[:5])  # Displays the first five rows

The output will display columns such as sepal length, sepal width, petal length, petal width, and species for the first five rows.

Useful Arguments for read_csv

  • Separator (sep): Specify the delimiter using the sep argument. For example:

    1data = pd.read_csv("data/tab_separated_file.tsv", sep="\t")
    

    Use delim_whitespace=True for files where columns are separated by whitespace.

  • Header (header): Define which row to use as column headers. For example:

    1data = pd.read_csv("data/no_header_file.csv", header=None)
    
  • Column Names (names): Manually specify column names as a list:

    1column_names = ["col1", "col2", "col3"]
    2data = pd.read_csv("data/file.csv", names=column_names, header=None)
    
  • Null Values (na_values): Specify values to treat as null:

    1data = pd.read_csv("data/file.csv", na_values=["NA", "N/A", "99"])
    
  • Data Types (dtype): Enforce specific data types for columns:

    1data = pd.read_csv("data/file.csv", dtype={"col1": int, "col2": float})
    
  • Chunking (chunksize): Read large files in chunks:

    1for chunk in pd.read_csv("data/large_file.csv", chunksize=1000):
    2    print(chunk.head())
    

Writing CSV Files

To write a DataFrame to a CSV file, use the to_csv function:

1data.to_csv("output.csv", index=False)

Reading JSON Files

JSON (JavaScript Object Notation) files are widely used for storing data in a structured format, resembling Python dictionaries. They are common in NoSQL databases and APIs.

Example JSON Structure

A JSON file may look like this:

 1[
 2  {
 3    "sepal_length": 5.1,
 4    "sepal_width": 3.5,
 5    "petal_length": 1.4,
 6    "petal_width": 0.2,
 7    "species": "setosa"
 8  },
 9  {
10    "sepal_length": 4.9,
11    "sepal_width": 3.0,
12    "petal_length": 1.4,
13    "petal_width": 0.2,
14    "species": "setosa"
15  }
16]

Reading JSON Files with Pandas

To read a JSON file into a Pandas DataFrame, use the read_json function:

1data = pd.read_json("data/iris_data.json")
2print(data.head())

Useful Arguments for read_json

  • Orientation (orient): Specify the file’s structure. Options include:

    • records: Each JSON object is a row.
    • index: Keys are row indices.
    1data = pd.read_json("data/iris_data.json", orient="records")
    
  • Lines (lines): For JSON files with one object per line:

    1data = pd.read_json("data/iris_data_lines.json", lines=True)
    

Writing JSON Files

To write a DataFrame to a JSON file, use the to_json function:

1data.to_json("output.json", orient="records", lines=True)

Handling Nested JSON

For deeply nested JSON files, consider using the json_normalize function from Pandas:

1from pandas import json_normalize
2import json
3
4with open("data/nested.json") as f:
5    nested_data = json.load(f)
6
7data = json_normalize(nested_data, record_path=["nested_key"], meta=["meta_key"])

Customizing the Reading Process: - sep: Specify the delimiter (default is a comma ,). - header: Define the row number to use as column names (default is the first row). - dtype: Set the data type for specific columns. - na_values: Define additional strings to recognize as missing values.

  • Example with Custom Parameters:

    1df = pd.read_csv(
    2    'file.csv',
    3    sep=',',  # Delimiter
    4    header=0,  # First row as header
    5    dtype={'column_name': 'int'},  # Specify data type
    6    na_values=['NA', 'null']  # Treat 'NA' and 'null' as missing values
    7)
    8print(df.info())  # Display DataFrame structure
    
  • Common Errors and Solutions:

    • FileNotFoundError: Ensure the file path is correct.
    • ParserError: Check if the file has a consistent structure and matches the specified delimiter.
    • MemoryError: Use the chunksize parameter for large files.

Conclusion

This document outlines methods for retrieving data from CSV and JSON files using Pandas. It also highlights key arguments and considerations for handling these formats effectively, ensuring smooth data processing workflows. By understanding the nuances of these file formats, you can streamline your data analysis tasks and avoid common pitfalls.


FAQ

Reading a CSV file into a Pandas DataFrame is one of the most common tasks in data analysis. A CSV (Comma-Separated Values) file is a plain text file that stores tabular data, where each line represents a row and columns are separated by a delimiter (usually a comma).

In Pandas, you can use the read_csv() method to load a CSV file into a DataFrame. This method provides several parameters to customize the reading process, such as specifying the delimiter, handling missing values, and defining data types for columns.

Basic Example:

1import pandas as pd
2df = pd.read_csv('file.csv', sep=',', header=0, dtype={'column_name': 'int'})

The ! command allows you to run bash commands from the Jupyter Notebook kernel. This is particularly useful for executing shell commands that interact with the operating system or perform file operations such as to downloading files or executing system commands directly from a Jupyter Notebook.

To read the given data into a Pandas DataFrame, you can use the read_csv() function with appropriate parameters. Since the data is space-separated, you need to specify the delimiter as a space (sep=' '), and you may also need to handle multiple spaces using the delim_whitespace=True parameter.

Example:

1import pandas as pd
2df = pd.read_csv('data.txt', sep=' ', header=None)

JSON is better for hierarchical or nested structured data, while CSV is more suitable for flat, tabular data. Example:

  • JSON: A dataset representing a company’s organizational structure, where each employee has nested details like name, position, and a list of subordinates.
  • CSV: A dataset containing sales records with columns like Date, Product, Quantity, and Price.

Yes, you can use the names parameter in pd.read_csv() to specify custom column names. Ensure to set header=None if the file does not have a header row.

You can handle missing values by using the na_values parameter in pd.read_csv() to define which values should be treated as null. Additionally, Pandas provides methods like fillna() and dropna() for further processing.

For nested JSON files, you can use the json_normalize() function from Pandas to flatten the structure and extract specific fields.

The file read_csv is used for reading flat, tabular data stored in CSV files, while read_json is designed for structured data in JSON format, supporting various orientations like records and index.

Use the chunksize parameter when working with large CSV files to read the data in smaller, manageable chunks instead of loading the entire file into memory.

Yes, you can use the to_json() function in Pandas to write a DataFrame to a JSON file. You can specify parameters like orient and lines to control the output format.

The orient parameter specifies the structure of the JSON file, such as records for row-based data or index for key-value pairs. It helps Pandas interpret the file correctly.