This document explains methods for retrieving data from SQL and NoSQL databases, APIs, and Cloud data sources, highlighting practical considerations and Python code examples for seamless data integration.
Retrieving Data from Different Sources
- SQL Databases: Structured Query Language databases are relational databases with fixed schemas. They are widely used for data storage and retrieval.
- NoSQL Databases: Non-relational databases that offer flexibility in data storage and retrieval. They are often faster and more scalable than SQL databases.
- APIs: Application Programming Interfaces allow access to data from various providers, enabling seamless integration with external data sources.
- Cloud Data Sources: Cloud platforms provide data storage and retrieval services, allowing users to access data from anywhere with an internet connection.
Working with SQL Databases
SQL (Structured Query Language) databases are relational databases with fixed schemas. Examples include Microsoft SQL Server, Postgres, MySQL, AWS Redshift, Oracle DB, and IBM Db2. Python libraries such as sqlite3, SQLAlchemy, Psycopg2 (for Postgres), and ibm_db (for Db2) can be used to connect to these databases.
Example Reading Data from SQLite
The following example demonstrates how to connect to an SQLite database and retrieve data using Python:
1import sqlite3 as sq3
2import pandas as pd
3
4# Define the database path
5path = "data/classic_rock.db"
6
7# Establish a connection
8con = sq3.connect(path)
9
10# Write the SQL query
11query = "SELECT * FROM rock_songs"
12
13# Execute the query and load the data into a Pandas DataFrame
14df = pd.read_sql(query, con)
This code retrieves all rows from the rock_songs table and stores them in a Pandas DataFrame.
Additional Example Downloading and Analyzing a Database
Download a database file and analyze its contents.
1# Download the database
2!wget -P data https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML0232EN-SkillsNetwork/asset/baseball.db
3
4# Define the database path
5path = 'data/baseball.db'
6
7# Establish a connection
8con = sq3.connect(path)
9
10# Query to retrieve all data from a table
11query = "SELECT * FROM allstarfull"
12allstar_observations = pd.read_sql(query, con)
13
14# Query to evaluate top 3 players
15best_query = """
16SELECT playerID, sum(GP) AS num_games_played, AVG(startingPos) AS avg_starting_position
17FROM allstarfull
18GROUP BY playerID
19ORDER BY num_games_played DESC, avg_starting_position ASC
20LIMIT 3
21"""
22best_players = pd.read_sql(best_query, con)
23print(best_players)
This example demonstrates downloading a database, retrieving data, and analyzing it to evaluate the top 3 players.
Working with NoSQL Databases
Overview of NoSQL Databases
NoSQL databases are non-relational and vary in structure. They are often faster and more flexible than SQL databases for certain applications. Data is typically stored in JSON format. Common types of NoSQL databases include:
- Document Databases: Store data as documents (e.g., MongoDB).
- Key-Value Stores: Use key-value pairs for data storage.
- Graph Databases: Maintain relationships, such as social network connections.
- Wide Column Stores: Group related columns into column families.
Example: Reading Data from MongoDB
The following example demonstrates how to connect to a MongoDB database and retrieve data:
1from pymongo import MongoClient
2import pandas as pd
3
4# Establish a connection
5con = MongoClient("mongodb://username:password@host:port/")
6
7# Access a specific database
8db = con.database_name
9
10# Query a collection
11cursor = db.collection_name.find({}) # Select all documents
12
13# Convert the cursor to a Pandas DataFrame
14df = pd.DataFrame(list(cursor))
This code retrieves all documents from the specified collection and converts them into a Pandas DataFrame.
Working with APIs and Cloud Data Sources
APIs
APIs (Application Programming Interfaces) allow access to data from various providers. For example, Twitter APIs can be used to retrieve tweets, and Amazon APIs can provide marketing data. APIs simplify the process of connecting to external data sources.
Example: Reading Data from a URL
Data can also be accessed directly from online sources, such as the UC Irvine Machine Learning Library:
1import pandas as pd
2
3# Define the URL
4data_url = "https://example.com/dataset.csv"
5
6# Read the data into a Pandas DataFrame
7df = pd.read_csv(data_url)
This code retrieves a CSV file from the specified URL and loads it into a Pandas DataFrame.
Conclusion
This document provides an overview of retrieving data from SQL and NoSQL databases, APIs, and Cloud data sources. It includes practical Python examples for connecting to these sources and loading data into Pandas DataFrames. Understanding these methods is essential for efficient data integration and analysis.
Jupyter Notebook
Note: This exercise file is taken verbatim from the IBM Machine Learning course materials.
- Jupyter lite server is running in JupyterLite lab.
- Exercise file can be seen in this PDF.
FAQ
sqlite3, SQLAlchemy, or Psycopg2. SQL queries are used to fetch data from relational tables.boto3 for AWS, google-cloud-storage for Google Cloud, or azure-storage-blob for Azure.wget, curl or Python libraries like requests, and then analyzed using SQL queries with Python libraries.pymongo library, querying a collection, and converting the results into a Pandas DataFrame.pd.read_csv for CSV files or pd.read_json for JSON files.sqlite3, pymongo, and requests, which simplify data retrieval and integration from various sources, including databases, APIs, and cloud platforms.





