<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Module-2 on Ghafoor's Personal Blog</title><link>http://ghafoorsblog.com/courses/ibm/ml-content/ml-pcert/01-data-analysis-for-ml/02-module/</link><description>Recent content in Module-2 on Ghafoor's Personal Blog</description><generator>Hugo</generator><language>en</language><managingEditor>noreply@example.com (AG Sayyed)</managingEditor><webMaster>noreply@example.com (AG Sayyed)</webMaster><copyright>Copyright © 2024-2026 AG Sayyed. All Rights Reserved.</copyright><atom:link href="http://ghafoorsblog.com/courses/ibm/ml-content/ml-pcert/01-data-analysis-for-ml/02-module/index.xml" rel="self" type="application/rss+xml"/><item><title>Outliers and Missing Values</title><link>http://ghafoorsblog.com/courses/ibm/ml-content/ml-pcert/01-data-analysis-for-ml/02-module/004-outliers-part-1/</link><pubDate>Mon, 31 Mar 2025 14:11:35 +0000</pubDate><author>noreply@example.com (AG Sayyed)</author><guid>http://ghafoorsblog.com/courses/ibm/ml-content/ml-pcert/01-data-analysis-for-ml/02-module/004-outliers-part-1/</guid><description>&lt;p class="lead text-primary"&gt;
This document provides a comprehensive guide on handling missing values and outliers in datasets, including techniques for detection, imputation, and removal, along with practical Python code examples. It also discusses the impact of these issues on machine learning models and offers strategies for effective data preprocessing.
&lt;/p&gt;


&lt;hr&gt;
&lt;h2 id="understanding-data-quality-issues"&gt;Understanding Data Quality Issues&lt;/h2&gt;
&lt;p&gt;Before diving into specific techniques, it&amp;rsquo;s important to understand why missing values and outliers matter:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Impact on Model Performance&lt;/strong&gt;: These issues can significantly reduce model accuracy and reliability&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bias Introduction&lt;/strong&gt;: Improper handling can lead to biased models that don&amp;rsquo;t generalize well&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Integrity&lt;/strong&gt;: They often signal problems in data collection or processing that need addressing&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A systematic approach to handling these issues is essential for building robust machine learning models.&lt;/p&gt;</description></item><item><title>Data Cleaning</title><link>http://ghafoorsblog.com/courses/ibm/ml-content/ml-pcert/01-data-analysis-for-ml/02-module/003-data-cleaning/</link><pubDate>Mon, 31 Mar 2025 13:57:30 +0000</pubDate><author>noreply@example.com (AG Sayyed)</author><guid>http://ghafoorsblog.com/courses/ibm/ml-content/ml-pcert/01-data-analysis-for-ml/02-module/003-data-cleaning/</guid><description>&lt;p class="lead text-primary"&gt;
This document explains the importance of data cleaning in machine learning, common issues with messy data, and methods for handling duplicate data to ensure reliable model outcomes.
&lt;/p&gt;


&lt;hr&gt;
&lt;h2 id="importance-of-data-cleaning"&gt;Importance of Data Cleaning&lt;/h2&gt;
&lt;p&gt;Data cleaning is a critical step in the machine learning workflow. Models rely on accurate and clean data to produce reliable outcomes. Messy data can misrepresent relationships between features and targets, leading to the &amp;ldquo;garbage-in, garbage-out&amp;rdquo; effect. Key aspects affected by messy data include:&lt;/p&gt;</description></item><item><title>Retrieving Data from SQL and NoSQL Databases, APIs, and Cloud Data Sources</title><link>http://ghafoorsblog.com/courses/ibm/ml-content/ml-pcert/01-data-analysis-for-ml/02-module/002-retrieving-data-part-2/</link><pubDate>Sun, 30 Mar 2025 20:47:51 +0000</pubDate><author>noreply@example.com (AG Sayyed)</author><guid>http://ghafoorsblog.com/courses/ibm/ml-content/ml-pcert/01-data-analysis-for-ml/02-module/002-retrieving-data-part-2/</guid><description>&lt;p class="lead text-primary"&gt;
This document explains methods for retrieving data from SQL and NoSQL databases, APIs, and Cloud data sources, highlighting practical considerations and Python code examples for seamless data integration.
&lt;/p&gt;


&lt;hr&gt;
&lt;h2 id="retrieving-data-from-different-sources"&gt;Retrieving Data from Different Sources&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;SQL Databases&lt;/strong&gt;: Structured Query Language databases are relational databases with fixed schemas. They are widely used for data storage and retrieval.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;NoSQL Databases&lt;/strong&gt;: Non-relational databases that offer flexibility in data storage and retrieval. They are often faster and more scalable than SQL databases.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;APIs&lt;/strong&gt;: Application Programming Interfaces allow access to data from various providers, enabling seamless integration with external data sources.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cloud Data Sources&lt;/strong&gt;: Cloud platforms provide data storage and retrieval services, allowing users to access data from anywhere with an internet connection.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="working-with-sql-databases"&gt;Working with SQL Databases&lt;/h2&gt;
&lt;p&gt;SQL (Structured Query Language) databases are relational databases with fixed schemas. Examples include Microsoft SQL Server, Postgres, MySQL, AWS Redshift, Oracle DB, and IBM Db2. Python libraries such as &lt;code&gt;sqlite3&lt;/code&gt;, &lt;code&gt;SQLAlchemy&lt;/code&gt;, &lt;code&gt;Psycopg2&lt;/code&gt; (for Postgres), and &lt;code&gt;ibm_db&lt;/code&gt; (for Db2) can be used to connect to these databases.&lt;/p&gt;</description></item><item><title>Retrieving Data from CSV and JSON Files</title><link>http://ghafoorsblog.com/courses/ibm/ml-content/ml-pcert/01-data-analysis-for-ml/02-module/001-retrieving-data-part-1/</link><pubDate>Sun, 30 Mar 2025 20:43:48 +0000</pubDate><author>noreply@example.com (AG Sayyed)</author><guid>http://ghafoorsblog.com/courses/ibm/ml-content/ml-pcert/01-data-analysis-for-ml/02-module/001-retrieving-data-part-1/</guid><description>&lt;p class="lead text-primary"&gt;
This document explains methods for retrieving data from various sources, including CSV and JSON files, and highlights practical considerations when working with these formats using Python and Pandas.
&lt;/p&gt;


&lt;hr&gt;
&lt;h2 id="retrieving-data-from-different-sources"&gt;Retrieving Data from Different Sources&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;CSV Files&lt;/strong&gt;: Comma Separated Values files are widely used for storing tabular data. They can be easily read into Pandas DataFrames.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;JSON Files&lt;/strong&gt;: JavaScript Object Notation files are commonly used for structured data storage. They can also be read into Pandas DataFrames.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3 id="downloading-data-files"&gt;Downloading Data Files&lt;/h3&gt;
&lt;p&gt;For this exercise, the Iris dataset is used which contains information about different species of iris flowers. The dataset is available in both CSV and JSON formats. Download the files from the following links:&lt;/p&gt;</description></item></channel></rss>