Lesson 3.2: Data Collection Methods – APIs, Web Scraping, Databases
In Data Science, the first step is always collecting data. Without proper data, no analysis or model can be built. There are several methods of collecting data:
1. APIs (Application Programming Interfaces)
-
APIs act as a bridge that allow applications to exchange data with servers.
-
Examples:
-
Twitter API → collect tweets.
-
Weather API → fetch temperature and forecast data.
-
-
Advantages:
-
Provides reliable, structured, and often real-time data.
-
2. Web Scraping
-
Web scraping means extracting data automatically from websites.
-
Tools:
-
BeautifulSoup, Scrapy, Selenium (Python libraries)
-
-
Examples:
-
Collecting product prices from Amazon/Flipkart.
-
Extracting news headlines from websites.
-
-
Note: Always check the website’s Terms of Use to avoid legal issues.
3. Databases
-
Data is often stored in databases (SQL or NoSQL).
-
SQL Databases: MySQL, PostgreSQL, Oracle (for structured data).
-
NoSQL Databases: MongoDB, Cassandra (for semi-structured/unstructured data).
-
Data scientists use SQL queries to fetch large datasets.
4. Other Sources
-
CSV/Excel files → e.g., Kaggle datasets, Government open data portals.
-
Sensors/IoT devices → e.g., fitness trackers collecting health data.
-
Surveys & Questionnaires → directly from users.
✅ Summary:
-
APIs → Structured, reliable, real-time data.
-
Web Scraping → Extracting from websites.
-
Databases → SQL/NoSQL storage systems.
-
Files/Sensors/Surveys → Additional sources.
👉 The choice of data source depends on the project requirement and availability of data.
