- Introduction to Datascience/Analytics
- Why does companies need Datascientist/Analyst?
- Data Analytics: OLAP vs Data Mining
- What is Datascience? Why Datascience?
- Data driven product engineering
- How to become a Datascientist? and Skill-set of Datascientist?
- Career Opportunities & Hiring companies.
- Business problems with Datascience
- Predictive Analytics Problems: Classification, Regression, Recommenders. (Supervised techniques)
- Descriptive Analytics Problems: Frequent Pattern Mining, Clustering, Outlier Detection. (Un Supervised techniques)
- Prescriptive Analytics problems: Predictive and Descriptive Problems.
- Types of Data:
- Structured data
- Unstructured data
- Semi structured data
- Time Series data
- Business Verticals: Retail, Banking, Financial, Auto mobile, Social, Web, Medical, Scientific, Logistics, Real Estate etc.
- Required Tools/technologies for Data Science
- Datascience Life Cycle for Analysis
- Required technologies for each phase of Data Science life cycle.
- Single Machine Analytic Platforms: R, Python, SAS, etc...
- Distributed Analytical Platforms: Hadoop, Spark, H20
- Mastering in Python Language
- Python introduction and Installation
- Python basic topics
- Variables
- Decision making
- Loops
- Functions, etc.
- Python advance topic
- Classes and OOPs Concepts
- Modules & packages
- File Handling
- Database handling, etc.
- Python advanced features
- Required Packages for Datascience in Python
Statistics and Mathematics for Datascientist/Analyst.
- Statistics
- Descriptive stats for single variable
- Mean, Median, Mode, Quantiles, Percentiles
- Standard Deviation, Variance
- MAD, IQR
- Descriptive stats for two variables
- Covariance
- Correlation
- Chi-squared Analysis
- Hypothesis Testing
- Inferential Statistics
- Linear Algebra.
- Ideas that need Linear Algebra
- Vector Algebra
- Ideas that map to vectors
- Understanding vector operations
- Matrix Algebra
- Ideas that map to matrices
- Understanding matrix operations
- Understanding eigen-values and eigen-vectors
- Concepts of basis
- Understanding factorization & Types
- Spectral factorization
- Eigen factorization
- SVD factorization
- Probability
- Basic Probability
- Conditional Probability
- Bayes Rule/Reasoning
- Mapping Random process to Random variable
- Properties of Random variables
- Probability Expectation
- Entropy and cross-entropy
- Estimating probability of Random variable
- Understanding standard random processes
- Understanding on Probability Distributions
- Calculus for data scientist
- Rate of change
- Concept of limit
- Concept of derivative
- Partial derivatives & gradient
- Significance of gradient
- Concept of integration, etc.
- Data Visualizations
- Tabular form
- Using statistical methods – mean, medium, mode, range, frequency, multi-dimensional tables, etc.
- Graphical form
- Bar graphs
- Histograms graphs
- Pie graphs
- Area graphs
- Density graphs
- Scatter graphs
- Line graphs
- Whisker graphs
- Correlation graphs
- Facet plots, etc.
- Overview of Machine Learning Algorithms
- What is Machine Learning?
- ML – Software Development Life Cycle
- ML-SDLC Phases
- Data Collection
- Data Preparation
- Feature Engineering
- Model Building,
- Model Evaluation
- Model Deployment
- Model Maintenance
- Type of Machine Learning Algorithms
- Supervised
- Unsupervised
- Semi-supervised
- Reinforcement Algorithms
- Data Collection Techniques
- Collecting data from Excel/csv/txt files
- Collecting data from databases
- Collecting data from services
- Collecting data via scraping (from Web)
- Data Preparation Techniques
- Structured Data Preparation
- Handling Missing Data
- Data Type Conversion
- Category to Numeric Conversion
- Numeric to Category Conversion
- Data Normalization:0-1, Z-Score
- Handling Skew Data: Box-Cox Idea
- Text Data Preparation/preprocessing
Noise removal (Stop word removal, URLs, punctuations, etc.)
Lexicon Normalization (Stemming & Lemmatization)
Object Standardization (Convert acronyms to dictionary words, grammar check, spell check etc.)
- EDA (Numerical + Graphical) and Feature Engineering
- Exploring Individual Features
- Exploring Bi-Feature Relationships
- Exploring Multi-Feature Relationships
- Create new Features.
- Feature/Dimension Reduction: PCA
- Intuition behind PCA
- Covariance & Correlation
- Relating PCA to Covariance/Correlation
- Intuition to math
- Applications of PCA: Dimensionality Reduction, Image Compression
- Model Building (ML Algorithm Building)
- Mathematical understanding for each model
- Limitations for each model
- Tuning for each model
- Model scope for type of problem: Classification, Regression Recommenders and Association.
- Pros and Cons of each model
Supervised learning Models
- Decision trees
- Probability learning (Naive Bayes)
- KNN Learning
- Linear regression
- Non-Linear regression
- Logistic regression
- Support Vector Machines(SVM)
- Ensemble Models
- Bagging
- Bagged trees
- Random Forest
- Extreme tress
- Boosting
- Ada boosting
- Gradient Boosting
- Extreme Gradient Boosting
- Voting (by Stacking aggregation)
- Soft voting
- Hard voting
- Stacking
- Neural Network Model
- Apriori Model
Unsupervised Learning
- Clustering Models
- K-Mean Model
- K-Medoid Model
- K-Centers Model
- Hierarchical Model
- Dimension reduction: Principal component analysis (PCA)
- Association Models.
Time Series Models
- Holt-Linear
- Holt-Winters (Extension for Holts Linear for Trend and Seasonality)
- ARIMA Model
- SARIMAX
- Model Evaluations techniques
- Repeated holdouts (R.H)
- K-fold Cross-validation
- Bootstrap
- Metrix based evaluation.
- Model Implementation
- Distributed/BIGDATA Analytics overview
- Big data Analytics Overview
- Platforms for Distributed Analytics: Hadoop, Spark, H20
- Hoop and Spark Overview & Comparison