UrbanPro
true

Learn Data Science from the Best Tutors

  • Affordable fees
  • 1-1 or Group class
  • Flexible Timings
  • Verified Tutors

Search in

Topic Modeling in Text Mining : LDA

Ashish R.
13/05/2017 0 0

Latent Dirichlet allocation (LDA)

Topic modeling is a method for unsupervised classification of text documents, similar to clustering on numeric data, which finds natural groups of items even when we’re not sure what we’re looking for. In clustering one entity can belongs to one group only, whereas in topic modeling a word can belongs to multiple groups/clusters with varying level of probability. The input of the model is a text document/ or a set of documents. The out of the model is to split the documents into multiple K groups and then determining a topic from each group based on the association of the most important words in the respective group. The number of topic which is equivalent to the number of clusters in cluster analysis (K) has to be selected based various heuristics on how many topics might be extracted from the document/s. LDA treats each document as a mixture of topics, and each topic as a mixture of words. This allows documents to “overlap” with each other in terms of content, rather than being separated into discrete groups.

 As an output of LDA model, if we decide to find out K topics then our set of documents are segregated into K groups. The key words or the tokens in each group receive a beta value describing how strong the tokenized word is associated with many other words (tokens) within the group. The larger the value of beta explains the more importance of the word in that group. Top 6-10 words with the largest beta values are chosen to decide the topic that is depicting by that group of words. The topic is decided based on human intelligence on understanding the meaning of those words in the underlying context of the collected documents.

How to determine the number of topic from a set of documents

Hierarchical clustering analysis is performed on the group of words that are collected from the corpus to determine the number of clusters to form. Using distance metric like Levenshtein distance, Hamming Distance etc., the distance among the words are plotted in a dendrogram. The vertical axis of the dendrogram scales the chosen distance metric. Based on the word cloud formation, we decide what distance to consider as a cut off distance to determine the number of appropriate groups to be formed with the set of documents. This is similar like hierarchical clustering with numeric data values where usually Euclidean distance is considered by default. 

 

0 Dislike
Follow 0

Please Enter a comment

Submit

Other Lessons for You

Approach for Mastering Data Science
Few tips to Master Data Science 1)Do not start your learning with some software like R/Python/SAS etc 2)Start with very basics like 10th class Matrices/Coordinate Geometry/ 3) Understand little bit...

1st Lesson -Data Science -Introduction
Here, I am going to cover on - What is Data Science, skills required to a data scientist and general tasks that data scientist do What is Data Science?This is an exciting discipline where we take the...

DATA SCIENCE UNLEASHED Demo
DATA SCIENCE live demo recording This Demo addresses most of your basic questions about Data Science like What is Data Science ? What are the Pre requisites ? What all should I learn to call myself...
G

Gravitty

2 0
0

Big Data & Hadoop - Introductory Session - Data Science for Everyone
Data Science for Everyone An introductory video lesson on Big Data, the need, necessity, evolution and contributing factors. This is presented by Skill Sigma as part of the "Data Science for Everyone" series.

Code: Gantt Chart: Horizontal bar using matplotlib for tasks with Start Time and End Time
import pandas as pd from datetime import datetimeimport matplotlib.dates as datesimport matplotlib.pyplot as plt def gantt_chart(df_phase): # Now convert them to matplotlib's internal format... ...
R

Rishi B.

0 0
0
X

Looking for Data Science Classes?

The best tutors for Data Science Classes are on UrbanPro

  • Select the best Tutor
  • Book & Attend a Free Demo
  • Pay and start Learning

Learn Data Science with the Best Tutors

The best Tutors for Data Science Classes are on UrbanPro

This website uses cookies

We use cookies to improve user experience. Choose what cookies you allow us to use. You can read more about our Cookie Policy in our Privacy Policy

Accept All
Decline All

UrbanPro.com is India's largest network of most trusted tutors and institutes. Over 55 lakh students rely on UrbanPro.com, to fulfill their learning requirements across 1,000+ categories. Using UrbanPro.com, parents, and students can compare multiple Tutors and Institutes and choose the one that best suits their requirements. More than 7.5 lakh verified Tutors and Institutes are helping millions of students every day and growing their tutoring business on UrbanPro.com. Whether you are looking for a tutor to learn mathematics, a German language trainer to brush up your German language skills or an institute to upgrade your IT skills, we have got the best selection of Tutors and Training Institutes for you. Read more