TF-IDF is a very powerful way to classify text data. It is a well known method to know how important is a word in the text data. This wikipedia article has a brief introduction on TF-IDF. Now lets try to understand what we want to achieve using classification.
There can be many scenarios where we need to classify a text. Text can be of one line or two lines or thousand lines. For example,
How does a news article about deal between ITC and Brookfield got classified as "Business" ?
How does a news article about OnePlus 7 series mobile phone got classified as "Technology" ?
How does a news article on bollywood celebrity Boney Kapoor and his daughter got classified as "Entertainment" ?
In order to classify any text into categories, we require a set of pre-classified data. For example, in above classification problem, we need a few news articles with categories of each article. This data is usually prepared in the form of a csv file with first column as news article and second column as category. We call it as "Training Data".
Other examples of classification problems are mentioned below.
Lets take a classification problem and try to build a solution.
Classify a property address in indian city names.
As shown above, we want to predict city name of a property address. As discussed earlier, we require training data with "Property Address" as first column and "City Name" as second column. We need as much training data as possible. Ideally, we require all the city names, location names, road names, building names, popular landmarks of all the cities in India. The more training data we have, more accuracy we can achieve.
For the sake of understanding, lets take limited training data only sufficient to predict these 4 property addresses.
We can convert these addresses in individual keywords by splitting each address by "," (comma). A python script does the job.
import numpy as np
import pandas as pd
training_data = pd.read_csv("Training_Data.csv")
training_data_keywords = pd.DataFrame(columns = ['Training_Keyword', 'City_Name', 'Keyword_Count_in_City'])
for i in range(len(training_data)):
keywords = training_data.iloc[i]['Property_Address'].split(',')
for word in keywords:
training_data_keywords = training_data_keywords.append({'Training_Keyword' : word.strip(), 'City_Name' : training_data.iloc[i]['City_Name'], 'Keyword_Count_in_City' : 1}, ignore_index = True)
training_data_keywords = training_data_keywords.pivot_table(index = ['Training_Keyword', 'City_Name'], values = ['Keyword_Count_in_City'], aggfunc = np.sum)
training_data_keywords = pd.DataFrame(training_data_keywords.to_records())
Now lets try to predict which city an address belongs to. Lets take third address, "Sree Krishna Medicos, Shaheed Mangal Pandey Marg, Janak Park, Hari Enclave, Hari Nagar" and break it into 5 keywords (we call it as address-keywords),
Now, we create a simple matrix with above address-keywords in rows and city names in columns. Each cell represents "Number of times a particular keyword is occurring in that city".
By adding all the occurrences in the city and taking city with maximum number of occurrences, we can fairly predict that third address belongs to Delhi.
A function can be built in Python to achieve this.
def predict_city(input_address, training_data_keywords):
address_keyword = input_address.split(',')
address_keyword = pd.DataFrame(address_keyword, columns = ['Address_Keyword'])
address_keyword['Address_Keyword'] = address_keyword['Address_Keyword'].apply(lambda x: x.strip())
max_score = 0
predicted_city = ''
for city in training_data_keywords['City_Name'].unique():
single_city_training_data_keywords = training_data_keywords[training_data_keywords['City_Name'] == city]
merged_data = pd.merge(address_keyword, single_city_training_data_keywords, how = 'left', left_on = ['Address_Keyword'], right_on = ['Training_Keyword']).fillna(0)
address_keyword[city] = address_keyword['Address_Keyword'].apply(lambda x: sum(merged_data[merged_data['Address_Keyword'] == x]['Weight']))
curr_score = sum(address_keyword[city])
if max_score < curr_score:
max_score = curr_score
predicted_city = city
return predicted_city
input_address = 'Sree Krishna Medicos, Shaheed Mangal Pandey Marg, Janak Park, Hari Enclave, Hari Nagar'
predicted_city = predict_city(input_address, training_data_keywords)
print("Predicted City: ", predicted_city)
What if things are a little more complex. In any text classification problem, there is a possibility that 1 or more keywords can appear in multiple categories (in our case, cities). For example, "Joshi Lane" and "Pant Nagar" are appearing twice in whole training data. Also "MG Road" is appearing 5 times (once in each city).
Try to predict city name for first address using above method.
We have a tie between Mumbai and Delhi. While working with large volume of training data, this problem is bound to occur. A better way to deal with this is finding out "how important a keyword is to the city" and then predict city name based on this "Importance".
TF-IDF is used to calculate importance of a keyword.
First 2 columns is our training data. Now lets see how other columns are prepared.
Number of times a particular keyword is occurring in a city. Here, keywords like "Joshi Lane" or "Janak Park" or "Yellappa Garden" or "MG Road" each occur once in the city. Hence Keyword count in city is 1 for all keywords
Total Number of Keywords in a city. You will notice, that
It is the total count of cities in out training data. In total, we have training data for 5 cities.
Number of times a particular keyword is occurring in the whole training data. Keywords like "Janak Park" or "Shastri Nagar" are occurring once and "MG Road" is occurring 5 times in whole training data.
Here, term represents training-keyword and document represents city.
TF represents weight of a keyword. In our example, we did not break keywords into individual words or else the word "Nagar" in delhi would have higher Term Frequency and hence higher weight.
IDF represents "Importance" of a keyword in the city. Note that "Vallabh Apartment" in Mumbai has more importance than "Joshi Lane" or "Pant Nagar" because these keywords are also present in Delhi. "Vallabh Apartment" is unique to Mumbai and hence commands higher importance. Similarly, keywords like "Craig Park Layout" in Bangalore or "Thiruvanmiyur" in Chennai has higher importance.
By now, you must have noticed that "MG Road" has least importance, 0 to be precise.
Total number of documents (cities) = 5
Total number of documents (cities) with "MG Road" = 5
So, IDF("MG Road") = ln(5/5) = ln(1) = 0
The keywords which are present in all cities have 0 importance and the keywords unique to a city has highest importance.
TF alone cannot be considered as weight, as words like "Nagar", "Road", "Park" are too common in a property address and can lead to less accuracy in prediction. On the other hand, IDF reduces importance of such keywords as these keywords are common across cities. Hence, TF and IDF together can be considered as weight for our classification problem.
It is derived by multiplying TF and IDF to normalise weights. We call it as "Weight" of a keyword.
Let us create similar matrix as earlier. Here each cell represents "Weight" of that keyword in that city.
This can be achieved by adding TF and IDF values to our dataframe and making a small change in prediction function.
training_data_keywords['Total_Keywords_in_City'] = training_data_keywords['City_Name'].apply(lambda x: len(training_data_keywords[training_data_keywords['City_Name'] == x]) )
training_data_keywords['Total_Number_of_Documents'] = training_data_keywords['City_Name'].apply(lambda x: len(training_data_keywords['City_Name'].unique()) )
training_data_keywords['Number_of_cities_having_keyword'] = training_data_keywords['Training_Keyword'].apply(lambda x: len(training_data_keywords[training_data_keywords['Training_Keyword'] == x]['City_Name'].unique()) )
training_data_keywords['Term_Frequency_(TF)'] = training_data_keywords['Keyword_Count_in_City'].astype(float) / training_data_keywords['Total_Keywords_in_City'].astype(float)
training_data_keywords['Inverse_Document_Frequency_(IDF)'] = np.log(training_data_keywords['Total_Number_of_Documents'].astype(float) / training_data_keywords['Number_of_cities_having_keyword'].astype(float))
training_data_keywords['Weight_(TFxIDF)'] = training_data_keywords['Term_Frequency_(TF)'] * training_data_keywords['Inverse_Document_Frequency_(IDF)']
Now, replace column "Keyword_Count_in_City" with "Weight_(TFxIDF)" in previous prediction function.
def predict_city_using_tfidf(input_address, training_data_keywords):
address_keyword = input_address.split(',')
address_keyword = pd.DataFrame(address_keyword, columns = ['Address_Keyword'])
address_keyword['Address_Keyword'] = address_keyword['Address_Keyword'].apply(lambda x: x.strip())
max_score = 0
predicted_city = ''
for city in training_data_keywords['City_Name'].unique():
single_city_training_data_keywords = training_data_keywords[training_data_keywords['City_Name'] == city]
merged_data = pd.merge(address_keyword, single_city_training_data_keywords, how = 'left', left_on = ['Address_Keyword'], right_on = ['Training_Keyword']).fillna(0)
address_keyword[city] = address_keyword['Address_Keyword'].apply(lambda x: sum(merged_data[merged_data['Address_Keyword'] == x]['Weight_(TFxIDF)']))
curr_score = sum(address_keyword[city])
if max_score < curr_score:
max_score = curr_score
predicted_city = city
return predicted_city