Hi guys! This is my first blog in my life. First let me introduce myself, I am Srikar V, from India. I’m pursuing my Bachelors of Engineering in Computer Science and I’m currently in my second year. I’m aspiring to be a AWS Machine Learning Specialist by the time I graduate or at least somewhere near that timeline. I am no where near to my goal but I’m getting there slowly. I guess that’s enough about myself and let me introduce what is this blog about.
Text Analysis of articles in a URL
This assignment was provided to me by a company to which I had applied for internship. They provided me with a “input.xlsx” that contained the dataset of URLs from which I had to extract the article title and content to perform some analysis on it. The analysis metrics included:
- POSITIVE SCORE
- NEGATIVE SCORE
- POLARITY SCORE
- SUBJECTIVITY SCORE
- AVG SENTENCE LENGTH
- PERCENTAGE OF COMPLEX WORDS
- FOG INDEX
- COMPLEX WORD COUNT
- WORD COUNT
- SYLLABLE PER WORD
- PERSONAL PRONOUNS
- AVG WORD LENGTH
*All of the work is done on Jupyter Notebook.
Let’s go through this assignment step-by-step:
Data Extraction(Web Scraping)
First thing I did was to load the “input.xlsx” file to my notebook. I have used BeautifulSoup to extract the article titles and content. Article title has been extracted from the first <h1> tag found in the html of the URL and the content is extracted from all the <p> tags found in the html. Here is the function snippet which I have used to apply to the dataframe in my notebook:
from bs4 import BeautifulSoup #for webscraping
from fake_useragent import UserAgent #for creating a fake useragent for headers
def url_text(url):
paragraphs=[]
data = requests.get(url,headers={"User-Agent": "XY"}).text
soup = BeautifulSoup(data,'lxml')
title=soup.find_all('h1')[0].text.strip()
p=soup.find_all('p')
for paragraph in p:
paragraph=paragraph.text.strip()
paragraphs.append(paragraph)
text=''.join(paragraphs)
text=title+" "+text
return text
# we can apply this function on the dataframe to create a new column with the extracted text
df['ARTICLE'] = df['URL'].apply(url_text)
Now we have the article title and content in our data frame.
Data Analysis
We have to extract words from the article and perform sentimental analysis.
Sentimental analysis is the process of determining whether a piece of writing is positive, negative, or neutral.
The company had provided us with text files that contained the positive-words, negative words and stop words.
Stop words are words that do not contribute or hold any value for sentimental analysis. Hence we filter these stop words out of the text to perform sentimental analysis using the positive words and negative words.
We are going to be using the RegExpTokenizer from NLTK to tokenise the text and then filter out the stop words. Here is the function snippet:
from nltk.tokenize import RegexpTokenizer , sent_tokenize
def tokenz(text):
tokenizer = RegexpTokenizer(r'\w+')
text = str(text)
tokens = tokenizer.tokenize(text)
filter = []
for token in tokens:
if token not in stop_words:
filter.append(token)
return filter
We are going to be defining functions to compute the variables I mentioned above one by one:
- POSITIVE SCORE
This score is calculated by assigning the value of +1 for each word if found in the Positive Dictionary and then adding up all the values.
def positive_score(text):
tokens = tokenz(text)
pos_score = 0
for token in tokens:
if token in positive_words:
pos_score += 1
return pos_score
2. NEGATIVE SCORE
This score is calculated by assigning the value of -1 for each word if found in the Negative Dictionary and then adding up all the values. We multiply the score with -1 so that the score is a positive number.
def negative_score(text):
tokens = tokenz(text)
neg_score = 0
for token in tokens:
if token in negative_words:
neg_score -= 1
return neg_score * -1
3. POLARITY SCORE
This is the score that determines if a given text is positive or negative in nature. It is calculated by using the formula:
Polarity Score = (Positive Score — Negative Score)/ ((Positive Score + Negative Score) + 0.000001)
Range is from -1 to +1
def polarity_score(pos_score,neg_score):
polar_score = (pos_score - neg_score)/((pos_score + neg_score)+0.000001)
return polar_score
4. SUBJECTIVITY SCORE
This is the score that determines if a given text is objective or subjective. It is calculated by using the formula:
Subjectivity Score = (Positive Score + Negative Score)/ ((Total Words after cleaning) + 0.000001)
Range is from 0 to +1
def subjectivity_score(pos_score,neg_score,text):
tokens = tokenz(text)
sub_score = (pos_score + neg_score)/((len(tokens)) + 0.000001)
return sub_score
5. AVG SENTENCE LENGTH
Average Sentence Length = the number of words / the number of sentences
def avg_sentence_length(text):
words = len(tokenz(text))
sentences = len(text.split('.'))
avg_length = words / sentences
return avg_length
6. PERCENTAGE OF COMPLEX WORDS
Percentage of Complex words = the number of complex words / the number of words
def perc_complex(text):
tokens = tokenz(text)
complex_cnt = 0
for token in tokens:
vowels = 0
if token.endswith(('es','ed')):
pass
else:
for l in token:
if (l == 'a' or l == 'e' or l == 'i' or l == 'o' or l == 'u'):
vowels += 1
if vowels > 2:
complex_cnt += 1
if len(token) != 0:
return complex_cnt/len(token)
7. FOG INDEX
Fog Index = 0.4 * (Average Sentence Length + Percentage of Complex words)
def fog_index(avg_sent_length,percentage_complex):
return 0.4 * (avg_sent_length + percentage_complex)
8. COMPLEX WORD COUNT
Complex words are words in the text that contain more than two syllables.
def complex_word_count(text):
tokens = tokenz(text)
complex_cnt = 0
for token in tokens:
vowels = 0
if token.endswith(('es','ed')):
pass
else:
for l in token:
if (l == 'a' or l == 'e' or l == 'i' or l == 'o' or l == 'u'):
vowels += 1
if vowels > 2:
complex_cnt += 1
return complex_cnt
9. WORD COUNT
We count the total cleaned words present in the text by
- removing the stop words
- removing any punctuations like ? ! , . from the word before counting.
def word_count(text):
cnt = len(tokenz(text))
return cnt
10. SYLLABLE PER WORD
We count the number of Syllables in each word of the text by counting the vowels present in each word. We also handle some exceptions like words ending with “es”,”ed” by not counting them as a syllable.
def syllable_count_word(text):
words = tokenz(text)
words_cnt = len(words)
vowels = 0
for word in words:
if word.endswith(('es','ed')):
pass
else:
for l in word:
if (l == 'a' or l == 'e' or l == 'i' or l == 'o' or l == 'u'):
vowels += 1
return vowels / words_cnt
11. PERSONAL PRONOUNS
To calculate Personal Pronouns mentioned in the text, we use regex to find the counts of the words — “I,” “we,” “my,” “ours,” and “us”. Special care is taken so that the country name US is not included in the list.
import re
def cal_personal_pronouns(text):
words = tokenz(text)
pronoun_re = r'\b(I|my|we|us|ours)\b'
matches = re.findall(pronoun_re,text)
return len(matches)
12. AVG WORD LENGTH
Average Word Length is calculated by the formula:
Sum of the total number of characters in each word/Total number of words
def avg_word_length(text):
words = tokenz(text)
charcnt = 0
for word in words:
charcnt += len(word.strip())
if len(words) != 0:
return charcnt / len(words)
We finished defining functions for each of the compute variables now all we have to do apply each of these functions and add a column for each of the compute variable. Sample output:
Data Export
Final task pushing all the calculated variables to the excel file using the following code:
urls_df.to_excel("Output Data Structure.xlsx",encoding='utf-8')
Thank you. I’m looking forward to any suggestions regarding my blog.