A Simple Text Analysis of articles in a URL using python

Srikar V
6 min readSep 29, 2023

--

Hi guys! This is my first blog in my life. First let me introduce myself, I am Srikar V, from India. I’m pursuing my Bachelors of Engineering in Computer Science and I’m currently in my second year. I’m aspiring to be a AWS Machine Learning Specialist by the time I graduate or at least somewhere near that timeline. I am no where near to my goal but I’m getting there slowly. I guess that’s enough about myself and let me introduce what is this blog about.

Text Analysis of articles in a URL

This assignment was provided to me by a company to which I had applied for internship. They provided me with a “input.xlsx” that contained the dataset of URLs from which I had to extract the article title and content to perform some analysis on it. The analysis metrics included:

  1. POSITIVE SCORE
  2. NEGATIVE SCORE
  3. POLARITY SCORE
  4. SUBJECTIVITY SCORE
  5. AVG SENTENCE LENGTH
  6. PERCENTAGE OF COMPLEX WORDS
  7. FOG INDEX
  8. COMPLEX WORD COUNT
  9. WORD COUNT
  10. SYLLABLE PER WORD
  11. PERSONAL PRONOUNS
  12. AVG WORD LENGTH

*All of the work is done on Jupyter Notebook.

Let’s go through this assignment step-by-step:

Data Extraction(Web Scraping)

webscraping flow

First thing I did was to load the “input.xlsx” file to my notebook. I have used BeautifulSoup to extract the article titles and content. Article title has been extracted from the first <h1> tag found in the html of the URL and the content is extracted from all the <p> tags found in the html. Here is the function snippet which I have used to apply to the dataframe in my notebook:

from bs4 import BeautifulSoup #for webscraping
from fake_useragent import UserAgent #for creating a fake useragent for headers
def url_text(url):
paragraphs=[]
data = requests.get(url,headers={"User-Agent": "XY"}).text
soup = BeautifulSoup(data,'lxml')
title=soup.find_all('h1')[0].text.strip()
p=soup.find_all('p')
for paragraph in p:
paragraph=paragraph.text.strip()
paragraphs.append(paragraph)
text=''.join(paragraphs)
text=title+" "+text
return text

# we can apply this function on the dataframe to create a new column with the extracted text
df['ARTICLE'] = df['URL'].apply(url_text)

Now we have the article title and content in our data frame.

Data Analysis

We have to extract words from the article and perform sentimental analysis.

Sentimental analysis is the process of determining whether a piece of writing is positive, negative, or neutral.

The company had provided us with text files that contained the positive-words, negative words and stop words.

Stop words are words that do not contribute or hold any value for sentimental analysis. Hence we filter these stop words out of the text to perform sentimental analysis using the positive words and negative words.

We are going to be using the RegExpTokenizer from NLTK to tokenise the text and then filter out the stop words. Here is the function snippet:

from nltk.tokenize import RegexpTokenizer , sent_tokenize
def tokenz(text):
tokenizer = RegexpTokenizer(r'\w+')
text = str(text)
tokens = tokenizer.tokenize(text)
filter = []
for token in tokens:
if token not in stop_words:
filter.append(token)
return filter

We are going to be defining functions to compute the variables I mentioned above one by one:

  1. POSITIVE SCORE

This score is calculated by assigning the value of +1 for each word if found in the Positive Dictionary and then adding up all the values.

def positive_score(text):
tokens = tokenz(text)
pos_score = 0
for token in tokens:
if token in positive_words:
pos_score += 1
return pos_score

2. NEGATIVE SCORE

This score is calculated by assigning the value of -1 for each word if found in the Negative Dictionary and then adding up all the values. We multiply the score with -1 so that the score is a positive number.

def negative_score(text):
tokens = tokenz(text)
neg_score = 0
for token in tokens:
if token in negative_words:
neg_score -= 1
return neg_score * -1

3. POLARITY SCORE

This is the score that determines if a given text is positive or negative in nature. It is calculated by using the formula:

Polarity Score = (Positive Score — Negative Score)/ ((Positive Score + Negative Score) + 0.000001)

Range is from -1 to +1

def polarity_score(pos_score,neg_score):
polar_score = (pos_score - neg_score)/((pos_score + neg_score)+0.000001)
return polar_score

4. SUBJECTIVITY SCORE

This is the score that determines if a given text is objective or subjective. It is calculated by using the formula:

Subjectivity Score = (Positive Score + Negative Score)/ ((Total Words after cleaning) + 0.000001)

Range is from 0 to +1

def subjectivity_score(pos_score,neg_score,text):
tokens = tokenz(text)
sub_score = (pos_score + neg_score)/((len(tokens)) + 0.000001)
return sub_score

5. AVG SENTENCE LENGTH

Average Sentence Length = the number of words / the number of sentences

def avg_sentence_length(text):
words = len(tokenz(text))
sentences = len(text.split('.'))
avg_length = words / sentences
return avg_length

6. PERCENTAGE OF COMPLEX WORDS

Percentage of Complex words = the number of complex words / the number of words

def perc_complex(text):
tokens = tokenz(text)
complex_cnt = 0
for token in tokens:
vowels = 0
if token.endswith(('es','ed')):
pass
else:
for l in token:
if (l == 'a' or l == 'e' or l == 'i' or l == 'o' or l == 'u'):
vowels += 1
if vowels > 2:
complex_cnt += 1
if len(token) != 0:
return complex_cnt/len(token)

7. FOG INDEX

Fog Index = 0.4 * (Average Sentence Length + Percentage of Complex words)

def fog_index(avg_sent_length,percentage_complex):
return 0.4 * (avg_sent_length + percentage_complex)

8. COMPLEX WORD COUNT

Complex words are words in the text that contain more than two syllables.

def complex_word_count(text):
tokens = tokenz(text)
complex_cnt = 0
for token in tokens:
vowels = 0
if token.endswith(('es','ed')):
pass
else:
for l in token:
if (l == 'a' or l == 'e' or l == 'i' or l == 'o' or l == 'u'):
vowels += 1
if vowels > 2:
complex_cnt += 1
return complex_cnt

9. WORD COUNT

We count the total cleaned words present in the text by

  1. removing the stop words
  2. removing any punctuations like ? ! , . from the word before counting.
def word_count(text):
cnt = len(tokenz(text))
return cnt

10. SYLLABLE PER WORD

We count the number of Syllables in each word of the text by counting the vowels present in each word. We also handle some exceptions like words ending with “es”,”ed” by not counting them as a syllable.

def syllable_count_word(text):
words = tokenz(text)
words_cnt = len(words)
vowels = 0
for word in words:
if word.endswith(('es','ed')):
pass
else:
for l in word:
if (l == 'a' or l == 'e' or l == 'i' or l == 'o' or l == 'u'):
vowels += 1
return vowels / words_cnt

11. PERSONAL PRONOUNS

To calculate Personal Pronouns mentioned in the text, we use regex to find the counts of the words — “I,” “we,” “my,” “ours,” and “us”. Special care is taken so that the country name US is not included in the list.

import re
def cal_personal_pronouns(text):
words = tokenz(text)
pronoun_re = r'\b(I|my|we|us|ours)\b'
matches = re.findall(pronoun_re,text)
return len(matches)

12. AVG WORD LENGTH

Average Word Length is calculated by the formula:

Sum of the total number of characters in each word/Total number of words

def avg_word_length(text):
words = tokenz(text)
charcnt = 0
for word in words:
charcnt += len(word.strip())
if len(words) != 0:
return charcnt / len(words)

We finished defining functions for each of the compute variables now all we have to do apply each of these functions and add a column for each of the compute variable. Sample output:

Data Export

Final task pushing all the calculated variables to the excel file using the following code:

urls_df.to_excel("Output Data Structure.xlsx",encoding='utf-8')

Thank you. I’m looking forward to any suggestions regarding my blog.

--

--

Srikar V
Srikar V

Written by Srikar V

Aspiring AWS Machine Learning Specialist

No responses yet