Skip to content

khuyentran1401/Extract-text-from-article

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 

Repository files navigation

About this project

This project extracts the text from an article using Python Article Library and uses NLTK (Natural Language Processing Toolkit) to preprocess the text and extract the most common words in the article

Tools

  • Newspaper3k: tool to scrape article
  • NLTK: tool to process text

Steps

  • Scrape articles with newspaper3k
from newspaper import Article

url = 'https://mystudentvoices.com/it-took-me-2-years-to-get-1000-followers-life-lessons-ive-learned-throughout-the-journey-9bc44f2959f0'
article = Article(url)

article.download()
  • Find the publish date
article.publish_date
  • Extract image
  • Find the author
  • Find the keywords
  • Find the summary
  • Preprocessing with NLTK
    • Tokenize text
    • Lowercase and remove stopwords
  • Visualization the frequency of words with Matplotlib image

Tutorial blog

Find the Medium article for this repository here