Yesterday when using pelican_link_to_title plugin I've found a 'no title found' problem. Let's focus on that and fix it.

To The Point

The issue

Yesterday I wanted to use pelican_link_to_title plugin with a file instead of html page. I've forgot that plugin is reading html page which can cause issue while using a file.

Issue I've got while using a link to file was simlar to this:

plugins/pelican_link_to_title/pelican_link_to_title.py", line 36, in read_page
    title = soup.find("title").string
AttributeError: 'NoneType' object has no attribute 'string'

Fixing the problem

Originally the plugin was meant to save page titles. It was meant for html pages. While I used it, one day I added a link to file. It crashed. The problem was rather conceptional than technical - in idea of the plugin.

To fix the problem we need to find, if the link is actually an html page or not.

Checking if link is html page.

While searching for solution for finding if page is html page or not I've found this StackOverflow issue comment that describes a solution of accessing url with HEAD request.

It will give only metadata information - without content - different from request with GET.

The part of solution with head request looks like this in plugin:

r = requests.head(url_page)
if "text/html" in r.headers["content-type"]:
    html = requests.get(url_page).text

Source of plugin fixed.

# -*- coding: utf-8 -*-
""" This is a main script for pelican_link_to_title """
from pelican import signals
from bs4 import BeautifulSoup
import requests


def link_to_title_plugin(generator):
    "Link_to_Title plugin "
    article_ahreftag= {}
    for article in generator.articles:
        soup = BeautifulSoup(article._content, 'html.parser')
        ahref_tag = soup.find_all('ahref')
        if ahref_tag:
            article_ahreftag[article] = (ahref_tag, soup)

    for article, (p_tags, soup) in article_ahreftag.items():
        for tag in p_tags:
            url_page = tag.string
            if url_page:
                if "http://" in url_page or "https://" in url_page:
                    tag.name = "a"
                    tag.string = read_page(url_page)
                    tag.attrs = {"href": url_page}
            else:
                continue
        article._content = str(soup).decode("utf-8")

def read_page(url_page):
    import redis
    redconn = redis.Redis(host='localhost', port=6379, db=0)
    found = redconn.get(url_page)
    if not found:
        header_response = requests.head(url_page)
        if "text/html" in header_response.headers["content-type"]:
            html = requests.get(url_page).text
            soup = BeautifulSoup(html , "html.parser")
            title = soup.find("title").string
            redconn.set(url_page, title)
            return title
        else:
            return get_non_html_page_title(url_page, header_response)
    else:
        return found

def get_non_html_page_title(url_page, header_response):
    file_str = url_page.split("/")[-1]
    file_ext = file_str.split(".")
    url_domain = url_page.split("//")[1].split("/")[0]
    if len(file_ext) > 1:
        # file with extension in url.
        return "Url to {} file: {} on domain: {}".format(file_ext[-1], file_str, url_domain)
    else:
        # no file with extension in url
        return "Url to: {}".format(url_page)

def register():
    """ Registers Plugin """
    signals.article_generator_finalized.connect(link_to_title_plugin)

Snippets

r = requests.head(url_page)
if "text/html" in r.headers["content-type"]:
    # this url_page is an text/html page type content.

Acknowledgements

Auto-promotion

Related links

Thanks!

That's it :) Comment, share or don't - up to you.

Any suggestions what I should blog about? Post me a comment in the box below or poke me at Twitter: @anselmos88.

See you in the next episode! Cheers!



Comments

comments powered by Disqus