In previous post I have made some quick fix with redis that makes re-loading links a bit faster.

While I was searching for some plugins that I could use in my blog, I've found this plugin source code which uses content_object_init signal. Let's check if using this will make our plugin even faster.

To The Point

To be precise and have reliable performance checks let's first change our debugging project and add more articles with links to it.

Source code of debugging plugin for performance can be found at this branch.

Comparison

  1. Without plugin

    Took 0.14 second.

  2. With previous plugin (redis-aware)

    Took 0.17-0.20 second.

  3. With content_object_init change.

    First time:

    Took ~10 second.

    Second time(and next):

    Took ~0.20 second.

Source Code of changed plugin

# -*- coding: utf-8 -*-
""" This is a main script for pelican_link_to_title """
from pelican import signals
from bs4 import BeautifulSoup
import urllib


def link_to_title_plugin(generator):
    "Link_to_Title plugin "
    article_ahreftag= {}
    for article in generator.articles:
        soup = BeautifulSoup(article._content, 'html.parser')
        ahref_tag = soup.find_all('ahref')
        if ahref_tag:
            article_ahreftag[article] = (ahref_tag, soup)

    for article, (p_tags, soup) in article_ahreftag.items():
        for tag in p_tags:
            url_page = tag.string
            if url_page:
                if "http://" in url_page or "https://" in url_page:
                    tag.name = "a"
                    tag.string = read_page(url_page)
                    tag.attrs = {"href": url_page}
            else:
                continue
        article._content = str(soup).decode("utf-8")

def read_page(url_page):
    import redis
    redconn = redis.Redis(host='localhost', port=6379, db=0)
    found = redconn.get(url_page)
    if not found:
        r = urllib.urlopen(url_page).read()
        soup = BeautifulSoup(r , "html.parser")
        title = soup.find("title").string
        redconn.set(url_page, title)
        return title
    else:
        return found


def content_object_init(instance):
    if instance._content is not None:
        content = instance._content
        soup = BeautifulSoup(content, "html5lib")

        for ctbl in soup.find_all('ahref'):
            url_page = ctbl.contents[0]
            if url_page:
                if "http://" in url_page or "https://" in url_page:
                    ctbl.name = "a"
            try:
                ctbl.string = read_page(url_page)
            except:
                pass
            ctbl.attrs = {"href": url_page}
        instance._content = soup.decode()
            # If beautiful soup appended html tags.
        if instance._content.startswith('<html>'):
            instance._content = instance._content[12:-14]

def register():
    """ Registers Plugin """
    signals.content_object_init.connect(content_object_init)
    # signals.article_generator_finalized.connect(link_to_title_plugin)

Effects

Well as I see the efects are not so much different.

I will need to check if updating urllib into something with parallelism will make it better.

Acknowledgements

Auto-promotion

Related links

Thanks!

That's it :) Comment, share or don't - up to you.

Any suggestions what I should blog about? Post me a comment in the box below or poke me at Twitter: @anselmos88.

See you in the next episode! Cheers!



Comments

comments powered by Disqus