Today let's focus on gathering results from Grammarly document checking in our mini-automation-project :)

This article is a part of Grammarly Selenium Pelican Automation Mini project

S0-E17/E30 :)

Grammarly Automation Gathering Report.

So Grammarly creates an report in a PDF. But the report in PDF is just a simple PDF creation of HTML file.

Why not disassemble that HTML page and get the results that we need instead of generating PDF and gathering only part of results?

Let's check if our assumptions meet with reality :)

So we have a text, lets for our example:

A simple
Multiline
text
That wraps
after max
2 words.

Put this into grammarly and gather HTML output it will generate.

Grammarly As I've found out will only generate pdf for premium users.

So let's focus on disassembling the html.

After gathering output from script:

    def test_simple_text_gather_html(self):
        """ This test is suppose to return html """
        page_login = GrammarlyLogin(self.driver)
        page_login.make_login('za2217279@mvrht.net', 'test123')
        page_new_doc = GrammarlyNewDocument(self.driver)
        page_new_doc.make_new_document("")
        page_doc = GrammarlyDocument(self.driver)
        text_to_put = "A simple \
            Multiline \
            text \
            That wraps \
            after max \
            2 words. \
        "
        page_doc.text = text_to_put
        self.sleep(10)
        print self.driver.page_source

I've found that there is an unique div called with css-class _adbfa1e6-editor-page-cardsCol That does not change from document to document. Maybe that's just a feature for react-app to know where to put the result of checking ? Either way - we have output!

But ... The output is not perfect. Data gathered in this way only gives a "shape" information - not exact details about where text is written in bad way - only information about specific text that is incorrect and the proposition for correct.

So instead of using Selenium for sculpturing I'll use python's html extractor - Beautiful Soup.

Let's make a source that will output html that then can be used within Beautiful Soup.

Now the GrammarlyDocument src looks like this:

class GrammarlyDocument(PageObject):

    title = PageElement(css='input[type="text"]')
    text = PageElement(id_='textarea')
    # This button below will only be visible for grammarly premium users.
    score_button = PageElement(css='span[class="_ff9902-score"]')
    download_pdf_btn = PageElement(css='div[class="_d0e45e-button _d0e45e-medium"]')

    def put_title(self, title):
        self.title = title

    def put_text(self, text):
        self.text = text
        time.sleep(10)

    def get_page_source(self):
        return self.w.page_source

And test:

    def test_get_page_source(self):
        page_login = GrammarlyLogin(self.driver)
        page_login.make_login('za2217279@mvrht.net', 'test123')
        page_new_doc = GrammarlyNewDocument(self.driver)
        page_new_doc.make_new_document("")
        page_doc = GrammarlyDocument(self.driver)
        text_to_put = "A simple \n\
            Multiline \n\
            text \n\
            That wraps \n\
            after max \n\
            2 words. \n\
        "
        page_doc.put_text(text_to_put)
        page_source = GrammarlyDocument(self.driver)
        actual_source = page_source.get_page_source()
        self.assertTrue("<html" in actual_source and "</html" in actual_source)

Yeah It's very silly test, but for now is sufficient.

Reverse parse HTML

Let's start with what the Beautiful Soup is.

It's a python html extractor that can be usefull when you want to scrape data from html. It's used widly with web-crawlers.

Now, let's make a DocumentScraper that will scrape data from html for us with at least the results data for now.

from bs4 import BeautifulSoup

class DocumentScraper(object):

    def __init__(self, html_source):
        # self.html_source = html_source
        self.bs = BeautifulSoup(html_source, "html.parser")

    def get_issue_div(self):
        # DIV with class=_adbfa1e6-editor-page-cardsCol
        return self.bs.find('div', {'class': '_adbfa1e6-editor-page-cardsCol'})

    def get_all_warnings(self):
        return self.get_issue_div().contents

    def get_all_warnings_texts(self):
        return [element.text for element in  self.get_all_warnings()]

    def iterate_over_warnings(self):
        for innerelement in self.get_all_warnings():
            print innerelement.text

And the simplest test on pre-downloaded file (output from selenium):

# -*- coding: utf-8 -*-
import unittest
from document_scraper import DocumentScraper

class GrammarlyScrapingTests(unittest.TestCase):

    def setUp(self):

        filename = "bs_output_test1.html"
        with open(filename, 'r+') as f:
            self.data_scrape1 = f.read()


    def test1(self):
        assert len(self.data_scrape1) == 21022
        scraper = DocumentScraper(self.data_scrape1)
        expected = [("Incorrect spacingwraps             after → wraps after".decode("utf-8"))]
        result = list(scraper.get_all_warnings_texts())
        assert result == expected

if __name__ == "__main__":
    unittest.main()

Part 3 !

There is going to be a part3 that will sum-up this mini-project and make this draft accessable.

So stay tuned :)

Acknowledgements

Thanks!

That's it :) Comment, share or don't :)

If you have any suggestions what I should blog about in the next articles - please give me a hint :)

See you tomorrow! Cheers!



Comments

comments powered by Disqus