I have to think about it

i have to think about it Blog

How many Plone sites are in UK? I’ve took upon myself a task to count them and create list of Plone websites. I’ve looked for Plone webistes in England, Scotland and Wales. But I didn’t do it manually. I’ve written a Python script that harvest links from Google based on some keywords and then checks if the website is a Plone website

The idea

Recently I was writing a document and I’ve used a sentence: There are hundreds Plone sites in UK Standard marketing rubbish. But then I thought: How many Plone sites is out there? Really. Then I’ve came up with idea to actually count them. Manual labour was out of question so I’ve decided to write some program that will do it for me.

Plone finder script should:

  • Query Google for Plone related keywords
  • Parse Google pages to extract urls
  • Check if behind url is Plone website
  • Save results

Harvesting Google results

Thinking about how to find Plone websites I’ve decided to query a search engine. Google is the best so it was obvious choice. But we cannot (yet) search for “all sites build in Plone”. So I had to use normal keywords. Those keywords though should be strongly related to Plone websites. They should also be unique and not to appear in “not Plone” sites. I’ve looked at few Plone sites, search Plone templates to find common denominator. After some thinking and testing I’ve chosen those:

  • plone
  • plone uk
  • plone england
  • plone wales
  • plone scotland
  • plone powered
  • login_form
  • join_form
  • document_view
  • base_view
  • welcome to plone
  • accessibility text size huge
  • please log in to access this part
  • registration form personal details
  • plone foundation

We can see 3 groups here. First – Plone for given region. Second – common templates (links) used in Plone and hopefully nowhere else (or not in many places). And third – titles or descriptions of pages that are rarely hidden or removed: accessibility, login and footer. I’ve even used ‘welcome to plone’ which is standard Plone home page and surprisingly get quite a lot hits here.

Unfortunately Google can serve only top 1000 results. So I could only harvest links from first 100 pages for each keyword. I’ve considered also sponsored links.

Validating if site is in Plone

How to check if a website is build in Plone. When we can look on it its easier. But how can robot recognize that? I needed a simple and reliable test. And like any other complicated problem this too has a simple solution.

Research

I’ve harvested 1500 Google pages (15 keywords * 100 pages). It took 1282.79s (21.5 min). Below there is a table with the results. All links – total number of links for a keyword found on 100 pages (normal & sponsored). Unique links – total number of links for a keyword that were not found before. Unique/All – ratio that shows how good a keyword is to bring new links.

Keyword All links Unique links Unique/All
plone 1243 326 26.23%
plone uk 1141 146 12.80%
plone england 699 102 14.59%
plone wales 973 76 7.81%
plone scotland 660 48 7.27%
plone powered 450 41 9.11%
login_form 931 243 26.10%
join_form 706 236 33.43%
document_view 629 45 7.15%
base_view 313 12 3.83%
welcome to plone 749 159 21.23%
accessibility text size huge 916 457 49.89%
please log in to access this part 737 412 55.90%
registration form personal details 1703 611 35.88%
plone foundation 623 6 0.96%
Total 12473 2920

Results

This is a summary of the research. Table with number of Plone websites found for each keyword and total number of sites. This stage took 2762.17s (46 min).

I’ve found 470 Plone websites. This is unfiltered list so there are websites with multiple subdomains, sites described only by IP address. But now I can say that there is about 500 Plone sites in UK and show this as a proof.

Keyword Plone sites Plone sites/Unique links
plone 92 28.22%
plone uk 54 36.99%
plone england 48 47.06%
plone wales 31 40.79%
plone scotland 15 31.25%
plone powered 28 68.29%
login_form 57 23.46%
join_form 36 15.25%
document_view 18 40.00%
base_view 2 16.67%
welcome to plone 76 47.80%
accessibility text size huge 5 1.09%
please log in to access this part 6 1.46%
registration form personal details 0 0.00%
plone foundation 2 33.33%
Total 470

Future

This is not a complete list. I’m aware that I’ve missed a great deal of sites. I can think about various reasons. Plone website was not indexed by Google or was outside top 1000. I could use more keywords to look for possible sites. Maybe I’ve missed sites that didn’t have plone logo image and maybe there is a better test. But still there is a reason to be proud of..

I think this is the second longest list of Plone sites available. Plone. net has 927 and 57 from UK (11 November 2007). So at least in terms of country my list is better.

Python code

Great article! I’m spanish and interested in repeat the search but for spanish sites. ВїCould you share your python code? Thanks

Rate article
Plone and Zope developer
Add a comment