How many Plone sites are in UK? I’ve took upon myself a task to count them and create list of Plone websites. I’ve looked for Plone webistes in England, Scotland and Wales. But I didn’t do it manually. I’ve written a Python script that harvest links from Google based on some keywords and then checks if the website is a Plone website
Recently I was writing a document and I’ve used a sentence: There are hundreds Plone sites in UK Standard marketing rubbish. But then I thought: How many Plone sites is out there? Really. Then I’ve came up with idea to actually count them. Manual labour was out of question so I’ve decided to write some program that will do it for me.
Plone finder script should:
- Query Google for Plone related keywords
- Parse Google pages to extract urls
- Check if behind url is Plone website
- Save results
Harvesting Google results
Thinking about how to find Plone websites I’ve decided to query a search engine. Google is the best so it was obvious choice. But we cannot (yet) search for “all sites build in Plone”. So I had to use normal keywords. Those keywords though should be strongly related to Plone websites. They should also be unique and not to appear in “not Plone” sites. I’ve looked at few Plone sites, search Plone templates to find common denominator. After some thinking and testing I’ve chosen those:
- plone uk
- plone england
- plone wales
- plone scotland
- plone powered
- welcome to plone
- accessibility text size huge
- please log in to access this part
- registration form personal details
- plone foundation
We can see 3 groups here. First – Plone for given region. Second – common templates (links) used in Plone and hopefully nowhere else (or not in many places). And third – titles or descriptions of pages that are rarely hidden or removed: accessibility, login and footer. I’ve even used ‘welcome to plone’ which is standard Plone home page and surprisingly get quite a lot hits here.
Unfortunately Google can serve only top 1000 results. So I could only harvest links from first 100 pages for each keyword. I’ve considered also sponsored links.
Validating if site is in Plone
How to check if a website is build in Plone. When we can look on it its easier. But how can robot recognize that? I needed a simple and reliable test. And like any other complicated problem this too has a simple solution.
I’ve harvested 1500 Google pages (15 keywords * 100 pages). It took 1282.79s (21.5 min). Below there is a table with the results. All links – total number of links for a keyword found on 100 pages (normal & sponsored). Unique links – total number of links for a keyword that were not found before. Unique/All – ratio that shows how good a keyword is to bring new links.
|Keyword||All links||Unique links||Unique/All|
|welcome to plone||749||159||21.23%|
|accessibility text size huge||916||457||49.89%|
|please log in to access this part||737||412||55.90%|
|registration form personal details||1703||611||35.88%|
This is a summary of the research. Table with number of Plone websites found for each keyword and total number of sites. This stage took 2762.17s (46 min).
I’ve found 470 Plone websites. This is unfiltered list so there are websites with multiple subdomains, sites described only by IP address. But now I can say that there is about 500 Plone sites in UK and show this as a proof.
|Keyword||Plone sites||Plone sites/Unique links|
|welcome to plone||76||47.80%|
|accessibility text size huge||5||1.09%|
|please log in to access this part||6||1.46%|
|registration form personal details||0||0.00%|
This is not a complete list. I’m aware that I’ve missed a great deal of sites. I can think about various reasons. Plone website was not indexed by Google or was outside top 1000. I could use more keywords to look for possible sites. Maybe I’ve missed sites that didn’t have plone logo image and maybe there is a better test. But still there is a reason to be proud of..
I think this is the second longest list of Plone sites available. Plone. net has 927 and 57 from UK (11 November 2007). So at least in terms of country my list is better.
Yeah, loads missing, there’s only two of the ones I’ve worked on there, I think it’s having difficulty picking up the more customised sites, and it’s found zest and the van Rees brothers’ sites as UK based.
Very interesting results though, might be worth seeing how they’re segmented into. co. uk. org. uk. ac. uk, etc
Non UK sites
Linux. co. uk
You’re missing a few: linux. co. uk, and arthur-ransome. org off the top of my head, are two which I’ve worked on.
Have you thought about searching for the plone generator metatag, that most plone sites include, or using that to confirm that the site is really running plone. It is extremely rare for the generator tag to be removed. ( )