ANSWERED on Fri 2 Nov 2007 - 3:24 pm UTC by davidsarokin
Home » Technology » #1015
Please carefully read the Disclaimer and Terms & conditions.Actions: Add Comment
Asked by kevin2kelly on Thu 1 Nov 2007 - 7:31 pm UTC:
I have some older references for this question but I am looking for more recent and more reliable figures. Older sources include: Prefetching Hyperlinks, 1999, by Dan Duchamp - http://www.sagecertification.org/publications/library/proceedings/usits99/full_papers/duchamp/duchamp_html/doc004.html Finds an average of 22.6 links per page. Article in 2000 referring to a now defunct company Linkguard, http://news.bbc.co.uk/1/hi/sci/tech/790685.stm Finds 52 links per page. Calculated from a secondary figure in 2006 article on estimating pagerank http://portal.acm.org/citation.cfm?id=1150419 Finds 4.2 links per page on 4 million .edu pages and 3.9 links/pp on political pages That is such a wide range, I'd like source(s) with more confidence.
Request for clarification by Researcher bobbie7 on Fri 2 Nov 2007 - 5:18 am UTC:
Hello again Kevin2kelly, I located a web standards audit of 105 Australian Government web sites performed during December 2006 http://gdispain.site.net.au/standards/ag-website-audit-dec06/ The spreadsheet at the link below link contains the data for the 105 websites. http://gdispain.site.net.au/standards/ag-website-audit-dec06/pubs/ag-website-audit-dec06.xls If the above link doesn't work, you may download the web site audit data (XLS - 180 KB) from page 47. http://gdispain.site.net.au/standards/ag-website-audit-dec06/ Column DQ lists the average number of links per webpage for each of the 105 websites. For example, here are the figures for the first nine websites. URL Average number of links per web page www.aad.gov.au 50 www.abs.gov.au 57 www.accc.gov.au 38 www.accesscard.gov.au 41 www.afma.gov.au 102 www.afp.gov.au 51 www.ag.gov.au 33 www.agimo.gov.au 47 www.ags.gov.au 38 They calculated the average number of links per webpage for all 105 web sites at 43.5. Would these figures work for you? Thanks, Bobbie
Question clarification by kevin2kelly on Fri 2 Nov 2007 - 6:41 am UTC:
No, the sample size of 100 websites is so small as to be meaningless, especially since they were all .gov sites.
Request for clarification by Researcher bobbie7 on Fri 2 Nov 2007 - 7:18 am UTC:
Kevin2kelly, I'll try again and if I find more relevant figures I'll let you know. In the meantime, I am unlocking this question so that other researchers can take a crack at it as well. Bobbie
Answer by Researcher davidsarokin on Fri 2 Nov 2007 - 3:24 pm UTC:
kevin2kelly, Actually, the range you cited in your question probably isn't as large as it first appears. From what I can see from your third source, "Estimating the Global PageRank of Web Communities", the count of links was restricted to edu links, and is not representative of the web as a whole. And thereby hangs a tale. Counting links is not straightforward. There's a huge difference between what the eye sees in viewing a site, and what the spider is instructed to see when crawling the same site. Studies of link statistics may include or exclude, as they see fit, all sorts of hyperlinks, such as advertising links, image hyperlinks, duplicated links on the same page, and so on. In addition, the web appears to be so strongly skewed in terms of site distribution, with a few large sites that house tens or even hundreds of thousands of links, that measures of central tendency are inherently difficult, and simple means (averages) tell quite a different tale than medians. On top of these complexities, most of the data on links statistics arises from cyber-academics studying the composition of the web, and their insistent refusal to speak anything resembling English often makes the interpretation and comparision of their results difficult, and sometimes impossible. With that whiny caveat to kick things off, the best overview I came across of link stats is -- far and away -- this 2003 study performed by Microsoft, IBM and HP researchers: http://research.microsoft.com/research/sv/sv-pubs/p96-broder/p96-broder.pdf Efficient URL Caching for World Wide Web Crawling In it, the researchers conducted a large-scale crawl of several hundred million pages over the course of several weeks. Their findings: "...These pages contained about 26.83 billion links, equivalent to an average of 62.55 links per page; however, the median number of links per page was only 23, suggesting that the average is inflated by some pages with a very high number of links..." So there you have it. Websites with an average of about 62 links per page, or a median of 23 links....take your pick. Either way, the numbers are not wildly different than the first two cites in your question, and look to be quite consistent with the australian data that bobbie7 cited. Note, however, that the researchers counted *all* links on a page, including things like image links, in contrast to some other counting methods which only count links in anchor tags. Secondly, the authors note that, unlike some other studies, they counted each and every link, even if it was a duplicate on the same page: "...most studies report the number of unique links per page. The numbers above include duplicate copies of a link on a page. If we only consider unique links per page, then the average number of links is 42.74 and the median is 17..." Last point: the authors note that their numbers are somewhat larger than other data in the literature: "...Earlier studies reported only an average of 8 links or 17 links per page..." and speculate that, in addition to counting all links and duplicate links, their numbers are larger because their crawl had the capacity to include very large webpages, which other studies were not able to include due to memory limitations. Thus, they were able to include in their counts some mega-large pages with many thousands of links. However, I would point out their non-duplicates median figure of 17 links per page is not substantially different from some of the other studies available. Not all studies are as clear as this one, in terms of what was actually counted, and how. I've included a few of these studies below, only some of which are directly linkable, and the rest were accessed through subscription databases. [analysis by Brazilian researchers of several hundred thousand web pages, though what exactly was counted is not clear] JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 57(2):208–221, 2006 Link-Based Similarity Measures for the Classification of Web Documents "...TodoBR provides 40,871,504 links between Web pages (an average of 6.9 links per page)..." Stochastic models for the web graph R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal FOCS ’00: Proceedings of the 41st Annual Symposium on Foundations of Computer Science IEEE Computer Society, 2000, p. 57. "The web may be viewed as a directed graph each of whose vertices is a static HTML web page, and each of whose edges corresponds to a hyperlink from one web page to another...[with] an average degree of about 7..." [The above is widely cited (and interpreted) as meaning there is an average of 7 links per page, though it's not clear to me what the authors actually counted, or how they derived this figure] http://www.iit.cnr.it/staff/marco.pellegrini/papiri/www015-pellegrini.pdf Extraction and Classification of Dense Communities in the Web May 8–12, 2007 "...Andrei Broder et al. [6] in the year 2000 estimated the size of the indexable web graph at 200M pages and 1.5G edges (thus an average degree about 7.5 links per page, which is consistent with the average degree 8.4 of the WebBase data of 2001)..." [Again, not entirely clear what or how things are counted. WebBase refers to existing set of researcher-accessible data about the web which is described in this reference from the above article: J. Cho and H. Garcia-Molina. WebBase and the stanford interlib project. In 2000 Kyoto International Conference on Digital Libraries: Research and Practice, 2000.] http://www.dcc.ufla.br/infocomp/artigos/v5.2/art07.pdf Assessment of WWW-Based Ranking Systems for Smaller Web Sites 2006 "...The database contains 7312 pages, of which 2728 are HTML pages with outgoing links. There are a total of 22970 hyperlinks, yielding an average of approximately 8.42 outgoing hyperlinks per HTML page..." [This small sample size study counts only outgoing links, and excludes internal links to other pages on the same website] 18th International Workshop on Database and Expert Systems Applications Hyperlink Classification: A New Approach to Improve PageRank 2007. DEXA apos;07. 18th International Conference on Database and Expert Systems Applications Volume , Issue , 3-7 Sept. 2007 Page(s):274 - 277 "...We fetched about 21,717 pages by open source search engine Nutch. We find that there are about 82 hyperlinks on each page on average, however, there are only twenty hyperlinks or even less are relating about the page’s topic while most of the hyperlinks are about the information about the whole Web site map or the ads..." [appears to have been Chinese pages, but it's not totally clear] Semantic prefetching objects of slower web site pages Journal of Systems and Software Volume 79, Issue 12, December 2006, Pages 1715-1724 "...The average web page contains 8.87 images per page, ranging from 1 to 25 images, with a standard deviation of 4.27. The average number of hyperlinks per web page is 10.70, ranging from 1 to 26 hyperlinks, with a standard deviation of 4.49..." [appears to be a small sample size] I trust the information here meets your needs, (hopefully in a non-controversial manner). But if there's anything more I can do for you, just say the word.
Actions: Add Comment
|
Frequently Asked Questions | Terms & Conditions | Disclaimer | Privacy Policy | Contact Us | Spread the word! © 2010 Uclue Ltd |