- There is many additional informations we could squeeze from internet to give some insight to every page. We may scan the document structure, text complexity or elementary word behavior.
We should recognize language of document, also we could analyze field of documents which are not recognized as personal. We can analyze it for blanks, treatment of images, presency of abstract, notes. We could analyze relation of text to the caption, quality of caption; this could inform us about seriousness and possibility of beeing scientific and also with help of the caption we may get very good results in machinery tagging of high rate scientific documents. We can also look for who refer to this document and reverse. We can judge basic document formatting as selected font, background image and music, color circusity or color smoothness, bracketing, underlining etc. This all can give particular image if this may be what user searches (or is prepared to find in tag-folder).
- There is many types of documents for which rdf or header-tagging is not step forward: multi content documents, very comprehensive papers where caption is only for decorum, and abstract may suffice as caption (e.g. if you find 3 new hormones and describe one kb of processes between these h.). Also rdf nor tagging may fully suffice in epoch of destroying language bariers on internet (every rdf etry would need professional translation).
- There is missing parenthood / extern defining in web documents, what I feel as great mistake in HTML architecture: we cannot say this is part of homepage or an external link and be 100% sure it is, even if we compare url locations etc - there is no formal relation for us even if one page is stored in same end directory as second. This is possible trouble maker for context oriented methods of analyzing page.
- Even if I am still talking about machinery / group cathegorising of documents, which is not in primary sense of semantic web, it is still about better orientation in searched materials. I am not trying this is the best we can do, and I also think, there is the real semantic web in future, listening our questions and requesting replenish if necessary. But it is surely as far as intelligent self programing systems driven by absolute programmer dudes to create full environments after one hour man-machine intensive interview.
- There is missing "public entrance" to the webpage, or at least totally unique and mostly visible intelligent public noticeboard beside every "doors". With that there is lack of good advertiser in front of good web entrance. It is all about static lists of result links now.
- It's almost sure, that having pre-semantic1 or semantic web2 may need stable support, administration and voluntarism on one side, and education of internet user, to be visible but readable, in order not to desintegrate back to medley we're now. It will also need to offer visible effects of doing few steps more as standard user, or at least avoiding some. Hand in hand wider movement, propagation and understanding actual state of internet as superable, reversible and depending on any of us.
- In concrete there may be control over collected data, to be not hacked and to be true3. Semantic / taggin crime, theft, spam may be controled. There may be for example lot of physicians to update cat-checker word database with newest top words4. In other words we will not have intelligent internet until wise people tell it how to behave intelligent and how to understand new informations available. Or we await everyone tag his own documents brilliantly and we only come and get? Yes we could "hide" some form to wiswig editors which at least asks first if we are doing homepage or memorial to Queen, but many of such dictatorships and standard enchancings would cause only rapid increase of Queen memorials over the net.
1) Lets call any upgrade of internet as pre-semantic. So increase of machinery folding and user description of any link.
2) Lets call our unreachable final target semantic.
3) I am not telling we may destroy fiction on the web, but in future, we may regulate this centralised information about web. Or how would we accept some "joke" of masses, turning our intelligent system to permanently stupid liar (when maybe someone's live could depend on its higher functions)?
4) For example to identify document as paper about physics, we need every year different set of contain-test words. They may do some votes around it or anything. That is only for illustration of how far may support go. In this example is also evident that year of creation or at least first cache may be exposed for every document for better folding.