There's No Magic...
So I was following some of the referrers to Mowser, and I saw that someone actually went in to the about page I have and copied a section explaining the service… Well, since I just ripped off Google mobile’s TOS and About page just to get something up there, I accidentally copied over something that shouldn’t be there, specifically this quote: “During this translation process, Mowser analyzes the original HTML code using sophisticated algorithms.”
That’s as far from what’s happening as possible (for Mowser at least). I’m not a particularly wonderful programmer and an even worse mathematician, so I can promise you there’s no magical sophisticated algorithms at work here. And in fact, if you saw my presentation at Mobile Monday, you’d have heard me talk about how no transcoding service should have proprietary methods for manipulating content that’s not there’s, it should be a completely open and standard process.
That’s not saying there’s not some work involved - I’m still trying to debug the markup munging process and get it solid, but from a publisher’s perspective, when their content passes through the Mowser proxy, very predictable and reliable things happen without variation. In fact, I’m using a standard DOM parser to transform the pages, with just one line of Regular Expression voodoo to help work around an annoying bug with script tags. (Hopefully I can get rid of that at some point too, and maybe even start using something even more standard like XSL to parse the pages instead.)
Note that as time goes on, I’m sure the specifics of this process will change slightly (hopefully there will be actual documentation to cover it then), but here’s the basic steps I’m doing right now:
Step 1: Load the page and remove Scripting. This is where I use that fugly RegEx hack because if a page has unescaped tags in their JavaScript, it’ll mess up the DOM, so I just wack this right away. At some point I want to start parsing some of the JavaScript on the server so that I can provide some level of interactivity through another page request, and maybe get javascript within hrefs to work. But for now it’s gone.
Step 2: Parse the Header. Once the page is loaded into the DOM, I go through and look for things in the HTML header that are pertinent, such as the title, feed links, alternative mobile content link, character encoding and a handheld stylesheet. Though it’s not working right now, very soon I will automatically redirect the page if I see a mobile page link in the header. To me that means the publisher of that site does NOT want transcoding, and has requested that their mobile site be used instead. Examples of this include digg.com (pointing at DiggRiver) and cinevegasblog.com (pointing back at Mowser! Yeah!). The second one is the problem, I need to make sure I don’t cause an endless loop when redirecting automatically. This will be fixed soon.
Step 3: Parse the Links: Now I go through and rewrite the href and src attributes (and other links) to redirect through the Mowser proxy, including rewriting forms to POST through Mowser as well. There’s some bugs here too right now, but it’s working okay. Images as well as links are converted, and when the images are requested they are transformed by the server - any image wider than 150px is converted, and anything smaller is left alone. This is an alternative to a Percentage method, where I would make all the images say, 25% the size of the original. This tends to make really small images way too small however, so I’m sticking with the hard limit for now.
Step 4: Remove unwanted tags. The following is a list of tags that mowser just removes all together: ‘noscript’, ’style’, ‘link’, ‘object’, ‘applet’, ‘embed’, ‘iframe’, ‘base’, ‘basefont’, ‘map’. I need to do a better job with frames (simply parsing out and linking to each inner frame - rather than trying any magic) and converting iframes into links as well, but for now these are the verbotten tags.
Step 5: Remove unwanted attributes: There’s a lot of attributes that could mess up the formatting of a page or try to enable scripting, etc. so I just wack them all. Here’s the list (copied right out of my code): ‘rel’, ‘rev’, ’style’, ‘target’, ‘align’, ‘width’, ‘height’, ‘bgcolor’, ‘text’, ‘link’, ‘vlink’, ‘alink’, ’size’, ‘hspace’, ‘vspace’, ‘background’, ‘border’, ‘cellspacing’, ‘cellpadding’, ‘valign’, ‘colspan’, ‘onload’, ‘onunload’, ‘onchange’, ‘onsubmit’, ‘onreset’, ‘onselect’, ‘onblur’, ‘onfocus’, ‘onkeydown’, ‘onkeypress’, ‘onkeyup’, ‘onclick’, ‘ondblclick’, ‘onmousedown’, ‘onmousemove’, ‘onmouseover’, ‘onmouseout’, ‘onmouseup’.
Step 6: Convert the ‘block’ tags: The next step is to go through and convert any tags that might mess up formatting as well, mostly tables and such, however I leave the class and id attributes so that the handheld stylesheets still have markers to use. I convert the following tags to ‘div’s: ‘table’, ‘tr’, ‘td’, ‘th’, ‘thead’, ‘tbody’, ‘tfoot’, ‘col’, ‘colgroup’, ‘nobr’.
Step 7: Chopping up the page: This is a real pain… as each phone or model type can support a different page length and getting valid markup after chopping the pages up is a challenge. Right now I’m winging it and dividing the pages in 4k or 12k chunks, with actual hard-coded checks for Blackberrys which get 100k and Opera mini which is redirected completely. This will definitely change as there are lots of bugs currently - expect some forms to be split in the middle and not work. Again, though, for publishers it’s important that they understand that there’s little to be done to help this part of the process - as it’s less about formatting options, and all about what a phone’s browser can and can’t support.
So what’s left should be an html page that is very clean and linear. You can duplicate what it would look like in FireFox by using the Web Developer Toolbar plugin and choosing Miscellaneous -> Linearize Page, or using Opera and hitting shift-F11. And again, when the pages pass through Mowser, the handheld stylesheet is included with the page as well so that even though many of the tags have been converted, the content should not have been touched, and the publisher still has complete control over many parts of the formatting of the page by referring to the class names and ids of their tags which still persist after the page is reformatted.
And again, the final page has a small orange bar at the top to signal to the user that they are using an adapted page, but doesn’t include specific branding or a logo, and only the stats and menu options at the very bottom of the page under the pagination links link back to Mowser.
Hope that helps - now back to debugging and some cool new features.
-Russ