{"id":32,"date":"2007-01-04T11:51:00","date_gmt":"2007-01-04T11:51:00","guid":{"rendered":"http:\/\/www.tutego.de\/blog\/javainsel\/?p=32"},"modified":"2007-01-04T11:51:00","modified_gmt":"2007-01-04T11:51:00","slug":"character-%e2%80%9echunking%e2%80%9c-bei-sax","status":"publish","type":"post","link":"https:\/\/www.tutego.de\/blog\/javainsel\/2007\/01\/character-%e2%80%9echunking%e2%80%9c-bei-sax\/","title":{"rendered":"Character \u201eChunking\u201c bei SAX"},"content":{"rendered":"<\/p>\n<p>Bei langen Zeichenketten ist es nicht zwingend, dass sie in einem Rutsch der characters() Methode \u00fcbergeben werden.<\/p>\n<p>&lt;TSeq_sequence&gt;<br \/>&nbsp;TAACCCTAACCCTAACCCTAACCCTAAC<br \/>&lt;\/TSeq_sequence&gt;<\/p>\n<p>Denkbar ist hier ein (skizzierter) Aufruf von<\/p>\n<p>characters(&#8222;T&#8220;);<br \/>characters(&#8222;AACCCTAAC&#8220;);<br \/>characters(&#8222;CCTAACCCTAACCCTAAC&#8220;);<\/p>\n<p>Die Anwendung muss mit Character \u201eChunking\u201c rechnen und damit umgehen. Man kann es nicht abstellen! Viele Parser teilen die CDATA-Sektion auf, so dass nach jedem Carriage Returns einmal characters() aufgerufen wird.<\/p>\n<p>Eine L\u00f6sung ist, die Zeichen zwischenzuspeichern, etwa in einem StringBuilder\/StringBuffer.<\/p>\n<p>public void characters(char[] ch, int start, int length)<br \/>throws SAXException {<br \/>currentText.append( ch, start, length );<br \/>}<\/p>\n<p>Wenn endElement() aufgerufen wird, kann man den Puffer auswerten.<\/p>\n<p>Eine interessante Idee ist, einen Delegate zu bauen, der sich f\u00fcr den SAX-Parser wie ein ContentHandler verh\u00e4lt, aber characters() abf\u00e4ngt und die Zeichen intern speichert. Alle anderen Methoden wie endDocument(), \u2026 gehen zum Original.<br \/>Zuvor jedoch werden die Methoden \u201eflushen\u201c, also ein Aufruf von characters() auf dem Original durchf\u00fchren und alle Zeichen \u00fcbergeben.<\/p>\n<p>Anschauen kann man eine solche unter <a href=\"http:\/\/koders.com\/java\/fid607C6038FB8FD26D12EED5B32042D3EF48038AA7.aspx\">http:\/\/koders.com\/java\/fid607C6038FB8FD26D12EED5B32042D3EF48038AA7.aspx<\/a>. Die Implementierung steht unter der Apache-Lizenz.<\/p>\n<pre><p>\/*<br \/>* Copyright 2004 Outerthought bvba and Schaubroeck nv<br \/>*<br \/>* Licensed under the Apache License, Version 2.0 (the \"License\");<br \/>* you may not use this file except in compliance with the License.<br \/>* You may obtain a copy of the License at<br \/>*<br \/>* <a href=\"http:\/\/www.apache.org\/licenses\/LICENSE-2.0\">http:\/\/www.apache.org\/licenses\/LICENSE-2.0<\/a><br \/>*<br \/>* Unless required by applicable law or agreed to in writing, software<br \/>* distributed under the License is distributed on an \"AS IS\" BASIS,<br \/>* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.<br \/>* See the License for the specific language governing permissions and<br \/>* limitations under the License.<br \/>*\/<br \/>package org.outerj.daisy.htmlcleaner; <\/p><br \/><p>import org.xml.sax.ContentHandler;<br \/>import org.xml.sax.SAXException;<br \/>import org.xml.sax.Locator;<br \/>import org.xml.sax.Attributes; <br \/><p>class MergeCharacterEventsHandler implements ContentHandler {<br \/>private ContentHandler consumer;<br \/>private char[] ch;<br \/>private int start = 0;<br \/>private int length = 0; <br \/><p>public MergeCharacterEventsHandler(ContentHandler consumer) {<br \/>this.consumer = consumer;<br \/>} <br \/><p>public void characters(char ch[], int start, int length) throws SAXException {<br \/>char[] newCh = new char[this.length + length];<br \/>if (this.ch != null)<br \/>System.arraycopy(this.ch, this.start, newCh, 0, this.length);<br \/>System.arraycopy(ch, start, newCh, this.length, length);<br \/>this.start = 0;<br \/>this.length = newCh.length;<br \/>this.ch = newCh;<br \/>} <br \/><p>private void flushCharacters() throws SAXException {<br \/>if (ch != null) {<br \/>consumer.characters(ch, start, length);<br \/>ch = null;<br \/>start = 0;<br \/>length = 0;<br \/>}<br \/>} <br \/><p>public void endDocument() throws SAXException {<br \/>flushCharacters();<br \/>consumer.endDocument();<br \/>} <br \/><p>public void startDocument() throws SAXException {<br \/>flushCharacters();<br \/>consumer.startDocument();<br \/>} <br \/><p>public void ignorableWhitespace(char ch[], int start, int length) throws SAXException {<br \/>flushCharacters();<br \/>consumer.ignorableWhitespace(ch, start, length);<br \/>} <br \/><p>public void endPrefixMapping(String prefix) throws SAXException {<br \/>flushCharacters();<br \/>consumer.endPrefixMapping(prefix);<br \/>} <br \/><p>public void skippedEntity(String name) throws SAXException {<br \/>flushCharacters();<br \/>consumer.skippedEntity(name);<br \/>} <br \/><p>public void setDocumentLocator(Locator locator) {<br \/>consumer.setDocumentLocator(locator);<br \/>} <br \/><p>public void processingInstruction(String target, String data) throws SAXException {<br \/>flushCharacters();<br \/>consumer.processingInstruction(target, data);<br \/>} <br \/><p>public void startPrefixMapping(String prefix, String uri) throws SAXException {<br \/>flushCharacters();<br \/>consumer.startPrefixMapping(prefix, uri);<br \/>} <br \/><p>public void endElement(String namespaceURI, String localName, String qName) throws SAXException {<br \/>flushCharacters();<br \/>consumer.endElement(namespaceURI, localName, qName);<br \/>} <br \/><p>public void startElement(String namespaceURI, String localName, String qName, Attributes atts) throws SAXException {<br \/>flushCharacters();<br \/>consumer.startElement(namespaceURI, localName, qName, atts);<br \/>}<br \/>}<\/p><br \/><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Bei langen Zeichenketten ist es nicht zwingend, dass sie in einem Rutsch der characters() Methode \u00fcbergeben werden. &lt;TSeq_sequence&gt;&nbsp;TAACCCTAACCCTAACCCTAACCCTAAC&lt;\/TSeq_sequence&gt; Denkbar ist hier ein (skizzierter) Aufruf von characters(&#8222;T&#8220;);characters(&#8222;AACCCTAAC&#8220;);characters(&#8222;CCTAACCCTAACCCTAAC&#8220;); Die Anwendung muss mit Character \u201eChunking\u201c rechnen und damit umgehen. Man kann es nicht abstellen! Viele Parser teilen die CDATA-Sektion auf, so dass nach jedem Carriage Returns einmal characters() [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":"","_links_to":"","_links_to_target":""},"categories":[1],"tags":[],"class_list":["post-32","post","type-post","status-publish","format-standard","hentry","category-allgemein"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.tutego.de\/blog\/javainsel\/wp-json\/wp\/v2\/posts\/32","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.tutego.de\/blog\/javainsel\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.tutego.de\/blog\/javainsel\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.tutego.de\/blog\/javainsel\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.tutego.de\/blog\/javainsel\/wp-json\/wp\/v2\/comments?post=32"}],"version-history":[{"count":0,"href":"https:\/\/www.tutego.de\/blog\/javainsel\/wp-json\/wp\/v2\/posts\/32\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.tutego.de\/blog\/javainsel\/wp-json\/wp\/v2\/media?parent=32"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.tutego.de\/blog\/javainsel\/wp-json\/wp\/v2\/categories?post=32"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.tutego.de\/blog\/javainsel\/wp-json\/wp\/v2\/tags?post=32"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}