Process XML, JSON and other Data Formats

1. Process XML, JSON and other Data Formats

Two important data formats for exchanging documents are XML and JSON. XML is historically the older data type, JSON we often find nowadays in communication between a server and a JavaScript application. JSON documents are also popular for configuration files.

While Java SE provides different classes for reading and writing XML documents, JSON support is only available in Java Enterprise Edition or through complementary open-source libraries. Many of the tasks in this chapter, therefore, resort to external libraries.

Description languages form a significant category of document formats. They define the structure of the data. Among the most important formats are HTML, XML, JSON, and PDF.

Java does not provide support for other data formats, except for property files and the ability to process ZIP archives. This is especially true for CSV files, PDFs, or Office documents. Fortunately, dozens of open-source libraries fill this gap, so you don’t have to program this functionality yourself.

Prerequisites

know how to add Maven dependencies
know StAX
be able to write XML documents
be able to create JAXB beans from XML schema files
be able to use object XML mapping with Jakarta XML Binding
basic understanding of Jakarta JSON libraries
be able to read ZIP archives

Data types used in this chapter:

1.1. XML processing with Java

There are different Java APIs for handling XML documents. One way is to hold complete XML objects in memory, the other solution is similar to data streams. StAX is a pull API that allows elements to be actively pulled from the data stream and also written. The processing model is optimal for large documents that do not need to be completely in memory.

JAXB provides an easy way to convert Java objects to XML and XML back to Java objects later. Using annotations or external configuration files, the mapping can be precisely controlled.

1.1.1. Write XML file with recipe ⭐

Captain CiaoCiao has so many recipes that he needs a database. He has several quotes for database management systems and wants to see if they can import all his recipes.

His recipes are in RecipeML format, an XML format that is loosely specified: http://www.formatdata.com/recipeml/. There is a large database at https://dsquirrel.tripod.com/recipeml/indexrecipes2.html. An example from "Key Gourmet":

<?xml version="1.0" encoding="UTF-8"?>
<recipeml version="0.5">
  <recipe>
    <head>
      <title>11 Minute Strawberry Jam</title>
      <categories>
        <cat>Canning</cat>
        <cat>Preserves</cat>
        <cat>Jams &amp; jell</cat>
      </categories>
      <yield>8</yield>
    </head>
    <ingredients>
      <ing>
        <amt>
          <qty>3</qty>
          <unit>cups</unit>
        </amt>
        <item>Strawberries</item>
      </ing>
      <ing>
        <amt>
          <qty>3</qty>
          <unit>cups</unit>
        </amt>
        <item>Sugar</item>
      </ing>
    </ingredients>
    <directions>
      <step>Put the strawberries in a pan.</step>
      <step>Add 1 cup of sugar.</step>
      <step>Bring to a boil and boil for 4 minutes.</step>
      <step>Add the second cup of sugar and boil again for 4 minutes.</step>
      <step>Then add the third cup of sugar and boil for 3 minutes.</step>
      <step>Remove from stove, cool, stir occasionally.</step>
      <step>Pour in jars and seal.</step>
    </directions>
  </recipe>
</recipeml>

Task:

Write a program that outputs an XML document in RecipeML format.

Solution

1.1.2. Check if all images have an alt attribute ⭐

Images in HTML documents should always have an alt attribute.

Task:

Implement an XHTML checker that reports whether each img tag has an alt attribute set.
Take as XHTML file, e.g., http://tutego.de/download/index.xhtml.

Solution

1.1.3. Writing Java objects with JAXB ⭐

JAXB simplifies access to XML documents by allowing a convenient mapping from a Java object to an XML document and vice versa.

JAXB was included in the Standard Edition in Java 6 and removed in Java 11. We need a dependency:

<dependency>
  <groupId>jakarta.xml.bind</groupId>
  <artifactId>jakarta.xml.bind-api</artifactId>
  <version>4.0.0</version>
</dependency>

<dependency>
  <groupId>org.glassfish.jaxb</groupId>
  <artifactId>jaxb-runtime</artifactId>
  <version>4.0.2</version>
  <scope>runtime</scope>
</dependency>

Task:

Write JAXB beans so that we can generate the following XML:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ingredients>
    <ing>
        <amt>
            <qty>3</qty>
            <unit>cups</unit>
        </amt>
        <item>Sugar</item>
    </ing>
    <ing>
        <amt>
            <qty>3</qty>
            <unit>cups</unit>
        </amt>
    </ing>
</ingredients>

Creates the classes Ingredients, Ing, Amt.
Give the classes corresponding instance variables; it is ok if these are public.
Consider which annotation to use.

Solution

1.1.4. Read in jokes and laugh heartily ⭐⭐

Bonny Brain is also laughing at simple jokes, which she can never have enough of. She finds the site https://sv443.net/jokeapi/v2/joke/Any?format=xml on the Internet, which always provides her with new jokes.

The format is XML, which is good for transporting data, but we are Java developers and want everything in objects! With JAXB we want to read the XML files and convert them into Java objects, so we can develop custom output later.

The first step is to automatically generate JAXB beans from an XML schema file. The schema for the Joke page is as follows — don’t worry, you don’t have to understand it.

<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="data">
    <xs:complexType>
      <xs:sequence>
        <xs:element type="xs:string" name="category" />
        <xs:element type="xs:string" name="type" />
        <xs:element name="flags">
          <xs:complexType>
            <xs:sequence>
              <xs:element type="xs:boolean" name="nsfw" />
              <xs:element type="xs:boolean" name="religious" />
              <xs:element type="xs:boolean" name="political" />
              <xs:element type="xs:boolean" name="racist" />
              <xs:element type="xs:boolean" name="sexist" />
            </xs:sequence>
          </xs:complexType>
        </xs:element>
        <xs:element type="xs:string" name="setup" />
        <xs:element type="xs:string" name="delivery" />
        <xs:element type="xs:int" name="id" />
        <xs:element type="xs:string" name="error" />
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

The provider does not provide a schema, so it is generated from the XML using https://www.freeformatter.com/xsd-generator.html.

Task:

Load the XML schema definition at http://tutego.de/download/jokes.xsd, and place the file in the Maven directory /src/main/resources.

Add the following element to the POM file:

<build>
<plugins>
  <plugin>
    <groupId>org.codehaus.mojo</groupId>
    <artifactId>jaxb2-maven-plugin</artifactId>
    <version>3.1.0</version>
    <executions>
      <execution>
        <id>xjc</id>
        <goals>
          <goal>xjc</goal>
        </goals>
      </execution>
    </executions>
    <configuration>
      <packageName>com.tutego.exercise.xml.joke</packageName>
      <sources>
        <source>src/main/resources/jokes.xsd</source>
      </sources>
      <generateEpisode>false</generateEpisode>
      <outputDirectory>${basedir}/src/main/java</outputDirectory>
      <clearOutputDir>false</clearOutputDir>
      <noGeneratedHeaderComments>true</noGeneratedHeaderComments>
      <locale>en</locale>
    </configuration>
  </plugin>
</plugins>
</build>

The plugin section includes org.codehaus.mojo:jaxb2-maven-plugin and configures it; all options are explained at https://www.mojohaus.org/jaxb2-maven-plugin/Documentation/v3.1.0/index.html.

From the command line, launch mvn generate-sources. This will generate two classes in the com.tutego.exercise.xml.joke package:
- Data
- ObjectFactory
Use JAXB to get a joke from the URL https://sv443.net/jokeapi/v2/joke/Any?format=xml and convert it to an object.

Solution

1.2. JSON

Java SE does not provide built-in support for JSON, but there are two standards from the Jakarta EE project that provide this support: Jakarta JSON Processing (JSON-P) (https://jakarta.ee/specifications/jsonp/) and Jakarta JSON Binding (JSON-B) (https://jakarta.ee/specifications/jsonb/). JSON-B allows for the mapping of Java objects to JSON and vice versa, while JSON-P provides APIs for processing JSON data. Another popular implementation is Jackson (https://github.com/FasterXML/jackson).

To use JSON-B, we need to add both the API and an implementation to our project’s POM. The reference implementation, Yasson, is a good choice.

<dependency>
  <groupId>jakarta.json.bind</groupId>
  <artifactId>jakarta.json.bind-api</artifactId>
  <version>3.0.0</version>
</dependency>

<dependency>
  <groupId>org.eclipse</groupId>
  <artifactId>yasson</artifactId>
  <version>3.0.0</version>
  <scope>runtime</scope>
</dependency>

1.2.1. Hacker News JSON exploit. ⭐

The page Hacker News (https://news.ycombinator.com/) was briefly introduced in the chapter "Network Programming".

The URL https://hacker-news.firebaseio.com/v0/item/24857356.json returns a JSON object of the message with ID 24857356. The response looks (formatted and slightly shortened for the kids) like this:

{
   "by":"luu",
   "descendants":257,
   "id":24857356,
   "kids":[
      24858151,
      24857761,
      24858192,
      24858887
   ],
   "score":353,
   "time":1603370419,
   "title":"The physiological effects of slow breathing in the healthy human",
   "type":"story",
   "url":"https://breathe.ersjournals.com/content/13/4/298"
}

With JSON-B this JSON can be converted into a Map:

Map map = JsonbBuilder.create().fromJson( source, Map.class );

The source can be a String, Reader or InputStream.

Task:

Write a new method Map<Object, Object> news(long id) that, using JSON-B, obtains the JSON document at "https://hacker-news.firebaseio.com/v0/item/" + id + ".json" and converts it to a Map and returns it.

Example:

news(24857356).get("title") → "The physiological effects of slow breathing in the healthy human"
news(111111).get("title") → null.

Solution

1.2.2. Read and write editor configurations as JSON ⭐⭐

The developers are working on a new editor for Captain CiaoCiao, and the configurations should be saved in a JSON file.

Task:

Write a class Settings so that the following configurations can be mapped:

{
  "editor" : {
    "cursorStyle" : "line",
    "folding" : true,
    "fontFamily" : [ "Consolas, 'Courier New', monospace" ],
    "fontSize" : 22,
    "fontWeight" : "normal"
  },
  "workbench" : {
    "colorTheme" : "Default Dark+"
  },
  "terminal" : {
    "integrated.unicodeVersion" : "11"
  }
}

The JSON file gives a good indication of the data types:
- cursorStyle is String, folding is boolean, fontFamily is an array or List.
If an attribute is not set, which means null, it should not be written.
For terminal the contained key values are unknown, they shall be contained in a Map<String, String>.

Solutions: Settings, EditorPreferences, EditorPreferencesDemo.

1.3. HTML

HTML is an important markup language. The Java standard library does not provide support for HTML documents, except for what the javax.swing.JEditorPane can do, which is to render HTML 3.2 and a subset of CSS 1.0.

For Java programs to be able to write and read HTML documents correctly and validly, and to be able to read nodes, we have to turn to (open source) libraries.

1.3.1. Load Wikipedia images with jsoup ⭐⭐

The popular open-source library jsoup (https://jsoup.org/) loads the content of web pages and represents the content in a tree in memory.

Include the following dependency in the POM:

<dependency>
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.15.4</version>
</dependency>

Task:

Study the examples at https://jsoup.org/cookbook/extracting-data/dom-navigation and https://jsoup.org/cookbook/extracting-data/selector-syntax.
Retrieve from the main Wikipedia page all images and save them to your file system.

Solution

1.4. Office documents

Microsoft Office continues to be at the top when it comes to word processing and spreadsheets. For many years, the binary file format has been well known, and there are Java libraries for reading and writing. Meanwhile, processing Microsoft Office documents has become much easier since the documents are, at their core, XML documents that are combined into a ZIP archive. Java support is excellent.

1.4.1. Generate Word files with screenshots ⭐⭐

Read the Wikipedia entry for POI: https://de.wikipedia.org/wiki/Apache_POI.

Task:

Add the following for Maven in the POM to include Apache POI and the necessary dependencies for DOCX:

<dependency>
  <groupId>org.apache.poi</groupId>
  <artifactId>poi-ooxml</artifactId>
  <version>4.1.2</version>
</dependency>

Study the source code of SimpleImages.java.

Java allows you to capture screenshots, like this:

private static byte[] getScreenCapture() throws AWTException, IOException {
  BufferedImage screenCapture = new Robot().createScreenCapture( SCREEN_SIZE );
  ByteArrayOutputStream os = new ByteArrayOutputStream();
  ImageIO.write( screenCapture, "jpeg", os );
  return os.toByteArray();
}

Write a Java program that takes a screenshot every 5 seconds for 20 seconds and attaches the image to the Word document.

Solution

1.5. Archives

Files with metadata are collected in archives. A well-known and popular archive format is ZIP, which not only combines the data but also compresses it. Many archive formats can also store the files encrypted and store checksums so that errors in the transfer can be detected later.

Java offers two possibilities for compression: Since Java 7 there is a ZIP file system provider and already since Java 1.0 there are the classes ZipFile and ZipEntry.

1.5.1. Play insect sounds from ZIP archive ⭐⭐

Bonny Brain likes to listen to the sounds of insects and uses the WAV collection of https://catalog.data.gov/dataset/bug-bytes-sound-library-stored-product-insect-pest-sounds, where various audio files are offered for download in a ZIP.

Task:

Study the documentation at https://christian-schlichtherle.bitbucket.io/truezip/truezip-path/.

Include two dependencies in the Maven POM:

<dependency>
  <groupId>en.schlichtherle.truezip</groupId>
  <artifactId>truezip-path</artifactId>
  <version>7.7.10</version>
</dependency>

<dependency>
  <groupId>en.schlichtherle.truezip</groupId>
  <artifactId>truezip-driver-zip</artifactId>
  <version>7.7.10</version>
</dependency>

Download the ZIP with the insect sounds, but do not unpack it.
Build a TPath object for the ZIP file.
Transfer all filenames from the ZIP file into a list: Files.newDirectoryStream(…) helps here.
Write an infinite loop, and
- select a random WAV file,
- open the random file with Files.newInputStream(…), decorate it with a BufferedInputStream and open an AudioSystem.getAudioInputStream(…). Play the WAV file and access the following code, where ais the AudioInputStream.
  Clip clip = AudioSystem.getClip(); clip.open( ais ); clip.start(); TimeUnit.MICROSECONDS.sleep( clip.getMicrosecondLength() + 50 ); clip.close();
  In chapter "Exceptions" we had worked with the javax.sound API before.

Solution