1. Advanced String Processing

After dealing with basic data types around characters and strings in another chapter, we will now discuss advanced string processing. The topics are formatted output, regular expressions, string splitting.

Prerequisites

  • be able to format strings

  • be able to match, search, replace with regular expressions

  • be able to split strings

  • understand character encodings and UTF-8

Data types used in this chapter:

1.1. Format strings

To format strings, numbers and temporal data as a string, there are different ways to do it in Java. In the package java.text you can find the classes MessageFormat, DateFormat and DecimalFormat as well as the class Formatter and in String the method String.format(…​).The next tasks can be solved using the formatting strings from Formatter in a rather easy way.

1.1.1. Build ASCII table ⭐

Bonny Brain has installed a new app on her Aye Phone that shows her the ASCII alphabet:

$ ascii
Usage: ascii [-adxohv] [-t] [char-alias...]
   -t = one-line output  -a = vertical format
   -d = Decimal table  -o = octal table  -x = hex table  -b binary table
   -h = This help screen -v = version information
Prints all aliases of an ASCII character. Args may be chars, C \-escapes,
English names, ^-escapes, ASCII mnemonics, or numerics in decimal/octal/hex.

Dec Hex    Dec Hex    Dec Hex  Dec Hex  Dec Hex  Dec Hex   Dec Hex   Dec Hex
  0 00 NUL  16 10 DLE  32 20    48 30 0  64 40 @  80 50 P   96 60 `  112 70 p
  1 01 SOH  17 11 DC1  33 21 !  49 31 1  65 41 A  81 51 Q   97 61 a  113 71 q
  2 02 STX  18 12 DC2  34 22 "  50 32 2  66 42 B  82 52 R   98 62 b  114 72 r
  3 03 ETX  19 13 DC3  35 23 #  51 33 3  67 43 C  83 53 S   99 63 c  115 73 s
  4 04 EOT  20 14 DC4  36 24 $  52 34 4  68 44 D  84 54 T  100 64 d  116 74 t
  5 05 ENQ  21 15 NAK  37 25 %  53 35 5  69 45 E  85 55 U  101 65 e  117 75 u
  6 06 ACK  22 16 SYN  38 26 &  54 36 6  70 46 F  86 56 V  102 66 f  118 76 v
  7 07 BEL  23 17 ETB  39 27 '  55 37 7  71 47 G  87 57 W  103 67 g  119 77 w
  8 08 BS   24 18 CAN  40 28 (  56 38 8  72 48 H  88 58 X  104 68 h  120 78 x
  9 09 HT   25 19 EM   41 29 )  57 39 9  73 49 I  89 59 Y  105 69 i  121 79 y
 10 0A LF   26 1A SUB  42 2A *  58 3A :  74 4A J  90 5A Z  106 6A j  122 7A z
 11 0B VT   27 1B ESC  43 2B +  59 3B ;  75 4B K  91 5B [  107 6B k  123 7B {
 12 0C FF   28 1C FS   44 2C ,  60 3C <  76 4C L  92 5C \  108 6C l  124 7C |
 13 0D CR   29 1D GS   45 2D -  61 3D =  77 4D M  93 5D ]  109 6D m  125 7D }
 14 0E SO   30 1E RS   46 2E .  62 3E >  78 4E N  94 5E ^  110 6E n  126 7E ~
 15 0F SI   31 1F US   47 2F /  63 3F ?  79 4F O  95 5F _  111 6F o  127 7F DEL

However, their Aye Phone doesn’t have such a wide screen, and the first two blocks are not visible characters anyway.

Task:

  • Write a program that prints all ASCII characters from position 32 to 127 in the same formatting as the Unix program ascii does.

  • At position 127 write DEL.

1.1.2. Aligned outputs ⭐

Captain CiaoCiao needs a table of the following type for a listing:

Dory Dab     paid
Bob Banjo    paid
Cod Buri     paid
Bugsy        not paid

Task:

  • Write a method printList(String[] names, boolean[] paid) that prints a collection on the screen. The first string array contains all names, the second array contains information whether the person has paid or not.

  • All names can be of different length, but the texts in the second column should be aligned.

  • The longest string in the first column has a distance of four spaces to the second column.

  • If passed arrays are null, a NullPointerException must be thrown.

1.2. Regular expressions and pattern recognition

Regular expressions are a curse and a blessing for many. Used incorrectly, they lead to programs that are impossible to read later, used correctly, they shorten the program massively and contribute to clarity.

The next exercises will show that we can use regular expressions to test whether

  1. a string completely matches a regular expression,

  2. a partial string exists and, if so, find out where, and then replace it.

Later, we will also use regular expressions to specify separators and decompose strings.

1.2.1. Quiz: Define regex ⭐

What do the regex look like to match or find the following?

  • a string of exactly 10 digits

  • a string of 5 to 10 digits and letters

  • a string ending in ., ! or ? like a sentence

  • a non-empty string that does not contain digits

  • an official title or a name title: Prof., Dr., Dr. med., Dr. h.c. in the string

1.2.2. Determine popularity in social media ⭐

Of course Captain CiaoCiao is active on social media, his identifier is #CaptainCiaoCiao or @CaptainCiaoCiao.

Now Captain CiaoCiao wants to know how popular he is.

Task:

  • Given an aggregated text with messages; how often does #CaptainCiaoCiao or @CaptainCiaoCiao occur there?

Example:

  • For the following input, the result is 2.

    Make me a baby #CaptainCiaoCiao
    Hey @CaptainCiaoCiao, where is the recruitment test?
    What is a hacker’s favorite pop group? The Black IP’s.

1.2.3. Detect scanned values ⭐

Bonny Brain receives scanned lists of numbers that need to be processed electronically. In the first step, it sends the scans through an OCR recognizer, and the end result is ASCII text. The numbers from the OCR recognition always look like this:

Representation of the numbers 0 to 9
 000   11   22  333  4  4 5555   6   77777  888   9999
0  00 111  2  2    3 4  4 5     6       7  8   8 9   9
0 0 0  11    2   33  4444 555  6666    7    888   9999
00  0  11   2      3    4    5 6   6   7   8   8    9
 000  11l1 2222 333     4 555   666    7    888    9

Task:

  • Given is a line from the scan with numbers from the format shown. Convert the numbers to an integer.

  • There could be missing spaces after the last digit, and there could be several spaces between the large characters.

Example:

  • If the string (written in the text block syntax) is

    String ocr = """
        4  4 77777  11   11    4  4  22
        4  4    7  111  111    4  4 2  2
        4444   7    11   11    4444   2
           4   7    11   11       4  2
           4   7   11l1 11l1      4 2222""";

    so the desired result should be 471142.

If you want to play around with the strings, you can find a way at https://patorjk.com/software/taag/#p=display&f=Alphabet&t=0123456789.

1.2.4. Quiet please! Defuse shouting texts ⭐

Captain CiaoCiao often gets letters, and the senders often SCREAM in capital letters — how nasty for his eyes.

Task:

  • Write a method String silentShoutingWords(String) that converts all capitalized words over three letters in length to lowercase.

Example:

  • silentShoutingWords("AY Captain! Smutje MUST GO!")"AY Captain! Smutje must GO!"

1.2.5. Convert time with AM and PM to 24-hour count ⭐⭐

Bonny Brain often gets messages where the time is given in AM and PM.

We raid the harbor at 11:00 PM and meet on the amusement mile at 1:30 AM.

Bonny Brain doesn’t like that, she wants only the 24-hour count of Military Time.

Task:

  • Write a converter that converts strings with AM/PM (case-insensitive, even with periods) to Military Time. As a reminder, 12:00 AM is 00:00, and 12:00 PM is 12:00.

Examples:

  • "Harbour: 11:00 PM, entertainment districts: 1:30 a.m.!"Harbour: 2300, entertainment districts: 0130!"

  • "Get out of bed: 12:00AM, bake a cake: 12 PM.""Get out of bed: 0000, bake a cake: 1200"

1.3. Decompose strings into tokens

Tokenizing is the opposite of constructing and formatting a string. A string is split into substrings, with separators determining the separation points. Java provides various classes for tokenizing strings and input. The separators can be symbols or strings described by regular expressions. The split(…​) method of the String class works with regular expressions just like the Scanner, the StringTokenizer class does not.

1.3.1. Split address lines with the StringTokenizer ⭐

The software of Captain CiaoCiao has to evaluate addresses consisting of three or four lines.

The meaning of the lines

linecontentoptional

1

name

no

2

street

no

3

city

no

4

country

yes

The lines are separated with a line break. There are four valid separator symbols or sequences:

Table 1. line break
character (abbreviation)decimalhexadecimalescape sequence

LF

10

0A

\n

CR

13

0D

\r

CR LF

13 10

0D 0A

\r\n

LF CR

10 13

0A 0D

\n\r

LF is the abbreviation for "line feed" and CR for "carriage return"; in old teleprinters, CR moved the carriage to the first column and, LF pushed the paper up.

Traditionally, DOS and Microsoft Windows use the combination \r\n, while Unix systems use \n.

Task:

  • Break a newline-separated string into four lines, and assign the lines to the variables name, street, city, country.

  • If a fourth line with the country name is not given, let country be "Drusselstein".

  • Reassemble the line as a CSV line separated by semicolons.

Examples:

  • "Boots and Bootles\n21 Pickle Street\n424242 Douglas\nArendelle"Boots and Bootles;21 Pickle street;424242 Douglas;Arendelle

  • "Doofenshmirtz Evil Inc.\nStrudelkuschel 4427\nDanville"Doofenshmirtz Evil Inc.;Strudelkuschel 4427;Gimmelshtump;Drusselstein

1.3.2. Split sentences into words and reverse them ⭐

Bonny Brain is waiting for a message, but something went wrong with the transmission — all the words are reversed!

erehW did eht etarip esahcrup sih kooh? tA eht dnah-dnoces pohs!

Task:

  1. Break the string into words. Separators of words are spaces and punctuation marks.

  2. Turn over all words one by one.

  3. Output the words one after the other separated by a space. The punctuation marks and other separators do not matter.

Example:

  • "erehW did eht etarip esahcrup sih kooh? tA eht dnah-dnoces pohs!""Where did the pirate purchase his hook At the hand second shop"

1.3.3. Check relations between numbers ⭐

Captain CiaoCiao practices archery, and he records the scores from 0 to 10 in a list. He also notes if he got better, worse, or if the score stays the same. It may look like this:

1 < 2 > 1 < 10 = 10 > 2

Goldy Goldfish has the task of checking the relation signs <, > and =.

Task:

  • Write a program that gets a string like the one in the example and returns true if all relation characters are correct, and false otherwise.

Examples:

  • 1 < 2 > 1 < 10 = 10 > 2true

  • 1 < 1false

  • 1 <false

  • 1true

1.3.4. Convert A1 notation to columns and rows ⭐⭐

Captain CiaoCiao keeps records of his loot and uses spreadsheets. With his staff he discusses the figures, and to address the cells he uses the column and row index; for example, he says 4-16 to refer to the 4th column and the 16th row. Now he has heard of a whole new way to name cells, A1 notation, which uses a new kind of software called ECKSEL. This involves coding the column with letters from A to Z, according to the following scheme:

A, B, ..., Z, AA, ..., AZ, BA, ..., ZZ, AAA, AAB, ...

The lines are still described with numbers. Thus A2 stands for the cell 1-2.

Since Captain CiaoCiao has its difficulties with A1 notation, the specification is to be converted back to numeric columns and rows.

Task:

  • Write a method parseA1Notation(String) that gets a string in A1 notation and returns an array with two elements, in which at position 0 is the column and at position 1 is the row.

Examples:

  • parseA1Notation( "A1" )[1, 1]

  • parseA1Notation( "Z2" )[26, 2]

  • parseA1Notation( "AA34" )[27, 34]

  • parseA1Notation( "BZ" )[0, 0]

  • parseA1Notation( "34" )[0, 0]

  • parseA1Notation( " " )[0, 0]

  • parseA1Notation( "" )[0, 0]

1.3.5. Parse simple CSV files with coordinates ⭐

Bonny Brain notes the locations with loot in a CSV file coordinates.csv, where the coordinates are floating point numbers separated by commas.

For example, the file looks like this:

com/tutego/exercise/string/coordinates.csv
20.091612,-155.676695
23.087301,-73.643472
21.305452,-71.690421

Task:

  • Create a CSV file by hand. It should contain several lines with coordinates; the coordinates are separated by a comma.

  • A Java program should read the CSV file and output an HTML file with SVG for the polygon course on the screen.

  • Use the class Scanner to parse the file. Make sure to initialize the Scanner with useLocale(Locale.ENGLISH) if your locale is not English.

Example: For the upper block we want to create

<svg height="210" width="500">
<polygon points="20.091612,-155.676695 23.087301,-73.643472 21.305452,-71.690421 " style="fill:lime;stroke:purple;stroke-width:1" />
</svg>

1.3.6. Compress strings lossless by runlength encoding ⭐⭐⭐

To reduce the volume of data, files are often compressed. There are different compression algorithms; some are lossy like vowel removal, others work without loss like ZIP. Lossy compression is found in images, JPEG is a good example. Depending on the degree of compression, the image quality degrades. In JPEG, very high compression results in an image with strong artifacts.

A simple lossless compression is run-length encoding. The idea is to combine a sequence of identical symbols so that only the number and the symbol are written. The graphic format GIF, for example, uses this form of compression. Therefore, images with many monochrome lines are also smaller than, for example, images in which each pixel has a different color.

The next task is about run-length encoding. Suppose a string consists of a sequence of . (dot) and - (minus sign), such as:

--....--------..-

To shorten the length of strings, we can first write the symbol followed by the number of symbols. The string with 17 characters could be shortened to the following string with 9 characters:

-2.4-8.2-

Task:

  1. Create a new class SimpleStringCompressor.

  2. Write a static method String compress(String) that encodes sequences of . and - according to the described algorithm: First comes the character, then the number.

  3. Write a decoder String decompress(String) that unpacks the compressed string. Let decompress(compress(input)) be equal to input.

Extensions:

  • The program shall be able to handle all non-digits.

  • Refine the program so that the number is omitted if the character occurs only exactly once.

1.4. Character encodings and Unicode Collation Algorithm

Over the network and in the file system, everything is stored as a byte. One byte can have 256 different values, but this is not enough for all the characters in the world. Therefore there are character encodings, which can encode all characters in the world in a certain way. The character encodings differ thereby. Sometimes characters are mapped to a fixed number of bytes, sometimes characters are mapped to a different number of bytes. Sometimes characters are mapped to a fixed number of bytes, sometimes characters are mapped to a different number of bytes, or maybe certain characters are simply not recognized at all. Java supports all kinds of character encodings, but the most important character encoding today is the UTF-8 encoding.

1.4.1. Quiz: Encoding for Unicode characters ⭐

What characterizes a UTF-8 encoding?

1.4.2. Quiz: Order of strings with and without collator ⭐

After comparing with the Comparator, are the outputs positive, negative or 0?

Comparator<String> comparator = Comparator.naturalOrder();
System.out.println( comparator.compare( "a", "ä" ) );
System.out.println( comparator.compare( "ä", "z" ) );

Comparator<Object> collator = Collator.getInstance( Locale.GERMAN );
System.out.println( collator.compare( "a", "ä" ) );
System.out.println( collator.compare( "ä", "z" ) );

1.5. Suggested solutions

1.5.1. Build ASCII table

com/tutego/exercise/string/PrintAsciiTable.java
String header = "Dec Hex   Dec Hex   Dec Hex   Dec Hex   Dec Hex   Dec Hex";
System.out.println( header );

for ( int row = 0; row < 16; row++ ) {
  for ( int asciiCode = 32 + row; asciiCode <= 127; asciiCode += 16 ) {
    System.out.printf(
        "%1$3d %1$X %2$s  ", asciiCode,
        asciiCode == 127 ? "DEL" : Character.toString( asciiCode ) );
  }
  System.out.println();
}

The first thing the program does is write the table row, which in principle we could generate dynamically using a loop, but we’ll make it simple and write out the header statically.

The generated table has 16 rows, which generates a loop. In principle, we could also dynamically calculate the number of rows from the start and end values and the number of columns (6 in our case). However, we know that if we start at position 32, and end at 127, with 6 columns we need 16 rows.

The inner loop writes all columns for a given row. At the top left is the first element, the space character. To the right, the character increases not by one, but by 16, which is therefore the loop counter. In the next line we don’t start at 32 but at 33, the pattern here is the following: The start value of the inner loop is 32 + row, so 32 plus the row number. Altogether the loop ends when the ASCII code has reached 127.

In the body of the inner loop the character must be printed. Basically, each character is written out as a string. The condition operator checks for the DEL character at position 127 — all other characters are converted to strings of length 1 via the Character method. The format string contains three parts: The first two parts access the first format argument and start by writing the position of the character as a decimal number, followed by the number in hexadecimal format. The integer is to be padded with white space on the left; for the hexadecimal number, we don’t need to include any width information because it always uses two digits when starting with 32. The third block contains the character as a string and the second format argument. At the end of the line we set a line feed.

Output of PrintAsciiTable
Dec Hex   Dec Hex   Dec Hex   Dec Hex   Dec Hex   Dec Hex
 32 20     48 30 0   64 40 @   80 50 P   96 60 `  112 70 p
 33 21 !   49 31 1   65 41 A   81 51 Q   97 61 a  113 71 q
 34 22 "   50 32 2   66 42 B   82 52 R   98 62 b  114 72 r
 35 23 #   51 33 3   67 43 C   83 53 S   99 63 c  115 73 s
 36 24 $   52 34 4   68 44 D   84 54 T  100 64 d  116 74 t
 37 25 %   53 35 5   69 45 E   85 55 U  101 65 e  117 75 u
 38 26 &   54 36 6   70 46 F   86 56 V  102 66 f  118 76 v
 39 27 '   55 37 7   71 47 G   87 57 W  103 67 g  119 77 w
 40 28 (   56 38 8   72 48 H   88 58 X  104 68 h  120 78 x
 41 29 )   57 39 9   73 49 I   89 59 Y  105 69 i  121 79 y
 42 2A *   58 3A :   74 4A J   90 5A Z  106 6A j  122 7A z
 43 2B +   59 3B ;   75 4B K   91 5B [  107 6B k  123 7B {
 44 2C ,   60 3C <   76 4C L   92 5C \  108 6C l  124 7C |
 45 2D -   61 3D =   77 4D M   93 5D ]  109 6D m  125 7D }
 46 2E .   62 3E >   78 4E N   94 5E ^  110 6E n  126 7E ~
 47 2F /   63 3F ?   79 4F O   95 5F _  111 6F o  127 7F DEL

1.5.2. Aligned outputs

com/tutego/exercise/string/PaidOrNotPaid.java
public static void printList( String[] names, boolean[] paid ) {

  if ( names.length != paid.length )
    throw new IllegalArgumentException(
        "Number of names and paid entries are not the same, but "
        + names.length + " and " + paid.length );

  int maxColumnLength = 0;
  for ( String name : names )
    maxColumnLength = Math.max( maxColumnLength, name.length() );

  String format = "%-" + maxColumnLength + "s    %spaid%n";

  for ( int i = 0; i < names.length; i++ ) {
    System.out.printf( format, names[ i ], paid[ i ] ? "" : "not " );
  }
}

First, the method checks if the two arrays have an equal number of elements, and if not, an IllegalArgumentException is thrown. Accessing length generates the desired NullPointerException if the arrays are null.

Since it is unknown in advance how wide the first left column will be, we have to go over all the strings and determine the maximum length. Using this maximum maxColumnLength we can build a format string. The format string gets a format specifier that determines the width of a string, padded with spaces. The format string has a leading minus sign, so this gives a left-aligned string that is padded on the right with spaces up to the maximum length maxColumnLength. In addition, the format string contains a space to the right column of four spaces.

The right column contains either "paid" or "not paid". That is, the string "paid" always occurs, and only the word "not" is to be set dependent on a boolean value. This is done by the condition operator, which either returns an empty string or the string "not" as a format argument for the format string.

1.5.3. Quiz: Define regex

com/tutego/exercise/string/RegexExamples.java
// A string of exactly 10 digits

Pattern p12 = Pattern.compile( "\\d{10}" );
Matcher m12 = p12.matcher( "0123456789" );
System.out.println( m12.matches() );       // true
Pattern p11 = Pattern.compile( "\\d{10}" );
Matcher m11 = p11.matcher( "1" );
System.out.println( m11.matches() );                // false
Pattern p10 = Pattern.compile( "\\d{10}" );
Matcher m10 = p10.matcher( "abcdefghij" );
System.out.println( m10.matches() );       // false

// A string of 5 to 10 numbers and letters.

Pattern p9 = Pattern.compile( "\\d{5,10}" );
Matcher m9 = p9.matcher( "01234567" );
System.out.println( m9.matches() );       // true
Pattern p8 = Pattern.compile( "\\d{5,10}" );
Matcher m8 = p8.matcher( "0" );
System.out.println( m8.matches() );              // false
Pattern p7 = Pattern.compile( "\\d{5,10}" );
Matcher m7 = p7.matcher( "01234567890123" );
System.out.println( m7.matches() ); // false

// A string that ends with `.`, `!` or `?`, like a sentence.

Pattern p6 = Pattern.compile( ".*?[.!?]" );
Matcher m6 = p6.matcher( "Ja? Ja!" );
System.out.println( m6.matches() );         // true
Pattern p5 = Pattern.compile( ".*?[.!?]" );
Matcher m5 = p5.matcher( "Nein?" );
System.out.println( m5.matches() );           // true
Pattern p4 = Pattern.compile( ".*?[.!?]" );
Matcher m4 = p4.matcher( "Ok." );
System.out.println( m4.matches() );             // true
Pattern p3 = Pattern.compile( ".*?[.!?]" );
Matcher m3 = p3.matcher( "No" );
System.out.println( m3.matches() );              // true

// A string that contains no digits.

Pattern p2 = Pattern.compile( "\\D*" );
Matcher m2 = p2.matcher( "Ciao" );
System.out.println( m2.matches() );                // true
Pattern p1 = Pattern.compile( "\\D*" );
Matcher m1 = p1.matcher( "Cia0" );
System.out.println( m1.matches() );                // false
Pattern p = Pattern.compile( "\\D*" );
Matcher m = p.matcher( "" );
System.out.println( m.matches() );                    // false

// The official title, Prof., Dr., Dr. med., Dr. h.c.

Pattern pattern = Pattern.compile( "(Prof\\.|Dr\\. med\\.|Dr\\. h\\.c\\.|Dr\\.)\\s" );
System.out.println( pattern.matcher( "Hallo Herr Dr. Miles" ).find() ); // true
System.out.println( pattern.matcher( "Nix mit Dr. h.c. Thai med." ).find() ); // true
System.out.println( pattern.matcher( "Megan Dr.Thai" ).find() ); // false

In the last example, it is about finding and not about a comlete match, so the find() method of Matcher is used. In principle, tests of existence can also be formulated by .*FIND.* and matches(…​), but matching makes a little more work for the regex engine than just providing the first find and not having to look all the way to the end.

1.5.4. Determine popularity in social media

com/tutego/exercise/string/Popularity.java
String text = """
    Make me a baby #CaptainCiaoCiao
    Hey @CaptainCiaoCiao, where is the recruitment test?
    What is a hacker’s favorite pop group? The Black IP’s.""";

Pattern pattern = Pattern.compile( "[#@]CaptainCiaoCiao" );
System.out.println( pattern.matcher( text ).results().count() );

The task can be solved with a one-liner, because the Matcher method results() returns a Stream<MatchResult>, and for Stream objects the count() method determines the number of occurrences.

Java 8 Backport

Before Java 9, J there is no direct way to return the number of occurrences for Pattern or Matcher types. Therefore we have to work with a small workaround. After building the Pattern object and requesting it from the Matcher we run over the string with a loop as many times with the find() method until there are no more findings. As we run through, we count up a variable and at the end we know how many occurrences were counted:

Matcher matcher = Pattern.compile( "[#@]CaptainCiaoCiao" ).matcher( text );
int count = 0;
while ( matcher.find() ) count++;

The proposed solution further uses Java 15 text blocks; they would need to be replaced with a concatenation of lines for Java 8.

1.5.5. Detect scanned values

com/tutego/exercise/string/OcrNumbers.java
private final static String[] searches = {
    "000", "11", "22", "333", "44", "55555", "6", "77777", "888", "9999" };

private static int parseOcrNumbers( String string ) {
  String line = new Scanner( string ).nextLine().replaceAll( "\\s+", "" );

  for ( int i = 0; i < searches.length; i++ )
    line = line.replaceAll( searches[ i ], "" + i );

  return Integer.parseInt( line );
}

If we take a closer look at the big numbers, we quickly realize that each large number contains the actual number itself. The large zero contains 0, the large one contains 1 and so on. So we don`t have to evaluate several lines, it`s enough to take any line. We just take the first one.

After extracting the first line, we can search for the substring that makes up the single digits. The blanks interfere a bit, so they are removed in the first step. A new String is created from the first line, in which all spaces have been removed and then the digits of each major character are aligned. For example, if the line starts with 000, we know that the first digit in the result must be 0. We can simply use replaceAll(…​) to replace the sequence 000 with 0. For example, if there are two zeros in a row, 0000 correctly becomes 00.

Since not only three zeros have to be replaced by a zero, but also two ones by a one, and there are ten different replacements, we store the individual strings in an array beforehand. A loop runs through the array, and in the body there are repeated calls to replaceAll(…​), which replaces all partial strings from the search with the loop index, so that, for example, 000 becomes "" + 0, i.e. "0". At the end we convert the number to an integer and return the result.

If the incoming string is null, empty or contains foreign characters, there will be exceptions in the following. This behavior is fine.

1.5.6. Quiet please! Defuse shouting texts

com/tutego/exercise/string/DoNotShout.java
public static String silentShoutingWords( String string ) {
  return Pattern.compile( "\\p{javaUpperCase}{3,}" )
                .matcher( string )
                .replaceAll( matchResult -> matchResult.group().toLowerCase() );
}

The proposed solution proceeds in three steps: Building the pattern object, matching the string, replacing the match group with lowercase strings. The regex string must describe a sequence of uppercase letters. Upper case letters, over the entire Unicode alphabet, determines \p{javaUpperCase}. We want to have at least three uppercase letters in a row, which {3,} takes care of. Whether or not to keep the pattern precompiled in a static variable depends entirely on how often the method is called. In general, it is a good strategy to keep the Pattern objects in memory, because translation always costs a little execution time. On the other hand, you reference an object, and if you rarely need it, you don`t have to.

compile(…​) gives us a Pattern object and the matcher(…​) method gives us a Matcher object. We can well use the replaceAll(Function<MatchResult, String>) method, which can perform a transformation of the found strings. The argument passed is a function that maps a MatchResult to a string. Our function accesses the group, converts the string to lowercase and returns it. replaceAll(…​) finds all places with the selected uppercase letters and calls this function multiple times.

Java 8 Backport

The replaceAll(Function) method only exists since Java 9. Users of Java 8 use the loop and the pair appendReplacement(…​) and appendTail(…​).

1.5.7. Convert time with AM and PM to 24-hour count

com/tutego/exercise/string/AmPmToMilitaryTime.java
private static final Pattern AM_PM_PATTERN =
    Pattern.compile( "(?<hours>\\d\\d?)" +        // hours
                         "(?::(?<minutes>\\d\\d))?" + // minutes
                         "\\s?" +                     // optional whitespace
                         "(?<ampm>[ap])[.]?m[.]?",    // AM/PM
                     Pattern.CASE_INSENSITIVE );

public static String convertToMilitaryTime( String string ) {
  Matcher matcher = AM_PM_PATTERN.matcher( string );
  return matcher.replaceAll( matchResult -> {
    int hours   = Integer.parseInt( matcher.group( "hours" ) );
    int minutes = matcher.group( "minutes" ) == null ?
                  0 :
                  Integer.parseInt( matcher.group( "minutes" ) );
    boolean isTimeInPm = "pP".contains( matcher.group( "ampm" ) );
    if ( isTimeInPm && hours < 12 ) hours += 12;
    else if ( !isTimeInPm && hours == 12 ) hours -= 12;
    return String.format("%02d%02d", hours, minutes);
  } );
}

At the heart of the program is a regular expression that captures times. This regular expression consists of four parts, which are also split into four lines in the code. The first part captures the hours, the part captures the minutes, the third part is an optional white space between the time and the last part, AM/PM.

  1. The hours consist of at least one integer, the second integer being optional if, for example, 1 AM is written. In principle, we could write the expression a bit more precisely so that something like 99 AM is not recognized, but we do not make that check here. The hours themselves are in round brackets, a group named ?<hours>. All regular expression groups can be named so that we can access them more easily later by that name and not have to use a group index.

  2. The minutes part is optional, that is, it is enclosed in round brackets altogether, and there is a question mark at the end. The inside starts with a ?:, which is a small regular expression optimization so that this group is not accessible via the API later. If hours and minutes are specified at the same time, they are separated by a colon. The minutes themselves are also a named group, and they consist of two decimal numbers. Again, we do not make any checks about possible ranges of validity.

  3. The third part is a white space, which is optional.

  4. The last part must capture different notations of AM/PM. A dot could be placed between the two symbols, perhaps even mistakenly just after a letter, so say A.M or AM. So that we don’t have to specify case, we include a special pattern flag that checks case independently, so it doesn’t matter whether we write AM, am, Am, or pM.

The convertToMilitaryTime(String) method gets the string with the time information as a parameter. The Pattern object was stored as a constant, and the matcher(…​) method connects the Pattern with the string method parameter. The result is a Matcher. This type can do all the work with replaceAll(Function<MatchResult, String> replacer): the method runs over all the matches for us, calls our Function, and we can access the match from the MatchResult and replace it with a string.

Regarding the Function: First we access the group for the hours and convert it to an integer. Converting an integer will not throw an exception, because our regular expression ensures that only digits occur. The peculiarity with minutes is that they can be missing. So we have to go back to the group minute and ask if it exists at all. If it does not exist, we assign the variable minutes with 0, otherwise we convert the string with the minutes into an integer and set the variable minutes with it.

Now we evaluate the group ampm and declare a variable isTimeInPm which becomes true if the time is given in PM. For AM the variable remains false. This variable helps with the conversion. If isTimeInPm is true, then the time is "post meridiem", i.e. after noon, and 12 hours must be added. It may happen that the text mistakenly enters 23 PM, for example; in this case we want to correct the error and do not add 12. Moreover, if the time of the hours is equal to 12 o`clock, we also do not make a correction. The next check is specifically for 12:xx, which becomes 00xx clock. So if isTimeInPm equals false, then it is the time "ante meridiem", that is the morning, and we subtract 12.

After the two variables hours and minutes are set, we generate a string with the hours and minutes and use String.format(…​) so that we get the times with only one digit padded with a 0. This is the return of the Function.

Java 8 Backport

The String replaceAll(Function<MatchResult, String> replacer) method has been in Java 9. In Java 8, we must fall back to the pair appendReplacement(…​) and appendTail(…​).

1.5.8. Split address lines with the StringTokenizer

com/tutego/exercise/string/SplitAddressLines.java
// String address = "Boots and Bootles\n21 Pickle Street\r\n424242 Douglas\rArendelle";
String address="Doofenshmirtz Evil Inc.\nStrudelkuschel 4427\nGimmelshtump";
StringTokenizer lines = new StringTokenizer( address, "\n\r" );
String name = lines.nextToken();
String street = lines.nextToken();
String city = lines.nextToken();
final String DEFAULT_COUNTRY = "Drusselstein";
String country = lines.hasMoreTokens() ? lines.nextToken() : DEFAULT_COUNTRY;

System.out.println( name + ";" + street + ";" + city + ";" + country );

In the center of the job is the class StringTokenizer, which is useful whenever .

  1. not (all) tokens are to be extracted in advance like with split(…​), but step by step, and

  2. the separators consist of single characters but no strings.

In the same way we build a StringTokenizer instance, and don`t forget that the two characters mentioned are not a separator together, but two separate characters, which in any combination are a separator of the tokens.

With the method nextToken() we ask for the lines three times, and since we don`t know if there is a fourth line, we look ahead with hasMoreTokens(). If there is a token, we consume it, otherwise we choose the desired default country.

Since StringTokenizer is an Enumeration<Object>, in principle we could have used hasMoreElements() and nextElement(), but the latter method has the awkward Object return type.

1.5.9. Split sentences into words and reverse them

Let’s start with a small helper method that prints words reversed:

com/tutego/exercise/string/PrintReverseWords.java
private static void printWordReversed( String word ) {
  System.out.print( new StringBuilder( word ).reverse() );
  System.out.print( ' ' );
}

The reversing of the string is done by the reverse(…​)- method of the StringBuilder. We then output the result to the screen separated by a space at the end.

Now we have to split the sentence and recognize the words.

com/tutego/exercise/string/PrintReverseWords.java
String string = "erehW did eht etarip esahcrup sih kooh? tA eht dnah-dnoces pohs!";

String pattern = "[\\p{Punct}\\s]+";

new Scanner( string )
    .useDelimiter( pattern )
    .forEachRemaining( PrintReverseWords::printWordReversed );

System.out.println();

for ( String word : string.split( pattern ) )
  printWordReversed( word );

At the center is the regular expression [\p{punct}\s]+. It captures a sequence of punctuation marks, parentheses, etc., and white space separating words. We make use of predefined character classes. It uses \p{punct} for a character out of

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

and \s for the white space; it is the character class [ \t\n\x0B\f\r]. For a connection, we put both into a new character class, and since characters can occur any number of times in a row, we add a plus.

Java supports splitting with regular expressions via the Pattern-Matcher combination, and two well-known facades are the Scanner and the split(…​) method. Both variants are shown in the proposed solution. The Scanner is always good if the number of matches can be large, because with the split(…​) method the answer is always an array with all words. The Scanner implements Iterable and at forEachRemaining(…​) we put a method reference to the helper method printWordReversed(…​) for the Consumer, which writes each word reversed on the screen.

1.5.10. Check relations between numbers

com/tutego/exercise/string/RelationChecker.java
private static boolean isValidRelation( int number1, String operator, int number2 ) {
  return switch ( operator ) {
    case ">" -> number1 > number2;
    case "<" -> number1 < number2;
    case "=" -> number1 == number2;
    default  -> false;
  };
}

public static boolean isValidRelation( String string ) {
  Scanner scanner = new Scanner( string );
  int number1 = scanner.nextInt();

  while ( scanner.hasNext() ) {
    String operator = scanner.next();
    int number2 = scanner.nextInt();
    if ( isValidRelation( number1, operator, number2 ) )
      number1 = number2;
    else
      return false;
  }

  return true;
}

The actual method checkRelation(String) gets the string and checks the relations. However, the method falls back on its own private method isValidRelation(int, String, int) so we want to start with that. isValidRelation(…​) gets a number, a comparison operator, and another number. It checks if the comparison operator with the two numbers returns a true result. If so, the answer is true; if the comparison is false, the answer is false just as in the case of a misplaced symbol, because we evaluate only <, > and =.

The method checkRelation(…​) builds a Scanner with the passed String and now uses a combination of nextInt(), hasNext() and next() to process tokens from the Scanner. At the beginning there must always be a number, that means we can initialize a variable number1 with the first number. It could be that the data stream is empty now, but if the Scanner has next symbols, this will be a comparison operator, which we will refer to. After the comparison operator comes a second integer, which we also read in and store in number2. Now we call our own method isValidRelation(…​), and this method returns true if the comparison was fine. Then number2 will become the new number1, so in the next loop pass number2 will be assigned the following number. If isValidRelation(…​) returns false, then we can abort the method, because at this point a comparison is false. If there was no breakout from the loop, all comparisons were correct, and the method ends with return true.

Java 8 backport

The proposed solution uses the switch expressions with arrow notation that was introduced in Java 14. An alternative solution is:

switch ( relationalOperator ) {
  case ">": return number1 > number2;
  case "<": return number1 < number2;
  case "=": return number1 == number2;
  default: return false;
}

1.5.11. Convert A1 notation to columns and rows

com/tutego/exercise/string/A1Notation.java
private static final int NUMBER_OF_LETTERS = 26;

private static int parseColumnIndex( String string ) {
  int result = 0;
  for ( int i = 0; i < string.length(); i++ ) {
    // Map A..Z to 1..26
    int val = Character.getNumericValue( string.charAt( i ) ) - 9;
    result = NUMBER_OF_LETTERS * result + val;
  }
  return result;
}

public static int[] parseA1Notation( String cell ) {
  Matcher matcher = Pattern.compile( "([A-Z]+)(\\d+)" ).matcher( cell );
  if ( ! matcher.find() || matcher.groupCount() != 2 )
    return new int[]{ 0, 0 };
  int column = parseColumnIndex( matcher.group( 1 ) );
  int row    = Integer.parseInt( matcher.group( 2 ) );
  return new int[]{ column, row };
}

Let us consider the cell AA34 mentioned in the task as an example. In the first step, we need to separate the column from the row. The column here would be AA, the row 34. We then need to convert the column AA to the numeric representation, 27. These two steps are handled by two separate methods. The main method parseA1Notation(String) first extracts the row and column, and then calls an internal method parseColumnIndex(String), which converts the column to a numeric value for us after the A1 notation.

Let’s start with the parseColumnIndex(String) method. We’ll take a few examples to make it easier to read the calculation pattern.

Table 2. calculation of A1 notation
symbol sequencecalculationresult

A

1 × 260

1

Z

26 × 260

26

AA

1 × 261 + 1 × 260

27

IV

9 × 261 + 22 × 260

256

AAA

1 × 262 + 1 × 261 + 1 × 260

703

What we can read is the following:

  • Each letter is transferred to a number — A to 1, B to 2 until finally Z to 26.

  • The example is a place value system. It is similar to our decimal system, but here the factor 26 is used as multiplier.

To convert the whole thing now into an algorithm, we use the Horner scheme. Let us illustrate this with an example:

WDE = 23 × 262 + 4 × 261 + 5 × 260 = ((23 × 26) + 4) × 26 + 5

The Horner scheme is important for us because we don’t need to calculate powers anymore. If we go one place further to the right, we multiply the old result by 26 and repeat the scheme for the other places. This is exactly what the method parseColumnIndex(String) does. A loop runs over all characters, extracts them and queries the numeric representation with Character.getNumericValue(char). This is defined not only for digits, but also for letters. For the letter 'a' the result is 10, the same as for 'A'. For 'Z' it is 35. If we subtract 9 from this, we get the range of values 1 to 26. We take the old result, multiply it by 26 and add the numeric representation of the letter. The next step is to calculate the new numeric value of the next character, multiply the last result by 26 and again add the value of the last letter. This performs the calculation exactly as we planned it.

The method parseA1Notation(String) has little work to do. First, we compile a Pattern that extracts the column and row — since the column is all letters and the row is all digits, we can easily capture that via the groups in the regular expression. If the string is wrong, and if we don’t have two matches, we return an array of {0, 0}, signaling incorrect input. If there are two match groups, we take the column information from the first one and convert it to an integer using our own parseColumnIndex(String) method. The second string, according to the regular expression, is a valid string of digits, which Integer.parseInt(String) converts to an int. The numeric column and row go into a small array, and that goes back to the caller.

1.5.12. Parse simple CSV files with coordinates

com/tutego/exercise/string/GenerateSvgFromCsvCoordinates.java
String filename = "coordinates.csv";
try ( InputStream is =
          GenerateSvgFromCsvCoordinates.class.getResourceAsStream( filename );
      Scanner scanner = new Scanner( is, StandardCharsets.ISO_8859_1 ) ) {
  scanner.useDelimiter( ",|\\s+" ).useLocale( Locale.ENGLISH );

  StringBuilder svg = new StringBuilder( 1024 );
  svg.append( "<svg height=\"210\" width=\"500\">\n<polygon points=\"" );

  while ( scanner.hasNextDouble() ) {
    double x = scanner.nextDouble();

    if ( ! scanner.hasNextDouble() )
      throw new IllegalStateException( "Missing second coordinate" );

    double y = scanner.nextDouble();
    svg.append( x ).append( "," ).append( y ).append( " " );
  }

  svg.append( "\" style=\"fill:lime;stroke:purple;stroke-width:1\" />\n</svg>" );
  System.out.println( svg );
}

The file consists of sequences of integers separated by either a semicolon or a newline, generally speaking by arbitrary white space. We need to find a tokenizer that we can feed with a regular expression that stands for just these separators. Since we want to read from a file and process that with regular expressions, the class Scanner is a good choice. This is also how the proposed solution does it.

The Scanner is connected to an input stream, the file we want to read. The character encoding is explicitly set to UTF-8.

Then the Scanner is initialized with a regular expression for the delimiters. These delimiters are set by the Scanner method useDelimiter(…​). It is important to set the Locale.ENGLISH, because by default the Scanner is preconfigured with the Locale by the operating system, and if that is for example German, the Scanner expects floating point numbers with a comma as separator. But the source always has English formatted numbers.

After the Scanner is prepared, the program can produce the output. It starts with the SVG container and the polygon start tag. The Scanner method hasNext() helps to iterate through the file. When the hasNext() method returns a token, we always expect pairs. We can read the first integer, and now there must be a second integer. But if the Scanner cannot give a new token, this is an error and we raise an exception. If the second number exists, it will also be read in. The pair can then be placed in the SVG container.

At the end of the loop, we close the polygon tag and print the SVG element to the screen. For the output, we don’t need to pay attention to the language for append(double), because the formatting of the double is automatically in English.

Java 8 Backport

The constructor Scanner(InputStream source, String charsetName) has been around since the introduction of the class Scanner, the constructor Scanner(InputStream source, Charset charset) since Java 10. So for Java 8 we have to pass a string instead of a Charset object; StandardCharsets.ISO_8859_1.name() is a workable solution.

1.5.13. Compress strings lossless by runlength encoding

com/tutego/exercise/string/SimpleStringCompressor.java
public static String compress( String string ) {

  if ( string.isEmpty() )
    return "";

  char lastChar = string.charAt( 0 );
  int  count = 1;
  StringBuilder result = new StringBuilder( string.length() );

  for ( int i = 1; i < string.length(); i++ ) {
    char currentChar = string.charAt( i );
    if ( currentChar == lastChar )
      count++;
    else {
      result.append( lastChar );
      if ( count > 1 )
        result.append( count );
      count = 1;
      lastChar = currentChar;
    }
  }

  result.append( lastChar );
  if ( count > 1 )
    result.append( count );

  return result.toString();
}

private static CharSequence decodeToken( String token ) {

  if ( token.isEmpty() )
    return "";

  if ( token.length() == 1 )
    return token;

  int length = Integer.parseInt( token.substring( 1 ) );
  return String.valueOf( token.charAt( 0 ) ).repeat( Math.max( 0, length ) );
}

public static String decompress( String string ) {
  StringBuilder result = new StringBuilder( string.length() * 2 );
  Matcher pattern = Pattern.compile( "[.-](\\d*)" ).matcher( string );

  while ( pattern.find() )
    result.append( decodeToken( pattern.group() ) );

  return result.toString();
}

In the first step we want to compress the string. To do this, we first query whether the string contains any text at all; if not, we are quickly done with the task.

So that we can see how many of the same characters occur in the text in a row, we use a variable lastChar for the last character seen, so that we can compare a new character with the last character. Additionally we note the number of same characters in count. Since this result is freshly built, we add a variable result of the datatype StringBuilder.

The for loop goes over each character and stores it in the auxiliary variable currentChar. Now two things can happen: The character currentChar just read can be the same as the previous character in lastChar, or it can be a different character. We have to handle this difference.

The first case distinction checks if the character we saw last is the same as the currently read character. If so, we don’t have to do anything but increment the counter. Then it goes back into the loop. If the new character read is different from the last character, local compression comes to an end. Basically, we write the character to the buffer first. Next, we have to take care of the count. Exactly when we counted more than one character before, we write the count into the data stream, unless we had only exactly one character, then according to the task not 1 should come into the buffer, but nothing at all. After completing the pair — character and counter — the counter must be reset to 1. We are almost done with the loop. The moment a new character is found, we set lastChar to exactly the current character currentChar.

When we are done with the loop, we still have a character in lastChar and count. We therefore perform exactly the same query as before in the case distinction and append the counter to the string if the counter is genuinely greater than 1.

In the method for unpacking, we fall back on the pattern that results from the compression. There are always pairs of a string and a number, with the special case that the pair is single and the number is missing. Using a regular expression, we run the entire string and look at all the pairs. To keep the method from getting too big, we use a helper method called decodeToken(String) that takes a pair and expands it.

First the method must find out whether the token consists of only one character or of several characters. If the string consists of only one character, then it must have been our symbol and it comes into the output. If the string is longer than 1, then there is a length encoding from the second position upwards.With substring(1) we extract the string and convert it to an integer, so that the repeat(int) method of String can generate us exactly this number of characters token.charAt(0). With substring(1) we extract the string and convert it to an integer, so that the repeat(int) method of String can generate us exactly this number of characters token.charAt(0).

Java 8 Backport

The repeat(int) method exists since Java 11. In older Java versions, the solution can be rewritten like this:

StringBuilder result = new StringBuilder( length );
for ( int i = 0; i < length; i++ ) result.append( token.charAt( 0 ) );
return result;

1.5.14. Quiz: Encoding for Unicode characters

When we write a string in source code, we don’t worry much about encoding because we can write a large number of characters directly in quotes. But when these strings are written to a file, for example, the encoding is relevant because other parties will naturally want to read that file again.

Unicode characters require 4 bytes in their full extent. However, this is a waste if the character can only be encoded in one byte, such as a simple ASCII character. Therefore, one encodes Unicode characters to get the shortest possible representation. Common encodings are UTF-8 and UTF-16, they can represent all Unicode characters. This is not true for Latin-1, for example, because Latin-1 is only 8 bytes, and hundreds of thousands of characters cannot be represented in Latin-1. UTF-8 encoding will use one byte if possible, then use two bytes if possible, otherwise use three or four.

Table 3. bit assignments of UTF8 encoding
number of bytesbitsfirst codepointlast codepointbyte 1byte 2byte 3byte 4

1

7

U+0000

U+007F

0xxxxxxx

2

11

U+0080

U+07FF

110xxxxx

10xxxxxx

3

16

U+0800

U+FFFF

1110xxxx

10xxxxxx

10xxxxxx

5

21

U+10000

U+10FFFF

11110xxx

10xxxxxx

10xxxxxx

10xxxxxx

1.5.15. Quiz: Order of strings with and without collator

The outputs are:

Comparator<String> comparator = Comparator.naturalOrder();
System.out.println( comparator.compare( "a", "ä" ) ); // < 0
System.out.println( comparator.compare( "ä", "z" ) ); // > 0

Comparator<Object> collator = Collator.getInstance( Locale.GERMAN );
System.out.println( collator.compare( "a", "ä" ) ); // < 0
System.out.println( collator.compare( "ä", "z" ) ); // < 0

The natural ordering of strings is lexicographic, meaning that the position of the character in the Unicode alphabet counts. Let’s illustrate this with some characters:

Table 4. Some characters with their positions in the Unicode Standard

character

0

A

a

Ä

ß

ä

code point (decimal)

48

65

97

196

223

228

From the table you can see that the digits are first, followed by the uppercase letters and then the lowercase letters. The umlauts are far behind the capital letters and not sorted between the upper and lower case letters.

Umlauts are regular letters in German, which do not come after "Z" in the order. The standard DIN 5007-1 describes under "Ordering of character sequences (ABC rules)" two sorting procedures: (ABC rules

DIN 5007, variant 1 (used for words, for example in dictionaries)
  • "ä" and "a" are the same.

  • "ö" and "o" are the same.

  • "ü" and "u" are the same.

  • "ß" and "ss" are the same,

DIN 5007, variant 2 (special sorting for lists of names, such as in telephone directories)
  • "ä" and "ae" are the same.

  • "ö" and "oe" are the same.

  • "ü" and "ue" are the same.

  • "ß" and "ss" are the same.

There are similar rules for other languages. The lexicographical order does not support this. Therefore the Unicode Collation Algorithmm describes how the sorting looks like in different national languages. For an introduction see also https://en.wikipedia.org/wiki/Unicode_collation_algorithm.

In Java Collator objects are available. They are initialized with a Locale object. In addition to the language, so-called levels can be passed, the Javadoc gives further examples under the heading "Collator strength value".