1. Advanced String Processing

After dealing with basic data types around characters and strings in another chapter, we will now discuss advanced string processing. The topics are formatted output, regular expressions, string splitting.

Prerequisites

  • be able to format strings

  • be able to match, search, replace with regular expressions

  • be able to split strings

  • understand character encodings and UTF-8

Data types used in this chapter:

1.1. Format strings

To format strings, numbers, and temporal data as a string, there are different ways to do it in Java. In the package java.text you can find the classes MessageFormat, DateFormat and DecimalFormat as well as the class Formatter and in String the method String.format(…​). The next tasks can be solved using the formatting strings from Formatter in a rather easy way.

1.1.1. Build ASCII table ⭐

Bonny Brain has installed a new app on her Aye Phone that shows her the ASCII alphabet:

$ ascii
Usage: ascii [-adxohv] [-t] [char-alias...]
   -t = one-line output  -a = vertical format
   -d = Decimal table  -o = octal table  -x = hex table  -b binary table
   -h = This help screen -v = version information
Prints all aliases of an ASCII character. Args may be chars, C \-escapes,
English names, ^-escapes, ASCII mnemonics, or numerics in decimal/octal/hex.

Dec Hex    Dec Hex    Dec Hex  Dec Hex  Dec Hex  Dec Hex   Dec Hex   Dec Hex
  0 00 NUL  16 10 DLE  32 20    48 30 0  64 40 @  80 50 P   96 60 `  112 70 p
  1 01 SOH  17 11 DC1  33 21 !  49 31 1  65 41 A  81 51 Q   97 61 a  113 71 q
  2 02 STX  18 12 DC2  34 22 "  50 32 2  66 42 B  82 52 R   98 62 b  114 72 r
  3 03 ETX  19 13 DC3  35 23 #  51 33 3  67 43 C  83 53 S   99 63 c  115 73 s
  4 04 EOT  20 14 DC4  36 24 $  52 34 4  68 44 D  84 54 T  100 64 d  116 74 t
  5 05 ENQ  21 15 NAK  37 25 %  53 35 5  69 45 E  85 55 U  101 65 e  117 75 u
  6 06 ACK  22 16 SYN  38 26 &  54 36 6  70 46 F  86 56 V  102 66 f  118 76 v
  7 07 BEL  23 17 ETB  39 27 '  55 37 7  71 47 G  87 57 W  103 67 g  119 77 w
  8 08 BS   24 18 CAN  40 28 (  56 38 8  72 48 H  88 58 X  104 68 h  120 78 x
  9 09 HT   25 19 EM   41 29 )  57 39 9  73 49 I  89 59 Y  105 69 i  121 79 y
 10 0A LF   26 1A SUB  42 2A *  58 3A :  74 4A J  90 5A Z  106 6A j  122 7A z
 11 0B VT   27 1B ESC  43 2B +  59 3B ;  75 4B K  91 5B [  107 6B k  123 7B {
 12 0C FF   28 1C FS   44 2C ,  60 3C <  76 4C L  92 5C \  108 6C l  124 7C |
 13 0D CR   29 1D GS   45 2D -  61 3D =  77 4D M  93 5D ]  109 6D m  125 7D }
 14 0E SO   30 1E RS   46 2E .  62 3E >  78 4E N  94 5E ^  110 6E n  126 7E ~
 15 0F SI   31 1F US   47 2F /  63 3F ?  79 4F O  95 5F _  111 6F o  127 7F DEL

However, their Aye Phone doesn’t have such a widescreen, and the first two blocks are not visible characters anyway.

Build ASCII table

Task:

  • Write a program that prints all ASCII characters from position 32 to 127 in the same formatting as the Unix program ascii does.

  • At position 127, write DEL.

1.1.2. Aligned outputs ⭐

Captain CiaoCiao needs a table of the following type for a listing:

Dory Dab     paid
Bob Banjo    paid
Cod Buri     paid
Bugsy        not paid

Task:

  • Write a method printList(String[] names, boolean[] paid) that prints a collection on the screen. The first string array contains all names, the second array contains information whether the person has paid or not.

  • All names can be of different lengths, but the texts in the second column should be aligned.

  • The longest string in the first column has a distance of four spaces to the second column.

  • If passed arrays are null, a NullPointerException must be thrown.

1.2. Regular expressions and pattern recognition

Regular expressions are a curse and a blessing for many. Used incorrectly, they lead to programs that are impossible to read later, used correctly, they shorten the program massively and contribute to clarity.

The next exercises will show that we can use regular expressions to test whether

  1. a string completely matches a regular expression,

  2. a partial string exists and if so, find out where, and then replace it.

Later, we will also use regular expressions to specify separators and decompose strings.

1.2.1. Quiz: Define regex ⭐

What does the regex look like to match or find the following?

  • a string of exactly 10 digits

  • a string of 5 to 10 digits and letters

  • a string ending in ., ! or ? like a sentence

  • a non-empty string that does not contain digits

  • an official title or a name title: Prof., Dr., Dr. med., Dr. h.c. in the string

1.2.2. Determine popularity in social media ⭐

Of course Captain CiaoCiao is active on social media, his identifier is #CaptainCiaoCiao or @CaptainCiaoCiao.

Now Captain CiaoCiao wants to know how popular he is.

Determine popularity in social media

Task:

  • Given an aggregated text with messages; how often does #CaptainCiaoCiao or @CaptainCiaoCiao occur there?

Example:

  • For the following input, the result is 2.

    Make me a baby #CaptainCiaoCiao
    Hey @CaptainCiaoCiao, where is the recruitment test?
    What is a hacker’s favorite pop group? The Black IP’s.

1.2.3. Detect scanned values ⭐

Bonny Brain receives scanned lists of numbers that need to be processed electronically. In the first step, it sends the scans through an OCR recognizer, and the result is ASCII text. The numbers from the OCR recognition always look like this:

Representation of the numbers 0 to 9
 000   11   22  333  4  4 5555   6   77777  888   9999
0  00 111  2  2    3 4  4 5     6       7  8   8 9   9
0 0 0  11    2   33  4444 555  6666    7    888   9999
00  0  11   2      3    4    5 6   6   7   8   8    9
 000  11l1 2222 333     4 555   666    7    888    9

Task:

  • Given is a line from the scan with numbers from the format shown. Convert the numbers to an integer.

  • There could be missing spaces after the last digit, and there could be several spaces between the large characters.

Example:

  • If the string (written in the text block syntax) is

    String ocr = """
        4  4 77777  11   11    4  4  22
        4  4    7  111  111    4  4 2  2
        4444   7    11   11    4444   2
           4   7    11   11       4  2
           4   7   11l1 11l1      4 2222""";

    so the desired result should be 471142.

If you want to play around with the strings, you can find a way at https://patorjk.com/software/taag/#p=display&f=Alphabet&t=0123456789.

1.2.4. Quiet please! Defuse shouting texts ⭐

Captain CiaoCiao often gets letters, and the senders often SCREAM in capital letters — how nasty for his eyes.

Quiet please

Task:

  • Write a method String silentShoutingWords(String) that converts all capitalized words over three letters in length to lowercase.

Example:

  • silentShoutingWords("AY Captain! Smutje MUST GO!")"AY Captain! Smutje must GO!"

1.2.5. Converting time from AM/PM format to 24-hour format ⭐⭐

Bonny Brain frequently gets messages where the time is given in AM and PM.

We raid the harbor at 11:00 PM and meet on the amusement mile at 1:30 AM.

Bonny Brain doesn’t like that, she wants only the 24-hour count of Military Time.

Convert time with AM PM to 24 hour

Task:

  • Write a converter that converts strings with AM/PM (case-insensitive, even with periods) to Military Time. As a reminder, 12:00 AM is 00:00, and 12:00 PM is 12:00.

Examples:

  • "Harbour: 11:00 PM, entertainment districts: 1:30 a.m.!"Harbour: 2300, entertainment districts: 0130!"

  • "Get out of bed: 12:00AM, bake a cake: 12 PM.""Get out of bed: 0000, bake a cake: 1200"

1.3. Decompose strings into tokens

Tokenizing is the opposite of constructing and formatting a string. A string is split into substrings, with separators determining the separation points. Java provides various classes for tokenizing strings and input. The separators can be symbols or strings described by regular expressions. The split(…​) method of the String class works with regular expressions just like the Scanner, the StringTokenizer class does not.

1.3.1. Split address lines with the StringTokenizer ⭐

The software of Captain CiaoCiao has to evaluate addresses consisting of three or four lines.

The meaning of the lines

linecontentoptional

1

name

no

2

street

no

3

city

no

4

country

yes

The lines are separated with a line break. There are four valid separator symbols or sequences:

Table 1. line break
character (abbreviation)decimalhexadecimalescape sequence

LF

10

0A

\n

CR

13

0D

\r

CR LF

13 10

0D 0A

\r\n

LF CR

10 13

0A 0D

\n\r

LF is the abbreviation for "line feed" and CR for "carriage return"; in old teleprinters, CR moved the carriage to the first column, and LF pushed the paper up.

Traditionally, DOS and Microsoft Windows use the combination \r\n, while Unix systems use \n.

Task:

  • Break a newline-separated string into four lines, and assign the lines to the variables name, street, city, country.

  • If a fourth line with the country name is not given, let country be "Drusselstein".

  • Reassemble the line as a CSV line separated by semicolons.

Examples:

  • "Boots and Bootles\n21 Pickle Street\n424242 Douglas\nArendelle"Boots and Bootles;21 Pickle street;424242 Douglas;Arendelle

  • "Doofenshmirtz Evil Inc.\nStrudelkuschel 4427\nDanville"Doofenshmirtz Evil Inc.;Strudelkuschel 4427;Gimmelshtump;Drusselstein

1.3.2. Split sentences into words and reverse them ⭐

Bonny Brain is waiting for a message, but something went wrong with the transmission — all the words are reversed!

erehW did eht etarip esahcrup sih kooh? tA eht dnah-dnoces pohs!

Task:

  1. Break the string into words. Separators of words are spaces and punctuation marks.

  2. Turn over all the words one by one.

  3. Output the words one after the other, separated by a space. The punctuation marks and other separators are not critical.

Example:

  • "erehW did eht etarip esahcrup sih kooh? tA eht dnah-dnoces pohs!""Where did the pirate purchase his hook At the hand second shop"

1.3.3. Check relations between numbers ⭐

Captain CiaoCiao practices archery, and he records the scores from 0 to 10 in a list. He also notes if he got better, worse, or if the score stays the same. It may look like this:

1 < 2 > 1 < 10 = 10 > 2

Goldy Goldfish has the task of checking the relation signs <, > and =.

Task:

  • Write a program that gets a string like the one in the example and returns true if all relation signs are correct, and false otherwise.

Examples:

  • 1 < 2 > 1 < 10 = 10 > 2true

  • 1 < 1false

  • 1 <false

  • 1true

1.3.4. Convert A1 notation to columns and rows ⭐⭐

Captain CiaoCiao keeps records of his loot and uses spreadsheets. With his staff, he discusses the figures, and to address the cells he uses the column and row index; for example, he says 4-16 to refer to the 4th column and the 16th row. Now he has heard of a whole new way to name cells, A1 notation, which uses a new kind of software called ECKSEL. This involves coding the column with letters from A to Z, according to the following scheme:

A, B, …, Z, AA, …, AZ, BA, …, ZZ, AAA, AAB, …

The lines are still described with numbers. Thus, A2 stands for cell 1-2.

Since Captain CiaoCiao has its difficulties with A1 notation, the specification is to be converted back to numeric columns and rows.

Task:

  • Write a method parseA1Notation(String) that gets a string in A1 notation and returns an array with two elements, in which at position 0 is the column and at position 1 is the row.

Examples:

  • parseA1Notation( "A1" )[1, 1]

  • parseA1Notation( "Z2" )[26, 2]

  • parseA1Notation( "AA34" )[27, 34]

  • parseA1Notation( "BZ" )[0, 0]

  • parseA1Notation( "34" )[0, 0]

  • parseA1Notation( " " )[0, 0]

  • parseA1Notation( "" )[0, 0]

1.3.5. Parse simple CSV files with coordinates ⭐

Bonny Brain notes the locations with loot in a CSV file coordinates.csv, where the coordinates are floating-point numbers separated by commas.

For example, the file looks like this:

com/tutego/exercise/string/coordinates.csv
20.091612,-155.676695
23.087301,-73.643472
21.305452,-71.690421

Task:

  • Create a CSV file manually. It should contain several lines with coordinates; a comma separates the coordinates.

  • A Java program should read the CSV file and output an HTML file with SVG for the polygon course on the screen.

  • Use the class Scanner to parse the file. Make sure to initialize the Scanner with useLocale(Locale.ENGLISH) if your locale is not English.

Example: For the upper block, we want to create

<svg height="210" width="500">
<polygon points="20.091612,-155.676695 23.087301,-73.643472 21.305452,-71.690421 " style="fill:lime;stroke:purple;stroke-width:1" />
</svg>

1.3.6. Compress strings lossless by runlength encoding ⭐⭐⭐

To reduce the volume of data, files are often compressed. There are different compression algorithms; some are lossy like vowel removal, others work without loss like ZIP. Lossy compression is found in images, JPEG is a good example. Depending on the degree of compression, the image quality degrades. In JPEG, very high compression results in an image with strong artifacts.

A simple lossless compression is run-length encoding. The idea is to combine a sequence of identical symbols so that only the number and the symbol are written. The graphic format GIF, for example, uses this form of compression. Therefore, images with many monochrome lines are also smaller than, for example, images in which each pixel has a different color.

The next task is about run-length encoding. Suppose a string consists of a sequence of . (dot) and - (minus sign), such as:

--....--------..-

To shorten the length of strings, we can first write the symbol followed by the number of symbols. The string with 17 characters could be shortened to the following string with 9 characters:

-2.4-8.2-

Task:

  1. Create a new class SimpleStringCompressor.

  2. Write a static method String compress(String) that encodes sequences of . and - according to the described algorithm: First comes the character, then the number.

  3. Write a decoder String decompress(String) that unpacks the compressed string. Let decompress(compress(input)) be equal to input.

Extensions:

  • The program shall be able to handle all non-digits.

  • Refine the program so that the number is omitted if the character occurs only exactly once.