Java streams 11. Create from String using chars(), codePoints(), and lines()

Nick Samoylov
4 min readFeb 11, 2020

Class String has the following methods that create streams:
IntStream chars()
IntStream codePoints()
Stream<String> lines()

The chars() and codePoints() methods create a stream of code points of the characters that compose the string. The lines() method creates a stream of lines extracted from this string, separated by line terminators.

IntStream intStream = “someString”.chars()

The created IntStream emits code points (integer char values) of the characters. For example:

IntStream intStream = "abc".chars();  
intStream.forEach(System.out::println); //prints: 97 98 99

To demonstrate that the emitted values are the expected ‘a’, ‘b’, and ‘c’, we can cast the emitted values back to char as follows:

IntStream intStream = "abc".chars();
intStream.mapToObj(c -> (char)c)
.forEach(System.out::println); //prints: a b c

Using code points you can process the characters without boxing them into the Character objects, thus improving performance, which can be significant when the number of processed characters is big enough. For example, the following code selects only the lower-case letters, taking advantage of the fact that the upper-case Latin letters are listed in the character set before the lower-case Latin letters:

int[] chars = "abcDeF".chars()
.filter(c -> c < 97) // avoids boxing into Character
.toArray();

Let’s look now what we have in the chars array:

Arrays.stream(chars)
.forEach(c -> System.out.println(c)); //prints: 68 70
Arrays.stream(chars)
.forEach(c -> System.out.println((char)c)); //prints: D F

This example is very simplistic, but I hope it makes the point: processing characters as integers — without converting them to Character — can be more efficient than processing them as objects or boxing/unboxing them unnecessarily. It does not mean though that one has to process characters this way all the time. For the majority of mainstream applications — those that do not process a significant number of characters — the performance gain does not justify the clarity of the code, which going to be less human-readable if all the characters are processed as integers. There are also characters that are not appropriate for processing in a stream created by the chars() method. They are called supplementary characters and they are greater than U+FFFF (such as emoji, for example). You can find if the character is a supplementary one by comparing it with Character.MIN_SUPPLEMENTARY_CODE_POINT. For example:

System.out.println('a' > Character.MIN_SUPPLEMENTARY_CODE_POINT);     //false
System.out.println(0x1F600 > Character.MIN_SUPPLEMENTARY_CODE_POINT); //true

The character 0x1F600 is an example of an emoji called “grinning face”:

String supplString = new String(Character.toChars(0x1F600));
System.out.println("\n" + supplString); //prints: 😀

The supplementary characters are not appropriate for processing in a stream created by the chars() method because they are represented as a pair of char values, the first — from the high-surrogates range, (\uD800-\uDBFF), the second — from the low-surrogates range (\uDC00-\uDFFF). The chars() method treats them as two different code points:

String supplString = new String( Character.toChars(0x1F600) );
supplString.chars()
.forEach(System.out::println); //prints: 55357 56832
supplString.chars()
.forEach(c -> System.out.println((char)c)); //prints: ? ?
supplString.chars()
.forEach(c -> System.out.println(Character.toChars(c)));
//prints: ? ?

As you can see, the information about the character as a whole is lost. For such cases — if your application has to process supplementary characters too — use method codePoints() instead.

IntStream intStream = “someString”.codePoints();

When produced by the codePoints() method, supplementary characters represented by Unicode surrogate pairs are merged into a single code point:

supplString.codePoints()
.forEach(System.out::println); //prints: 128512
supplString.codePoints()
.forEach(c -> System.out.println((char)c)); //prints: 
supplString.codePoints()
.forEach(c -> System.out.println(Character.toChars(c)));
//prints: 😀

That is the only reason you would want to use the codePoints() method — to represent a supplementary character as a single code point.

Stream<String> lines = “someStringWithLineTerminators”.lines()

A line terminator is one of the following:
— a line feed character “\n” (U+000A),
— a carriage return character “\r” (U+000D), or
— a carriage return followed immediately by a line feed “\r\n” (U+000D U+000A).
For example:

Stream<String> linesStream = "Once\nupon\ra\r\ntime".lines();
linesStream.forEach(System.out::println);

The output is as follows:

As you can see, the lines() method has broken the string into the lines at each of the line terminators. In the next post, we will talk about creating a numeric stream using methods
IntegerStream Random.ints() and its overloads
LongStream Random.longs() and its overloads
DoubleStream Random.doubles() and its overloads

See other posts on Java 8 streams.

--

--

Nick Samoylov
Nick Samoylov

Written by Nick Samoylov

Born in Moscow, lived in Crimea, now lives in the US. Used to be physicist and rock climber, now programmer and writer.

No responses yet