What is the most common Unicode encoding used in programming languages implementation?

The Unicode codespace is divided into seventeen planes, numbered 0 to 16:

All code points in the BMP are accessed as a single code unit in UTF-16 encoding and can be encoded in one, two or three bytes in UTF-8. Code points in Planes 1 through 16 (supplementary planes) are accessed as surrogate pairs in UTF-16 and encoded in four bytes in UTF-8.

Within each plane, characters are allocated within named blocks of related characters. Although blocks are an arbitrary size, they are always a multiple of 16 code points and often a multiple of 128 code points. Characters required for a given script may be spread out over several different blocks.

Each code point has a single General Category property. The major categories are: Letter, Mark, Number, Punctuation, Symbol, Separator and Other. Within these categories, there are subdivisions. The General Category is not useful for every use, since legacy encodings have used multiple characteristics per single code point. E.g., U+000A Line feed (LF) in ASCII is both a control and a formatting separator; in Unicode the General Category is "Other, Control". Often, other properties must be used to specify the characteristics and behaviour of a code point

Baudouin Le Charlier

I am not sure that I see what is your point. Normally, in any language you can use any encoding if you choose to use it. The main point is that you need to know what you want to do just like in Alice in Wonderland.

Orestes Febles

UTF-8/16

Christian G. Wolf

I think its save to say I have gone thru some very personal experiences with char encodings in general, and porting both applications and comprehensive lower layer system libraries and interface layers (like WiFi) between different major variants of OS types in cross-platform projects from recent iOS, Linux and Windows back to the imho first comprehensive thoughtful design of an Unicode service in form of NSString class. Adding to what Majzoob technically precise stated, its imho much more important in what format your project *stores* normalized (think serialized frozen object) data. Any sane system designer should imho use plain UTF8 for this, unless there are very good reasons to do otherwise (like severe speed or space constraints in IoT). Windows libs like MFC tended to use (if not enforce) 16-bit wide char, mapping unicode later only half-baken into the system APIs causing all sorts of confusion, trouble and bugs. So what I am trying to express here, bad design has nothing to do with the language (like C, ObjC, C++, Java, Perl, Python, Ruby and the like) but their *core library design* as well as their so called best-practice usage (aka patterns). Before Apple bought them, NeXT designed libraries using the NSString base class (inheriting from NSObject), which internally can (and for optimization reasons on platforms which allow does) have multiple cached storage representations per NSString object as needed, while other (mostly much more simple) implementations use UTF-8 or UTF-16 as normalized storage all the time, enforcing useless conversions steps and other problems (at least in practice). Their designer have thus been either following the path to be hyperflexible, which is usually an excuse for not to know and choose (and also enforce!) a library design (with giving complex high quality best practice examples as proof of usability) which is the long term best use and compromise to relieve the individual coder from all these difficult decisions. Fact is instead, most library programmers are more the expert type coder never ever having to *use* their typically quick shot half baken v1.0 (or v0.0.1 in the Linux case ...) library design. As string types are the single most required types and imho the most important ones, only few languages like ObjC and Wolfram Language (aka Mathematica) ever got their design right in the first attempt. But of course it does not stop at strings - TIFF format has invented a seemingly clever flag, defining the endiness of the following data - and causing trouble in many ported applications. Why not just define a singular version of a crisp clean file format and think of ways to optimize of (felt) too slow implementations when required? Its always the hyper-flexibility trap, which unfortunately is in full effect also in the Unicode design - and while of course only my subjective personal opinion - outside of the Apple/NeXT ecosystem (and maybe the Wolfram Language implementations) *nobody* today delivers a suitable implementation for handling unicode throughout (which for Linux would start in using it in the kernel code, system libs and all shell commands) and for Windows i do not even want to start (and yes, Windows 10 can do all that if one knows what/how exactly). And please excuse the long reply, I hope I stayed on topic.

Komlen Lalović

Dear Jafar,

the answer is simple, just write UTF-8.

Peter Raynham

Sorry to give the answer "It depends" ... but it depends on what you want to do ... however UTF-8 is the default to prefer.

UTF-8 is more durable, which means that more kinds of software can handle it. UTF-8 will be the better choice for programs that send/receive data to/from other programs, that read/write files to be shared with people on other systems, that use libraries including Unicode-unaware libraries like the old C 'string.h'. That is, UTF-8 a better choice for most programs that you may write.

UTF-16 is more computation-friendly. If you program does a lot of string manipulation, UTF-16's 2-byte fixed-size character cells make it easier to do things like extract substrings, or get/compare/update individual characters inside a string by index. Unfortunately, if a string includes characters from the (lesser-used) supplementary code pages (which take 4 bytes each) then you lose the computation-friendliness.

Are air moisture harvesting technologies effective in combating desertification?

Dirty and clean?

Can anyone provide me with molecular docking softwares/ websites?

Can we patent a process flow diagram developed using a process simulator but no actual cases is carried out?

Gas chromatography RT detection?

PhD thesis topic?

Can anybody provide me the Matlab code to plot the attached picture (Time-Frequency Domain), please?

Who wants opportunities for scientific cooperation?

Method of collection of samples of microorganisms from environmets to labortary?

Who wants opportunities for scientific cooperation?

All math can be explained by iterator of code?

Could anyone please guide me on to proceed to apply for phd ?

Which book and outline do you recommend for computational physics course for BS level ?

Why does our stiff biochemical ODE model in R produce unreasonable results (negative values, NAM) despite using solvers like lsoda, vode, and rk4)?

Which is better for the student : Implementing the principles of object-oriented programming using Java or C++?

How to design an online training, learning platform ?

How to reconstruct original observations using PCA?

What is it's difference between lsoda method in R vs. ODE23 or 45 solver in MATLAB?

Why R-Square calculation show different value between STATA result and Eviews?

Hello everyone, I am looking for collaboration to do clinical research?