The implementation of the IMCI ISA on Intel's Xeon Phi microarchitecture relies on a dedicated VPU, which in turn is very different on how the Execution Engines for SSE/AVX are implemented on Intel Nehalem/Sandy Bridge/Haswell microarchitectures.
It is basically due to how SSE/AVX instructions are implemented on the P6/Core family processors.
There are tipically many Execution Ports within the Execution Engines in which different types of SIMD instructions can be dispatched.
For example, on the Nehalem microarchitecture, a SIMD FP Multiply would go to Port-0, whereas a SIMD FP Add would go to Port-1. The same applies to some common Integer SIMD instructions (for both SSE and AVX).
On the MIC however, which is a derivative of the former Larrabee GPU, the whole VPU which executes all SIMD instructions, is connected to a Single Execution Port/Pipe within the same core. (Port-0 on a 64-bit P54C core).
I hope the attached papers would also help to explain my doubt.
Intel has not disclosed implementation details for AVX-512 on the Skylake Xeon or Knights Landing processors, but they have said a few things about performance that have implications for the implementation.
Intel has said that AVX-512 performance on Knights Landing will be twice the peak performance of the Xeon Phi (Knights Corner) core on a per-cycle basis -- https://software.intel.com/en-us/blogs/2013/avx-512-instructions.
Intel has strongly implied that this doubling of performance will also be provided in the Skylake Xeon processors (but has made no such comments concerning the Skylake client processors). This will likely be implemented as two 512-bit wide FMA functional units attached to Port 0 and Port 1 -- just like on Haswell, but with twice the operand width.
My analysis of the implementation of the DGEMM kernel on an AVX-512 system suggests that Intel will need to support two 512-bit loads per cycle from the L1 Data Cache to keep the two AVX-512 FMA units busy. This will also require doubling the number of registers from 16 to 32, which Intel has already documented.
I also think this would the most reasonable architectural project decision (i.e. the Doubling from 256 to 512-bit of both Port-0 and 1 FMA units), as it has already been previously done when transitioning (128 to 256-bit) from SSE to AVX in Sandy Bridge.
Also, the performance doubling for AVX-512 on Knights Landing from the previous Knights Corner is also a good clue for this hypothesis.
Furthermore, AVX-512 on Knights Landing looks pretty much as alias to the original IMCI ISA for further binary compatibility between the mainstream Xeon chips and the Xeon Phis . This makes me wonder if the original IMCI on the Knights Corner will be discontinued on future Xeon Phi gerenerations.
Thank you very much for your comments Mr. McCalpin
There is no indication that the IMCI instruction set used on the Knights Corner Xeon Phi processor will ever be used again.
Careful reading of the Intel publications on AVX-512 support for Knights Landing and Skylake Xeon shows that they will both support the main subset of AVX-512 instructions, but each processor will support optional subsets that are not supported by the other processor. It should be easy to generate a binary that runs on either processor, but the performance penalty for not using each processor's specialized subset will be strongly application-dependent (and hard to predict in advance).
I guess you are referring to AVX3.1 (AVX-512F, AVX-512 CDI) , AVX-512 ERI, AVX-512 PFI and AVX3.2 (AVX-512 BW, AVX-512 DQ, AVX-512 VL), right ?
Well... This pretty much explains another doubt of mine : "Whether the Integer SIMD units would also be upgraded, just like hapenned when transitioning from AVX to AVX2 for support Byte/Word operations like SSE/AVX did but AVX-512F/IMCI don't"
Also, it is pretty reasonable to think that the IMCI ISA will be discontinued, since it would be redundant with AVX-512F.
I am not aware of any Intel documentation that uses the "AVX3" terminology, so I will stick with the more verbose nomenclature.
The "Intel Architecture Instruction Set Extensions Programming Reference" (document 319433, revision 022, October 2014) is probably the most authoritative reference, and it only refers to the various subsets as AVX-512 (plus modifiers) as you note above.
Information about which subsets will be supported in each product offering is limited, and mostly indirect (e.g., in compiler documentation, and compiler-related documents such as https://software.intel.com/sites/default/files/managed/9f/d1/Intrinsics%20mapping%20KNC%20to%20KNL-v1.0.pdf)
For the initial product offerings supporting AVX-512, we have https://software.intel.com/en-us/blogs/additional-avx-512-instructions that says:
All products supporting AVX-512 will support AVX512F (Foundation) + AVX512CD (Conflict Detection)
Knights Landing will include AVX-512 ERI (Exponential and Reciprocal) and AVX-512 PFI (Prefetch) support.
The first Xeon processors to support AVX-512 will include AVX-512DQ (Double and Quadword), AVX-512BW (Byte and Word), and AVX-512VL (128-bit and 256-bit Vector Length) support.
The article says (reasonably enough) that future processors in each product line will maintain support for these ISA groups, but may add support for more extensions over time.
Interestingly, I can't find any mention of the AVX-512 IFMA or AVX-512 VBMI extensions on the Intel web site, except in the ISA Extensions Manual and in the Software Development Emulator documentation. So it is not clear when these extensions might be supported, or in which products.
So it looks like Xeon users won't need to wait an extra generation to get integer SIMD support (as we did with the AVX to AVX2 transition), but Xeon Phi will not have this support initially (and may or may not not get it in later generations).
Some more In-depth (however, still compiler-related) information regarding the availability of the ISA subsets can be found on Kirill Yukhin's presentation at the GNU Tools Cauldron 2014
One interesting thing to notice is that ERI and PFI were mentioned to stay exclusive to Xeon Phi.
Although the "AVX3" terminology was reportedly "abused" from some supposedly leaked Intel roadmap presentation (as well as the only IFMA and VBMI information available yet), I still think that the AVX3.1/3.2 naming is quite reasonable, as it has already been done with SSE4.1/4.2, when a subset of a planned ISA extension was splitted up between different processor generations/families .
As a sort of an speculation exercise, I guess one could make some inferences about the AVX-512 implementation in Skylake by having an equivalence in between AVX-512 and previous SSE/AVX generations within currently available processors.
I'm currently referring myself to both "Intel Architecture Instruction Set Extensions Programming Reference" and also "Agner Fog's Instruction Tables" ( http://www.agner.org/optimize/instruction_tables.pdf ) for getting an in-depth understanding.