The obvious thing is to fetch an entire I-cache line every cycle, and use a multiplexer to select the desired datas. Otherwise, the I-cache can be rearchitected to perform auto-realignement. That's how high-end CPUs fetch multiple instructions per cycle.
Note that this technique has some inefficiencies, due to the limited cache lines size. E.g. consider cache lines that consist of 16 instructions, on the average the fetch bandwith with a 4-wide window is (13 ÷ 16) × 4 + (1 ÷ 16) × 3 + (1 ÷ 16) × 2 + (1 ÷ 16) × 1 = 3.625 ≠ 4.
Look also at papers on trace-cache (TC). A TC collects sequences of instructions at run time. Instructions belonging to the same trace may come from different none-sequential locations . By this way, miss rate is low.