Based on my own experience, I can propose the next roadmap:
1) try to find an open-source solution of this algorithm, for example, in GPL image converting software for *nix-based operating systems.
2) decompose function onto loops, calculation, read-write and lookup operations, evaluate calculation complexity of each part. The complexity is easy to implement by simple timing profiling using of OS time getting API or with MATLAB tic() and toc() functions.
3) perform determination of sequential dependence of each part from result of execution of previous part. It helps to determine the possibility and degree of parallelism of future FPGA-implemented solution.
4) to transfer data between algorithm stages is good practice to use on-chip dual-port RAM-blocks instead of data I/O buffers in software.
5) due to output format is a two-dimensional pixel array it is good to implement output interface similar to video- controller interface. It will be great for algorithm debugging.
If your program creates jpg image and at the same execution pass uses bmp image for your further processing, you just know the bmp format and create bmp format dataset by yourself. If your program do not need the bmp image at the same execution pass, do that conversion by other program or manually and then pass the bmp image as input argument.
JPG format is too intricate, coding hardware to handle it is not trivial. If this is an undergrad project, you are shooting too high.
There is however a middle-ground solution. Use a softcore (e.g. microblaze) and a memory, those two fully coded in vhdl/verilog. Then write software that runs on the softcore and that software will do the actual work of parsing and understanding the JPG format.