I am looking to learn representation from location data similar to what word2vec does to the a sentence. I am aiming to develop a system that can understand the region based on the location input and relate it to other locations.
Assuming you're referring to images of particular locations. Attached are two photos of the same aisle in a supermarket. Pixels have been reduced to adjacent, maybe nested, areas. The operations of choice were applying a Gauss convolution, then establishing areas by clustering pixels by color.
In order to recognize such an encoding of a place (or person or thing), you have to define a distance measure. In the simplest case you choose one prominent area (e.g. the green emergency light on top), take the midpoint of the area, and determine the distance to the midpoint of each other area. You may restrict to areas that have low excentricity or extremal brightness. This gives you a vector which can be compared to the analog vector in the other image.
Note that in this case there is only one reference point (the green lamp on the ceiling), which can be identified easily. In reality there might be no single prominent point to refer to. Each midpoint of n suited areas could be a reference point, which would give you n vectors with n components, an n*n matrix. So you will have to match matrices.
Simply adding up the squares of the differences of corresponding cells of a matrix could be a basis. But you should minimize that sum by simultaneously swapping pairs of rows and columns with the same index.