Please see my comment to the question — not enough information.
On second thought, I can give you a very general idea. Yes, your data structure represented training data is by far over-populated with data, too big to hold in memory. One common and universal approach is to implement the same interface to this data you have right now, but mimic access to memory through access to file.
Create a file with some appropriate structure (which can be different from what you have in input, you will need to digest it onto suitable structure), open it for read-only non-shared access and keep open during the lifetime of your application. You will need to seek records in this file on request, so it might be binary; and you may need to have well-defined records in this file. Yes, I understand that YAML data is hierarchical, not sequential, but you can somehow artificially subdivide it into records at some reasonable level of granularity.
Then, you need to read this file once and remember file positions of records in this file. Most likely, it should be some
hash table of this file positions with the ability to perform quick search by one of several keys (then if could be several hash tables). Please see:
http://en.wikipedia.org/wiki/Hash_table.
In case of hierarchical data like YAML, the search index could be the place of some item in the YAML tree, such as 1-3-203313-9. I hope you understand that this index should be binary structure, not text. Also, you can store some metadata information in this structure, such as number of children in each node.
You can permanently store this hash table in another file to be used on next run. Then this file should be kept until you update your training data. If your index can be binary structure of fixed size, you would not need the secondary index, "the index of the index file". You can easily calculate the position based on the size of the representation of this structure in the file.
Good. Now, a next step. Apparently, you already have some software interface providing you fast access to your YAML structure, or anything else taking too much space. It could be a usual interface for access to a tree node or anything. Do the following: replace the implementation of this interface with another implementation wrapping your access to the same structure represented as a file. It will help you to keep the rest of the code intact.
[EDIT]
If the data structure in question is
Matrix
, this problem with indexing becomes even simpler, much simpler. Please see my comment on this in the comments to the question.
Basically, any item is indexed by just two indices: Y and X (row, column). Even if each matrix element takes different size when presented in a file, this is not a problem: it could be an index table, which is the array of rank 3 of the file positions. So, you can have this position table (index table) as an additional structure used for fast search of the position in the bigger file by Y-X the index. This is very, very simple to do.
—SA