What is the exact numerical format for representing a base pair (as in byte/ integer8 or similar)?
Hi @Zoran_Kostic , welcome to the community!
I assume you mean what we pass to the model representing the DNA sequence? We pass that as a 1-hot encoded, float32 array. The shape we use is [B, S, 4], where B=number of examples, S=sequence length (e.g. 1Mb) and 4 being the A, C, G, T 1-hot encoding (e.g. [1, 0, 0, 0] for A); and all zeros represents Nās.
1 Like
Does that mean other letters in the IUPAC table are not supported ( IUPAC Codes )? I was also wondering if any forms of methylation, which may play significant roles are not considered at the moment?